This is a first version of the Wikimedia project etytree. The aim of the project is to visualize in an interactive web page the etymological tree (i.e., the etymology of a word in the form of a tree, with ancestors, cognate words, derived words, etc.) of any word in any language using data extracted from Wiktionary.
If you have comments on the project please write on the talk page of the project.
This project has been inspired by my interest in etymology, in open source collaborative projects and in interactive visualizations.
The code is distributed under MIT licence and the data is distributed under Creative Commons Attribution-ShareAlike 3.0
The site's html files are contained in the repo root. The main page is index.html. To view the site you just need to navigate to the root of the repo.
GruntJS is currently being used as the task manager. It runs JSLint on the JavaScript files under resources/js. In order to use these tools you will need to take the following steps:
- Install Node
- Run
npm install
from your the repo root.
Then you can run a lint of the JS files by running grunt js
in your command line, from the repo root.
Files contained in resources/data are imported from Wiktionary and updated when a new dump of the English Wiktionary is generated (updated on 07/22/2017).
File etymology-only_languages.csv has been created from Wiktionary data with a lua module available here.
File iso-639-3.tab has been downloaded from this link (the first line has been removed).
File list_of_languages.csv has been downloaded from Wiktionary.
This code queries the wmflabs etytree-virtuoso sparql endpoint which I have set up and populated with data (RDF) produced with dbnary_etymology. The extracted data is kept in sync with Wiktionary each time a new dump is generated (we are a little behind now - data was extracted on 01/06/2017).
I have defined an ontology for etymologies here. In particular I have defined properties etymologicallyRelatedTo, etymologicallyDerivesFrom and etymologicallyEquivalentTo.
Besides etymological relationships data also contain POS-s, definitions, senses and more as extracted by dbnary. The ontology for dbnary is defined here.
An example query to the sparql endpoint follows:
PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>
SELECT DISTINCT ?p ?o {
eng:__ee_get ?p ?o
}
Property http://www.w3.org/2000/01/rdf-schema#seeAlso is used to link to the Wiktionary page the etymological entry was extracted from. If you want to find all entries containing string "door":
SELECT DISTINCT ?s {
?s rdfs:label ?label .
?label bif:contains "door" .
}
If you want to find ancestors of "door":
PREFIX dbetym: <http://etytree-virtuoso.wmflabs.org//dbnaryetymology#>
PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>
SELECT DISTINCT ?o {
eng:__ee_1_door dbetym:etymologicallyRelatedTo+ ?o .
}
The RDF database of etymological relationships is periodically extracted when a new dump of the English Wiktionary is released. The code used to extract the data is dbnary_etymology.
dbnary_etymology is a Maven project (use java 8 and maven3).
cd dbnary_etymology/extractor/
mvn site
mvn javadoc:jar
cd dbnary_etymology
mvn package
VERSION=20170601
EXEC=~/dbnary_etymology/dbnary-extractor/target/dbnary-extractor-2.0e-SNAPSHOT-jar-with-dependencies.jar
DUMP=/srv/datasets/dumps/$VERSION/enwiktionary-$VERSION-pages-articles.utf-16.xml
FPAGE=0
TPAGE=2000000
LOG=extracts/lemon/en/$VERSION/enwkt-$VERSION_x_${FPAGE}_${TPAGE}.ttl.log
OUT=extracts/lemon/en/$VERSION/enwkt-$VERSION_x_${FPAGE}_${TPAGE}.ttl
ETY=extracts/lemon/en/$VERSION/enwkt-$VERSION_x_${FPAGE}_${TPAGE}.etymology.ttl
PREFIX=http://etytree-virtuoso.wmflabs.org/dbnary
rm ${OUT}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp ${EXEC} org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage ${FPAGE} --topage ${TPAGE} -E ${ETY} -o ${OUT} ${DUMP} 3>&1 1>>${LOG} 2>&1
WORD="door"
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary.eng=debug -cp $EXEC org.getalp.dbnary.cli.GetExtractedSemnet -l en --prefix http://etytree-virtuoso.wmflabs.org/ --etymology $DUMP $WORD
WORD="door"
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary.eng=debug -cp $EXEC org.getalp.dbnary.cli.GetExtractedSemnet -x -l en --etymology testfile $DUMP $WORD
- To speed up queries edit ontology and add new properties to identify different etymological categories as described in Wiktionary's Etymology help page:
- inherited word (template inherited)
- borrowed word (template borrowed)
- named from people
- developed from initialism
- surface analysis
- long detailed etymology - propose a new template??
- invented word/coined expression (coined by)
- back-formation (e.g.: burglar -> burgle, play the tamburine -> tambour, i.e. remove a morpheme, real or perceived) (template back-form)
- compound (template compound)
- phrase
- initialism
- acronym
- abbreviation
- clipping
- blend/portmanteau (template blend)
- calque/loan translation
- year template - propose a new template??
- cognates - I plan to ignore this
-
parse glosses in templates
-
I would like to add a preferred direction to the graph, that goes from left to right following the evolution of a word from the past to the present. This would mean in terms of force field to add a magnetic field that orients arrows towards a preferred direction.
-
Add zoom to tooltip, set zoom also in google chrome and other browsers.
-
Add etymology controversies.
-
Currently for some words the Virtuoso server doesn't return data because it reaches timeout. I want to try a different query like the following
DEFINE input:inference "etymology_ontology"
PREFIX dbetym: <http://kaiko.getalp.org/dbnaryetymology#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?source ?p ?o ?cognate ?pcognate ?scognate
{
?source ?p ?o .
FILTER (?p in (dbetym:etymologicallyDerivesFrom, dbetym:descendsFrom, dbetym:derivesFrom,dbetym:etymologicallyEquivalentTo))
# {
# SELECT ?source
# {
# ?source dbetym:etymologicallyRelatedTo{1,} <http://kaiko.getalp.org/dbnary/eng/__ee_1_water> .
# }
# }
# UNION
{
SELECT ?source
{
<http://kaiko.getalp.org/dbnary/eng/__ee_1_water> dbetym:etymologicallyRelatedTo{1,} ?source .
}
}
OPTIONAL
{
?source dbetym:etymologicallyRelatedTo{1,} ?cognate .
?scognate ?pcognate ?cognate .
FILTER (?pcognate in (dbetym:etymologicallyDerivesFrom, dbetym:descendsFrom, dbetym:derivesFrom,dbetym:etymologicallyEquivalentTo))
}
}
-
Click on a word and interrogate the server to get data about the word.
-
Search words with space or with accent
-
Extract Reconstructed words.
-
Maybe consider Dialects:
Module:da:Dialects ?
Module:en:Dialects This module provides labels to {{alter}}, which is used in the Alternative forms section.
Module:grc:Dialects This module translates from dialect codes to dialect names for templates such as {{alter}}. (e.g. aio -> link = 'Aeolic Greek', display = 'Aeolic')
Module:he:Dialects
Module:hy:Dialects ?
Module:la:Dialects (e.g.: aug -> link = Late Latin#Late and post-classical Latin, display = post-Augustan)
- Maybe consider additional modules:
Module:families/data mapping language code -> language name (e.g.: aav -> canonicalName = "Austro-Asiatic",otherNames = {"Austroasiatic"}