Lexicography is the science of words and their semantic relationships. Wouldn't it be beneficial to take advantage of linked data for lexicography too? Well, this is the motivation behind Ontolex-Lemon.
Lemon stands for the lexicon model for ontologies (lemon) which provides rich linguistic grounding for ontologies. By rich, the creators theoretically aim at all types of information related to words in dictionaries, such as morphological and syntactic properties. The Ontolex-Lemon is the result of the W3C Ontology-Lexica Community Group.
One of the useful modules in Ontolex-Lemon is the Ontolex-Lemon lexicography module (lexicog).
This tool gets as input the lexicographic data in a tabular format, such as comma-separated values (CSV) and tab-separated values (TSV). In the current version of the tool, the conversion can be done for the followings:
- headwords
- part-of-speech tags
- senses
- examples
- idioms
- and
see also
.
The conversion can be configured using a configuration file called configuration.json
. In this file, you can set various information such as source and target languages with their codes, PoS tags according to the Lexinfo module.
To run the code, clone or download this repository and pass the input file and the configuration files respectively following -input
and -config
arguments in the command line:
python -input Sample_dictionary.tsv -config configuration.json
Please note that this script can deal with relatively simple structures for the moment.
These are a few entries in a Kurdish dictionary in tabular format (original data in tsv
):
Headword | POS | Sense (translation) | Example | Expression | Cf. |
---|---|---|---|---|---|
aferîde | m | creature | |||
aferîn | excl | bravo | bravo ji ... re: good for ... | ||
afirandin | v.t. | to create | |||
afîş | f | poster | |||
aga | adj | aware | |||
agadarî | f | information | announcement; awareness | ||
agah | aga | ||||
agahdarî | agadarî | ||||
agir | m | fire | agir danîn bi: to set fire to |
In order to carry out the conversion correctly, we set a few conventions:
- Senses are separated using
;
or,
. - Any part-of-speech tag can be used, as long as the correct mappings are provided in the configuration file. This regards
Word
,MultiwordExpression
andAffix
classes in Ontolex-Lemon.
The results is created in a folder with the source language name, as in Kurmanji.