Converting tabular data to Ontolex-Lexicog

Ontolex-Lemon and Lexicog

Lexicography is the science of words and their semantic relationships. Wouldn't it be beneficial to take advantage of linked data for lexicography too? Well, this is the motivation behind Ontolex-Lemon.

Lemon stands for the lexicon model for ontologies (lemon) which provides rich linguistic grounding for ontologies. By rich, the creators theoretically aim at all types of information related to words in dictionaries, such as morphological and syntactic properties. The Ontolex-Lemon is the result of the W3C Ontology-Lexica Community Group.

One of the useful modules in Ontolex-Lemon is the Ontolex-Lemon lexicography module (lexicog).

Conversion to Ontolex-Lexicog

This tool gets as input the lexicographic data in a tabular format, such as comma-separated values (CSV) and tab-separated values (TSV). In the current version of the tool, the conversion can be done for the followings:

headwords
part-of-speech tags
senses
examples
idioms
and see also.

The conversion can be configured using a configuration file called configuration.json. In this file, you can set various information such as source and target languages with their codes, PoS tags according to the Lexinfo module.

To run the code, clone or download this repository and pass the input file and the configuration files respectively following -input and -config arguments in the command line:

python -input Sample_dictionary.tsv -config configuration.json

Please note that this script can deal with relatively simple structures for the moment.

A working example

These are a few entries in a Kurdish dictionary in tabular format (original data in tsv):

Headword	POS	Sense (translation)	Example	Expression	Cf.
aferîde	m	creature
aferîn	excl	bravo	bravo ji ... re: good for ...
afirandin	v.t.	to create
afîş	f	poster
aga	adj	aware
agadarî	f	information	announcement; awareness
agah					aga
agahdarî					agadarî
agir	m	fire		agir danîn bi: to set fire to

In order to carry out the conversion correctly, we set a few conventions:

Senses are separated using ; or ,.
Any part-of-speech tag can be used, as long as the correct mappings are provided in the configuration file. This regards Word, MultiwordExpression and Affix classes in Ontolex-Lemon.

The results is created in a folder with the source language name, as in Kurmanji.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Kurmanji		Kurmanji
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Sample_dictionary.tsv		Sample_dictionary.tsv
configuration.json		configuration.json
converter.py		converter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Converting tabular data to Ontolex-Lexicog

Ontolex-Lemon and Lexicog

Conversion to Ontolex-Lexicog

A working example

About

Releases

Packages

Languages

License

sinaahmadi/Tabular2Lexicog

Folders and files

Latest commit

History

Repository files navigation

Converting tabular data to Ontolex-Lexicog

Ontolex-Lemon and Lexicog

Conversion to Ontolex-Lexicog

A working example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages