Embeddings can be visualized with Semiograph. We already created many embeddings, which are available for download under http://embeddings.texttechnologylab.org.
Mehler et al 2020b: "The Frankfurt Latin Lexicon. From Morphological Expansion and Word Embeddings to SemioGraphs"The preprocessing was done by the TextImager (https://textimager.hucompute.org/). The resulting file (data/example.tei) was prepared for vectorization according to the model of Mikolov, but others are also possible (data/trainingfile.txt). The example is a small extract from the corpus of the New York Times (2018 / 01).
After successfully creating a training file, the embeddings have to be calculated. There are several possibilities for this, we use word2vec and fastText. By adjusting parameters, different embeddings can be created. The parameters of embedding generation used in SemioGraph are documented by the individual entries.
word2vec
word2vec -train $trainingsfile -output $trainingresult
word2vec -train $trainingsfile -output $trainingresult -size 500 -window 5 -iter 5
- size Set size of word vectors
- window Max skip length between words
- iter Count of training interations
- cbow Use the continuous bag of words model (1); (0) skip-gram model is used
fastText
fasttext skipgram -input $trainingsfile -output $trainingresult
fasttext skipgram -input $trainingsfile -output $trainingresult -dim 300 -minn 3 -maxn 6 -epoch 5
- skipgram Replace with cbow to use continuous bag of words model instead
- dim Set size of word vectors
- epoch Count of training interations
- minn, maxn Min/max length of char ngram
For a list of all available arguments see https://github.com/facebookresearch/fastText/
The SemioGraph web interface uses SemioGraph shelve files for visualization. The shelves contain precalculated nearest neighbors data, for quick access instead of having to use the full embeddings files. Optionally, we also add DDC topic labels using text2ddc (available at https://textimager.hucompute.org/DDC/). To generate these shelves we use Gensim.
python3 semiograph_shelve_create.py [-n TOP_N] [--is_binary_file] [-l LANG] [--ignore_pos] [-i EMBEDDING_ID] [-c MIN_COUNT] [--min_count_corpus MIN_COUNT_CORPUS] embeddings_input_file shelve_output_file
python3 semiograph_shelve_create.py $trainingresult "${trainingresult}.shelve" -n 50 --ignore_pos
We provide a simple shelve-viewer script to easily check the contents of a shelve. This opens a command prompt that allows to query the shelve for words.
python3 semiograph_shelve_viewer.py "${trainingresult}.shelve"
When using SemioGraph cite the following according to AGPL licence.
@article{Mehler:et:al:2020b,
author={Mehler, Alexander and Geelhaar, Tim and Henlein, Alexander and Abrami, Giuseppe and Baumartz, Daniel and Uslu, Tolga and Hemati, Wahed and Jussen, Bernhard},
title={The Frankfurt Latin Lexicon. From Morphological Expansion and Word Embeddings to SemioGraphs},
journal={Studi e Saggi Linguistici},
year={2020},
note={accepted}
}

