Here is a tool for presenting word2vec models as graphs using networkx library.
The work is done as a part of VerbNet project developed at Higher School of Economics, Moscow, Philology department, master's programme Computational Linguistics. For further information on the project, visit http://web-corpora.net/wsgi3/ru-verbs/
We're making a Princeton-like lexicon of Russian verbs, and we are set to get the best of both data-driven and lexicography-based approaches. Data-driven approach to populating synset base has many faces. And the first one we tried is to create a proto semantic network out of a word2vec model, which is essentially a distributed semantic model for words with its dimension reduced to a reasonable number without losing the distances. The general idea is as follows:
- Take a word2vec model trained by Andrey Kutuzov.
- Either fix a threshold T or a number of most similar words N.
- Create a graph of all the verbs in the model that are at the same time present in Lyashevskaya-Sharoff frequency dictionary. The frequency dictionary is used to filter out typos and obscure words, which are not any good for statistical model anyway.
- For each vertex in the graph, add an edge between it and each other vertex that is either has similarity score above T according to the model or is among that vertex' N most similar words.
The graph constructed this way is then analyzed one connected component at a time using graph community detection algorithm, which is basically clustering of graph vertices. The resulting communities often look like feasible synsets or make good sense as a (somewhat narrow) semantic classes that can be partitioned again to get synset-like communities again. The results are then intended to be used for populating the wordnet database.
You can check out the initial report and slides on the matter.
- python 3.4+
- ipython/jupyter notebook
- scipy
- numpy
- matplotlib
- gensim
- networkx
- Clone this repository or just download and unzip it.
- Create
models
folder located in the same folder as yourverb2graph.ipynb.
- Put there your models, which you should get from http://ling.go.mail.ru/dsm/en/about#models.
- Start your ipython/jupyter notebook there, specify the file name of models you want to work with, and you are all set.
community.py
is a module developed by third-party developers and is only placed here for your convenience
as it was originally written in python2 and had to be converted in order to work with python3 notebook.
You can also: