GraphGlove

Supplementary code for the EMNLP 2020 paper "Embedding Words in Non-Vector Space with Unsupervised Graph Learning".

What does it do?

It learns to represent words as a graph with learned weights and topology. The resulting graph outperforms vector-based embeddings (GloVe and Poincare Glove) on a set of word similarity and analogy benchmarks.

What do I need to run it?

A machine with a lot of CPU cores (preferably 8+)
- Our implementation does not support GPU but can run on multiple CPU machines
A popular Linux x64 distribution
- Tested on Ubuntu16.04, Ubuntu 18.04, should work fine on any popular linux64 and even MacOS;
- Windows and x32 systems may require heavy wizardry to run;
- When in doubt, use Docker, e.g. this one.

Environment setup

Clone or download this repo. cd to its root directory.
Grab or build a working python enviromnent. Anaconda works fine.
sudo apt-get install redis-server build-essential gcc g++ swig gcc-4.8 g++-4.8
- If you don't have superuser privelegies, redis-server can be installed locally;
Install packages from requirements.txt. All but three can be installed with pip

faiss is installed using the official installation guide;
torch can be installed with pip, but we recommend this page for more options;
glove-python installs normally on python3.6; For >=3.7 use this workaround;

You're now ready to run the code!

Training the model

Training runs in two processes:

scripts/run_master.py - builds the model, does the optimization, evaluates metrics - basically runs everything except for pathfinding;
scripts/run_worker.py - periodically loads the model and finds shortest paths for training batches, sends results back to master;

To run locally, first download wiki17 50k co-occurrences:

# assuming you are in repository root folder
mkdir -p word_embeddings_50k
cd word_embeddings_50k
wget https://www.dropbox.com/s/rub6334e2cuxcar/init_glove_50k_dot.npz?dl=1 -O init_glove.npz
wget https://www.dropbox.com/s/7wrhagpz4elk200/cooc_50k.pkl?dl=1 -O cooc.pkl
cd -

# If you want to brew your own corpora, use ./scripts/data_preprocessing.py

Then run the master script

python scripts/run_master.py --port 6999 --password securepassword123 --batch_size 64 --buffer_size_multiple 2 \
  --data_dir word_embeddings_50k --report_frequency 250 --lr 0.01 --exp_name word2graph_simple_50k \
  --soft --clip_prob_max 0.99 --knn_edges 64 --random_edges 10 --base_lambda 5 --lambda_warmup 5000

Finally, run the worker script

python3 scripts/run_worker.py --host localhost --port 6999 --password securepassword123 --n_threads -1 --restart_on_error

The worker is designed to wait until master begins training. If it has not happened yet, worker will periodically print error messages and restart. You can set the datasets for word similarity evaluation with --track-benchmarks. For word analogy evaluation, refer

Distributed training: you can train faster by using multiple CPU machines

first, run master process on the most powerful machine
then, one "worker" thread on each subsequent machine. Change --host localhost to master's host

This setup assumes that workers can access master on port 6999. If not, either pick a different port or use ssh port forwarding.

Training does not scale infinitely: once you find that trainer "receives a new batch" in less than 0.05 seconds, it means you've hit the worker cap.

Play around with trained embeddings

You can explore the trained embeddings using this notebook. By default, it works with a pre-trained model that was obtained on 50K most frequent tokens from wikipedia 2017 with distance-based loss. However, you can seamlessly replace that pre-trained model with the one you trained in the previous section.

Contacts

For any issues with running the code or questions about our work, please create an issue or contact us by email.

References

If you find this repository useful, please cite the paper:

@inproceedings{ryabinin2020embedding,
  title={Embedding Words in Non-Vector Space with Unsupervised Graph Learning},
  author={Ryabinin, Max and Popov, Sergei and Prokhorenkova, Liudmila and Voita, Elena},
  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
lib		lib
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GraphGlove

What does it do?

What do I need to run it?

Environment setup

Training the model

Play around with trained embeddings

Contacts

References

About

Releases

Packages

Languages

License

yandex-research/graph-glove

Folders and files

Latest commit

History

Repository files navigation

GraphGlove

What does it do?

What do I need to run it?

Environment setup

Training the model

Play around with trained embeddings

Contacts

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages