Code for Mimicking Word Embeddings using Subword RNNs (EMNLP 2017)
Python
Switch branches/tags
Nothing to show
Permalink
Failed to load latest commit information.
mimick mimick make documentation Jul 31, 2017
scripts
vocabs
.gitignore
LICENSE
README.md more pipeline description Jul 31, 2017
evaluate_morphotags.py
make_dataset.py first commit Jul 21, 2017
model.py
morphotag_eval_unittest.py
simple_morpho_eval_test.txt first commit Jul 21, 2017
test_model.py
utils.py first commit Jul 21, 2017

README.md

Mimick

Code for Mimicking Word Embeddings using Subword RNNs (EMNLP 2017)

I'm adding details to this documentation as I go. When I'm through, this comment will be gone.

Dependencies

The main dependency for this project is DyNet. Get it here. Their 2.0 version has just been released, and I hope to upgrade this project and models to that version at some point.

Create Mimick models

The mimick directory contains scripts relevant to the Mimick model: dataset creation, model creation, intrinsic analysis. The models directory within contains models trained for all 23 languages mentioned in the paper. If you're using the pre-trained models, you don't need anything else from the mimick directory in order to run the tagging model. If you train new models, please add them here via pull request!

Tag parts-of-speech and morphosyntactic attributes using trained models

The root directory of this repository contains the code required to perform extrinsic analysis on Universal Dependencies data. Vocabulary files are supplied in the vocabs directory.

The entry point is model.py, which can use tagging datasets created using the make_dataset.py script. Note that model.py accepts pre-trained Word Embedding models via text files with no header. For Mimick models, this exact format is output into the path in mimick/model.py script's --output argument. For Word2Vec, FastText, or Polyglot models, one can create such a file using the scripts/output_word_vectors.py script that accepts a model (.pkl or .bin) and the desired output vocabulary (.txt).