Pre-trained ELMo Representations for Many Languages

We release our ELMo representations trained on many languages which helps us win the CoNLL 2018 shared task on Universal Dependencies Parsing according to LAS.

Technique Details

We use the same hyperparameter settings as Peters et al. (2018) for the biLM and the character CNN. We train their parameters on a set of 20-million-words data randomly sampled from the raw text released by the shared task (wikidump + common crawl) for each language. We largely based ourselves on the code of AllenNLP, but made the following changes:

We support unicode characters;
We use the sample softmax technique to make training on large vocabulary feasible (Jean et al., 2015). However, we use a window of words surrounding the target word as negative samples and it shows better performance in our preliminary experiments.

The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU.

Downloads


Arabic	Bulgarian	Catalan	Czech
Old Church Slavonic	Danish	German	Greek
English	Spanish	Estonian	Basque
Persian	Finnish	French	Irish
Galician	Ancient Greek	Hebrew	Hindi
Croatian	Hungarian	Indonesian	Italian
Japanese	Korean	Latin	Latvian
Norwegian Bokmål	Dutch	Norwegian Nynorsk	Polish
Portuguese	Romanian	Russian	Slovak
Slovene	Swedish	Turkish	Uyghur
Ukrainian	Urdu	Vietnamese	Chinese

The models are hosted on the NLPL Vectors Repository.

ELMo for Simplified Chinese

We also provided simplified-Chinese ELMo. It was trained on xinhua proportion of Chinese gigawords-v5, which is different from the Wikipedia for traditional Chinese ELMo.

Pre-requirements

must python >= 3.6 (if you use python3.5, you will encounter this issue HIT-SCIR#8)
pytorch 0.4
other requirements from allennlp

Usage

First, after unzip the model, please change the "config_path" field in ${lang}.model/config.json to ${project_home}/configs/cnn_50_100_512_4096_sample.json.

Then, prepare your input file in the conllu format, like

1   Sue    Sue    _   _   _   _   _   _   _
2   likes  like   _   _   _   _   _   _   _
3   coffee coffee _   _   _   _   _   _   _
4   and    and    _   _   _   _   _   _   _
5   Bill   Bill   _   _   _   _   _   _   _
6   tea    tea    _   _   _   _   _   _   _

Fileds should be separated by '\t'. We only use the second column and space (' ') is supported in this field (for Vietnamese, a word can contains spaces). Do remember tokenization!

When it's all set, run

python src/gen_elmo.py test \
    --input_format conll \
    --input /path/to/your/input \
    --model /path/to/your/model \
    --output_prefix /path/to/your/output \
    --output_format hdf5 \
    --output_layer -1

It will dump an hdf5 encoded dict onto the disk, where the key is '\t' separated words in the sentence and the value is it's 3-layer averaged ELMo representation. You can also dump the cnn encoded word with --output_layer 0, the first layer of the LsTM with --output_layer 1 and the second layer of the LSTM with --output_layer 2.
We are actively changing the interface to make it more adapted to the AllenNLP ELMo and more programmatically friendly.

Convert lists of tokens to vectors inside your own code

By using Embedder python object, you can easily merge ELMo into your own code like this:

#import it outside the top directory of this repo
from ELMoForManyLangs import elmo

e = elmo.Embedder()

sents = [['今', '天', '天氣', '真', '好', '阿'],
['潮水', '退', '了', '就', '知道', '誰', '沒', '穿', '褲子']]
# the list of lists which store the sentences 
# after segment if necessary.

e.sents2elmo(sents)
# will return a list of numpy arrays 
# each with the shape=(seq_len, embedding_size)

the parameters to init Embedder:

class Embedder(model_dir='zht.model/', batch_size=64):

model_dir: the relative path from the repo top dir to you model dir. (default: zht.model/)
batch_size: the batch_size you want when the model inference, you can specify it properly according to your gpu/cpu ram size. (default: 64)

the parameters of the function sents2elmo:

def sents2elmo(sents, output_layer=-1):

sents: the list of lists which store the sentences after segment if necessary.
output_layer: the target layer to output.
- 0 for the word encoder
- 1 for the first LSTM hidden layer
- 2 for the second LSTM hidden layer
- -1 for an average of 3 layers. (default)

Training Your Own ELMo

Please run

python src/biLM.py train -h

to get more details about the ELMo training. However, we need to add that the training process is not very stable. In some cases, we end up with a loss of nan. We are actively working on that and hopefully improve it in the future.

Citation

If our ELMo gave you nice improvements, please cite us.

@InProceedings{che-EtAl:2018:K18-2,
  author    = {Che, Wanxiang  and  Liu, Yijia  and  Wang, Yuxuan  and  Zheng, Bo  and  Liu, Ting},
  title     = {Towards Better {UD} Parsing: Deep Contextualized Word Embeddings, Ensemble, and Treebank Concatenation},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {55--64},
  url       = {http://www.aclweb.org/anthology/K18-2005}
}

Please also cite the NLPL Vectors Repository for hosting the models.

@InProceedings{fares-EtAl:2017:NoDaLiDa,
  author    = {Fares, Murhaf  and  Kutuzov, Andrey  and  Oepen, Stephan  and  Velldal, Erik},
  title     = {Word vectors, reuse, and replicability: Towards a community repository of large-text resources},
  booktitle = {Proceedings of the 21st Nordic Conference on Computational Linguistics},
  month     = {May},
  year      = {2017},
  address   = {Gothenburg, Sweden},
  publisher = {Association for Computational Linguistics},
  pages     = {271--276},
  url       = {http://www.aclweb.org/anthology/W17-0237}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
configs		configs
src		src
.gitignore		.gitignore
README.md		README.md
elmo.py		elmo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

src

src

.gitignore

.gitignore

README.md

README.md

elmo.py

elmo.py

Repository files navigation

Pre-trained ELMo Representations for Many Languages

Technique Details

Downloads

Pre-requirements

Usage

Convert lists of tokens to vectors inside your own code

the parameters to init Embedder:

the parameters of the function sents2elmo:

Training Your Own ELMo

Citation

About

Releases

Packages

Languages

voidism/ELMoForManyLangs

Folders and files

Latest commit

History

Repository files navigation

Pre-trained ELMo Representations for Many Languages

Technique Details

Downloads

Pre-requirements

Usage

Convert lists of tokens to vectors inside your own code

the parameters to init Embedder:

the parameters of the function sents2elmo:

Training Your Own ELMo

Citation

About

Resources

Stars

Watchers

Forks

Languages