Skip to content

transducens/LASERtrain

master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

LASERtrain

This package reproduces the architecture described by Artetxe and Schwenk (2018, 2019) to train language-agnostic sentence embeddings. The authors have released a large model covering 93 languages as part of the LASER project; however the code used to train them remains unreleased. The code in this repository is an approximation to the actual code implemented by the authors using the description of the architecture and training parameters provided in their recent publications.

At the moment, the models produced with this software are not compatible with the models available in LASER project; this limitation will be tackled iin the near future.

The package includes instructions to reproduce the experiments described in Artetxe and Schwenk (2019) in which a model is trained on the UN v1.0 corpus and evaluated on the data released for the BUCC'18 shared task.

Requirements

The following packages are required to reproduce run this package and reproduce the results reported:

  • Fairseq
  • Python
  • PyTorch
  • NumPy
  • Sentencepiece
  • Faiss, for fast similarity search and bitext mining
  • jieba 0.39, Chinese segmenter (pip install jieba)

Tutorial: train and evaluate your LASER model

In this section, we reproduce the experiments carried out by Artetxe and Schwenk (2019).

Download and prepare data

First step is to download the datasets needed to train and evaluate our model. Two datasets are required:

  • UN v1.0 corpus: the multilingual corpus on which our model will be trained
  • BUCC 2018 shared task data: the training and test data for the BUCC 2018 shared task that will be used to evaluate our model Note that UN corpus does not cover one of the language pairs in BUCC 2018: German-English. To deal with this, Artetxe and Schwenk (2019) train a second model on Europarl multilingual corpus. We will not cover this second experiment in this tutorial, although the same steps described could be applied to train it.

For BUCC 2018, download all the 4 training data packages and the 4 test data packages to the sub-directory data. Once downloaded, uncompress all the packages using the coommand: tar xjf bucc2018-ru-en.test.tar.bz2

Acknowledgements

Developed by Universitat d'Alacant as part of its contribution to the GoURMET project, which received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825299.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published