This is a production-ready tool for align parallel documents to sentences without the need for a machine translation system or lexicon. Vecalign - an accurate sentence alignment algorithm is used, which is fast even for very long documents.
Instead of LASER, we use LaBSE from sentence-transformers pre-trained models for its better performance in parallel text mining task.
To support the specific African languages in WMT22, we fork from laserembeddings and integrate sentence pieces tokenizer in this repository.
pip3 setup.py install - two separate docs
vector-align \
--left-doc tests/zh.file \
--right-doc tests/en.file \
--output output.txt- 1 bilingual doc
vector-align \
--left-doc tests/bilingual.file \
--right-doc tests/bilingual.file \
--output output.txtPlease cite the paper if you use this tool:
@inproceedings{thompson-koehn-2019-vecalign,
title = "{V}ecalign: Improved Sentence Alignment in Linear Time and Space",
author = "Thompson, Brian and Koehn, Philipp",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1136",
doi = "10.18653/v1/D19-1136",
pages = "1342--1348",
}