Vector Align

This is a production-ready tool for align parallel documents to sentences without the need for a machine translation system or lexicon. Vecalign - an accurate sentence alignment algorithm is used, which is fast even for very long documents.

Instead of LASER, we use LaBSE from sentence-transformers pre-trained models for its better performance in parallel text mining task.

To support the specific African languages in WMT22, we fork from laserembeddings and integrate sentence pieces tokenizer in this repository.

Installation (Recommend)

pip3 setup.py install

Example

two separate docs

vector-align \
    --left-doc tests/zh.file \
    --right-doc tests/en.file \
    --output output.txt

1 bilingual doc

vector-align \
    --left-doc tests/bilingual.file \
    --right-doc tests/bilingual.file \
    --output output.txt

Reference

Please cite the paper if you use this tool:

@inproceedings{thompson-koehn-2019-vecalign,
    title = "{V}ecalign: Improved Sentence Alignment in Linear Time and Space",
    author = "Thompson, Brian and Koehn, Philipp",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1136",
    doi = "10.18653/v1/D19-1136",
    pages = "1342--1348",
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
tests		tests
vector_align		vector_align
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vector Align

Installation (Recommend)

Example

Reference

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

volctrans/vector-align

Folders and files

Latest commit

History

Repository files navigation

Vector Align

Installation (Recommend)

Example

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages