Skip to content

volctrans/vector-align

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vector Align

This is a production-ready tool for align parallel documents to sentences without the need for a machine translation system or lexicon. Vecalign - an accurate sentence alignment algorithm is used, which is fast even for very long documents.

Instead of LASER, we use LaBSE from sentence-transformers pre-trained models for its better performance in parallel text mining task.

To support the specific African languages in WMT22, we fork from laserembeddings and integrate sentence pieces tokenizer in this repository.

Installation (Recommend)

pip3 setup.py install 

Example

  1. two separate docs
vector-align \
    --left-doc tests/zh.file \
    --right-doc tests/en.file \
    --output output.txt
  1. 1 bilingual doc
vector-align \
    --left-doc tests/bilingual.file \
    --right-doc tests/bilingual.file \
    --output output.txt

Reference

Please cite the paper if you use this tool:

@inproceedings{thompson-koehn-2019-vecalign,
    title = "{V}ecalign: Improved Sentence Alignment in Linear Time and Space",
    author = "Thompson, Brian and Koehn, Philipp",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1136",
    doi = "10.18653/v1/D19-1136",
    pages = "1342--1348",
}

About

Tool for extracting parallel sentences from web documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •