# 500-English Translation Demo

**Thamme "TG" Gowda**  
SARAL Team  
USC Information Sciences Institute  

http://rtg.isi.edu/many-eng

# Overview

* Our tools for machine translation
  * `mtdata`: dataset catalogue and downloader
  * `nlcodec`: vocabulary manager and training data storage
  * `rtg`: pytorch based NMT toolkit

* 500→English Translation
  * Demo

# MTData

*  Easy access to translation datasets
* `pip install mtdata` https://github.com/thammegowda/mtdata/ 
* Unified interface to [OPUS, Statmt, Paracrawl, TEDTalks, ....](https://github.com/thammegowda/mtdata/#current-status)
  * Parses XML, TMX, SGM, TSV etc
  * Handles compression formats .gz .tar.gz|.tgz, .xz, .zip, 
* ISO 639-3 standard names and codes 
* Stores signature which helps in reproducing datasets

_Special thanks to Kenneth Heafield for helping add more datasets to mtdata_

## ISO 639-3 
All language ID/names are mapped to ISO 639-3
```bash
$ mtdata-iso german de deu ge ka kn iloko spa fas fa
Input	ISO639_3	Name
german	deu	German
de	deu	German
deu	deu	German
ge	hmj	Ge
ka	kat	Georgian
kn	kan	Kannada
iloko	ilo	Iloko
spa	spa	Spanish
fas	fas	Persian
fa	fas	Persian
```

# `mtdata list` 

List datasets for langauges of interest; for example, *Farsi-English*

`$ mtdata list -l fas-eng | cut -f1,2`
```
2021-09-13 22:07:15 main.list_data:18 INFO:: Found 26
cc_aligned	eng-fas
OPUS_CCAligned_v1	eng-fas
OPUS_GNOME_v1	eng-fas
OPUS_GlobalVoices_v2018q4	eng-fas
OPUS_KDE4_v2	eng-fas
OPUS_OpenSubtitles_v2016	eng-fas
OPUS_OpenSubtitles_v2018	eng-fas
OPUS_QED_v2_0a	eng-fas
OPUS_TED2020_v1	eng-fas
OPUS_TEP_v1	eng-fas
OPUS_Tanzil_v1	eng-fas
OPUS_Ubuntu_v14_10	eng-fas
OPUS_WikiMatrix_v1	eng-fas
OPUS_Wikipedia_v1_0	eng-fas
OPUS_infopankki_v1	eng-fas
OPUS_tico19_v20201028	eng-fas
OPUS_wikimedia_v20210402	eng-fas
JW300	eng-fas
OPUS100v1_train	eng-fas
OPUS100v1_dev	eng-fas
OPUS100v1_test	eng-fas
WikiMatrix_v1	eng-fas
neulab_tedtalksv1_train	eng-fas
neulab_tedtalksv1_test	eng-fas
neulab_tedtalksv1_dev	eng-fas
ELRC_wikipedia_health	eng-fas
```

# `mtdata get`

Easily obtain training and test datasets
```bash
mtdata get -l fas-eng --out fas-eng \
  --merge --train JW300 neulab_tedtalksv1_train \
          --test neulab_tedtalksv1_{dev,test}
```

```shell
$ tree fas-eng
fas-eng
├── mtdata.signature.txt
├── tests
│   ├── neulab_tedtalksv1_dev-fas_eng.eng
│   ├── neulab_tedtalksv1_dev-fas_eng.fas
│   ├── neulab_tedtalksv1_test-fas_eng.eng
│   └── neulab_tedtalksv1_test-fas_eng.fas
├── train-parts
│   ├── JW300-fas_eng.eng
│   ├── JW300-fas_eng.fas
│   ├── neulab_tedtalksv1_train-fas_eng.eng
│   └── neulab_tedtalksv1_train-fas_eng.fas
├── train.eng
├── train.fas
├── train.meta.gz
└── train.stats.json
```

```bash
$ cat fas-eng/mtdata.signature.txt
mtdata get -l fas-eng -tr JW300 neulab_tedtalksv1_train -ts neulab_tedtalksv1_dev neulab_tedtalksv1_test -o <out-dir>
mtdata version 0.2.10
```

```bash
$ cat fas-eng/train.stats.json
```
```json
{
  "total": 435104,
  "parts": {
    "JW300-fas_eng": 284138,
    "neulab_tedtalksv1_train-fas_eng": 150966
  }
}
```

# View BibTEX Citation 

```
$ mtdata list -l fas-eng --full \
   -n JW300 neulab_tedtalksv1_train


JW300 eng-fas
@inproceedings{agic-vulic-2019-jw300,
    title = "{JW}300: A Wide-Coverage Parallel Corpus for Low-Resource Languages",
    author = "Agi{'c}, {{Z}}eljko  and
      Vuli{'c}, Ivan",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1310",
    doi = "10.18653/v1/P19-1310",
    pages = "3204--3210",
}
@inproceedings{tiedemann2012parallel,
  title={Parallel Data, Tools and Interfaces in OPUS.},
  author={Tiedemann, J{"o}rg},
  booktitle={Lrec},
  volume={2012},
  pages={2214--2218},
  year={2012}
}

neulab_tedtalksv1_train	eng-fas
@inproceedings{Ye2018WordEmbeddings,
    author = "Ye, Qi and Devendra, Sachan and Matthieu, Felix and Sarguna, Padmanabhan and Graham, Neubig",
    title = "When and Why are pre-trained word embeddings useful for Neural Machine Translation",
    booktitle = "HLT-NAACL",
    year = "2018"
}
```

# NLCodec

* Vocabulary types: word, char, BPE subword, ...
* (Optional) pyspark backend for large datasets
* https://isi-nlp.github.io/nlcodec/  `pip install nlcodec`
* `nlcodec [learn|encode|decode]`
* `nlcodec-db`: Efficient storage of large scale text datasets
  * adapts datatype based on vocabulary size e.g., 1-byte for vocab_size < 256, and 2-byte for vocab_size upto 65,536
  * Supports parallel writes by Spark, and parallel reads during distributed training

# Reader Translator Generator (RTG)

* NMT toolkit based on Pytorch
* https://isi-nlp.github.io/rtg/    `pip install rtg`
* Reproducible experiments, stores all hyper params in `conf.yml` 
* All necessary ingredients for NMT research -> production
  * Flexible vocabulary options: `sentencepiece`, `nlcodec`, shared or separate vocabulary
  * Distibuted training, mixed-precision training, gradient accumulation,
  * Beam search, length penalty, ensembling, 
  * Transfer learning, weights freezing, ...  

# 500→English Translation

* `mtdata`: downloads datasets
   * [Data selection](https://github.com/thammegowda/006-many-to-eng/blob/master/data/dataset-selection.tsv)
   * [Statistics](http://rtg.isi.edu/many-eng/data-v1.html)
* `nlcodec`: creates vocabulary, and encodes and stores traning sentences
* `rtg`: trains (multilingual) NMT models
  * Models released to [DockerHub](https://hub.docker.com/repository/docker/tgowda/rtg-model)
* Demo and more info http://rtg.isi.edu/many-eng

  
> *Caveat:* Since the language identifier is not used, translation of short phrases and words without suffcient context might produce suboptimal translation quality. We strongly recommend translation of full sentences (i.e., words in context) at a time.

# Examples
Copy paste these to [the web app](http://rtg.isi.edu/many-eng/v1):
```
प्रधानमंत्री ने शिकागो में स्वामी विवेकानंद के 1893 के प्रतिष्ठित भाषण को याद किया
శికాగో లో1893వ సంవత్సరం లో స్వామి వివేకానంద ప్రతిష్ఠిత ఉపన్యాసాన్ని స్మరించుకొన్న ప్రధానమంత్రి 
ಸ್ವಾಮಿ ವಿವೇಕಾನಂದರ 1893ರ ಷಿಕಾಗೊ ಐತಿಹಾಸಿಕ ಭಾಷಣ; ಪ್ರಧಾನ ಮಂತ್ರಿ ಸ್ಮರಣೆ
PM recalls Swami Vivekananda’s iconic 1893 speech at Chicago

Мы же можем вместе играть в крикет, мы можем дружить.
We can play cricket together, we can be friends.

فرانس کی طرف سے جیکوئس کارٹیئر 1534 میں اور ڈی چیمپلین 1603 میں آئے۔  
From France, Jacques Cartier in 1534 and De Champlain came in 1603

Δεν μπορώ να επιτρέψω στο εξάχρονο παιδί να καθορίσει τη ζωή μου.
I can't let that six-year-old keep dictating my life anymore.

עם עזיבתו של מר תומס, זה היה כמו שענן הורם מעל מנספילד פארק.
With Sir Thomas's departure, it was as if a cloud had been lifted from Mansfield Park.

குற்றம் சாட்டப்பட்ட 20 வயது நபர் திங்கட்கிழமையன்று கைது செய்யப்பட்டதாக கேஇசட்என் காவல்துறை செய்தித் தொடர்பாளர் கர்னல் தெம்பெகா எம்பில் கூறுகிறார்.
According to KZN police spoke sperson Colonel Thembeka Mbele, the 20-year-old accused was arrested on Monday.

থাইল্যান্ড থেকে এক সমর্থক পিকচার মেসেজ (ছবির বার্তা) পাঠিয়েছে কম্বোডিয়া থেকে মার্গারেট পসনেট এই বার্তা ওয়েবসাইটের জন্য পাঠিয়েছে।
Picture message sent by a supporter from Thailand Margaret Posnett from Cambodia texted this message on the website
```

# Thanks


<sub><sup>`mtdata` downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or make any claims regarding license to use these datasets. It is your responsibility to determine whether you have permission to use the dataset under the dataset's license. We request all the users of this tool to cite the original creators of the datsets, which maybe obtained from `mtdata list -n <NAME> -l <L1-L2> -full` .   
 If you're a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue.</sup></sub> Thanks for your contribution to the machine learning and natural language processing community!
