<a href="https://colab.research.google.com/github/tianshuailu/NMT-Adapt_ml_IN/blob/main/scripts/Pretrain_mBART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Downloading mbart.cc25**

In [None]:
!wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.v2.tar.gz 
!tar -xzvf mbart.cc25.v2.tar.gz

**Byte-Pair Encoding**

We used the sentencepiece model coming with the mbart.cc25 model to do the Byte-Pair Encoding on Hindi and English parallel data

In [None]:
!cat train.en_XX | spm_encode --model /your_path/sentence.bpe.model > train.spm.en_XX

**Generating dictionary**

The python snippet build_vocab.py for generating vocabulary is from this GitHub issue https://github.com/facebookresearch/fairseq/issues/2120

In [None]:
!python build_vocab.py --corpus-data "/your_path/data" --langs ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN --output ./dict.txt

**Triming mBART**

The python snippet trim_mbart.py for trimming mBART is from this GitHub issue https://github.com/facebookresearch/fairseq/issues/2120

In [None]:
# we changed the last three language label to ml_XX,no_ML,no_HI, refering to Malayalam, noised Malayalam, and noised Hindi
!python trim_mbart.py --pre-train-dir /your_path/mbart.cc25.v2 --ft-dict /your_path/dict.txt --langs ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,ml_XX,no_ML,no_HI --output /your_path/model.pt

**Preprocessing data with fairseq-preprocess**

In [None]:
# we used fairseq-preprocess to preprocess the data, srcdict and tgtdict should be the same dictionary file
!fairseq-preprocess \
--source-lang hi_IN \
--target-lang en_XX \
--trainpref /your_path/train.spm \
--validpref /your_path/valid.spm \
--testpref /your_path/test.spm \
--destdir /your_path/pp_data \
--thresholdtgt 0 \
--thresholdsrc 0 \
--srcdict /your_path/dict.txt \
--tgtdict /your_path/dict.txt \
--workers 70

**Pretraining mBART**

In [None]:
# we used fairseq-train to train the model
!fairseq-train /your_path/preprocessed_data \
--encoder-normalize-before --decoder-normalize-before \
--arch mbart_large --layernorm-embedding \
--task translation_from_pretrained_bart \
--source-lang hi_IN --target-lang en_XX \
--criterion label_smoothed_cross_entropy --label-smoothing 0.2 \
--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
--lr-scheduler polynomial_decay --lr 3e-05 --min-lr -1 --warmup-updates 2500 --total-num-update 40000 \
--dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 \
--max-tokens 1024 --update-freq 2 \
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints \
--seed 222 --log-format simple --log-interval 2 \
--save-dir /your_path/checkpoints \
--restore-file /your_path/pretrained_model.pt \
--reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler \
--langs ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,ml_XX,no_ML,no_HI \
--ddp-backend no_c10d

**Checkpoint Evaluation**

In [None]:
# we used the following command to check the translations and bleu score of the checkpoints
!fairseq-generate /your_path/pp_data \
  --path /your_path/checkpoint_last.pt \
  --results-path /your_path/eval_result \
  --task translation_from_pretrained_bart \
  --gen-subset test \
  -t en_XX -s hi_IN \
  --bpe 'sentencepiece' --sentencepiece-model /your_path/spm.model \
  --scoring sacrebleu \
  --batch-size 32 --langs ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,ml_XX,no_ML,no_HI