# Environment Settings

In [3]:
! python --version
! nvcc --version

Python 3.10.16
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [None]:
! pip install pip==24
! pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
! pip install fairseq
! pip install sacremoses tensorboardX tensorboard sacrebleu sentencepiece
! pip install datasets transformers

Let's create a working directory

In [5]:
import os
os.makedirs('fairseq', exist_ok=True)
os.chdir('fairseq')

# Data Preprocessing

- We need two dataset for training a model, that is, train and validation datasets.
- Data format: two separate text files for source and target language with the same prefix, e.g., train.en and train.hi
- Subword tokenization: source/target texts should be tokenized into subwords.
- Additional preprocessing: normalization, pre-tokenization, truecasing, and length-based cleaning. Ref: [MOSES](https://www2.statmt.org/moses/?n=Moses.Baseline)

In [None]:
from transformers import NllbTokenizer

tokenizer = NllbTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")

- Training data

In [4]:
import os
os.makedirs('data', exist_ok=True)

import datasets

train_data = datasets.load_dataset('ai4bharat/indic-instruct-data-v0.1', 'nmt-seed')
print(train_data)

with open('data/train.en', 'w', encoding='utf-8') as fout:
    for entry in train_data['hi']:
        text = entry['input_text'].replace('\n', ' ')
        text = ' '.join(tokenizer.tokenize(text))
        print(text, file=fout)

with open('data/train.hi', 'w', encoding='utf-8') as fout:
    for entry in train_data['hi']:
        text = entry['output_text'].replace('\n', ' ')
        text = ' '.join(tokenizer.tokenize(text))
        print(text, file=fout)

! head -n1 data/train.*

DatasetDict({
    hi: Dataset({
        features: ['id', 'input_text', 'output_text', 'input_language', 'output_language', 'bucket'],
        num_rows: 50000
    })
})
==> data/train.en <==
▁The ▁w inner ▁is ▁announced ▁at ▁an ▁event ▁in ▁Sydney ▁in ▁March .

==> data/train.hi <==
▁मार्च ▁में ▁सि ड नी ▁में ▁एक ▁कार्यक्रम ▁में ▁विजे ता ▁की ▁घोषणा ▁की ▁जाती ▁है ।


- Validation data

In [5]:
flores_en_data = datasets.load_dataset('facebook/flores', 'eng_Latn', trust_remote_code=True)
flores_hi_data = datasets.load_dataset('facebook/flores', 'hin_Deva', trust_remote_code=True)
print(flores_en_data)

# Validation
with open('data/valid.en', 'w', encoding='utf-8') as fout:
    for sentence in flores_en_data['dev']['sentence']:
        sentence = ' '.join(tokenizer.tokenize(sentence))
        print(sentence, file=fout)

with open('data/valid.hi', 'w', encoding='utf-8') as fout:
    for sentence in flores_hi_data['dev']['sentence']:
        sentence = ' '.join(tokenizer.tokenize(sentence))
        print(sentence, file=fout)

# Test
with open('data/test.en', 'w', encoding='utf-8') as fout:
    for sentence in flores_en_data['devtest']['sentence']:
        sentence = ' '.join(tokenizer.tokenize(sentence))
        print(sentence, file=fout)

with open('data/test.hi', 'w', encoding='utf-8') as fout:
    for sentence in flores_hi_data['devtest']['sentence']:
        sentence = ' '.join(tokenizer.tokenize(sentence))
        print(sentence, file=fout)

! head -n1 data/valid.*
! head -n1 data/test.*

DatasetDict({
    dev: Dataset({
        features: ['id', 'URL', 'domain', 'topic', 'has_image', 'has_hyperlink', 'sentence'],
        num_rows: 997
    })
    devtest: Dataset({
        features: ['id', 'URL', 'domain', 'topic', 'has_image', 'has_hyperlink', 'sentence'],
        num_rows: 1012
    })
})
==> data/valid.en <==
▁On ▁Monday , ▁scientists ▁from ▁the ▁Stanford ▁University ▁School ▁of ▁Medicine ▁announced ▁the ▁inv ention ▁of ▁a ▁new ▁diagnostic ▁tool ▁that ▁can ▁sort ▁cells ▁by ▁type : ▁a ▁tiny ▁pr inta ble ▁chip ▁that ▁can ▁be ▁manufact ured ▁using ▁standard ▁ink jet ▁prin ters ▁for ▁possi bly ▁about ▁one ▁U . S . ▁cent ▁each .

==> data/valid.hi <==
▁सोम वार ▁को , ▁स्ट ैन फ़ ो र्ड ▁यूनि वर् सिटी ▁स्कूल ▁ऑफ़ ▁मेड िस िन ▁के ▁वैज्ञानिक ों ▁ने ▁एक ▁नए ▁डाय ग् नो स्टिक ▁उपकरण ▁के ▁आवि ष्कार ▁की ▁घोषणा ▁की ▁जो ▁कोशिका ओं ▁को ▁उनके ▁प्रकार ▁के ▁आधार ▁पर ▁छा ँ ट ▁सकता ▁है : ▁एक ▁छोटी ▁प्रि ंट ▁करने ▁योग्य ▁चि प ▁जिसे ▁स्ट ै ण्ड र्ड ▁इ ंक जे ट ▁प्रि ं टर ▁का ▁उपयोग ▁करके ▁लगभग ▁एक

Binarize the data

In [11]:
! fairseq-preprocess \
    --source-lang en --target-lang hi \
    --trainpref data/train \
    --validpref data/valid \
    --destdir databin \
    --joined-dictionary

2025-01-12 17:19:49 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, log_file=None, aim_repo=None, aim_run_hash=None, tensorboard_logdir=None, wandb_project=None, azureml_logging=False, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, on_cpu_convert_precision=False, min_loss_scale=0.0001, threshold_loss_scale=None, amp=False, amp_batch_retries=2, amp_init_scale=128, amp_scale_window=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, quantization_config_path=None, profile=False, reset_logging=False, suppress_crashes=False, use_plasma_view=False, plasma_path='/tmp/plasma', criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang='en', target_lang='hi', tr

# Model Training

- Basic configuration
```
fairseq-train \
    databin \
    --source-lang en --target-lang hi \
    --arch transformer \
    --save-dir models \
    --max-tokens 4000 \
    --max-epoch 1000 --no-epoch-checkpoints \
    --optimizer adam
```

- Additional configuration for low-resource languages following
https://github.com/facebookresearch/flores/blob/main/previous_releases/floresv1/README.md
- Optimization
```
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 --stop-min-lr 1e-9 \
    --weight-decay 0.0001 \
    --label-smoothing 0.2 --criterion label_smoothed_cross_entropy
```

- Architecture
```
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 2 --decoder-attention-heads 2 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.4 --attention-dropout 0.2 --relu-dropout 0.2
```

- It's common to select a best checkpoint using a validation set
```
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "lenpen": 1.2}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe=sentencepiece \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric
```

- We can debug the model traiing using command-line ui, tensorboard, or wandb
- tensorboard: `--tensorboard-logdir models/tensorboard`
- wandb: `--wandb-project COLING2025`

In [None]:
! mkdir -p models models/tensorboard
! fairseq-train \
    databin \
    --source-lang en --target-lang hi \
    --arch transformer_tiny \
    --save-dir models \
    --max-tokens 4000 \
    --max-epoch 1000 --no-epoch-checkpoints \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0 \
    --weight-decay 0.0001 \
    --label-smoothing 0.2 --criterion label_smoothed_cross_entropy \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 --stop-min-lr 1e-9 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "lenpen": 1.2}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe=sentencepiece \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --tensorboard-logdir models/tensorboard \
    --wandb-project COLING2025

# Generation and Evaluation

In [None]:
! mkdir -p outputs
! cat data/test.en \
| fairseq-interactive \
    databin/ \
    --source-lang en --target-lang hi \
    --path pretrained/models/checkpoint_best.pt \
    --beam 5 --lenpen 1.2 \
    --remove-bpe=sentencepiece > outputs/test.hi

- BLEU, ChrF++, and TER evaluation using [Sacrebleu](https://github.com/prajdabre/sacrebleu)

In [5]:
! cat data/test.hi | sed 's/ //g' | sed 's/▁/ /g' > data/test.hi.detok
! cat outputs/test.hi | grep '^H-*' | cut -f3- \
| python -m sacrebleu \
    data/test.hi.detok \
    -b -m bleu chrf ter -w 2

[
19.29,
44.83,
69.28
]
[0m