# NLLB-200 demo

This simple demo demonstrates how to load a distilled version of the [NLLB-200 model](https://github.com/facebookresearch/fairseq/tree/nllb), translate with it, and fine-tune it with a small additional corpus.

For more details, please refer to the GitHub repository linked above, the [Huggingface documentation](https://huggingface.co/docs/transformers/main/en/model_doc/nllb#nllb), and [our paper](https://research.facebook.com/publications/no-language-left-behind/).

In [1]:
!nvidia-smi

Fri Feb 24 12:25:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   37C    P0    45W / 400W |      0MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install transformers datasets evaluate numpy sacrebleu sentencepiece

Defaulting to user installation because normal site-packages is not writeable
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m100.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting datasets
  Downloading datasets-2.10.0-py3-none-any.whl (469 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m95.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
Collecting sacrebleu
  Downloading sacrebleu-2.3.1-py3-none-any.whl (118 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m118.9/118.9 kB[0m [31m32.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.

For this demo, we will be testing out translation from [Ligurian](https://en.wikipedia.org/wiki/Ligurian_language) into English.

In [9]:
SRC_LANG = "eng_Latn"
TGT_LANG = "knc_Latn"
TRAIN_BATCH_SIZE = 100

# Load data and create train/test/dev sets

In [4]:
#Paths

data_dir = ""

path_TMX_train = data_dir + "TMX_train.json"
path_TMX_dev = data_dir + "TMX_dev.json"
path_TMX_test = data_dir + "TMX_test.json"

path_FLORES_dev = data_dir + "FLORES_dev.json"
path_FLORES_test = data_dir + "FLORES_test.json"

In [5]:
import json
with open(path_TMX_train) as f:
  data_TMX_train = json.loads(f.read())
with open(path_TMX_dev) as f:
  data_TMX_dev = json.loads(f.read())
with open(path_TMX_test) as f:
  data_TMX_test = json.loads(f.read())

with open(path_FLORES_dev) as f:
  data_FLORES_dev = json.loads(f.read())
with open(path_FLORES_test) as f:
  data_FLORES_test = json.loads(f.read())

In [7]:
data_train = data_TMX_train
data_dev = data_TMX_dev + data_FLORES_dev

print("data_train", len(data_train))
print("data_dev", len(data_dev))
print("data_test_FLORES", len(data_FLORES_test))
print("data_test_TMX", len(data_TMX_test))

data_train 73155
data_dev 1997
data_test_FLORES 1012
data_test_TMX 1000


# Load model

In [8]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M", use_cache=False)
tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M", src_lang=SRC_LANG, tgt_lang=TGT_LANG)

  from pandas.core.computation.check import NUMEXPR_INSTALLED


Downloading (…)lve/main/config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

Downloading (…)ncepiece.bpe.model";:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

Downloading (…)"tokenizer.json";:   0%|          | 0.00/17.3M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

## 1. Inference

In [10]:
#Test inference
translator = pipeline("translation", model=model, tokenizer=tokenizer, device=0, src_lang=SRC_LANG, tgt_lang=TGT_LANG)

test_src_sentence = "Do you think this is right?"
print("Test inference before fine-tuning")
print("SRC:", test_src_sentence)
translator(test_src_sentence)

Test inference before fine-tuning
SRC: Do you think this is right?




[{'translation_text': 'Adǝ shima zauro kәla ro waljin?'}]

## 2. Fine-tuning

We create our own dataset for fine-tuning, using only a handful more examples:

In [11]:
from datasets import Dataset, DatasetDict, load_dataset

src_key = "sentence_" + SRC_LANG
tgt_key = "sentence_" + TGT_LANG

# data = [
#     {src_key: "Emmo acciantou un chigheumao.", tgt_key: "We've planted a cucumber."},
#     {src_key: "Emmo acciantou un meigranâ.", tgt_key: "We've planted a pomegranate."},
#     {src_key: "Gh'é un chigheumao into frigo.", tgt_key: "There's a cucumber in the fridge."},
#     {src_key: "Te gusta o meigranâ?", tgt_key: "Do you like pomegranate?"},
#     {src_key: "Tutto insemme, o chigheumao e i faxolin vëgnan dexe euro.", tgt_key: "All together, the cucumber and the green beans are ten euros."},
#     {src_key: "O fruto do meigranâ o l'é ben ben doçe!", tgt_key: "The pomegranate fruit is very sweet!"},
#     {src_key: "O no te gusta o chigheumao?", tgt_key: "Don't you like the cucumber?"},
#     {src_key: "O chigheumao o no ne sa de ninte...", tgt_key: "Cucumbers don't taste of anything..."},
#     {src_key: "Ti gh'æ di chigheumai inte l'òrto?", tgt_key: "Do you have cucumbers in your vegetable garden?"},
#     {src_key: "Ò un chigheumao", tgt_key: "I have a cucumber"},
#     {src_key: "Ò un meigranâ", tgt_key: "I have a pomegranate"},
#     {src_key: "O mei e o chigheumao en di fruti.", tgt_key: "The apple and the cucumber are fruits."},
#     {src_key: "Mangemmo un chigheumao", tgt_key: "We eat a cucumber"},
#     {src_key: "Mangemmo un meigranâ", tgt_key: "We eat a pomegranate"},
# ]
data_finetune = Dataset.from_list(data_train)
# NB: We limit the size of the validation set so that it will run on this free
# instance of Colab. In practice you'd want to use the whole set.
# data_validate = load_dataset("facebook/flores", "lij_Latn-eng_Latn")["dev"].select(range(5))
data_validate = Dataset.from_list(data_dev)

We then prepare it by tokenising it:

In [12]:
def tokenize_fn(examples):
  return tokenizer(examples[src_key], text_target=examples[tgt_key], padding="max_length", truncation=True)

tokenized_finetune = data_finetune.map(tokenize_fn, batched=True)
tokenized_validate = data_validate.map(tokenize_fn, batched=True)

Map:   0%|          | 0/73155 [00:00<?, ? examples/s]

Map:   0%|          | 0/1997 [00:00<?, ? examples/s]

We fine-tune the model with it:

In [None]:
from transformers import TrainingArguments, Trainer, logging

In [30]:
TRAIN_BATCH_SIZE = 3
import gc
import torch
torch.cuda.empty_cache()
gc.collect()

221

In [None]:
# NB: We work with small batch sizes and checkpointing due to the limitations of
# this free instance of Colab. In practice you'd want to use settings closer to
# what we use in the paper.
training_args = TrainingArguments(
    output_dir="tmp",
    num_train_epochs=40,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=TRAIN_BATCH_SIZE,
    gradient_accumulation_steps=4,
    eval_accumulation_steps=4,
    gradient_checkpointing=True,
    fp16=True,
    fp16_full_eval=True,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_finetune,
    eval_dataset=tokenized_validate,
)
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `M2M100ForConditionalGeneration.forward` and have been ignored: sentence_eng_Latn, sentence_knc_Latn. If sentence_eng_Latn, sentence_knc_Latn are not expected by `M2M100ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 73155
  Num Epochs = 40
  Instantaneous batch size per device = 3
  Total train batch size (w. parallel, distributed & accumulation) = 12
  Gradient Accumulation steps = 4
  Total optimization steps = 243840
  Number of trainable parameters = 615073792


Epoch,Training Loss,Validation Loss


Saving model checkpoint to tmp/checkpoint-500
Configuration saved in tmp/checkpoint-500/config.json
Configuration saved in tmp/checkpoint-500/generation_config.json
Model weights saved in tmp/checkpoint-500/pytorch_model.bin
Saving model checkpoint to tmp/checkpoint-1000
Configuration saved in tmp/checkpoint-1000/config.json
Configuration saved in tmp/checkpoint-1000/generation_config.json
Model weights saved in tmp/checkpoint-1000/pytorch_model.bin


Let's try translating again, to see if things have improved:

In [None]:
print("Test inference after fine-tuning")
print("SRC:", test_src_sentence)
translator(test_src_sentence)

[{'translation_text': 'A salad with tomatoes, pomegranate and cucumber cut into slices.'}]

In [49]:
model.save_pretrained("my-fine-tuned-model")

Configuration saved in my-fine-tuned-model.pt/config.json
Configuration saved in my-fine-tuned-model.pt/generation_config.json
Model weights saved in my-fine-tuned-model.pt/pytorch_model.bin


# Test on test sets

## FLORES

In [None]:
from tqdm.notebook import tqdm
import sacrebleu

In [52]:
#Translate FLORES test set 
print("======Testing on FLORES======")
src_FLORES_test_eng = []
tgt_FLORES_test_knc = []
inf_FLORES_test_eng_knc = []
for sent in tqdm(data_FLORES_test):
  src = sent[src_key]
  tgt = sent[tgt_key]
  inf = translator(src)[0]['translation_text']

  src_FLORES_test_eng.append(src)
  tgt_FLORES_test_knc.append(tgt)
  inf_FLORES_test_eng_knc.append(inf)

  0%|          | 0/1012 [00:00<?, ?it/s]

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "eos_token_id": 2,
  "max_length": 200,
  "pad_token_id": 1,
  "transformers_version": "4.26.1",
  "use_cache": false
}



In [None]:
#Save inference results to file
with open("inference_FLORES_eng_knc.txt", 'w') as f:
  for l in inf_FLORES_test_eng_knc:
    f.write(l+"\n")

In [58]:
print("First sample")
print("SRC:", src_FLORES_test_eng[0])
print("TGT:", tgt_FLORES_test_knc[0])
print("INF:", inf_FLORES_test_eng_knc[0])

First sample
SRC: "We now have 4-month-old mice that are non-diabetic that used to be diabetic," he added.
TGT: “Kǝrmaro nanden Njiluwa kǝntagǝ 4 soga mbeji sandi duwo buron kasuwa shwurǝ be diye sandiya zǝamzǝna ma, ammaa kǝrma sandiya kolzǝna," sǝ gullono.
INF: "Suro adəye dәga musko 4 kәntagәna mbeji sandiya kәla diyabetebedәro waljin kuru sandiya kәla diyabetebedәro waljin", shiye waltәye wono.


In [59]:
# Calculate BLEU
bleu = sacrebleu.corpus_bleu(inf_FLORES_test_eng_knc, [tgt_FLORES_test_knc], tokenize='flores200')
print("BLEU:", round(bleu.score, 2))

# Calculate CHRF
chrf = sacrebleu.corpus_chrf(inf_FLORES_test_eng_knc, [tgt_FLORES_test_knc])
print("CHRF:", round(chrf.score, 2))

# Calculate TER
metric = sacrebleu.metrics.TER()
ter = metric.corpus_score(inf_FLORES_test_eng_knc, [tgt_FLORES_test_knc])
print("TER:", round(ter.score, 2))



BLEU: 2.97
CHRF: 27.39
TER: 86.96


## Test on TMX

In [60]:
#Translate TMX test set 
print("======Testing on TMX======")
src_TMX_test_eng = []
tgt_TMX_test_knc = []
inf_TMX_test_eng_knc = []
for sent in tqdm(data_TMX_test):
  src = sent[src_key]
  tgt = sent[tgt_key]
  inf = translator(src)[0]['translation_text']

  src_TMX_test_eng.append(src)
  tgt_TMX_test_knc.append(tgt)
  inf_TMX_test_eng_knc.append(inf)



  0%|          | 0/1000 [00:00<?, ?it/s]

Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "eos_token_id": 2,
  "max_length": 200,
  "pad_token_id": 1,
  "transformers_version": "4.26.1",
  "use_cache": false
}



In [None]:
#Save inference results to file
with open("inference_TMX_eng_knc.txt", 'w') as f:
  for l in inf_TMX_test_eng_knc:
    f.write(l+"\n")

In [62]:
print("First sample")
print("SRC:", src_TMX_test_eng[0])
print("TGT:", tgt_TMX_test_knc[0])
print("INF:", inf_TMX_test_eng_knc[0])

First sample
SRC: Do you think this is right?
TGT: Akaidǝ kalkallo ruwinna?
INF: Adǝ shima zauro kәla ro waljin?


In [63]:
# Calculate BLEU
bleu = sacrebleu.corpus_bleu(inf_TMX_test_eng_knc, [tgt_TMX_test_knc], tokenize='flores200')
print("BLEU:", round(bleu.score, 2))

# Calculate CHRF
chrf = sacrebleu.corpus_chrf(inf_TMX_test_eng_knc, [tgt_TMX_test_knc])
print("CHRF:", round(chrf.score, 2))

# Calculate TER
metric = sacrebleu.metrics.TER()
ter = metric.corpus_score(inf_TMX_test_eng_knc, [tgt_TMX_test_knc])
print("TER:", round(ter.score, 2))



BLEU: 4.03
CHRF: 15.45
TER: 200.0
