In this notebook, I show how to fine-tune a NLLB-200 machine translation model for a new language.

The new language will be [Tyvan](https://en.wikipedia.org/wiki/Tuvan_language), and I will use a Tyvan-Russian parallel corpus as the training data.

I am running this notebook on Google Colab with a T4 GPU that has 15Gb of memory. If you run it elsewhere, you may want to adjust the batch size, so that there are no OOM errors, but the GPU is well utilized.

# 0. Preliminaries

I run this notebook in Google Colab (which is ephemeral), and to read the dataset and to write the resulting model I use Google Drive, which I mount in the cell below.

In [None]:
from google.colab import drive
import os
if not os.path.exists('/gd'):
    drive.mount('/gd')

Mounted at /gd


Installing dependencies:
* `transformers`, as a neural network framework
* `sentencepiece`, a backend for my tokenizer (the algorithm for converting a text into symbols from the model's vocabulary)
* `sacremoses`, a package required for text preprocessing with which NLLB models were pretrained.
* `sacrebleu`, a package for evaluating translation models

In [None]:
import locale
def gpe(x=None):
    return "UTF-8"
locale.getpreferredencoding = gpe

In [None]:
!pip install sentencepiece transformers==4.33 datasets sacremoses sacrebleu  -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.3/106.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m106.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h

# 1. Exploring the data

In this section, I try to understand what is the training data that I have, and how suitable it is for fine-tuning a NLLB model.

In [None]:
import pandas as pd

In [None]:

trans_df = pd.read_csv('/gd/MyDrive/fra_fuf_all.tsv', sep="\t")
print(trans_df.shape)
print(trans_df.columns)

(20322, 3)
Index(['fra', 'fuf', 'split'], dtype='object')


In [None]:
pd.options.display.max_colwidth = 100

In [None]:
trans_df.sample(10)

Unnamed: 0,fra,fuf,split
3585,"Du mort, Il fait sortir le vivant, et du vivant, Il fait sortir le mort. Et Il redonne la vie à ...","Hino jeyaa e maandeeji Makko ɗin: tagugol kammuuli ɗin e leydi ndin, e luutondirgol ɗemɗe mon e ...",train
15818,cracher,kilooti,train
10204,"Crétois et Arabes , nous les entendons parler dans nos langues des merveilles de Dieu . »","fii no Joomiraaɗo on yeɗira on saa'iiji fowtere immorde e makko , e fii no o additirana on on Al...",train
13830,"Bien-aimés , si notre cœur ne nous condamne pas , nous avons de l'assurance devant Dieu;","Ko fii kala on hiwriiɗo mo , haray tawdaama e kuuɗe makko bonɗe ɗen .",train
5206,"Si Nous voulions, Nous la rendrions salée. Pourquoi n'êtes-vous donc pas reconnaissants?","Ɓen wuddooɓe ɓe yamira yimɓe ɓen nguddam... kala ɗuurniiɗo, pellet, Allah kan ko Galo, Yettiniiɗo.",train
4968,"Ceux qui ne croient pas en l'au-delà donnent aux Anges des noms de femmes,",hika dogira e ndenka Amen. Ɗum ko njoɓdi [Nuuhu] fennanooɗo on.,train
10342,"Mais si elle est de Dieu , vous ne pourrez pas la renverser , et l'on trouverait même que vous l...","Fewndo ko jamaa on mottondirnoo ka wulaa , ko kanko wondunoo e malaa'ikaajo yewtaynooɗo mo on ka...",train
6060,"Vous passerez, certes, par des états successifs!",Woɗɗitoo mbo ɓurɗo hiiteede.,train
2507,"Nous n'avons point fait descendre sur toi le Coran pour que tu sois malheureux,",Ko Jippinaande immorde e On Taguɗo leydi ndin e kammuuli toowuɗi ɗin.,train
2504,"Nous l'avons rendu (le Coran) facile [à comprendre] en ta langue, afin que tu annonces par lui l...",Taa Haa.,train


In [None]:
trans_df.isnull().sum()

fra      0
fuf      0
split    0
dtype: int64

In [None]:
trans_df.split.value_counts()

train    20122
dev        100
test       100
Name: split, dtype: int64

In [None]:
df_train = trans_df[trans_df.split=='train'].copy() # 20 122 items
df_dev = trans_df[trans_df.split=='dev'].copy()     # 100 items
df_test = trans_df[trans_df.split=='test'].copy()   # 100 items

# 2. How well does the data fit into a NLLB tokenizer?

In [None]:
from transformers import NllbTokenizer
from tqdm.auto import tqdm, trange

In [None]:
tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

In [None]:
import re

def word_tokenize(text):
    # a very naive word tokenizer for languages with English-like orthography
    return re.findall('(\w+|[^\w\s])', text)

In [None]:
smpl = df_train.sample(10000, random_state=1)

smpl['fra_toks'] = smpl.fra.apply(tokenizer.tokenize)
smpl['fuf_toks'] = smpl.fuf.apply(tokenizer.tokenize)

smpl['fra_words'] = smpl.fra.apply(word_tokenize)
smpl['fuf_words'] = smpl.fuf.apply(word_tokenize)

In [None]:
smpl.sample(5)[['fra', 'fra_words', 'fra_toks', 'fuf', 'fuf_words', 'fuf_toks']]

Unnamed: 0,fra,fra_words,fra_toks,fuf,fuf_words,fuf_toks
11573,"afin que vous la receviez dans le Seigneur d'une manière digne des saints , et que vous l'aidiez...","[afin, que, vous, la, receviez, dans, le, Seigneur, d, ', une, manière, digne, des, saints, ,, e...","[▁afin, ▁que, ▁vous, ▁la, ▁rece, vie, z, ▁dans, ▁le, ▁Seigneur, ▁d, ', une, ▁manière, ▁dig, ne, ...",Min tigi miɗo wonirnoo takko mon e noone no lo'iri e kulol e diwnol tiiɗungol .,"[Min, tigi, miɗo, wonirnoo, takko, mon, e, noone, no, lo, ', iri, e, kulol, e, diwnol, tiiɗungol...","[▁Min, ▁tigi, ▁mi, ɗo, ▁won, ir, noo, ▁tak, ko, ▁mon, ▁e, ▁no, one, ▁no, ▁lo, ', iri, ▁e, ▁kul, ..."
3847,afin [qu'Allah] les récompense pleinement et leur ajoute de Sa grâce. Il est Pardonneur et Recon...,"[afin, [, qu, ', Allah, ], les, récompense, pleinement, et, leur, ajoute, de, Sa, grâce, ., Il, ...","[▁afin, ▁[, qu, ', Allah, ], ▁les, ▁ré, com, pense, ▁plein, ement, ▁et, ▁leur, ▁ajoute, ▁de, ▁Sa...","Pellet, Alla ko Annduɗo wirniiɗi kammuuli ɗin e leydi ndin. Ko O Annduɗo ko suuɗii e ɓerɗe.","[Pellet, ,, Alla, ko, Annduɗo, wirniiɗi, kammuuli, ɗin, e, leydi, ndin, ., Ko, O, Annduɗo, ko, s...","[▁Pel, let, ,, ▁Alla, ▁ko, ▁An, ndu, ɗo, ▁wir, nii, ɗi, ▁kam, mu, uli, ▁ɗin, ▁e, ▁leydi, ▁ndin, ..."
19712,je me suis planté une écharde dans le doigt,"[je, me, suis, planté, une, écharde, dans, le, doigt]","[▁je, ▁me, ▁suis, ▁plant, é, ▁une, ▁é, char, de, ▁dans, ▁le, ▁do, igt]",Miɗo aami immagol .,"[Miɗo, aami, immagol, .]","[▁Mi, ɗo, ▁aami, ▁immag, ol, ▁.]"
8205,"La crainte s'empara de tous les habitants des environs , et l'on parlait de toutes ces paroles d...","[La, crainte, s, ', empara, de, tous, les, habitants, des, environs, ,, et, l, ', on, parlait, d...","[▁La, ▁cra, inte, ▁s, ', emp, ara, ▁de, ▁tous, ▁les, ▁habitants, ▁des, ▁en, vir, ons, ▁,, ▁et, ▁...",Tawi mawɓe Iisaa ɓen yahayno hitaande kala Yerusalaam fii Juldeere Yawtaneede nden .,"[Tawi, mawɓe, Iisaa, ɓen, yahayno, hitaande, kala, Yerusalaam, fii, Juldeere, Yawtaneede, nden, .]","[▁Ta, wi, ▁maw, ɓe, ▁I, isaa, ▁ɓen, ▁yahay, no, ▁hitaande, ▁kala, ▁Yer, us, ala, am, ▁fii, ▁Jul,..."
11778,"Ou bien n'avons-nous pas le droit , Barnabé et moi , de ne pas travailler ?","[Ou, bien, n, ', avons, -, nous, pas, le, droit, ,, Barnabé, et, moi, ,, de, ne, pas, travailler...","[▁Ou, ▁bien, ▁n, ', avons, -, nous, ▁pas, ▁le, ▁droit, ▁,, ▁Barna, bé, ▁et, ▁moi, ▁,, ▁de, ▁ne, ...","Mi mantii on fii ko on yejjitaali fii an kon e ɗi fow , e ko maanditiɗon janndeeji ɗin kon , wan...","[Mi, mantii, on, fii, ko, on, yejjitaali, fii, an, kon, e, ɗi, fow, ,, e, ko, maanditiɗon, jannd...","[▁Mi, ▁man, tii, ▁on, ▁fii, ▁ko, ▁on, ▁ye, jj, ita, ali, ▁fii, ▁an, ▁kon, ▁e, ▁ɗi, ▁fow, ▁,, ▁e,..."


In [None]:
stats = smpl[['fuf_toks', 'fra_toks', 'fuf_words', 'fra_words']].applymap(len).describe()
stats

Unnamed: 0,fuf_toks,fra_toks,fuf_words,fra_words
count,10000.0,10000.0,10000.0,10000.0
mean,28.9311,26.973,18.9516,21.885
std,25.176138,24.12413,16.988617,19.911866
min,1.0,1.0,1.0,1.0
25%,4.0,4.0,2.0,2.0
50%,27.0,25.0,18.0,20.0
75%,44.0,40.0,29.0,33.0
max,344.0,382.0,235.0,316.0


In [None]:
print(stats.fuf_toks['mean'] / stats.fuf_words['mean'])
print(stats.fra_toks['mean'] / stats.fra_words['mean'])

1.5265782308617744
1.2324880054832075


In [None]:
print(tokenizer.unk_token, tokenizer.unk_token_id)

<unk> 3


Good news: both for Russian and Tyvan, the NLLB tokenizer seems to produce around 2 tokens per word (more precisely, 2.3 and 1.8), which means that the translation quality of fine-tuned NLLB may be decent even without vocabulary extension.

One more check: how often does the <unk> token happen in the tokenizer output for Tyvan? If this is too often, we need to fix it somehow

In [None]:
texts_with_unk = [text for text in tqdm(trans_df.fuf) if tokenizer.unk_token_id in tokenizer(text).input_ids]
print(len(texts_with_unk))

  0%|          | 0/20322 [00:00<?, ?it/s]

2459


In [None]:
import random
s = random.sample(texts_with_unk, 5)
s

["Iisaa wi'i mo kadi fahin : « Mariyama ! » Debbo on fewti mo , wi'i e haala Yahuudiyanke : « Rabbunii ! » ( Ko woni ɗun ko « Karamoko'en » . )",
 "Kono Iisaa wi'i : « Wota on toŋan mo ɗun . Ko fii hay gooto waawataa waɗude kaawake moƴƴo e innde an , yiltoo kisan ɓawto ɗun , wowlammi ko boni .",
 "Mi andinii on , ko wano non kadi malaa'ikaaɓe Alla ɓen weltorta tuma nde junuubankeejo gooto tuubi . »",
 'ko fii e hino ko Allaahu on daali kon ka deftere : « Wonee laaɓuɓe , ko fii ko mi Laaɓuɗo . »',
 'Jooni non meɗen andi wonde hiɗon andi kala huunde , awa kadi jaraa ka goɗɗo landoo on . Ko ɗun waɗi si meɗen gomɗini wonde ko ka Alla iwruɗon . »']

Apparently, most of the texts with 3634 unknown tokens just have some punctuation unfamiliar to the NLLB tokenizer.

This is because the NLLB model was pretrained on normalized texts. If we reproduce the normalization, most of the problems would be fixed.

In [None]:
# this code is adapted from  the Stopes repo of the NLLB team
# https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214

import re
import sys
import typing as tp
import unicodedata
from sacremoses import MosesPunctNormalizer


mpn = MosesPunctNormalizer(lang="en")
mpn.substitutions = [
    (re.compile(r), sub) for r, sub in mpn.substitutions
]


def get_non_printing_char_replacer(replace_by: str = " ") -> tp.Callable[[str], str]:
    non_printable_map = {
        ord(c): replace_by
        for c in (chr(i) for i in range(sys.maxunicode + 1))
        # same as \p{C} in perl
        # see https://www.unicode.org/reports/tr44/#General_Category_Values
        if unicodedata.category(c) in {"C", "Cc", "Cf", "Cs", "Co", "Cn"}
    }

    def replace_non_printing_char(line) -> str:
        return line.translate(non_printable_map)

    return replace_non_printing_char

replace_nonprint = get_non_printing_char_replacer(" ")

def preproc(text):
    clean = mpn.normalize(text)
    clean = replace_nonprint(clean)
    # replace 𝓕𝔯𝔞𝔫𝔠𝔢𝔰𝔠𝔞 by Francesca
    clean = unicodedata.normalize("NFKC", clean)
    return clean

In [None]:
texts_with_unk_normed = [text for text in tqdm(texts_with_unk) if tokenizer.unk_token_id in tokenizer(preproc(text)).input_ids]
print(len(texts_with_unk_normed))

  0%|          | 0/2459 [00:00<?, ?it/s]

0


Indeed, after normalizing texts, none of them contain unknown tokens. We will use this as one more piece of evidence that we don't have to update the tokenizer vocabulary to use it with Tyvan.

# 3 (optional). Expanding the vocabulary

# 4. Adding a new language tag to the tokenizer and model

In [None]:
from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer

In [None]:
tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')
print(len(tokenizer))
print(tokenizer.convert_ids_to_tokens([256202, 256203]))

256204
['zul_Latn', '<mask>']


In [None]:
def fix_tokenizer(tokenizer, new_lang='fuf_Latn'):
    """
    Add a new language token to the tokenizer vocabulary
    (this should be done each time after its initialization)
    """
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}
    tokenizer.added_tokens_decoder = {}

In [None]:
fix_tokenizer(tokenizer)

In [None]:
print(tokenizer.convert_ids_to_tokens([256202, 256203, 256204])) # ['zul_Latn', 'tyv_Cyrl', '<mask>']
print(tokenizer.convert_tokens_to_ids(['zul_Latn', 'fuf_Latn', '<mask>'])) # [256202, 256203, 256204]
# this is consistent now, wow!

['zul_Latn', 'fuf_Latn', '<mask>']
[256202, 256203, 256204]


In [None]:
added_token_id = tokenizer.convert_tokens_to_ids('fuf_Latn')
similar_lang_id = tokenizer.convert_tokens_to_ids('fuv_Latn')
print(added_token_id, similar_lang_id)

256203 256059


In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained('facebook/nllb-200-distilled-600M')
model.resize_token_embeddings(len(tokenizer))

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 256205. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Embedding(256205, 1024)

In [None]:
# moving the embedding for "mask" to its new position
model.model.shared.weight.data[added_token_id+1] = model.model.shared.weight.data[added_token_id]
# initializing new language token with a token of a similar language
model.model.shared.weight.data[added_token_id] = model.model.shared.weight.data[similar_lang_id]

# 5. Preparing the training loop

In [None]:
import gc
import random
import numpy as np
import torch
from tqdm.auto import tqdm, trange
from transformers.optimization import Adafactor
from transformers import get_constant_schedule_with_warmup

def cleanup():
    """Try to free GPU memory"""
    gc.collect()
    torch.cuda.empty_cache()

cleanup()

In [None]:
model.cuda();

In [None]:
optimizer = Adafactor(
    [p for p in model.parameters() if p.requires_grad],
    scale_parameter=False,
    relative_step=False,
    lr=1e-4,
    clip_threshold=1.0,
    weight_decay=1e-3,
)

In [None]:
batch_size = 16  # 32 already doesn't fit well to 15GB of GPU memory
max_length = 128
warmup_steps = 1_000
training_steps = 57000

In [None]:
losses = []
scheduler = get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps)

In [None]:
LANGS = [('fra', 'fra_Latn'), ('fuf', 'fuf_Latn')]

def get_batch_pairs(batch_size, data=df_train):
    (l1, long1), (l2, long2) = random.sample(LANGS, 2)
    xx, yy = [], []
    for _ in range(batch_size):
        item = data.iloc[random.randint(0, len(data)-1)]
        xx.append(preproc(item[l1]))
        yy.append(preproc(item[l2]))
    return xx, yy, long1, long2

print(get_batch_pairs(1))
# (["Feɲɲinannde Iisaa Almasiihu on . Ndee feɲɲinannde ko Allaahu on halfini nde Iisaa Almasiihu on , fii yo o hollu jiyaaɓe makko ɓen kon ko saatii arude ko ɓooyaa . Kanko kadi o ɓanginiri ɗun nulugol malaa'ikaajo makko goo e oo jiyaaɗo makko Yuuhanna ,"], ['Nous savons que celui qui est né de Dieu ne pèche pas , mais celui qui est né de Dieu se garde lui-même , et le malin ne le touche pas .'], 'fuf_Latn', 'fra_Latn')

(['Tawi maaya-jungoojo no nder ton . Nde tawnoo hiɓe faalaa tooɲude Iisaa , ɓe landii mo , ɓe wi\'i: " E hara no dagii ka goɗɗo ɲawndee e asewe ? "'], ["N'ayez donc pas peur d'eux , car il n'y a rien de caché qui ne soit révélé , ni de dissimulé qui ne soit connu ."], 'fuf_Latn', 'fra_Latn')


In [None]:
MODEL_SAVE_PATH = '/gd/MyDrive/models/nllb-fra-fuf-v2'

# 6. The training loop

In [None]:
# If the Colab instance has shut down read this, else pass your way

# you can always load it back from the Google drive where you have saved it
# => if error : Repo id must be in the form 'repo_name' or 'namespace/repo_name':
# might need to rerun most of the above...
# might need to simply restart the session (Execution -> restart)
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/gd/MyDrive/models/nllb-fra-fuf-v2'. Use `repo_type` argument if needed.

In [None]:
model.train()
x, y, loss = None, None, None
cleanup()

tq = trange(len(losses), training_steps)
for i in tq:
    xx, yy, lang1, lang2 = get_batch_pairs(batch_size)
    try:
        tokenizer.src_lang = lang1
        x = tokenizer(xx, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(model.device)
        tokenizer.src_lang = lang2
        y = tokenizer(yy, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(model.device)
        y.input_ids[y.input_ids == tokenizer.pad_token_id] = -100

        loss = model(**x, labels=y.input_ids).loss
        loss.backward()
        losses.append(loss.item())

        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        scheduler.step()

    except RuntimeError as e:
        optimizer.zero_grad(set_to_none=True)
        x, y, loss = None, None, None
        cleanup()
        print('error', max(len(s) for s in xx + yy), e)
        continue

    if i % 1000 == 0:
        print(i, np.mean(losses[-1000:]))

    if i % 1000 == 0 and i > 0:
        model.save_pretrained(MODEL_SAVE_PATH)
        tokenizer.save_pretrained(MODEL_SAVE_PATH)

  0%|          | 0/57000 [00:00<?, ?it/s]

0 3.8596949577331543
1000 3.342402245283127
2000 2.83847035741806
3000 2.685642701148987
4000 2.5417772048711775
5000 2.418346631169319
6000 2.3161424453258515
7000 2.2317524782419205
8000 2.1189011729955674
9000 2.0626422716379165
10000 1.9833402442932129
11000 1.917804942548275
12000 1.8667315596938132
13000 1.8009946464896203
14000 1.755530894100666
15000 1.6837286068201065
16000 1.6139905810952186
17000 1.5744221299290657
18000 1.5123122119307517
19000 1.424876333296299
20000 1.402962335586548
21000 1.3352301238775253
22000 1.290425343811512
23000 1.255201721817255
24000 1.1615479020178319
25000 1.144034013926983
26000 1.0928158876001834
27000 1.047685709029436
28000 0.9951075100004673
29000 0.9454264369606972
30000 0.8993389782607555
31000 0.880986488699913
32000 0.8158053965121508
33000 0.7774984947443009
34000 0.7357727001458406
35000 0.7115383298397064
36000 0.6665979711860418
37000 0.6393418710231781
38000 0.6052015965729952
39000 0.566942078396678
40000 0.5415387770980596
410

In [None]:
pd.Series(losses).ewm(100).mean().plot();

NameError: name 'pd' is not defined

In [None]:
def translate(text, src_lang='fra_Latn', tgt_lang='fuf_Latn', a=16, b=1.5, max_input_length=1024, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        **kwargs
    )
    #print(inputs.input_ids.shape[1], result.shape[1])
    return tokenizer.batch_decode(result, skip_special_tokens=True)

In [None]:
xx, yy, lang1, lang2 = get_batch_pairs(1, data=df_dev)
print(xx)
print(yy)
model.eval()
print(translate(xx[0], lang1, lang2, no_repeat_ngram_size=3, num_beams=5))

['Mbela a anndaa wonde ko Alla jeyi laamu asamaanji e lesdi on ngalaa keedanoowo mon wolla ballo sonaa Alla.']
["Ne sais-tu pas qu'à Allah, appartient le royaume des cieux et de la terre, et qu'en dehors d'Allah vous n'avez ni protecteur ni secoureur?"]
["Ne savez-vous pas qu'Allah a la royauté des cieux et de la terre, et qu'aucun protecteur ne vous est donné en dehors d'Allah?"]


In [None]:
!ls -alsh $MODEL_SAVE_PATH

total 2.3G
1.0K -rw------- 1 root root  898 Jan 18 00:07 config.json
 512 -rw------- 1 root root  184 Jan 18 00:07 generation_config.json
2.3G -rw------- 1 root root 2.3G Jan 18 00:07 pytorch_model.bin
4.7M -rw------- 1 root root 4.7M Jan 14 15:10 sentencepiece.bpe.model
3.5K -rw------- 1 root root 3.5K Jan 18 00:07 special_tokens_map.json
1.0K -rw------- 1 root root  570 Jan 18 00:07 tokenizer_config.json


# 6. Using the model

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig
from tqdm.auto import tqdm, trange

In [None]:
trans_df = pd.read_csv('/gd/MyDrive/fra_fuf_all.tsv', sep="\t")
trans_df.dropna(subset=['fra', 'fuf'], inplace=True)
df_train = trans_df[trans_df.split=='train'].copy() # 20 122 items
df_dev = trans_df[trans_df.split=='dev'].copy()     # 100 items
df_test = trans_df[trans_df.split=='test'].copy()   # 100 items



#df_train, df_devtest = train_test_split(trans_df, test_size=200, random_state=1)
#df_dev, df_test = train_test_split(df_devtest, test_size=0.5, random_state=1)

In [None]:
# this code is adapted from  the Stopes repo of the NLLB team
# https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214

import re
import sys
import typing as tp
import unicodedata
from sacremoses import MosesPunctNormalizer


mpn = MosesPunctNormalizer(lang="en")
mpn.substitutions = [
    (re.compile(r), sub) for r, sub in mpn.substitutions
]


def get_non_printing_char_replacer(replace_by: str = " ") -> tp.Callable[[str], str]:
    non_printable_map = {
        ord(c): replace_by
        for c in (chr(i) for i in range(sys.maxunicode + 1))
        # same as \p{C} in perl
        # see https://www.unicode.org/reports/tr44/#General_Category_Values
        if unicodedata.category(c) in {"C", "Cc", "Cf", "Cs", "Co", "Cn"}
    }

    def replace_non_printing_char(line) -> str:
        return line.translate(non_printable_map)

    return replace_non_printing_char

replace_nonprint = get_non_printing_char_replacer(" ")

def preproc(text):
    clean = mpn.normalize(text)
    clean = replace_nonprint(clean)
    # replace 𝓕𝔯𝔞𝔫𝔠𝔢𝔰𝔠𝔞 by Francesca
    clean = unicodedata.normalize("NFKC", clean)
    return clean

In [None]:
def fix_tokenizer(tokenizer, new_lang='fuf_Latn'):
    """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}
    tokenizer.added_tokens_decoder = {}

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def translate(text, src_lang='fra_Latn', tgt_lang='fuf_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

In [None]:
t = "Où est la vache ?"
print(translate(t, 'fra_Latn', 'fuf_Latn'))
# 'Honto cowɗe ɗen woni?'

['Honto cowɗe ɗen woni?']


In [None]:
translate(t, 'fra_Latn', 'fuf_Latn', do_sample=True, num_beams=1, temperature=1.5)

['Nènè laamadu ndun hino woodaa.']

In [None]:
def batched_translate(texts, batch_size=16, **kwargs):
    """Translate texts in batches of similar length"""
    idxs, texts2 = zip(*sorted(enumerate(texts), key=lambda p: len(p[1]), reverse=True))
    results = []
    for i in trange(0, len(texts2), batch_size):
        results.extend(translate(texts2[i: i+batch_size], **kwargs))
    return [p for i, p in sorted(zip(idxs, results))]

In [None]:
fuf_translated = batched_translate(df_test, src_lang='fra_Latn', tgt_lang='fuf_Latn')

  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
df_test['fuf_translated'] = [translate(t, 'fra_Latn', 'fuf_Latn')[0] for t in tqdm(df_test.fra)]
df_test['fra_translated'] = [translate(t, 'fuf_Latn', 'fra_Latn')[0] for t in tqdm(df_test.fuf)]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

In [None]:
import sacrebleu
bleu_calc = sacrebleu.BLEU()
chrf_calc = sacrebleu.CHRF(word_order=2)  # this metric is called ChrF++

In [None]:
xx, yy = ['Bonjour'], ['Tanaala']
print(bleu_calc.corpus_score(xx, [yy]))
print(chrf_calc.corpus_score(xx, [yy]))
print(chrf_calc.corpus_score(yy, [xx]))

BLEU = 0.00 0.0/0.0/0.0/0.0 (BP = 1.000 ratio = 1.000 hyp_len = 1 ref_len = 1)
chrF2++ = 2.04
chrF2++ = 2.04


In [None]:
print(bleu_calc.corpus_score(df_test['fuf_translated'].tolist(), [df_test['fuf'].tolist()]))
print(chrf_calc.corpus_score(df_test['fuf_translated'].tolist(), [df_test['fuf'].tolist()]))
print(bleu_calc.corpus_score(df_test['fra_translated'].tolist(), [df_test['fra'].tolist()]))
print(chrf_calc.corpus_score(df_test['fra_translated'].tolist(), [df_test['fra'].tolist()]))

BLEU = 16.03 48.5/20.8/11.4/6.2 (BP = 0.983 ratio = 0.983 hyp_len = 4005 ref_len = 4074)
chrF2++ = 39.76
BLEU = 15.37 47.6/21.3/11.9/7.4 (BP = 0.889 ratio = 0.895 hyp_len = 4456 ref_len = 4979)
chrF2++ = 37.39


In [None]:
pd.options.display.max_colwidth = 100

In [None]:
df_dev.sample(10, random_state=5)[['fra', 'fuf', 'fra_translated', 'fuf_translated']]

Unnamed: 0,fra,fuf,fra_translated,fuf_translated
6468,"Alors Jésus lui dit : « Va derrière moi , Satan ! Car il est écrit : « Tu adoreras le Seigneur t...","On mo karhu-maa kadi rondanagol mo dongal , naɓaa yeru sagara gooto , naɓan mo yeru sagara ɗiɗi .","Alors Jésus lui dit: "" Va derrière moi, Satan! Car il est écrit: "" Tu adoreras le Seigneur ton D...","On mo karhu-maa kadi rondanagol mo dongal, naɓan mo yeru sagara gooto, naɓan mo yeru sagara ɗiɗi."
14687,agileté,lontondiral,agileté,lontondiral
16919,gomme,uumugol,monstre,uumugol
20242,endurcir le corps,Ko fawñere fii laamu addi hawre hakkunde maɓɓe .,endurcir le corps,Ko fawñere fii laamu addi hawre hakkunde maɓɓe.
5607,"En vérité notre Seigneur - que Sa grandeur soit exaltée - ne S'est donné ni compagne, ni enfant!",Accidam e mo Mi tagi on kañun tun.,"En vérité notre Seigneur - que Sa grandeur soit exaltée - ne S'est donné ni compagne, ni enfant!",Accidam e mo Mi tagi on kañun tun.
15737,copeau,kuruyee,copeau,kuruyee
3018,Dis: «L'a fait descendre Celui qui connaît les secrets dans les cieux et la terre. Et Il est Par...,"Maaɗum werlee e makko ngalu, maaɗum laatanoo mo ngesa o ñaama e mabba. Tooñooɓe ɓen wi'i: ""On jo...","Dis: ""L'a fait descendre Celui qui connaît les secrets dans les cieux et la terre. Et Il est Par...","Maaɗum werlee e makko ngalu, maaɗum laatanoo mo ngesa o ñaama e mabba. Tooñooɓe ɓen wi'i: ""On jo..."
6054,"Car il était tout joyeux parmi les siens,",O watti ngol yoorko wayliiko;,"Car il était tout joyeux parmi les siens,",O watti ngol yoorko wayliiko;
15048,banque,beereeru,banque,beereeru
13315,"( car ils ont été établis prêtres sans serment ) , mais il a été établi prêtre avec serment par ...",Himo wi'i taho : « A yiɗaali e a weltoraali sadakaaji e dokke e sadakaaji sunneteeɗi e sadakaaji...,"(car ils ont été établis prêtres sans serment), mais il a été établi prêtre avec serment par cel...","Himo wi'i taho: "" A yiɗaali e a weltoraali sadakaaji e dokke e sadakaaji fii junuubaaji, waɗiran..."


In [None]:
print((df_dev.fuf == df_dev.fuf_translated).mean())
print((df_dev.fra == df_dev.fra_translated).mean())

0.44
0.4


In [None]:
!pip install editdistance



In [None]:
import editdistance

def ed_similarity(text1, text2):
    return max(0, 1 - editdistance.eval(text1, text2) / min(len(text1), len(text2)))

print(ed_similarity('кот', 'собака'))
print(ed_similarity('кот', 'кит'))

0
0.6666666666666667


In [None]:
pd.Series([ed_similarity(row.ru, row.rus_translated) for row in df_dev.itertuples()]).describe()

count    500.000000
mean       0.516367
std        0.392761
min        0.000000
25%        0.116013
50%        0.507009
75%        1.000000
max        1.000000
dtype: float64

In [None]:
pd.Series([ed_similarity(row.tyv, row.tyv_translated) for row in df_dev.itertuples()]).describe()

count    500.000000
mean       0.506007
std        0.382357
min        0.000000
25%        0.111111
50%        0.504902
75%        0.979730
max        1.000000
dtype: float64

In [None]:
df_dev.index.name = "row_id"

In [None]:
df_dev.to_csv(model_load_name + "/dev_set_translated.tsv", sep="\t")

Evaluating another model (with extended vocabulary)

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'

In [None]:
cfg = AutoConfig.from_pretrained(model_load_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name + "/pytorch_model_60k.bin", config=cfg).cuda()

In [None]:
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
df_dev['rus_translated2'] = [translate(t, 'tyv_Cyrl', 'rus_Cyrl')[0] for t in tqdm(df_dev.tyv)]

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
df_dev['tyv_translated2'] = [translate(t, 'rus_Cyrl', 'tyv_Cyrl')[0] for t in tqdm(df_dev.ru)]

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
print(bleu_calc.corpus_score(df_dev['rus_translated2'].tolist(), [df_dev['ru'].tolist()]))
print(chrf_calc.corpus_score(df_dev['rus_translated2'].tolist(), [df_dev['ru'].tolist()]))
print(bleu_calc.corpus_score(df_dev['tyv_translated2'].tolist(), [df_dev['tyv'].tolist()]))
print(chrf_calc.corpus_score(df_dev['tyv_translated2'].tolist(), [df_dev['tyv'].tolist()]))

BLEU = 25.18 52.4/31.3/20.4/13.3 (BP = 0.976 ratio = 0.976 hyp_len = 2269 ref_len = 2324)
chrF2++ = 49.85
BLEU = 23.22 51.6/29.4/18.3/11.6 (BP = 0.975 ratio = 0.975 hyp_len = 2312 ref_len = 2371)
chrF2++ = 49.87


In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()

In [None]:
df_dev['rus_translated3'] = [translate(t, 'tyv_Cyrl', 'rus_Cyrl')[0] for t in tqdm(df_dev.tyv)]
df_dev['tyv_translated3'] = [translate(t, 'rus_Cyrl', 'tyv_Cyrl')[0] for t in tqdm(df_dev.ru)]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
print(bleu_calc.corpus_score(df_dev['rus_translated3'].tolist(), [df_dev['ru'].tolist()]))
print(chrf_calc.corpus_score(df_dev['rus_translated3'].tolist(), [df_dev['ru'].tolist()]))
print(bleu_calc.corpus_score(df_dev['tyv_translated3'].tolist(), [df_dev['tyv'].tolist()]))
print(chrf_calc.corpus_score(df_dev['tyv_translated3'].tolist(), [df_dev['tyv'].tolist()]))

BLEU = 23.06 51.1/29.1/18.1/11.5 (BP = 0.978 ratio = 0.978 hyp_len = 2273 ref_len = 2324)
chrF2++ = 48.56
BLEU = 26.12 53.4/32.5/21.0/13.6 (BP = 0.985 ratio = 0.985 hyp_len = 2336 ref_len = 2371)
chrF2++ = 52.60


In [None]:
df_dev['rus2eng'] = [translate(t, 'tyv_Cyrl', 'eng_Latn')[0] for t in tqdm(df_dev.tyv)]
df_dev['tyv2eng'] = [translate(t, 'rus_Cyrl', 'eng_Latn')[0] for t in tqdm(df_dev.ru)]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

Results with num_beams=1:
```
V1
BLEU = 23.21 51.2/29.1/18.0/11.8 (BP = 0.978 ratio = 0.978 hyp_len = 2273 ref_len = 2324)
chrF2++ = 47.88
BLEU = 22.03 51.5/29.7/17.9/10.4 (BP = 0.952 ratio = 0.953 hyp_len = 2260 ref_len = 2371)
chrF2++ = 49.37
V2
BLEU = 24.08 50.9/29.5/19.1/12.3 (BP = 0.988 ratio = 0.988 hyp_len = 2297 ref_len = 2324)
chrF2++ = 48.96
BLEU = 22.50 50.5/28.5/17.7/11.1 (BP = 0.974 ratio = 0.974 hyp_len = 2310 ref_len = 2371)
chrF2++ = 48.85
V3
BLEU = 22.25 49.8/27.8/17.2/11.0 (BP = 0.983 ratio = 0.983 hyp_len = 2284 ref_len = 2324)
chrF2++ = 47.89
BLEU = 25.28 52.2/31.2/20.0/13.1 (BP = 0.989 ratio = 0.989 hyp_len = 2346 ref_len = 2371)
chrF2++ = 51.87
````

Results with 4 beams:
```
V1
BLEU = 24.14 52.5/30.4/18.9/12.1 (BP = 0.981 ratio = 0.981 hyp_len = 2281 ref_len = 2324)
chrF2++ = 49.49
BLEU = 23.41 52.1/31.0/18.9/11.3 (BP = 0.966 ratio = 0.967 hyp_len = 2292 ref_len = 2371)
chrF2++ = 50.89
V2
BLEU = 25.18 52.4/31.3/20.4/13.3 (BP = 0.976 ratio = 0.976 hyp_len = 2269 ref_len = 2324)
chrF2++ = 49.85
BLEU = 23.22 51.6/29.4/18.3/11.6 (BP = 0.975 ratio = 0.975 hyp_len = 2312 ref_len = 2371)
chrF2++ = 49.87
V3
BLEU = 23.06 51.1/29.1/18.1/11.5 (BP = 0.978 ratio = 0.978 hyp_len = 2273 ref_len = 2324)
chrF2++ = 48.56
BLEU = 26.12 53.4/32.5/21.0/13.6 (BP = 0.985 ratio = 0.985 hyp_len = 2336 ref_len = 2371)
chrF2++ = 52.60
```

Which means:
* For all directions and models, beam search improves the results
* Longer training builds up quality for Tyvan, but decreases it for Russian.

```
                                  | tyv->rus | rus->tyv
Model v1 (no vocabulary update):  |
    no beam search                |   23.21  |  22.03
    num_beams = 4                 |   24.14  |  23.41
Model v2 (with vocabulary update):|
    no beam search                |   24.08  |  22.50
    num_beams = 4                 |   25.18  |  23.22
```

In [None]:
df_dev.to_csv(model_load_name + "/dev_set_translated.tsv", sep="\t")

Here are some examples of how translation has changed:

In [None]:
df_dev.sample(5, random_state=1)[['tyv', 'ru', 'rus_translated']]

Unnamed: 0_level_0,tyv,ru,rus_translated
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5442,транспорт херекселдерин ажыглаарының база шимчээшкинниң айыыл чок чоруунуң дүрүмнери,правила безопасности движения и эксплуатации транспортных средств,правила безопасности движения и эксплуатации транспортных средств
57777,аъш-чем садыы,продовольственный магазин,продовольственный магазин
104130,"Бүгү чүве төнген, бойлаан.","Все было кончено, потеряно.","Все было кончено, самостоятельно."
49344,фокуска кирбес,не попасть в фокус,не попасть в фокус
28319,рекорд тургузар,установить рекорд,поставить рекорд


In [None]:
df_dev.sample(20, random_state=1)[[
    'tyv', 'tyv_translated', 'tyv_translated2', 'tyv_translated3', 'tyv2eng',
    'ru', 'rus_translated', 'rus_translated2', 'rus_translated3', 'rus2eng',
]]

Unnamed: 0_level_0,tyv,tyv_translated,tyv_translated2,tyv_translated3,tyv2eng,ru,rus_translated,rus_translated2,rus_translated3,rus2eng
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5442,транспорт херекселдерин ажыглаарының база шимчээшкинниң айыыл чок чоруунуң дүрүмнери,шимчээшкинниң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнери,транспорт аймаан шимчээшкининиң болгаш ажыглаарының айыыл чок чоруунуң дүрүмнери,транспорт аймаан шимчээшкининиң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнери,ң болгаш транспорт аймаан ажыглаарының айыыл чок чоруунуң дүрүмнери,правила безопасности движения и эксплуатации транспортных средств,правила безопасности движения и эксплуатации транспортных средств,правила эксплуатации транспортных средств и безопасности движения,правила безопасности эксплуатации транспортных средств и движения,дүрүмнер транспорт херекселдерин ажыглаарының болгаш шимчээшкинниң айыыл чок чоруунуң дугайында
57777,аъш-чем садыы,аъш-чем садыы,аъш-чем садыы,аъш-чем садыы,садыы,продовольственный магазин,продовольственный магазин,продовольственный магазин,продовольственный магазин,садыы
104130,"Бүгү чүве төнген, бойлаан.","Шупту чүве доозулган, читкен.","Шупту чүве төнген, читкен.","Бүгү чүве бойлаан, читкен.","-даа, читкен-даа.","Все было кончено, потеряно.","Все было кончено, самостоятельно.","Все кончилось, разошёлся.","Все кончено, кончено.","-ла, бүгү чүве кончилось."
49344,фокуска кирбес,илби-шидиге алзыр арга чок,илбиге алыспас,илби-шидиге туттурбас,гге күш четпес,не попасть в фокус,не попасть в фокус,не попасться в фокусы,не попасть на фокус,г
28319,рекорд тургузар,рекорд тургузар,рекорд тургузар,рекорд тургузар,г тургузар,установить рекорд,поставить рекорд,установить рекорд,установить рекорд,г тургузар
43534,чурукту делгээр,чурукту делгээр,чурук делгээр,чурукту делгээр,ң чурукту делгээр,выставлять картину,выставлять картину,развернуть картину,экспонировать картину,чурукту делгередип чуруур
37159,ылап хөделир,ылап хөделир,бүзүрелдиг хөделир,ылап хөделир,г хөделир,действовать наверняка,действовать аккуратно,действовать наверняка,действовать наверняка,г хөделир
36993,колдуктап алгаш чоруур,шыңганның адаанга көдүрүп алгаш чоруур,колдук адаанга аппаар,колдуктап алгаш чоруур,г шыгжаар,нести под мышкой,нести под мышками,носить под мышкой,нести под мышкой,алгаш чоруур
116009,Копривничко-Крижевачка,Копривничко-Крижевачка,Копривничко-Крижевачка,Копривничко-Крижевачка,чко-Крижевачка,Копривничко-Крижевачка,Копривничко-Крижевачка,Копривничко-Крижевачка,Копривничко-Крижевачка,вничко-Крижевачка
113178,Лампаң,Лампанг,Лампаң,Лампанг,ң,Лампанг,Лампанг,Лампанг,Лампанг,


In [None]:
cols = ['ind', 'tyv', 'ru']
splits = {'train': df_train[df_train.index<=49_454], 'test': df_test, 'dev': df_dev}
df_joint = []
for k, v in splits.items():
    v = v[cols].copy()
    v.index.name = "row_id"
    v['split'] = k
    df_joint.append(v)
df_joint = pd.concat(df_joint)
df_joint.shape

(50000, 4)

In [None]:
df_joint.sample(5)

Unnamed: 0_level_0,ind,tyv,ru,split
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
314,328,Өг-бүле бүрүзү эвээш дээрге-ле 500-600 ивини азырап өстүрзүн.,Пусть на каждую семью было хотя бы по 500-600 оленей.,train
4376,4390,Өрээл аяннаны берген,Комната приняла хороший вид,train
13377,13392,кым-бир кижи-биле силер деп чугаалажыр,быть на вы с кем-либо,train
91144,97279,"Идээледир чемнениринде база эки чүү-даа чүве чок, ынчангаш арай эвээшти чиңер.","Ничего хорошего нет и в переедании, так что ешьте поменьше.",dev
307,321,"Оларның аразында 14 суурда 500 четпес, а 8 суурда 250 хире чурттакчы бар.","Среди них в 14 селах менее 500, в восьми - менее 250 человек.",train


In [None]:
df_joint.to_csv("/gd/MyDrive/datasets/nlp/tyvan/rus_tyv_parallel_50k.tsv", sep="\t")

# Publishing the model to HF

In [None]:
#!huggingface-cli login


from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig

In [None]:
def fix_tokenizer(tokenizer, new_lang='fuf_Latn'):
    """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}
    tokenizer.added_tokens_decoder = {}

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name)
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
#upload_repo = "slone/nllb-rus-tyv-v1"
upload_repo = "flutter-painter/nllb-fra-fuf-v2"
tokenizer.push_to_hub(upload_repo)
model.push_to_hub(upload_repo)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/flutter-painter/nllb-fra-fuf-v2/commit/27546302ffd142ac6065b3e55431f587a8993e73', commit_message='Upload M2M100ForConditionalGeneration', commit_description='', oid='27546302ffd142ac6065b3e55431f587a8993e73', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-rus-tyv-v2-extvoc'
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
cfg = AutoConfig.from_pretrained(model_load_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name + "/pytorch_model_60k.bin", config=cfg)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
upload_repo = "slone/nllb-rus-tyv-v2-extvoc"
tokenizer.push_to_hub(upload_repo)
model.push_to_hub(upload_repo)

sentencepiece.bpe.model:   0%|          | 0.00/5.14M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.51G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/slone/nllb-rus-tyv-v2-extvoc/commit/48e9b1269b037fe08280bfec990c189e5748bccd', commit_message='Upload M2M100ForConditionalGeneration', commit_description='', oid='48e9b1269b037fe08280bfec990c189e5748bccd', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
print(tokenizer.convert_ids_to_tokens([256202, 256203, 256204])) # ['zul_Latn', 'tyv_Cyrl', '<mask>']
print(tokenizer.convert_tokens_to_ids(['zul_Latn', 'tyv_Cyrl', '<mask>'])) # [256202, 256203, 256204]
# this is consistent now, wow!

['zul_Latn', '<mask>', 'tyv_Cyrl']
[256202, 256204, 256203]


Testing that it works

In [None]:
MODEL_URL = 'flutter-painter/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True)
fix_tokenizer(tokenizer)

config.json:   0%|          | 0.00/898 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.56k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def translate(
    text,
    model,
    tokenizer,
    src_lang='fra_Latn',
    tgt_lang='fuf_Latn',
    max_length='auto',
    num_beams=4,
    no_repeat_ngram_size=4,
    n_out=None,
    **kwargs
):
    tokenizer.src_lang = src_lang
    encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    if max_length == 'auto':
        max_length = int(32 + 2.0 * encoded.input_ids.shape[1])
    model.eval()
    generated_tokens = model.generate(
        **encoded.to(model.device),
        forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
        max_length=max_length,
        num_beams=num_beams,
        no_repeat_ngram_size=no_repeat_ngram_size,
        num_return_sequences=n_out or 1,
        **kwargs
    )
    out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    if isinstance(text, str) and n_out is None:
        return out[0]
    return out

In [None]:
translate("le riz était très bon", model=model, tokenizer=tokenizer)

'rizun hino moƴƴi fota'

In [None]:
translate("O watti ngol yoorko wayliiko", model=model, tokenizer=tokenizer, tgt_lang='fra_Latn')

'Ousmane eut trois enfants: deux mourûrent, le troisième survécut.'

In [None]:
lang_to_code = {
    'Русский | Russian': 'rus_Cyrl',
    'Тувинский | Tyvan': 'tyv_Cyrl',
}

In [None]:
def translate_wrapper(text, src, trg, correct=None):
    src_lang = lang_to_code.get(src)
    tgt_lang = lang_to_code.get(trg)
    if src == trg:
        return 'Please choose two different languages'
    print(text, src, trg)
    result = translate(
        text=text,
        model=model,
        tokenizer=tokenizer,
        src_lang=src_lang,
        tgt_lang=tgt_lang,
    )
    return result

In [None]:
translate_wrapper("красная птица", 'Русский | Russian', 'Тувинский | Tyvan')

красная птица Русский | Russian Тувинский | Tyvan


'кызыл куш'