In this notebook, I show how to fine-tune a NLLB-200 machine translation model for a new language.

The new language will be [Tyvan](https://en.wikipedia.org/wiki/Tuvan_language), and I will use a Tyvan-Russian parallel corpus as the training data.

I am running this notebook on Google Colab with a T4 GPU that has 15Gb of memory. If you run it elsewhere, you may want to adjust the batch size, so that there are no OOM errors, but the GPU is well utilized.

# 0. Preliminaries

I run this notebook in Google Colab (which is ephemeral), and to read the dataset and to write the resulting model I use Google Drive, which I mount in the cell below.

In [None]:
from google.colab import drive
import os
if not os.path.exists('/gd'):
    drive.mount('/gd')

Mounted at /gd


Installing dependencies:
* `transformers`, as a neural network framework
* `sentencepiece`, a backend for my tokenizer (the algorithm for converting a text into symbols from the model's vocabulary)
* `sacremoses`, a package required for text preprocessing with which NLLB models were pretrained.
* `sacrebleu`, a package for evaluating translation models

In [None]:
import locale
def gpe(x=None):
    return "UTF-8"
locale.getpreferredencoding = gpe

In [None]:
!pip install sentencepiece transformers==4.33 datasets sacremoses sacrebleu  -q

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.6/7.6 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m507.1/507.1 kB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m897.5/897.5 kB[0m [31m68.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m106.3/106.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:

# 1. Exploring the data

In this section, I try to understand what is the training data that I have, and how suitable it is for fine-tuning a NLLB model.

In [None]:
import pandas as pd

In [None]:

trans_df = pd.read_csv('/gd/MyDrive/fra_fuf_all.tsv', sep="\t")
print(trans_df.shape)
print(trans_df.columns)

(20322, 3)
Index(['fra', 'fuf', 'split'], dtype='object')


In [None]:
pd.options.display.max_colwidth = 100

In [None]:
trans_df.sample(10)

Unnamed: 0,fra,fuf,split
3585,"Du mort, Il fait sortir le vivant, et du vivant, Il fait sortir le mort. Et Il redonne la vie √† ...","Hino jeyaa e maandeeji Makko …óin: tagugol kammuuli …óin e leydi ndin, e luutondirgol …óem…óe mon e ...",train
15818,cracher,kilooti,train
10204,"Cr√©tois et Arabes , nous les entendons parler dans nos langues des merveilles de Dieu . ¬ª","fii no Joomiraa…óo on ye…óira on saa'iiji fowtere immorde e makko , e fii no o additirana on on Al...",train
13830,"Bien-aim√©s , si notre c≈ìur ne nous condamne pas , nous avons de l'assurance devant Dieu;","Ko fii kala on hiwrii…óo mo , haray tawdaama e kuu…óe makko bon…óe …óen .",train
5206,"Si Nous voulions, Nous la rendrions sal√©e. Pourquoi n'√™tes-vous donc pas reconnaissants?","∆Åen wuddoo…ìe …ìe yamira yim…ìe …ìen nguddam... kala …óuurnii…óo, pellet, Allah kan ko Galo, Yettinii…óo.",train
4968,"Ceux qui ne croient pas en l'au-del√† donnent aux Anges des noms de femmes,",hika dogira e ndenka Amen. ∆äum ko njo…ìdi [Nuuhu] fennanoo…óo on.,train
10342,"Mais si elle est de Dieu , vous ne pourrez pas la renverser , et l'on trouverait m√™me que vous l...","Fewndo ko jamaa on mottondirnoo ka wulaa , ko kanko wondunoo e malaa'ikaajo yewtaynoo…óo mo on ka...",train
6060,"Vous passerez, certes, par des √©tats successifs!",Wo…ó…óitoo mbo …ìur…óo hiiteede.,train
2507,"Nous n'avons point fait descendre sur toi le Coran pour que tu sois malheureux,",Ko Jippinaande immorde e On Tagu…óo leydi ndin e kammuuli toowu…ói …óin.,train
2504,"Nous l'avons rendu (le Coran) facile [√† comprendre] en ta langue, afin que tu annonces par lui l...",Taa Haa.,train


In [None]:
trans_df.isnull().sum()

fra      0
fuf      0
split    0
dtype: int64

In [None]:
trans_df.split.value_counts()

train    20122
dev        100
test       100
Name: split, dtype: int64

In [None]:
df_train = trans_df[trans_df.split=='train'].copy() # 20 122 items
df_dev = trans_df[trans_df.split=='dev'].copy()     # 100 items
df_test = trans_df[trans_df.split=='test'].copy()   # 100 items

# 2. How well does the data fit into a NLLB tokenizer?

In [None]:
from transformers import NllbTokenizer
from tqdm.auto import tqdm, trange

In [None]:
tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

In [None]:
import re

def word_tokenize(text):
    # a very naive word tokenizer for languages with English-like orthography
    return re.findall('(\w+|[^\w\s])', text)

In [None]:
smpl = df_train.sample(10000, random_state=1)

smpl['fra_toks'] = smpl.fra.apply(tokenizer.tokenize)
smpl['fuf_toks'] = smpl.fuf.apply(tokenizer.tokenize)

smpl['fra_words'] = smpl.fra.apply(word_tokenize)
smpl['fuf_words'] = smpl.fuf.apply(word_tokenize)

In [None]:
smpl.sample(5)[['fra', 'fra_words', 'fra_toks', 'fuf', 'fuf_words', 'fuf_toks']]

Unnamed: 0,fra,fra_words,fra_toks,fuf,fuf_words,fuf_toks
11573,"afin que vous la receviez dans le Seigneur d'une mani√®re digne des saints , et que vous l'aidiez...","[afin, que, vous, la, receviez, dans, le, Seigneur, d, ', une, mani√®re, digne, des, saints, ,, e...","[‚ñÅafin, ‚ñÅque, ‚ñÅvous, ‚ñÅla, ‚ñÅrece, vie, z, ‚ñÅdans, ‚ñÅle, ‚ñÅSeigneur, ‚ñÅd, ', une, ‚ñÅmani√®re, ‚ñÅdig, ne, ...",Min tigi mi…óo wonirnoo takko mon e noone no lo'iri e kulol e diwnol tii…óungol .,"[Min, tigi, mi…óo, wonirnoo, takko, mon, e, noone, no, lo, ', iri, e, kulol, e, diwnol, tii…óungol...","[‚ñÅMin, ‚ñÅtigi, ‚ñÅmi, …óo, ‚ñÅwon, ir, noo, ‚ñÅtak, ko, ‚ñÅmon, ‚ñÅe, ‚ñÅno, one, ‚ñÅno, ‚ñÅlo, ', iri, ‚ñÅe, ‚ñÅkul, ..."
3847,afin [qu'Allah] les r√©compense pleinement et leur ajoute de Sa gr√¢ce. Il est Pardonneur et Recon...,"[afin, [, qu, ', Allah, ], les, r√©compense, pleinement, et, leur, ajoute, de, Sa, gr√¢ce, ., Il, ...","[‚ñÅafin, ‚ñÅ[, qu, ', Allah, ], ‚ñÅles, ‚ñÅr√©, com, pense, ‚ñÅplein, ement, ‚ñÅet, ‚ñÅleur, ‚ñÅajoute, ‚ñÅde, ‚ñÅSa...","Pellet, Alla ko Anndu…óo wirnii…ói kammuuli …óin e leydi ndin. Ko O Anndu…óo ko suu…óii e …ìer…óe.","[Pellet, ,, Alla, ko, Anndu…óo, wirnii…ói, kammuuli, …óin, e, leydi, ndin, ., Ko, O, Anndu…óo, ko, s...","[‚ñÅPel, let, ,, ‚ñÅAlla, ‚ñÅko, ‚ñÅAn, ndu, …óo, ‚ñÅwir, nii, …ói, ‚ñÅkam, mu, uli, ‚ñÅ…óin, ‚ñÅe, ‚ñÅleydi, ‚ñÅndin, ..."
19712,je me suis plant√© une √©charde dans le doigt,"[je, me, suis, plant√©, une, √©charde, dans, le, doigt]","[‚ñÅje, ‚ñÅme, ‚ñÅsuis, ‚ñÅplant, √©, ‚ñÅune, ‚ñÅ√©, char, de, ‚ñÅdans, ‚ñÅle, ‚ñÅdo, igt]",Mi…óo aami immagol .,"[Mi…óo, aami, immagol, .]","[‚ñÅMi, …óo, ‚ñÅaami, ‚ñÅimmag, ol, ‚ñÅ.]"
8205,"La crainte s'empara de tous les habitants des environs , et l'on parlait de toutes ces paroles d...","[La, crainte, s, ', empara, de, tous, les, habitants, des, environs, ,, et, l, ', on, parlait, d...","[‚ñÅLa, ‚ñÅcra, inte, ‚ñÅs, ', emp, ara, ‚ñÅde, ‚ñÅtous, ‚ñÅles, ‚ñÅhabitants, ‚ñÅdes, ‚ñÅen, vir, ons, ‚ñÅ,, ‚ñÅet, ‚ñÅ...",Tawi maw…ìe Iisaa …ìen yahayno hitaande kala Yerusalaam fii Juldeere Yawtaneede nden .,"[Tawi, maw…ìe, Iisaa, …ìen, yahayno, hitaande, kala, Yerusalaam, fii, Juldeere, Yawtaneede, nden, .]","[‚ñÅTa, wi, ‚ñÅmaw, …ìe, ‚ñÅI, isaa, ‚ñÅ…ìen, ‚ñÅyahay, no, ‚ñÅhitaande, ‚ñÅkala, ‚ñÅYer, us, ala, am, ‚ñÅfii, ‚ñÅJul,..."
11778,"Ou bien n'avons-nous pas le droit , Barnab√© et moi , de ne pas travailler ?","[Ou, bien, n, ', avons, -, nous, pas, le, droit, ,, Barnab√©, et, moi, ,, de, ne, pas, travailler...","[‚ñÅOu, ‚ñÅbien, ‚ñÅn, ', avons, -, nous, ‚ñÅpas, ‚ñÅle, ‚ñÅdroit, ‚ñÅ,, ‚ñÅBarna, b√©, ‚ñÅet, ‚ñÅmoi, ‚ñÅ,, ‚ñÅde, ‚ñÅne, ...","Mi mantii on fii ko on yejjitaali fii an kon e …ói fow , e ko maanditi…óon janndeeji …óin kon , wan...","[Mi, mantii, on, fii, ko, on, yejjitaali, fii, an, kon, e, …ói, fow, ,, e, ko, maanditi…óon, jannd...","[‚ñÅMi, ‚ñÅman, tii, ‚ñÅon, ‚ñÅfii, ‚ñÅko, ‚ñÅon, ‚ñÅye, jj, ita, ali, ‚ñÅfii, ‚ñÅan, ‚ñÅkon, ‚ñÅe, ‚ñÅ…ói, ‚ñÅfow, ‚ñÅ,, ‚ñÅe,..."


In [None]:
stats = smpl[['fuf_toks', 'fra_toks', 'fuf_words', 'fra_words']].applymap(len).describe()
stats

Unnamed: 0,fuf_toks,fra_toks,fuf_words,fra_words
count,10000.0,10000.0,10000.0,10000.0
mean,28.9311,26.973,18.9516,21.885
std,25.176138,24.12413,16.988617,19.911866
min,1.0,1.0,1.0,1.0
25%,4.0,4.0,2.0,2.0
50%,27.0,25.0,18.0,20.0
75%,44.0,40.0,29.0,33.0
max,344.0,382.0,235.0,316.0


In [None]:
print(stats.fuf_toks['mean'] / stats.fuf_words['mean'])
print(stats.fra_toks['mean'] / stats.fra_words['mean'])

1.5265782308617744
1.2324880054832075


In [None]:
print(tokenizer.unk_token, tokenizer.unk_token_id)

<unk> 3


Good news: both for Russian and Tyvan, the NLLB tokenizer seems to produce around 2 tokens per word (more precisely, 2.3 and 1.8), which means that the translation quality of fine-tuned NLLB may be decent even without vocabulary extension.

One more check: how often does the <unk> token happen in the tokenizer output for Tyvan? If this is too often, we need to fix it somehow

In [None]:
texts_with_unk = [text for text in tqdm(trans_df.fuf) if tokenizer.unk_token_id in tokenizer(text).input_ids]
print(len(texts_with_unk))

  0%|          | 0/20322 [00:00<?, ?it/s]

2459


In [None]:
import random
s = random.sample(texts_with_unk, 5)
s

["Iisaa wi'i mo kadi fahin : ¬´ Mariyama ! ¬ª Debbo on fewti mo , wi'i e haala Yahuudiyanke : ¬´ Rabbunii ! ¬ª ( Ko woni …óun ko ¬´ Karamoko'en ¬ª . )",
 "Kono Iisaa wi'i : ¬´ Wota on to≈ãan mo …óun . Ko fii hay gooto waawataa wa…óude kaawake mo∆¥∆¥o e innde an , yiltoo kisan …ìawto …óun , wowlammi ko boni .",
 "Mi andinii on , ko wano non kadi malaa'ikaa…ìe Alla …ìen weltorta tuma nde junuubankeejo gooto tuubi . ¬ª",
 'ko fii e hino ko Allaahu on daali kon ka deftere : ¬´ Wonee laa…ìu…ìe , ko fii ko mi Laa…ìu…óo . ¬ª',
 'Jooni non me…óen andi wonde hi…óon andi kala huunde , awa kadi jaraa ka go…ó…óo landoo on . Ko …óun wa…ói si me…óen gom…óini wonde ko ka Alla iwru…óon . ¬ª']

Apparently, most of the texts with 3634 unknown tokens just have some punctuation unfamiliar to the NLLB tokenizer.

This is because the NLLB model was pretrained on normalized texts. If we reproduce the normalization, most of the problems would be fixed.

In [None]:
# this code is adapted from  the Stopes repo of the NLLB team
# https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214

import re
import sys
import typing as tp
import unicodedata
from sacremoses import MosesPunctNormalizer


mpn = MosesPunctNormalizer(lang="en")
mpn.substitutions = [
    (re.compile(r), sub) for r, sub in mpn.substitutions
]


def get_non_printing_char_replacer(replace_by: str = " ") -> tp.Callable[[str], str]:
    non_printable_map = {
        ord(c): replace_by
        for c in (chr(i) for i in range(sys.maxunicode + 1))
        # same as \p{C} in perl
        # see https://www.unicode.org/reports/tr44/#General_Category_Values
        if unicodedata.category(c) in {"C", "Cc", "Cf", "Cs", "Co", "Cn"}
    }

    def replace_non_printing_char(line) -> str:
        return line.translate(non_printable_map)

    return replace_non_printing_char

replace_nonprint = get_non_printing_char_replacer(" ")

def preproc(text):
    clean = mpn.normalize(text)
    clean = replace_nonprint(clean)
    # replace ùìïùîØùîûùî´ùî†ùî¢ùî∞ùî†ùîû by Francesca
    clean = unicodedata.normalize("NFKC", clean)
    return clean

In [None]:
texts_with_unk_normed = [text for text in tqdm(texts_with_unk) if tokenizer.unk_token_id in tokenizer(preproc(text)).input_ids]
print(len(texts_with_unk_normed))

  0%|          | 0/2459 [00:00<?, ?it/s]

0


Indeed, after normalizing texts, none of them contain unknown tokens. We will use this as one more piece of evidence that we don't have to update the tokenizer vocabulary to use it with Tyvan.

# 3 (optional). Expanding the vocabulary

# 4. Adding a new language tag to the tokenizer and model

In [None]:
from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer

In [None]:
tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')
print(len(tokenizer))
print(tokenizer.convert_ids_to_tokens([256202, 256203]))

256204
['zul_Latn', '<mask>']


In [None]:
def fix_tokenizer(tokenizer, new_lang='fuf_Latn'):
    """
    Add a new language token to the tokenizer vocabulary
    (this should be done each time after its initialization)
    """
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}
    tokenizer.added_tokens_decoder = {}

In [None]:
fix_tokenizer(tokenizer)

In [None]:
print(tokenizer.convert_ids_to_tokens([256202, 256203, 256204])) # ['zul_Latn', 'tyv_Cyrl', '<mask>']
print(tokenizer.convert_tokens_to_ids(['zul_Latn', 'fuf_Latn', '<mask>'])) # [256202, 256203, 256204]
# this is consistent now, wow!

['zul_Latn', 'fuf_Latn', '<mask>']
[256202, 256203, 256204]


In [None]:
added_token_id = tokenizer.convert_tokens_to_ids('fuf_Latn')
similar_lang_id = tokenizer.convert_tokens_to_ids('fuv_Latn')
print(added_token_id, similar_lang_id)

256203 256059


In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained('facebook/nllb-200-distilled-600M')
model.resize_token_embeddings(len(tokenizer))

config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 256205. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc


Embedding(256205, 1024)

In [None]:
# moving the embedding for "mask" to its new position
model.model.shared.weight.data[added_token_id+1] = model.model.shared.weight.data[added_token_id]
# initializing new language token with a token of a similar language
model.model.shared.weight.data[added_token_id] = model.model.shared.weight.data[similar_lang_id]

# 5. Preparing the training loop

In [None]:
import gc
import random
import numpy as np
import torch
from tqdm.auto import tqdm, trange
from transformers.optimization import Adafactor
from transformers import get_constant_schedule_with_warmup

def cleanup():
    """Try to free GPU memory"""
    gc.collect()
    torch.cuda.empty_cache()

cleanup()

In [None]:
model.cuda();

In [None]:
optimizer = Adafactor(
    [p for p in model.parameters() if p.requires_grad],
    scale_parameter=False,
    relative_step=False,
    lr=1e-4,
    clip_threshold=1.0,
    weight_decay=1e-3,
)

In [None]:
batch_size = 16  # 32 already doesn't fit well to 15GB of GPU memory
max_length = 128
warmup_steps = 1_000
training_steps = 57000

In [None]:
losses = []
scheduler = get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps)

In [None]:
LANGS = [('fra', 'fra_Latn'), ('fuf', 'fuf_Latn')]

def get_batch_pairs(batch_size, data=df_train):
    (l1, long1), (l2, long2) = random.sample(LANGS, 2)
    xx, yy = [], []
    for _ in range(batch_size):
        item = data.iloc[random.randint(0, len(data)-1)]
        xx.append(preproc(item[l1]))
        yy.append(preproc(item[l2]))
    return xx, yy, long1, long2

print(get_batch_pairs(1))
# (["Fe…≤…≤inannde Iisaa Almasiihu on . Ndee fe…≤…≤inannde ko Allaahu on halfini nde Iisaa Almasiihu on , fii yo o hollu jiyaa…ìe makko …ìen kon ko saatii arude ko …ìooyaa . Kanko kadi o …ìanginiri …óun nulugol malaa'ikaajo makko goo e oo jiyaa…óo makko Yuuhanna ,"], ['Nous savons que celui qui est n√© de Dieu ne p√®che pas , mais celui qui est n√© de Dieu se garde lui-m√™me , et le malin ne le touche pas .'], 'fuf_Latn', 'fra_Latn')

(['Tawi maaya-jungoojo no nder ton . Nde tawnoo hi…ìe faalaa too…≤ude Iisaa , …ìe landii mo , …ìe wi\'i: " E hara no dagii ka go…ó…óo …≤awndee e asewe ? "'], ["N'ayez donc pas peur d'eux , car il n'y a rien de cach√© qui ne soit r√©v√©l√© , ni de dissimul√© qui ne soit connu ."], 'fuf_Latn', 'fra_Latn')


In [None]:
MODEL_SAVE_PATH = '/gd/MyDrive/models/nllb-fra-fuf-v2'

# 6. The training loop

In [None]:
# If the Colab instance has shut down read this, else pass your way

# you can always load it back from the Google drive where you have saved it
# => if error : Repo id must be in the form 'repo_name' or 'namespace/repo_name':
# might need to rerun most of the above...
# might need to simply restart the session (Execution -> restart)
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/gd/MyDrive/models/nllb-fra-fuf-v2'. Use `repo_type` argument if needed.

In [None]:
model.train()
x, y, loss = None, None, None
cleanup()

tq = trange(len(losses), training_steps)
for i in tq:
    xx, yy, lang1, lang2 = get_batch_pairs(batch_size)
    try:
        tokenizer.src_lang = lang1
        x = tokenizer(xx, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(model.device)
        tokenizer.src_lang = lang2
        y = tokenizer(yy, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(model.device)
        y.input_ids[y.input_ids == tokenizer.pad_token_id] = -100

        loss = model(**x, labels=y.input_ids).loss
        loss.backward()
        losses.append(loss.item())

        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        scheduler.step()

    except RuntimeError as e:
        optimizer.zero_grad(set_to_none=True)
        x, y, loss = None, None, None
        cleanup()
        print('error', max(len(s) for s in xx + yy), e)
        continue

    if i % 1000 == 0:
        print(i, np.mean(losses[-1000:]))

    if i % 1000 == 0 and i > 0:
        model.save_pretrained(MODEL_SAVE_PATH)
        tokenizer.save_pretrained(MODEL_SAVE_PATH)

  0%|          | 0/57000 [00:00<?, ?it/s]

0 3.8596949577331543
1000 3.342402245283127
2000 2.83847035741806
3000 2.685642701148987
4000 2.5417772048711775
5000 2.418346631169319
6000 2.3161424453258515
7000 2.2317524782419205
8000 2.1189011729955674
9000 2.0626422716379165
10000 1.9833402442932129
11000 1.917804942548275
12000 1.8667315596938132
13000 1.8009946464896203
14000 1.755530894100666
15000 1.6837286068201065
16000 1.6139905810952186
17000 1.5744221299290657
18000 1.5123122119307517
19000 1.424876333296299
20000 1.402962335586548
21000 1.3352301238775253
22000 1.290425343811512
23000 1.255201721817255
24000 1.1615479020178319
25000 1.144034013926983
26000 1.0928158876001834
27000 1.047685709029436
28000 0.9951075100004673
29000 0.9454264369606972
30000 0.8993389782607555
31000 0.880986488699913
32000 0.8158053965121508
33000 0.7774984947443009
34000 0.7357727001458406
35000 0.7115383298397064
36000 0.6665979711860418
37000 0.6393418710231781
38000 0.6052015965729952
39000 0.566942078396678
40000 0.5415387770980596
410

In [None]:
pd.Series(losses).ewm(100).mean().plot();

NameError: name 'pd' is not defined

In [None]:
def translate(text, src_lang='fra_Latn', tgt_lang='fuf_Latn', a=16, b=1.5, max_input_length=1024, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        **kwargs
    )
    #print(inputs.input_ids.shape[1], result.shape[1])
    return tokenizer.batch_decode(result, skip_special_tokens=True)

In [None]:
xx, yy, lang1, lang2 = get_batch_pairs(1, data=df_dev)
print(xx)
print(yy)
model.eval()
print(translate(xx[0], lang1, lang2, no_repeat_ngram_size=3, num_beams=5))

['Mbela a anndaa wonde ko Alla jeyi laamu asamaanji e lesdi on ngalaa keedanoowo mon wolla ballo sonaa Alla.']
["Ne sais-tu pas qu'√† Allah, appartient le royaume des cieux et de la terre, et qu'en dehors d'Allah vous n'avez ni protecteur ni secoureur?"]
["Ne savez-vous pas qu'Allah a la royaut√© des cieux et de la terre, et qu'aucun protecteur ne vous est donn√© en dehors d'Allah?"]


In [None]:
!ls -alsh $MODEL_SAVE_PATH

total 2.3G
1.0K -rw------- 1 root root  898 Jan 18 00:07 config.json
 512 -rw------- 1 root root  184 Jan 18 00:07 generation_config.json
2.3G -rw------- 1 root root 2.3G Jan 18 00:07 pytorch_model.bin
4.7M -rw------- 1 root root 4.7M Jan 14 15:10 sentencepiece.bpe.model
3.5K -rw------- 1 root root 3.5K Jan 18 00:07 special_tokens_map.json
1.0K -rw------- 1 root root  570 Jan 18 00:07 tokenizer_config.json


# 6. Using the model

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig
from tqdm.auto import tqdm, trange

In [None]:
trans_df = pd.read_csv('/gd/MyDrive/fra_fuf_all.tsv', sep="\t")
trans_df.dropna(subset=['fra', 'fuf'], inplace=True)
df_train = trans_df[trans_df.split=='train'].copy() # 20 122 items
df_dev = trans_df[trans_df.split=='dev'].copy()     # 100 items
df_test = trans_df[trans_df.split=='test'].copy()   # 100 items



#df_train, df_devtest = train_test_split(trans_df, test_size=200, random_state=1)
#df_dev, df_test = train_test_split(df_devtest, test_size=0.5, random_state=1)

In [None]:
# this code is adapted from  the Stopes repo of the NLLB team
# https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214

import re
import sys
import typing as tp
import unicodedata
from sacremoses import MosesPunctNormalizer


mpn = MosesPunctNormalizer(lang="en")
mpn.substitutions = [
    (re.compile(r), sub) for r, sub in mpn.substitutions
]


def get_non_printing_char_replacer(replace_by: str = " ") -> tp.Callable[[str], str]:
    non_printable_map = {
        ord(c): replace_by
        for c in (chr(i) for i in range(sys.maxunicode + 1))
        # same as \p{C} in perl
        # see https://www.unicode.org/reports/tr44/#General_Category_Values
        if unicodedata.category(c) in {"C", "Cc", "Cf", "Cs", "Co", "Cn"}
    }

    def replace_non_printing_char(line) -> str:
        return line.translate(non_printable_map)

    return replace_non_printing_char

replace_nonprint = get_non_printing_char_replacer(" ")

def preproc(text):
    clean = mpn.normalize(text)
    clean = replace_nonprint(clean)
    # replace ùìïùîØùîûùî´ùî†ùî¢ùî∞ùî†ùîû by Francesca
    clean = unicodedata.normalize("NFKC", clean)
    return clean

In [None]:
def fix_tokenizer(tokenizer, new_lang='fuf_Latn'):
    """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}
    tokenizer.added_tokens_decoder = {}

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def translate(text, src_lang='fra_Latn', tgt_lang='fuf_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

In [None]:
t = "O√π est la vache ?"
print(translate(t, 'fra_Latn', 'fuf_Latn'))
# 'Honto cow…óe …óen woni?'

['Honto cow…óe …óen woni?']


In [None]:
translate(t, 'fra_Latn', 'fuf_Latn', do_sample=True, num_beams=1, temperature=1.5)

['N√®n√® laamadu ndun hino woodaa.']

In [None]:
def batched_translate(texts, batch_size=16, **kwargs):
    """Translate texts in batches of similar length"""
    idxs, texts2 = zip(*sorted(enumerate(texts), key=lambda p: len(p[1]), reverse=True))
    results = []
    for i in trange(0, len(texts2), batch_size):
        results.extend(translate(texts2[i: i+batch_size], **kwargs))
    return [p for i, p in sorted(zip(idxs, results))]

In [None]:
fuf_translated = batched_translate(df_test, src_lang='fra_Latn', tgt_lang='fuf_Latn')

  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
df_test['fuf_translated'] = [translate(t, 'fra_Latn', 'fuf_Latn')[0] for t in tqdm(df_test.fra)]
df_test['fra_translated'] = [translate(t, 'fuf_Latn', 'fra_Latn')[0] for t in tqdm(df_test.fuf)]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

In [None]:
import sacrebleu
bleu_calc = sacrebleu.BLEU()
chrf_calc = sacrebleu.CHRF(word_order=2)  # this metric is called ChrF++

In [None]:
xx, yy = ['Bonjour'], ['Tanaala']
print(bleu_calc.corpus_score(xx, [yy]))
print(chrf_calc.corpus_score(xx, [yy]))
print(chrf_calc.corpus_score(yy, [xx]))

BLEU = 0.00 0.0/0.0/0.0/0.0 (BP = 1.000 ratio = 1.000 hyp_len = 1 ref_len = 1)
chrF2++ = 2.04
chrF2++ = 2.04


In [None]:
print(bleu_calc.corpus_score(df_test['fuf_translated'].tolist(), [df_test['fuf'].tolist()]))
print(chrf_calc.corpus_score(df_test['fuf_translated'].tolist(), [df_test['fuf'].tolist()]))
print(bleu_calc.corpus_score(df_test['fra_translated'].tolist(), [df_test['fra'].tolist()]))
print(chrf_calc.corpus_score(df_test['fra_translated'].tolist(), [df_test['fra'].tolist()]))

BLEU = 16.03 48.5/20.8/11.4/6.2 (BP = 0.983 ratio = 0.983 hyp_len = 4005 ref_len = 4074)
chrF2++ = 39.76
BLEU = 15.37 47.6/21.3/11.9/7.4 (BP = 0.889 ratio = 0.895 hyp_len = 4456 ref_len = 4979)
chrF2++ = 37.39


In [None]:
pd.options.display.max_colwidth = 100

In [None]:
df_dev.sample(10, random_state=5)[['fra', 'fuf', 'fra_translated', 'fuf_translated']]

Unnamed: 0,fra,fuf,fra_translated,fuf_translated
6468,"Alors J√©sus lui dit : ¬´ Va derri√®re moi , Satan ! Car il est √©crit : ¬´ Tu adoreras le Seigneur t...","On mo karhu-maa kadi rondanagol mo dongal , na…ìaa yeru sagara gooto , na…ìan mo yeru sagara …ói…ói .","Alors J√©sus lui dit: "" Va derri√®re moi, Satan! Car il est √©crit: "" Tu adoreras le Seigneur ton D...","On mo karhu-maa kadi rondanagol mo dongal, na…ìan mo yeru sagara gooto, na…ìan mo yeru sagara …ói…ói."
14687,agilet√©,lontondiral,agilet√©,lontondiral
16919,gomme,uumugol,monstre,uumugol
20242,endurcir le corps,Ko faw√±ere fii laamu addi hawre hakkunde ma…ì…ìe .,endurcir le corps,Ko faw√±ere fii laamu addi hawre hakkunde ma…ì…ìe.
5607,"En v√©rit√© notre Seigneur - que Sa grandeur soit exalt√©e - ne S'est donn√© ni compagne, ni enfant!",Accidam e mo Mi tagi on ka√±un tun.,"En v√©rit√© notre Seigneur - que Sa grandeur soit exalt√©e - ne S'est donn√© ni compagne, ni enfant!",Accidam e mo Mi tagi on ka√±un tun.
15737,copeau,kuruyee,copeau,kuruyee
3018,Dis: ¬´L'a fait descendre Celui qui conna√Æt les secrets dans les cieux et la terre. Et Il est Par...,"Maa…óum werlee e makko ngalu, maa…óum laatanoo mo ngesa o √±aama e mabba. Too√±oo…ìe …ìen wi'i: ""On jo...","Dis: ""L'a fait descendre Celui qui conna√Æt les secrets dans les cieux et la terre. Et Il est Par...","Maa…óum werlee e makko ngalu, maa…óum laatanoo mo ngesa o √±aama e mabba. Too√±oo…ìe …ìen wi'i: ""On jo..."
6054,"Car il √©tait tout joyeux parmi les siens,",O watti ngol yoorko wayliiko;,"Car il √©tait tout joyeux parmi les siens,",O watti ngol yoorko wayliiko;
15048,banque,beereeru,banque,beereeru
13315,"( car ils ont √©t√© √©tablis pr√™tres sans serment ) , mais il a √©t√© √©tabli pr√™tre avec serment par ...",Himo wi'i taho : ¬´ A yi…óaali e a weltoraali sadakaaji e dokke e sadakaaji sunnetee…ói e sadakaaji...,"(car ils ont √©t√© √©tablis pr√™tres sans serment), mais il a √©t√© √©tabli pr√™tre avec serment par cel...","Himo wi'i taho: "" A yi…óaali e a weltoraali sadakaaji e dokke e sadakaaji fii junuubaaji, wa…óiran..."


In [None]:
print((df_dev.fuf == df_dev.fuf_translated).mean())
print((df_dev.fra == df_dev.fra_translated).mean())

0.44
0.4


In [None]:
!pip install editdistance



In [None]:
import editdistance

def ed_similarity(text1, text2):
    return max(0, 1 - editdistance.eval(text1, text2) / min(len(text1), len(text2)))

print(ed_similarity('–∫–æ—Ç', '—Å–æ–±–∞–∫–∞'))
print(ed_similarity('–∫–æ—Ç', '–∫–∏—Ç'))

0
0.6666666666666667


In [None]:
pd.Series([ed_similarity(row.ru, row.rus_translated) for row in df_dev.itertuples()]).describe()

count    500.000000
mean       0.516367
std        0.392761
min        0.000000
25%        0.116013
50%        0.507009
75%        1.000000
max        1.000000
dtype: float64

In [None]:
pd.Series([ed_similarity(row.tyv, row.tyv_translated) for row in df_dev.itertuples()]).describe()

count    500.000000
mean       0.506007
std        0.382357
min        0.000000
25%        0.111111
50%        0.504902
75%        0.979730
max        1.000000
dtype: float64

In [None]:
df_dev.index.name = "row_id"

In [None]:
df_dev.to_csv(model_load_name + "/dev_set_translated.tsv", sep="\t")

Evaluating another model (with extended vocabulary)

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'

In [None]:
cfg = AutoConfig.from_pretrained(model_load_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name + "/pytorch_model_60k.bin", config=cfg).cuda()

In [None]:
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
df_dev['rus_translated2'] = [translate(t, 'tyv_Cyrl', 'rus_Cyrl')[0] for t in tqdm(df_dev.tyv)]

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
df_dev['tyv_translated2'] = [translate(t, 'rus_Cyrl', 'tyv_Cyrl')[0] for t in tqdm(df_dev.ru)]

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
print(bleu_calc.corpus_score(df_dev['rus_translated2'].tolist(), [df_dev['ru'].tolist()]))
print(chrf_calc.corpus_score(df_dev['rus_translated2'].tolist(), [df_dev['ru'].tolist()]))
print(bleu_calc.corpus_score(df_dev['tyv_translated2'].tolist(), [df_dev['tyv'].tolist()]))
print(chrf_calc.corpus_score(df_dev['tyv_translated2'].tolist(), [df_dev['tyv'].tolist()]))

BLEU = 25.18 52.4/31.3/20.4/13.3 (BP = 0.976 ratio = 0.976 hyp_len = 2269 ref_len = 2324)
chrF2++ = 49.85
BLEU = 23.22 51.6/29.4/18.3/11.6 (BP = 0.975 ratio = 0.975 hyp_len = 2312 ref_len = 2371)
chrF2++ = 49.87


In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name).cuda()

In [None]:
df_dev['rus_translated3'] = [translate(t, 'tyv_Cyrl', 'rus_Cyrl')[0] for t in tqdm(df_dev.tyv)]
df_dev['tyv_translated3'] = [translate(t, 'rus_Cyrl', 'tyv_Cyrl')[0] for t in tqdm(df_dev.ru)]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
print(bleu_calc.corpus_score(df_dev['rus_translated3'].tolist(), [df_dev['ru'].tolist()]))
print(chrf_calc.corpus_score(df_dev['rus_translated3'].tolist(), [df_dev['ru'].tolist()]))
print(bleu_calc.corpus_score(df_dev['tyv_translated3'].tolist(), [df_dev['tyv'].tolist()]))
print(chrf_calc.corpus_score(df_dev['tyv_translated3'].tolist(), [df_dev['tyv'].tolist()]))

BLEU = 23.06 51.1/29.1/18.1/11.5 (BP = 0.978 ratio = 0.978 hyp_len = 2273 ref_len = 2324)
chrF2++ = 48.56
BLEU = 26.12 53.4/32.5/21.0/13.6 (BP = 0.985 ratio = 0.985 hyp_len = 2336 ref_len = 2371)
chrF2++ = 52.60


In [None]:
df_dev['rus2eng'] = [translate(t, 'tyv_Cyrl', 'eng_Latn')[0] for t in tqdm(df_dev.tyv)]
df_dev['tyv2eng'] = [translate(t, 'rus_Cyrl', 'eng_Latn')[0] for t in tqdm(df_dev.ru)]

  0%|          | 0/500 [00:00<?, ?it/s]

  0%|          | 0/500 [00:00<?, ?it/s]

Results with num_beams=1:
```
V1
BLEU = 23.21 51.2/29.1/18.0/11.8 (BP = 0.978 ratio = 0.978 hyp_len = 2273 ref_len = 2324)
chrF2++ = 47.88
BLEU = 22.03 51.5/29.7/17.9/10.4 (BP = 0.952 ratio = 0.953 hyp_len = 2260 ref_len = 2371)
chrF2++ = 49.37
V2
BLEU = 24.08 50.9/29.5/19.1/12.3 (BP = 0.988 ratio = 0.988 hyp_len = 2297 ref_len = 2324)
chrF2++ = 48.96
BLEU = 22.50 50.5/28.5/17.7/11.1 (BP = 0.974 ratio = 0.974 hyp_len = 2310 ref_len = 2371)
chrF2++ = 48.85
V3
BLEU = 22.25 49.8/27.8/17.2/11.0 (BP = 0.983 ratio = 0.983 hyp_len = 2284 ref_len = 2324)
chrF2++ = 47.89
BLEU = 25.28 52.2/31.2/20.0/13.1 (BP = 0.989 ratio = 0.989 hyp_len = 2346 ref_len = 2371)
chrF2++ = 51.87
````

Results with 4 beams:
```
V1
BLEU = 24.14 52.5/30.4/18.9/12.1 (BP = 0.981 ratio = 0.981 hyp_len = 2281 ref_len = 2324)
chrF2++ = 49.49
BLEU = 23.41 52.1/31.0/18.9/11.3 (BP = 0.966 ratio = 0.967 hyp_len = 2292 ref_len = 2371)
chrF2++ = 50.89
V2
BLEU = 25.18 52.4/31.3/20.4/13.3 (BP = 0.976 ratio = 0.976 hyp_len = 2269 ref_len = 2324)
chrF2++ = 49.85
BLEU = 23.22 51.6/29.4/18.3/11.6 (BP = 0.975 ratio = 0.975 hyp_len = 2312 ref_len = 2371)
chrF2++ = 49.87
V3
BLEU = 23.06 51.1/29.1/18.1/11.5 (BP = 0.978 ratio = 0.978 hyp_len = 2273 ref_len = 2324)
chrF2++ = 48.56
BLEU = 26.12 53.4/32.5/21.0/13.6 (BP = 0.985 ratio = 0.985 hyp_len = 2336 ref_len = 2371)
chrF2++ = 52.60
```

Which means:
* For all directions and models, beam search improves the results
* Longer training builds up quality for Tyvan, but decreases it for Russian.

```
                                  | tyv->rus | rus->tyv
Model v1 (no vocabulary update):  |
    no beam search                |   23.21  |  22.03
    num_beams = 4                 |   24.14  |  23.41
Model v2 (with vocabulary update):|
    no beam search                |   24.08  |  22.50
    num_beams = 4                 |   25.18  |  23.22
```

In [None]:
df_dev.to_csv(model_load_name + "/dev_set_translated.tsv", sep="\t")

Here are some examples of how translation has changed:

In [None]:
df_dev.sample(5, random_state=1)[['tyv', 'ru', 'rus_translated']]

Unnamed: 0_level_0,tyv,ru,rus_translated
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5442,—Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç —Ö–µ—Ä–µ–∫—Å–µ–ª–¥–µ—Ä–∏–Ω –∞–∂—ã–≥–ª–∞–∞—Ä—ã–Ω—ã“£ –±–∞–∑–∞ —à–∏–º—á—ç—ç—à–∫–∏–Ω–Ω–∏“£ –∞–π—ã—ã–ª —á–æ–∫ —á–æ—Ä—É—É–Ω—É“£ –¥“Ø—Ä“Ø–º–Ω–µ—Ä–∏,–ø—Ä–∞–≤–∏–ª–∞ –±–µ–∑–æ–ø–∞—Å–Ω–æ—Å—Ç–∏ –¥–≤–∏–∂–µ–Ω–∏—è –∏ —ç–∫—Å–ø–ª—É–∞—Ç–∞—Ü–∏–∏ —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç–Ω—ã—Ö —Å—Ä–µ–¥—Å—Ç–≤,–ø—Ä–∞–≤–∏–ª–∞ –±–µ–∑–æ–ø–∞—Å–Ω–æ—Å—Ç–∏ –¥–≤–∏–∂–µ–Ω–∏—è –∏ —ç–∫—Å–ø–ª—É–∞—Ç–∞—Ü–∏–∏ —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç–Ω—ã—Ö —Å—Ä–µ–¥—Å—Ç–≤
57777,–∞—ä—à-—á–µ–º —Å–∞–¥—ã—ã,–ø—Ä–æ–¥–æ–≤–æ–ª—å—Å—Ç–≤–µ–Ω–Ω—ã–π –º–∞–≥–∞–∑–∏–Ω,–ø—Ä–æ–¥–æ–≤–æ–ª—å—Å—Ç–≤–µ–Ω–Ω—ã–π –º–∞–≥–∞–∑–∏–Ω
104130,"–ë“Ø–≥“Ø —á“Ø–≤–µ —Ç”©–Ω–≥–µ–Ω, –±–æ–π–ª–∞–∞–Ω.","–í—Å–µ –±—ã–ª–æ –∫–æ–Ω—á–µ–Ω–æ, –ø–æ—Ç–µ—Ä—è–Ω–æ.","–í—Å–µ –±—ã–ª–æ –∫–æ–Ω—á–µ–Ω–æ, —Å–∞–º–æ—Å—Ç–æ—è—Ç–µ–ª—å–Ω–æ."
49344,—Ñ–æ–∫—É—Å–∫–∞ –∫–∏—Ä–±–µ—Å,–Ω–µ –ø–æ–ø–∞—Å—Ç—å –≤ —Ñ–æ–∫—É—Å,–Ω–µ –ø–æ–ø–∞—Å—Ç—å –≤ —Ñ–æ–∫—É—Å
28319,—Ä–µ–∫–æ—Ä–¥ —Ç—É—Ä–≥—É–∑–∞—Ä,—É—Å—Ç–∞–Ω–æ–≤–∏—Ç—å —Ä–µ–∫–æ—Ä–¥,–ø–æ—Å—Ç–∞–≤–∏—Ç—å —Ä–µ–∫–æ—Ä–¥


In [None]:
df_dev.sample(20, random_state=1)[[
    'tyv', 'tyv_translated', 'tyv_translated2', 'tyv_translated3', 'tyv2eng',
    'ru', 'rus_translated', 'rus_translated2', 'rus_translated3', 'rus2eng',
]]

Unnamed: 0_level_0,tyv,tyv_translated,tyv_translated2,tyv_translated3,tyv2eng,ru,rus_translated,rus_translated2,rus_translated3,rus2eng
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5442,—Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç —Ö–µ—Ä–µ–∫—Å–µ–ª–¥–µ—Ä–∏–Ω –∞–∂—ã–≥–ª–∞–∞—Ä—ã–Ω—ã“£ –±–∞–∑–∞ —à–∏–º—á—ç—ç—à–∫–∏–Ω–Ω–∏“£ –∞–π—ã—ã–ª —á–æ–∫ —á–æ—Ä—É—É–Ω—É“£ –¥“Ø—Ä“Ø–º–Ω–µ—Ä–∏,—à–∏–º—á—ç—ç—à–∫–∏–Ω–Ω–∏“£ –±–æ–ª–≥–∞—à —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç –∞–π–º–∞–∞–Ω –∞–∂—ã–≥–ª–∞–∞—Ä—ã–Ω—ã“£ –∞–π—ã—ã–ª —á–æ–∫ —á–æ—Ä—É—É–Ω—É“£ –¥“Ø—Ä“Ø–º–Ω–µ—Ä–∏,—Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç –∞–π–º–∞–∞–Ω —à–∏–º—á—ç—ç—à–∫–∏–Ω–∏–Ω–∏“£ –±–æ–ª–≥–∞—à –∞–∂—ã–≥–ª–∞–∞—Ä—ã–Ω—ã“£ –∞–π—ã—ã–ª —á–æ–∫ —á–æ—Ä—É—É–Ω—É“£ –¥“Ø—Ä“Ø–º–Ω–µ—Ä–∏,—Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç –∞–π–º–∞–∞–Ω —à–∏–º—á—ç—ç—à–∫–∏–Ω–∏–Ω–∏“£ –±–æ–ª–≥–∞—à —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç –∞–π–º–∞–∞–Ω –∞–∂—ã–≥–ª–∞–∞—Ä—ã–Ω—ã“£ –∞–π—ã—ã–ª —á–æ–∫ —á–æ—Ä—É—É–Ω—É“£ –¥“Ø—Ä“Ø–º–Ω–µ—Ä–∏,“£ –±–æ–ª–≥–∞—à —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç –∞–π–º–∞–∞–Ω –∞–∂—ã–≥–ª–∞–∞—Ä—ã–Ω—ã“£ –∞–π—ã—ã–ª —á–æ–∫ —á–æ—Ä—É—É–Ω—É“£ –¥“Ø—Ä“Ø–º–Ω–µ—Ä–∏,–ø—Ä–∞–≤–∏–ª–∞ –±–µ–∑–æ–ø–∞—Å–Ω–æ—Å—Ç–∏ –¥–≤–∏–∂–µ–Ω–∏—è –∏ —ç–∫—Å–ø–ª—É–∞—Ç–∞—Ü–∏–∏ —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç–Ω—ã—Ö —Å—Ä–µ–¥—Å—Ç–≤,–ø—Ä–∞–≤–∏–ª–∞ –±–µ–∑–æ–ø–∞—Å–Ω–æ—Å—Ç–∏ –¥–≤–∏–∂–µ–Ω–∏—è –∏ —ç–∫—Å–ø–ª—É–∞—Ç–∞—Ü–∏–∏ —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç–Ω—ã—Ö —Å—Ä–µ–¥—Å—Ç–≤,–ø—Ä–∞–≤–∏–ª–∞ —ç–∫—Å–ø–ª—É–∞—Ç–∞—Ü–∏–∏ —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç–Ω—ã—Ö —Å—Ä–µ–¥—Å—Ç–≤ –∏ –±–µ–∑–æ–ø–∞—Å–Ω–æ—Å—Ç–∏ –¥–≤–∏–∂–µ–Ω–∏—è,–ø—Ä–∞–≤–∏–ª–∞ –±–µ–∑–æ–ø–∞—Å–Ω–æ—Å—Ç–∏ —ç–∫—Å–ø–ª—É–∞—Ç–∞—Ü–∏–∏ —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç–Ω—ã—Ö —Å—Ä–µ–¥—Å—Ç–≤ –∏ –¥–≤–∏–∂–µ–Ω–∏—è,–¥“Ø—Ä“Ø–º–Ω–µ—Ä —Ç—Ä–∞–Ω—Å–ø–æ—Ä—Ç —Ö–µ—Ä–µ–∫—Å–µ–ª–¥–µ—Ä–∏–Ω –∞–∂—ã–≥–ª–∞–∞—Ä—ã–Ω—ã“£ –±–æ–ª–≥–∞—à —à–∏–º—á—ç—ç—à–∫–∏–Ω–Ω–∏“£ –∞–π—ã—ã–ª —á–æ–∫ —á–æ—Ä—É—É–Ω—É“£ –¥—É–≥–∞–π—ã–Ω–¥–∞
57777,–∞—ä—à-—á–µ–º —Å–∞–¥—ã—ã,–∞—ä—à-—á–µ–º —Å–∞–¥—ã—ã,–∞—ä—à-—á–µ–º —Å–∞–¥—ã—ã,–∞—ä—à-—á–µ–º —Å–∞–¥—ã—ã,—Å–∞–¥—ã—ã,–ø—Ä–æ–¥–æ–≤–æ–ª—å—Å—Ç–≤–µ–Ω–Ω—ã–π –º–∞–≥–∞–∑–∏–Ω,–ø—Ä–æ–¥–æ–≤–æ–ª—å—Å—Ç–≤–µ–Ω–Ω—ã–π –º–∞–≥–∞–∑–∏–Ω,–ø—Ä–æ–¥–æ–≤–æ–ª—å—Å—Ç–≤–µ–Ω–Ω—ã–π –º–∞–≥–∞–∑–∏–Ω,–ø—Ä–æ–¥–æ–≤–æ–ª—å—Å—Ç–≤–µ–Ω–Ω—ã–π –º–∞–≥–∞–∑–∏–Ω,—Å–∞–¥—ã—ã
104130,"–ë“Ø–≥“Ø —á“Ø–≤–µ —Ç”©–Ω–≥–µ–Ω, –±–æ–π–ª–∞–∞–Ω.","–®—É–ø—Ç—É —á“Ø–≤–µ –¥–æ–æ–∑—É–ª–≥–∞–Ω, —á–∏—Ç–∫–µ–Ω.","–®—É–ø—Ç—É —á“Ø–≤–µ —Ç”©–Ω–≥–µ–Ω, —á–∏—Ç–∫–µ–Ω.","–ë“Ø–≥“Ø —á“Ø–≤–µ –±–æ–π–ª–∞–∞–Ω, —á–∏—Ç–∫–µ–Ω.","-–¥–∞–∞, —á–∏—Ç–∫–µ–Ω-–¥–∞–∞.","–í—Å–µ –±—ã–ª–æ –∫–æ–Ω—á–µ–Ω–æ, –ø–æ—Ç–µ—Ä—è–Ω–æ.","–í—Å–µ –±—ã–ª–æ –∫–æ–Ω—á–µ–Ω–æ, —Å–∞–º–æ—Å—Ç–æ—è—Ç–µ–ª—å–Ω–æ.","–í—Å–µ –∫–æ–Ω—á–∏–ª–æ—Å—å, —Ä–∞–∑–æ—à—ë–ª—Å—è.","–í—Å–µ –∫–æ–Ω—á–µ–Ω–æ, –∫–æ–Ω—á–µ–Ω–æ.","-–ª–∞, –±“Ø–≥“Ø —á“Ø–≤–µ –∫–æ–Ω—á–∏–ª–æ—Å—å."
49344,—Ñ–æ–∫—É—Å–∫–∞ –∫–∏—Ä–±–µ—Å,–∏–ª–±–∏-—à–∏–¥–∏–≥–µ –∞–ª–∑—ã—Ä –∞—Ä–≥–∞ —á–æ–∫,–∏–ª–±–∏–≥–µ –∞–ª—ã—Å–ø–∞—Å,–∏–ª–±–∏-—à–∏–¥–∏–≥–µ —Ç—É—Ç—Ç—É—Ä–±–∞—Å,–≥–≥–µ –∫“Ø—à —á–µ—Ç–ø–µ—Å,–Ω–µ –ø–æ–ø–∞—Å—Ç—å –≤ —Ñ–æ–∫—É—Å,–Ω–µ –ø–æ–ø–∞—Å—Ç—å –≤ —Ñ–æ–∫—É—Å,–Ω–µ –ø–æ–ø–∞—Å—Ç—å—Å—è –≤ —Ñ–æ–∫—É—Å—ã,–Ω–µ –ø–æ–ø–∞—Å—Ç—å –Ω–∞ —Ñ–æ–∫—É—Å,–≥
28319,—Ä–µ–∫–æ—Ä–¥ —Ç—É—Ä–≥—É–∑–∞—Ä,—Ä–µ–∫–æ—Ä–¥ —Ç—É—Ä–≥—É–∑–∞—Ä,—Ä–µ–∫–æ—Ä–¥ —Ç—É—Ä–≥—É–∑–∞—Ä,—Ä–µ–∫–æ—Ä–¥ —Ç—É—Ä–≥—É–∑–∞—Ä,–≥ —Ç—É—Ä–≥—É–∑–∞—Ä,—É—Å—Ç–∞–Ω–æ–≤–∏—Ç—å —Ä–µ–∫–æ—Ä–¥,–ø–æ—Å—Ç–∞–≤–∏—Ç—å —Ä–µ–∫–æ—Ä–¥,—É—Å—Ç–∞–Ω–æ–≤–∏—Ç—å —Ä–µ–∫–æ—Ä–¥,—É—Å—Ç–∞–Ω–æ–≤–∏—Ç—å —Ä–µ–∫–æ—Ä–¥,–≥ —Ç—É—Ä–≥—É–∑–∞—Ä
43534,—á—É—Ä—É–∫—Ç—É –¥–µ–ª–≥—ç—ç—Ä,—á—É—Ä—É–∫—Ç—É –¥–µ–ª–≥—ç—ç—Ä,—á—É—Ä—É–∫ –¥–µ–ª–≥—ç—ç—Ä,—á—É—Ä—É–∫—Ç—É –¥–µ–ª–≥—ç—ç—Ä,“£ —á—É—Ä—É–∫—Ç—É –¥–µ–ª–≥—ç—ç—Ä,–≤—ã—Å—Ç–∞–≤–ª—è—Ç—å –∫–∞—Ä—Ç–∏–Ω—É,–≤—ã—Å—Ç–∞–≤–ª—è—Ç—å –∫–∞—Ä—Ç–∏–Ω—É,—Ä–∞–∑–≤–µ—Ä–Ω—É—Ç—å –∫–∞—Ä—Ç–∏–Ω—É,—ç–∫—Å–ø–æ–Ω–∏—Ä–æ–≤–∞—Ç—å –∫–∞—Ä—Ç–∏–Ω—É,—á—É—Ä—É–∫—Ç—É –¥–µ–ª–≥–µ—Ä–µ–¥–∏–ø —á—É—Ä—É—É—Ä
37159,—ã–ª–∞–ø —Ö”©–¥–µ–ª–∏—Ä,—ã–ª–∞–ø —Ö”©–¥–µ–ª–∏—Ä,–±“Ø–∑“Ø—Ä–µ–ª–¥–∏–≥ —Ö”©–¥–µ–ª–∏—Ä,—ã–ª–∞–ø —Ö”©–¥–µ–ª–∏—Ä,–≥ —Ö”©–¥–µ–ª–∏—Ä,–¥–µ–π—Å—Ç–≤–æ–≤–∞—Ç—å –Ω–∞–≤–µ—Ä–Ω—è–∫–∞,–¥–µ–π—Å—Ç–≤–æ–≤–∞—Ç—å –∞–∫–∫—É—Ä–∞—Ç–Ω–æ,–¥–µ–π—Å—Ç–≤–æ–≤–∞—Ç—å –Ω–∞–≤–µ—Ä–Ω—è–∫–∞,–¥–µ–π—Å—Ç–≤–æ–≤–∞—Ç—å –Ω–∞–≤–µ—Ä–Ω—è–∫–∞,–≥ —Ö”©–¥–µ–ª–∏—Ä
36993,–∫–æ–ª–¥—É–∫—Ç–∞–ø –∞–ª–≥–∞—à —á–æ—Ä—É—É—Ä,—à—ã“£–≥–∞–Ω–Ω—ã“£ –∞–¥–∞–∞–Ω–≥–∞ –∫”©–¥“Ø—Ä“Ø–ø –∞–ª–≥–∞—à —á–æ—Ä—É—É—Ä,–∫–æ–ª–¥—É–∫ –∞–¥–∞–∞–Ω–≥–∞ –∞–ø–ø–∞–∞—Ä,–∫–æ–ª–¥—É–∫—Ç–∞–ø –∞–ª–≥–∞—à —á–æ—Ä—É—É—Ä,–≥ —à—ã–≥–∂–∞–∞—Ä,–Ω–µ—Å—Ç–∏ –ø–æ–¥ –º—ã—à–∫–æ–π,–Ω–µ—Å—Ç–∏ –ø–æ–¥ –º—ã—à–∫–∞–º–∏,–Ω–æ—Å–∏—Ç—å –ø–æ–¥ –º—ã—à–∫–æ–π,–Ω–µ—Å—Ç–∏ –ø–æ–¥ –º—ã—à–∫–æ–π,–∞–ª–≥–∞—à —á–æ—Ä—É—É—Ä
116009,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–ö–æ–ø—Ä–∏–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞,–≤–Ω–∏—á–∫–æ-–ö—Ä–∏–∂–µ–≤–∞—á–∫–∞
113178,–õ–∞–º–ø–∞“£,–õ–∞–º–ø–∞–Ω–≥,–õ–∞–º–ø–∞“£,–õ–∞–º–ø–∞–Ω–≥,“£,–õ–∞–º–ø–∞–Ω–≥,–õ–∞–º–ø–∞–Ω–≥,–õ–∞–º–ø–∞–Ω–≥,–õ–∞–º–ø–∞–Ω–≥,


In [None]:
cols = ['ind', 'tyv', 'ru']
splits = {'train': df_train[df_train.index<=49_454], 'test': df_test, 'dev': df_dev}
df_joint = []
for k, v in splits.items():
    v = v[cols].copy()
    v.index.name = "row_id"
    v['split'] = k
    df_joint.append(v)
df_joint = pd.concat(df_joint)
df_joint.shape

(50000, 4)

In [None]:
df_joint.sample(5)

Unnamed: 0_level_0,ind,tyv,ru,split
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
314,328,”®–≥-–±“Ø–ª–µ –±“Ø—Ä“Ø–∑“Ø —ç–≤—ç—ç—à –¥—ç—ç—Ä–≥–µ-–ª–µ 500-600 –∏–≤–∏–Ω–∏ –∞–∑—ã—Ä–∞–ø ”©—Å—Ç“Ø—Ä–∑“Ø–Ω.,–ü—É—Å—Ç—å –Ω–∞ –∫–∞–∂–¥—É—é —Å–µ–º—å—é –±—ã–ª–æ —Ö–æ—Ç—è –±—ã –ø–æ 500-600 –æ–ª–µ–Ω–µ–π.,train
4376,4390,”®—Ä—ç—ç–ª –∞—è–Ω–Ω–∞–Ω—ã –±–µ—Ä–≥–µ–Ω,–ö–æ–º–Ω–∞—Ç–∞ –ø—Ä–∏–Ω—è–ª–∞ —Ö–æ—Ä–æ—à–∏–π –≤–∏–¥,train
13377,13392,–∫—ã–º-–±–∏—Ä –∫–∏–∂–∏-–±–∏–ª–µ —Å–∏–ª–µ—Ä –¥–µ–ø —á—É–≥–∞–∞–ª–∞–∂—ã—Ä,–±—ã—Ç—å –Ω–∞ –≤—ã —Å –∫–µ–º-–ª–∏–±–æ,train
91144,97279,"–ò–¥—ç—ç–ª–µ–¥–∏—Ä —á–µ–º–Ω–µ–Ω–∏—Ä–∏–Ω–¥–µ –±–∞–∑–∞ —ç–∫–∏ —á“Ø“Ø-–¥–∞–∞ —á“Ø–≤–µ —á–æ–∫, —ã–Ω—á–∞–Ω–≥–∞—à –∞—Ä–∞–π —ç–≤—ç—ç—à—Ç–∏ —á–∏“£–µ—Ä.","–ù–∏—á–µ–≥–æ —Ö–æ—Ä–æ—à–µ–≥–æ –Ω–µ—Ç –∏ –≤ –ø–µ—Ä–µ–µ–¥–∞–Ω–∏–∏, —Ç–∞–∫ —á—Ç–æ –µ—à—å—Ç–µ –ø–æ–º–µ–Ω—å—à–µ.",dev
307,321,"–û–ª–∞—Ä–Ω—ã“£ –∞—Ä–∞–∑—ã–Ω–¥–∞ 14 —Å—É—É—Ä–¥–∞ 500 —á–µ—Ç–ø–µ—Å, –∞ 8 —Å—É—É—Ä–¥–∞ 250 —Ö–∏—Ä–µ —á—É—Ä—Ç—Ç–∞–∫—á—ã –±–∞—Ä.","–°—Ä–µ–¥–∏ –Ω–∏—Ö –≤ 14 —Å–µ–ª–∞—Ö –º–µ–Ω–µ–µ 500, –≤ –≤–æ—Å—å–º–∏ - –º–µ–Ω–µ–µ 250 —á–µ–ª–æ–≤–µ–∫.",train


In [None]:
df_joint.to_csv("/gd/MyDrive/datasets/nlp/tyvan/rus_tyv_parallel_50k.tsv", sep="\t")

# Publishing the model to HF

In [None]:
#!huggingface-cli login


from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

In [None]:
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig

In [None]:
def fix_tokenizer(tokenizer, new_lang='fuf_Latn'):
    """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}
    tokenizer.added_tokens_decoder = {}

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name)
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
fix_tokenizer(tokenizer)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
#upload_repo = "slone/nllb-rus-tyv-v1"
upload_repo = "flutter-painter/nllb-fra-fuf-v2"
tokenizer.push_to_hub(upload_repo)
model.push_to_hub(upload_repo)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/flutter-painter/nllb-fra-fuf-v2/commit/27546302ffd142ac6065b3e55431f587a8993e73', commit_message='Upload M2M100ForConditionalGeneration', commit_description='', oid='27546302ffd142ac6065b3e55431f587a8993e73', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
model_load_name = '/gd/MyDrive/models/nllb-rus-tyv-v2-extvoc'
tokenizer = NllbTokenizer.from_pretrained(model_load_name)
cfg = AutoConfig.from_pretrained(model_load_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_load_name + "/pytorch_model_60k.bin", config=cfg)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
upload_repo = "slone/nllb-rus-tyv-v2-extvoc"
tokenizer.push_to_hub(upload_repo)
model.push_to_hub(upload_repo)

sentencepiece.bpe.model:   0%|          | 0.00/5.14M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.51G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/slone/nllb-rus-tyv-v2-extvoc/commit/48e9b1269b037fe08280bfec990c189e5748bccd', commit_message='Upload M2M100ForConditionalGeneration', commit_description='', oid='48e9b1269b037fe08280bfec990c189e5748bccd', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
print(tokenizer.convert_ids_to_tokens([256202, 256203, 256204])) # ['zul_Latn', 'tyv_Cyrl', '<mask>']
print(tokenizer.convert_tokens_to_ids(['zul_Latn', 'tyv_Cyrl', '<mask>'])) # [256202, 256203, 256204]
# this is consistent now, wow!

['zul_Latn', '<mask>', 'tyv_Cyrl']
[256202, 256204, 256203]


Testing that it works

In [None]:
MODEL_URL = 'flutter-painter/nllb-fra-fuf-v2'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True)
fix_tokenizer(tokenizer)

config.json:   0%|          | 0.00/898 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.56k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def translate(
    text,
    model,
    tokenizer,
    src_lang='fra_Latn',
    tgt_lang='fuf_Latn',
    max_length='auto',
    num_beams=4,
    no_repeat_ngram_size=4,
    n_out=None,
    **kwargs
):
    tokenizer.src_lang = src_lang
    encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    if max_length == 'auto':
        max_length = int(32 + 2.0 * encoded.input_ids.shape[1])
    model.eval()
    generated_tokens = model.generate(
        **encoded.to(model.device),
        forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
        max_length=max_length,
        num_beams=num_beams,
        no_repeat_ngram_size=no_repeat_ngram_size,
        num_return_sequences=n_out or 1,
        **kwargs
    )
    out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    if isinstance(text, str) and n_out is None:
        return out[0]
    return out

In [None]:
translate("le riz √©tait tr√®s bon", model=model, tokenizer=tokenizer)

'rizun hino mo∆¥∆¥i fota'

In [None]:
translate("O watti ngol yoorko wayliiko", model=model, tokenizer=tokenizer, tgt_lang='fra_Latn')

'Ousmane eut trois enfants: deux mour√ªrent, le troisi√®me surv√©cut.'

In [None]:
lang_to_code = {
    '–†—É—Å—Å–∫–∏–π | Russian': 'rus_Cyrl',
    '–¢—É–≤–∏–Ω—Å–∫–∏–π | Tyvan': 'tyv_Cyrl',
}

In [None]:
def translate_wrapper(text, src, trg, correct=None):
    src_lang = lang_to_code.get(src)
    tgt_lang = lang_to_code.get(trg)
    if src == trg:
        return 'Please choose two different languages'
    print(text, src, trg)
    result = translate(
        text=text,
        model=model,
        tokenizer=tokenizer,
        src_lang=src_lang,
        tgt_lang=tgt_lang,
    )
    return result

In [None]:
translate_wrapper("–∫—Ä–∞—Å–Ω–∞—è –ø—Ç–∏—Ü–∞", '–†—É—Å—Å–∫–∏–π | Russian', '–¢—É–≤–∏–Ω—Å–∫–∏–π | Tyvan')

–∫—Ä–∞—Å–Ω–∞—è –ø—Ç–∏—Ü–∞ –†—É—Å—Å–∫–∏–π | Russian –¢—É–≤–∏–Ω—Å–∫–∏–π | Tyvan


'–∫—ã–∑—ã–ª –∫—É—à'