<a href="https://colab.research.google.com/github/souvorinkg/Eng2Kin/blob/main/tutorial/training_NLLB_en_kin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune the NLLB model:
In the previous tutorial (link) on translation, we used backtranslation on a monolingual corpus to create a parallel corpus, then posted in on the HuggingFace hub. You do not need to have completed that tutorial to complete this one. In this tutorial, we will use our HuggingFace dataset to finetune the NLLB model.

* First, we will load a dataset, tokenizer, and model from huggingface hub
* Then, we will tokenize the data using NLLB, and look for any irregularities
* After that, we will train the model on our data.
* Finally, we will test our model and post it to the HuggingFace hub, where we will build a simple GUI for it with Gradio Space.

For this project I have selected English and Kinyarwanda. Kinyarwand is a language spoken by roughly 15 million people in the nation of Rwanda, where it is universally spoken as a first language. I would like to thank [Gaelle Agahozo](https://github.com/GaelleAgahozo), whose initial translations and feedback were crucial to this project's success. I am grateful to [David Dale](https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865), whose article greatly aided me in this project. The code for training the NLLB model found here belongs to him. Finally, I would like to thank my advisor, [Dr. Ferrer](https://github.com/gjf2a/), who has taught me AI and guided this project.

# The NLLB Model
No Language Left Behind ([NLLB](https://ai.meta.com/blog/nllb-200-high-quality-machine-translation/)) is a transformer model released by Facebook in 2022, allowing for high-quality translations between any pair in the over 200 languages supported. This is far more supported languages than any other currently availible model. The model takes in a sentence as input and returns a sentence as output, making it a Seq2Seq model. A parallel corpus is pairs of sentences, or sequences. We will train the model to predict a translation of the input sentence as the output.  

# Part 1: Loading and Regularizing the text
Before running this notebook, I highly recommend switching to a GPU. GPUs, or Graphics Processing Units, are many times faster than CPUs at performing matrix operations, which we will be doing alot of in this tutorial. In the "runtime" tab, click "change runtime type" to select the T4 GPU to greatly increase the speed of the training process.

We will need to install the following libraries:

* [Sentencepiece](https://github.com/google/sentencepiece/tree/master): Unsupervized text tokenization
* [Transformers](https://huggingface.co/docs/transformers/index): HuggingFace library for pretrained models
* [Datasets](https://huggingface.co/docs/datasets/index): HuggingFace library for public data
* [sacremoses](https://github.com/hplt-project/sacremoses) [sacreblue](https://github.com/mjpost/sacrebleu): Translation accuracy calculator

In [1]:
!pip install sentencepiece transformers==4.33 datasets sacremoses sacrebleu -q


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

We will need to save and load items from Google Drive during this tutorial. This will allow us to save and load models throughout without fear of losing our progress. You will recieive a prompt asking you to sign-in, along with a request for a HuggingFace (HF) secret token.

In [2]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


We will need a unicode implenmentation

In [3]:
import locale
def gpe(x=None):
    return "UTF-8"
locale.getpreferredencoding = gpe

I have constructed a parallel dataset between English and Kinyarwanda and saved it to my Google Drive. You can download that dataset from [HuggingFace](https://huggingface.co/datasets/souvorinkg/english_kinyarwanda/tree/main).

Now, we open up the data from the previous tutorial from our google drive. You may have saved you data under a different name, or location, so adjust as necesary.

In [4]:
# import packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# importing data
filepath = '/content/gdrive/MyDrive/models/trans.csv'
df = pd.read_csv(filepath) # replace with a file path to your csv file

# head of the data
print(df.head())


                                                 kin  \
0   Bya heroine birenze urugero muri motel ihendutse   
1                                 Yapfuye atarashaje   
2  Paine yapfuye mu 1976 azize kunywa ibiyobyabwenge   
3                      Uzi ibitangaza bitagira igihe   
4                        Nahoraga nkunda flicks nini   

                                         eng  id  \
0      Of a heroin overdose in a cheap motel   1   
1                  He died before he got old   2   
2      Paine died in 1976 of a drug overdose   3   
3                  You know dateless wonders   4   
4  I always preferred the big monster flicks   5   

                                         translation  
0  { 'en': 'Of a heroin overdose in a cheap motel...  
1  { 'en': 'He died before he got old', ' kn': 'Y...  
2  { 'en': 'Paine died in 1976 of a drug overdose...  
3  { 'en': 'You know dateless wonders', ' kn': 'U...  
4  { 'en': 'I always preferred the big monster fl...  


# Split the data into testing and training.
We will now split the data into a training dataset, which the model will see during each training loop, and a test dataset, which the model be evaluated against to determine performance each loop. This prevents overfitting, the problem of a model matching the training data too closely. I have set the test size to be a random quarter of the availible sentences, feel free to play around with this ratio!

In [5]:
X= df['eng']
y=df['kin']

X_train, X_test, y_train, y_test = train_test_split(X,y ,
								random_state=104,
								test_size=0.25,
								shuffle=True)

# printing out train and test sets

print('X_train : ')
print(X_train.head())
print('')
print('X_test : ')
print(X_test.head())
print('')
print('y_train : ')
print(y_train.head())
print('')
print('y_test : ')
print(y_test.head())

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

# upload df_train, test to csv
df_train.to_csv('train.csv', index=False)
df_test.to_csv('test.csv', index=False)




X_train : 
21718    The process was known as the enclosure movement
5863          There was a joyful flutter among the girls
29534                       SNOW Are nt men sweet to you
10497                           I m with my friend Kaley
43071                  My only problem is shower curtain
Name: eng, dtype: object

X_test : 
54398                           You were lookin for me too
32997                            No ham no turkey no goose
29942                              Well we ll talk a break
47293                          So about those new pot laws
16956    Unforgiving reconciliation is an ethical form ...
Name: eng, dtype: object

y_train : 
21718          Inzira yari izwi nkigikorwa cyo kuzitira
5863            Habayeho kuvuza akanyamuneza mu bakobwa
29534                SNOW Ntabwo nt abagabo bakuryoheye
10497                  Ndi kumwe n'inshuti yanjye Kaley
43071    Ikibazo cyanjye gusa ni umwenda wo kwiyuhagira
Name: kin, dtype: object

y_test : 
54398               

In [6]:
df_train.head()

Unnamed: 0,eng,kin
21718,The process was known as the enclosure movement,Inzira yari izwi nkigikorwa cyo kuzitira
5863,There was a joyful flutter among the girls,Habayeho kuvuza akanyamuneza mu bakobwa
29534,SNOW Are nt men sweet to you,SNOW Ntabwo nt abagabo bakuryoheye
10497,I m with my friend Kaley,Ndi kumwe n'inshuti yanjye Kaley
43071,My only problem is shower curtain,Ikibazo cyanjye gusa ni umwenda wo kwiyuhagira


Now, let's examine our dataset with pandas, a data analysis library. My data has four columns, importantly the first two are the English and Kinyarwanda sentences. Depending on the dataset you use, your columns may look different.

In [7]:
print(df.shape) # (54589, 4)
print(df.columns) # ['kin', 'eng', 'id', 'translation']
df.sample(10)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(54589, 4)
Index(['kin', 'eng', 'id', 'translation'], dtype='object')
(40941,)
(13648,)
(40941,)
(13648,)


#Part 2: Tokenize the Dataset

Models on Hugging Face use the [transformer](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)) architecture, a method invented in 2017 that greatly improved the ability of neural networks to respond to lengthy input. It is universally used in the field of [Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP), of which machine translation is a subfield. You have likely already used a [transformer](https://), it is the T in ChatGPT. For now, all you need to know is a transformer consists of two parts, a tokenizer and a model, which we will cover in the next section.

### Tokenization:
Transformers take input in the form of tokens. Naively, we can say tokens represent words. A transformer has a vocabulary of a fixed length, where each token in the vocabulary is a word it recognizes. Anything not tokenized within the model recieves a special token, out of vocabulary (OOV). This is similar to how a ten-year old might have a vocabulary size of 2,000 words, where they recognize "shiny" but have no meaning attached to "iridescent." With proper tokenization, we can minimize the number of OOV tokens in the model!

### Tokens:
we described a token as a word, but it can also be a part of a word, like prefixes "un-", "re-", or suffixes "-ing." In English words can have at most one suffix and one prefix, so we'd expect no word to have over three tokens. However, in Kinyarwanda words can have an unlimited number of prefixes and suffixes. We will expect there to be more tokens in Kinyarwanda!

If the tokenizer is performing well, there should be more tokens than words, but not too much more! If there are two many tokens, words are getting split unneccesarily. For instance "berry" should not be split into "ber", "ry." Additioanlly, our tokens should be constructed so that there is as few OOV tokens assigned as possible, as we get no meaning out of those.

It is closer to the [truth](https://www.3blue1brown.com/lessons/gpt) that each token is represented by a number. Ultimately, the tokens will become matrices of numbers when they are processed in the model. However we can abstractly deal with tokens as words for the purposes of this tutorial.



In [8]:
from transformers import NllbTokenizer
from tqdm.auto import tqdm, trange

In [9]:
tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')



sentencepiece.bpe.model:   0%|          | 0.00/4.85M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/3.55k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/564 [00:00<?, ?B/s]

In [10]:
import re

def word_tokenize(text):
    """
    Split a text into words, numbers, and punctuation marks
    (for languages where words are separated by spaces)
    """
    return re.findall('(\w+|[^\w\s])', text)

smpl = df.sample(10000, random_state=1)
smpl['eng_toks'] = smpl.eng.apply(tokenizer.tokenize) #Samples of Tokens
smpl['kin_toks'] = smpl.kin.apply(tokenizer.tokenize)
smpl['eng_words'] = smpl.eng.apply(word_tokenize) #Samples of Words
smpl['kin_words'] = smpl.kin.apply(word_tokenize)

Here, we find the average number of tokens and divide it by the average number of words. There should be roughly twice the number of tokens as words! If you get a number less than one, or greater than four here, check the quality of your data before running the model again. If you are translating into Kinyarwanda or another language with lots of prefixes and suffixes, it should have a higher ratio than English.

In [11]:
stats = smpl[
    ['eng_toks', 'kin_toks', 'eng_words', 'kin_words']
].applymap(len).describe()
print(stats.eng_toks['mean'] / stats.eng_words['mean'])  # 1.22
print(stats.kin_toks['mean'] / stats.kin_words['mean'])  # 1.69
stats

1.2207900933112408
1.6986711415483697


Unnamed: 0,eng_toks,kin_toks,eng_words,kin_words
count,10000.0,10000.0,10000.0,10000.0
mean,6.9994,7.4908,5.7335,4.4098
std,2.262943,3.103946,1.436692,1.798885
min,2.0,1.0,2.0,1.0
25%,5.0,5.0,5.0,3.0
50%,7.0,7.0,6.0,4.0
75%,8.0,9.0,7.0,5.0
max,24.0,25.0,9.0,16.0


There may be nonstandard characters in your dataset. Here is a cleaning function:

In [12]:
import re
import sys
import unicodedata
from sacremoses import MosesPunctNormalizer

mpn = MosesPunctNormalizer(lang="en")
mpn.substitutions = [
    (re.compile(r), sub) for r, sub in mpn.substitutions
]

def get_non_printing_char_replacer(replace_by: str = " "):
    non_printable_map = {
        ord(c): replace_by
        for c in (chr(i) for i in range(sys.maxunicode + 1))
        # same as \p{C} in perl
        # see https://www.unicode.org/reports/tr44/#General_Category_Values
        if unicodedata.category(c) in {"C", "Cc", "Cf", "Cs", "Co", "Cn"}
    }

    def replace_non_printing_char(line) -> str:
        return line.translate(non_printable_map)

    return replace_non_printing_char

replace_nonprint = get_non_printing_char_replacer(" ")

def preproc(text):
    clean = mpn.normalize(text)
    clean = replace_nonprint(clean)
    # replace 𝓕𝔯𝔞𝔫𝔠𝔢𝔰𝔠𝔞 by Francesca
    clean = unicodedata.normalize("NFKC", clean)
    return clean

Without cleaning, there are 140 nonstandard apostrophes within some words, which we can see in the random sample of words we select.

In [13]:
from tqdm.auto import tqdm, trange
import random
texts_with_unk = [
    text for text in tqdm(df.kin)
    if tokenizer.unk_token_id in tokenizer(text).input_ids # check if all words are in the vocabulary
]
print(len(texts_with_unk)) # any sentence with an OOV token
# 140
s = random.sample(texts_with_unk, 5) # 5 random flagged sentences
print(s)

  0%|          | 0/54589 [00:00<?, ?it/s]

140
['Itegeko ry’amazi meza ryasukuye inzuzi n’ibigobe', 'Imihindagurikire y’ibihe 2016', 'Umunyamakuru w’Ubwongereza Michael Nicholson', 'Hasidisimu hamwe n’idini ry’Abayahudi byakomeje kuba byiza kugeza mu 1939', 'Covert Igikorwa hamwe n’amasosiyete menshi']


After applying the cleaning, there should be zero nonstandard tokens. If this number is not zero, modify the cleaning function. Try to find the problem characters by printing random samples of words with problem characters.

In [14]:
texts_with_unk_normed = [
    text for text in tqdm(texts_with_unk)
    if tokenizer.unk_token_id in tokenizer(preproc(text)).input_ids # clean any nonstandard characters if the sentence has OOV tokens
]
print(len(texts_with_unk_normed))  # 0

  0%|          | 0/140 [00:00<?, ?it/s]

0


# Part 3: Preparing our Model
To train our NLLB model, first we will need to import it from HuggingFace Hub. Instead of using the full sized model for NLLB, we will used a [distilled](https://en.wikipedia.org/wiki/Knowledge_distillation) version, which has [600-million](https://huggingface.co/facebook/nllb-200-distilled-600M) rather than [3.3-billion](https://huggingface.co/facebook/nllb-200-3.3B) training parameters. We will be using the 600-million version, which is still quite large, at roughly 3 gigabyes. The model object can be trained in a defined training loop. We can also call translation methods on it directly. Finally, we can save the model to our Google Drive to use at a later point, which will be useful if our code gets interrupted for any reason, as we can load up our current version.

In [15]:
import gc
import random
import numpy as np
import torch
from tqdm.auto import tqdm, trange
from transformers.optimization import Adafactor
from transformers import get_constant_schedule_with_warmup

def cleanup():
    """Try to free GPU memory"""
    gc.collect()
    torch.cuda.empty_cache()

cleanup()

In [16]:
from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained('facebook/nllb-200-distilled-600M') # Change if using a different version




config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

  torch.utils._pytree._register_pytree_node(


pytorch_model.bin:   0%|          | 0.00/2.46G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [17]:
#model.cuda() # Ensure that you are running on GPU. Comment out if no GPU availible.

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

# The Optimizer

Transformers use [Backpropagation](https://www.3blue1brown.com/lessons/backpropagation) to update the activation functions of the nodes of the neural network. Each update runs an optimzer to minimize the loss. Minimizing the outfut of a function is an optimization problem, similar to finding the minimum value of a function in calculus. In Machine Learning, the simplest optimization algorithm consists of [gradient descent](https://www.3blue1brown.com/lessons/gradient-descent). Gradient Descent is simple to implement but can often converge at suboptimal values. To visualize this, imagine you are at the top of a mountain, and you want to get to an ocean. One strategy would be to always walk downhill until you reach a river, and follow it downstream until you reach the shore. However, there are cases where this strategy can get you stuck in a valley, such as the Great Basin of Nevada. In a similar way, gradient descent can lead to you settling at a "false minimum" value of your activation function. Additionally, gradient descent is a very slow process, because the losses are not updated very much in each training step. To speed up our process, we will use the [Adafactor](https://paperswithcode.com/method/adafactor) algorithm, which makes large jumps towards a perceived minimum early, and small changes late.


In [None]:
optimizer = Adafactor(
    [p for p in model.parameters() if p.requires_grad],
    scale_parameter=False,
    relative_step=False,
    lr=1e-4,
    clip_threshold=1.0,
    weight_decay=1e-3,
)

#Hyperparameters
Hyperparameters are variables used to control the training process.
* Batch Size: number of samples before the initial parameters of the model are updated. I recommend keeping this at 16 unless you have a very powerful GPU.
* Max Length: Any token over 128 long is shortened. The sample data has been pre-truncated in previous cleaning, so this will not affect our model.
* Warm up step: [Epochs](https://www.geeksforgeeks.org/epoch-in-machine-learning/#), or passes over the dataset, where you have a low [learning rate](https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/) at the beginning. This prevents your model from overfitting on the earliest data. Don't change unless you have a different optimization algorithm.
* Training steps: Number of epochs spent training on the entire database. **Change this to control how long you will train your model**. Currently, it is set to 10,000 steps, which will take roughly an hour to run with a GPU, and a very, very long time to run on a CPU. I tried that so you didn't have to.

In short, the only thing you probably will adjust here is the training steps. I ended up training for 60,000 steps over several hours, and got good performacne with the model.

In [None]:
batch_size = 16
max_length = 128  # max length of any sentence
warmup_steps = 100
training_steps = 1000 # You can set a large number and interupt the program
losses = []  # with this list, I do very simple tracking of average loss
MODEL_SAVE_PATH = '/gd/MyDrive/'  # on my Google drive

In [None]:
losses = []
scheduler = get_constant_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps)

#Language selection:
The first string in the first tuple is the source language as I encode it, and the second string in that tuple is the language as the model encodes it.T he second tuple is the same for the target language. NLLB supports the following [languages](https://huggingface.co/facebook/nllb-200-distilled-600M/blob/main/README.md) by 3-letter ISO 639 [codes](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes). The following code gets tuples of labeled sentences (Batches) in small sizes for each round of training.


In [None]:
xx, yy = [], []
for _ in range(10):
    item = df.iloc[random.randint(0, len(X_train)-1)]
    xx = preproc(item['eng'])
    yy = preproc(item['kin'])

In an earlier step, we split our data into a training and a test dataset. This is going to come in handy here, as now we will only use the portion of our data set aside for training, and test on the remainder of the data.

In [None]:
LANGS = [('eng', 'eng_Latn'), ('kin', 'kin_Latn')]

def get_batch_pairs(batch_size, data=df_train): #rename to your data here
    (l1, long1), (l2, long2) = random.sample(LANGS, 2)
    xx, yy = [], []
    for _ in range(batch_size):
        item = data.iloc[random.randint(0, len(data)-1)]
        xx.append(preproc(item[l1]))
        yy.append(preproc(item[l2]))
    return xx, yy, long1, long2

print(get_batch_pairs(1))
# (['Navuze ko nanga amazi akonje'], ['Have I mentioned that I hate cold water'], 'kin_Latn', 'eng_Latn')


### Mount Google Drive
Now, we will mount the google drive again. In case of a failure, I will save a version of the model to my google drive, which I am remounting. If the code stops for any reason, I can reload this model from my google drive, and continue the training process. All previous training data will be intact!

### Optional: Load a previous model
  -If you are continuing training after interuption, uncomment this block of code. Else, ignore it. Use your model name and location instead if they are different.

In [None]:
model_save_name = 'classifier.pt'
path = F"/content/gdrive/My Drive/{model_save_name}"

#Use if using GPU
#model.load_state_dict(torch.load(path))

#USe if using CPU
#model.load_state_dict(torch.load(path, map_location=torch.device('cpu')))

# Part 4: The Training Loop
Now, we train. This loop will execute for the number of training steps selected above. Each loop, a batch of sentences will be selected. The loss of information will be calculated using the adafactor algorithm. Finally, our scheduler will update what our learning rate should be based on our warm up steps selected earlier.

In [None]:
model.train()
x, y, loss = None, None, None
cleanup()

tq = trange(len(losses), training_steps)
for i in tq:
    xx, yy, lang1, lang2 = get_batch_pairs(batch_size)
    try:
        tokenizer.src_lang = lang1
        x = tokenizer(xx, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(model.device)
        tokenizer.src_lang = lang2
        y = tokenizer(yy, return_tensors='pt', padding=True, truncation=True, max_length=max_length).to(model.device)
        y.input_ids[y.input_ids == tokenizer.pad_token_id] = -100

        loss = model(**x, labels=y.input_ids).loss
        loss.backward()
        losses.append(loss.item())

        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        scheduler.step()

    except RuntimeError as e:  # usually, it is out-of-memory
        optimizer.zero_grad(set_to_none=True)
        x, y, loss = None, None, None
        cleanup()
        print('error', max(len(s) for s in xx + yy), e)
        continue

    if i % 1000 == 0:
        # Report the average loss each 1000 steps, should be decreasing. When it stops decreasing interupt the training loop
        print(i, np.mean(losses[-1000:]))

    if i % 1000 == 0 and i > 0:
        drive.mount('/content/gdrive', force_remount=True) #Save your progress every 1000 steps
        model_save_name = 'classifier.pt'
        path = F"/content/gdrive/MyDrive/{model_save_name}"
        torch.save(model.state_dict(), path)

In [None]:
model_save_name = 'classifier.pt'
path = F"/content/gdrive/My Drive/{model_save_name}"


#Use if using GPU
model.load_state_dict(torch.load(path))
# use if on CPU
#model.load_state_dict(torch.load(path, map_location=torch.device('cpu')))

Now, we can test out the model on a single sentence, "t."

In [None]:
def translate(
    text, src_lang='rus_Cyrl', tgt_lang='eng_Latn',
    a=32, b=3, max_input_length=1024, num_beams=4, **kwargs
):
    """Turn a text or a list of texts into a list of translations"""
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(
        text, return_tensors='pt', padding=True, truncation=True,
        max_length=max_input_length
    )
    model.eval() # turn off training mode
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams, **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

# Example usage:
t = 'I like funny men'
print(translate(t, 'eng_Latn', 'kin_Latn'))

In [None]:
!huggingface-cli login

In [None]:
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig

In [None]:
def fix_tokenizer(tokenizer, new_lang='tyv_Cyrl'):
    """ Add a new language token to the tokenizer vocabulary (this should be done each time after its initialization) """
    old_len = len(tokenizer) - int(new_lang in tokenizer.added_tokens_encoder)
    tokenizer.lang_code_to_id[new_lang] = old_len-1
    tokenizer.id_to_lang_code[old_len-1] = new_lang
    # always move "mask" to the last position
    tokenizer.fairseq_tokens_to_ids["<mask>"] = len(tokenizer.sp_model) + len(tokenizer.lang_code_to_id) + tokenizer.fairseq_offset

    tokenizer.fairseq_tokens_to_ids.update(tokenizer.lang_code_to_id)
    tokenizer.fairseq_ids_to_tokens = {v: k for k, v in tokenizer.fairseq_tokens_to_ids.items()}
    if new_lang not in tokenizer._additional_special_tokens:
        tokenizer._additional_special_tokens.append(new_lang)
    # clear the added token encoder; otherwise a new token may end up there by mistake
    tokenizer.added_tokens_encoder = {}
    tokenizer.added_tokens_decoder = {}

In [None]:
model_save_name = 'classifier.pt'
path = F"/content/gdrive/My Drive/{model_save_name}"
#Use if using GPU
model.load_state_dict(torch.load(path))
# use if on CPU
#model.load_state_dict(torch.load(path, map_location=torch.device('cpu')))

In [None]:
upload_repo = "souvorinkg/nllb"
tokenizer.push_to_hub(upload_repo)
model.push_to_hub(upload_repo)

Now, we will test the model, pulling it from your repository at hugging face. Here, replace the model URL with the URL of the repository with your model.

In [None]:
model.cuda()

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
fix_tokenizer(tokenizer)
tokenizer = AutoTokenizer.from_pretrained("souvorinkg/nllb")
model = AutoModelForSeq2SeqLM.from_pretrained("souvorinkg/nllb")
fix_tokenizer(tokenizer)

In [None]:
def translate(
    text,
    model,
    tokenizer,
    src_lang='eng_Latn',
    tgt_lang='kin_Latn',
    max_length='auto',
    num_beams=4,
    no_repeat_ngram_size=4,
    n_out=None,
    **kwargs
):
    tokenizer.src_lang = src_lang
    encoded = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    if max_length == 'auto':
        max_length = int(32 + 2.0 * encoded.input_ids.shape[1])
    model.eval()
    generated_tokens = model.generate(
        **encoded.to(model.device),
        forced_bos_token_id=tokenizer.lang_code_to_id[tgt_lang],
        max_length=max_length,
        num_beams=num_beams,
        no_repeat_ngram_size=no_repeat_ngram_size,
        num_return_sequences=n_out or 1,
        **kwargs
    )
    out = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
    if isinstance(text, str) and n_out is None:
        return out[0]
    return out

In [None]:
translate("I ran up the hill", model=model, tokenizer=tokenizer)

In [None]:
from huggingface_hub import HfApi
repo_id = "souvorinkg/nllb" # use your own Hugging Face Repo here
api = HfApi()

# For example with a Gradio SDK
#api.create_repo(repo_id=repo_id, repo_type="space", space_sdk="gradio")

Hosting the Model in a Gradio Space
  -Gradio Spaces are a GUI in hugging face that allows others to interact with your model and perform translations without touching any code. Here, I create a space in my repository. Then, I click the link to the repository, and upload an "app.py" file with the following code:

```python
import gradio as gr

description = "Use a command to translate sentences between English and Kinyarwanda"
title = "English-Kinyarwanda Translator"
examples = [["Translate English to Kinyarwanda: Double or nothing."],["Translate Kinyarwanda into English: Ntekereza ko umuyaga uza"]]

interface = gr.load("models/souvorinkg/nllb", examples=examples)

interface.launch()
```

Congrationalations on finishing this tutorial! We opened a dataset from Hugging Face and cleaned it to our specifications. Then, we tokenized the data into word-sized chunks that were represented by a matrix. Afterwards, we imported the NLLB model, then fine-tuned it on our tokenized data. Finally, we tested the model then uploaded it to a Hugging Face, where we can download it or interact with it through Gradio.