<a href="https://colab.research.google.com/github/souvorinkg/Eng2Kin/blob/main/tutorial/EnKinEvaluate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Evaluate The Model
In the previous two tutorials, we [created a parralel corpus](https://github.com/souvorinkg/Eng2Kin/blob/main/tutorial/EnKinDemo.ipynb), then used it to [train a NLLB translation model](https://github.com/souvorinkg/Eng2Kin/blob/main/tutorial/training_NLLB_en_kin.ipynb). In this tutorial, we will evaluate this model's performance on translation tasks, compared to the untrained model. We will use [BLEU](https://en.wikipedia.org/wiki/BLEU), chrF2++, and the [edit distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to determine the model's performance.

For this project I have selected English and Kinyarwanda. Kinyarwanda is a language spoken by roughly 15 million people in the nation of Rwanda, where it is universally spoken as a first language. I would like to thank [Gaelle Agahozo](https://github.com/GaelleAgahozo), whose initial translations and feedback were crucial to this project's success. I am grateful to [David Dale](https://cointegrated.medium.com/how-to-fine-tune-a-nllb-200-model-for-translating-a-new-language-a37fc706b865), whose article greatly aided me in this project. The code for training the NLLB model found here belongs to him. Finally, I would like to thank my advisor, [Dr. Ferrer](https://github.com/gjf2a/), who has taught me AI and guided this project.

Let's get our environment set up, import some libraries, and connect to our google drive.

In [45]:
from google.colab import drive
import os
if not os.path.exists('/gd'):
    drive.mount('/gd')

Mounted at /gd


In [46]:
import locale
def gpe(x=None):
    return "UTF-8"
locale.getpreferredencoding = gpe

In [47]:
!pip install sentencepiece transformers==4.33 datasets sacremoses sacrebleu  -q

In [48]:
import pandas as pd
import gc
import random
import numpy as np
import torch
from tqdm.auto import tqdm, trange
from transformers.optimization import Adafactor
from transformers import get_constant_schedule_with_warmup
from transformers import AutoModelForSeq2SeqLM
from sklearn.model_selection import train_test_split
from transformers import NllbTokenizer, AutoModelForSeq2SeqLM, AutoConfig

def cleanup():
    """Try to free GPU memory"""
    gc.collect()
    torch.cuda.empty_cache()

cleanup()

# 6. Using the model

This cleans any nonstandard characters.

In [49]:
# this code is adapted from  the Stopes repo of the NLLB team
# https://github.com/facebookresearch/stopes/blob/main/stopes/pipelines/monolingual/monolingual_line_processor.py#L214

import re
import sys
import typing as tp
import unicodedata
from sacremoses import MosesPunctNormalizer


mpn = MosesPunctNormalizer(lang="en")
mpn.substitutions = [
    (re.compile(r), sub) for r, sub in mpn.substitutions
]


def get_non_printing_char_replacer(replace_by: str = " ") -> tp.Callable[[str], str]:
    non_printable_map = {
        ord(c): replace_by
        for c in (chr(i) for i in range(sys.maxunicode + 1))
        # same as \p{C} in perl
        # see https://www.unicode.org/reports/tr44/#General_Category_Values
        if unicodedata.category(c) in {"C", "Cc", "Cf", "Cs", "Co", "Cn"}
    }

    def replace_non_printing_char(line) -> str:
        return line.translate(non_printable_map)

    return replace_non_printing_char

replace_nonprint = get_non_printing_char_replacer(" ")

def preproc(text):
    clean = mpn.normalize(text)
    clean = replace_nonprint(clean)
    # replace 𝓕𝔯𝔞𝔫𝔠𝔢𝔰𝔠𝔞 by Francesca
    clean = unicodedata.normalize("NFKC", clean)
    return clean

Load the model from Hugging Face. Use your model location here from the last tutorial!

In [50]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


Here, we load the model from the Hugging Face hub. Your model will be saved at a different URL. This should be the same model you created in the previous tutorial.

In [51]:
MODEL_URL = 'souvorinkg/nllb'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained('facebook/nllb-200-distilled-600M')
# tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True)



Here, we load a simple translate function to test the model's performance.

In [52]:
def translate(text, src_lang='eng_Latn', tgt_lang='kin_Latn', a=32, b=3, max_input_length=1024, num_beams=4, **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=max_input_length)
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams,
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

In [53]:
t = "I ran to the store"
print(translate(t, 'eng_Latn', 'kin_Latn'))
# ['Nanyarukiye mu iduka']

['Nanyarukiye mu iduka']


The do_sample method being set to true allow us to get the next token from a probability distribution of tokens, rather than just the most likely token. In our case, we will be using [beam search](https://en.wikipedia.org/wiki/Beam_search), which is a search algorithm that performs breadth first search on a limited number of paths. Currently, we are searching down 2 beams, or paths. The [temperature](https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683) reflects the likelihood that an unlikely translation will be selected, the higher the temperature, the more likely the model is to try a different translation.
Feel free to adjust these parameters to see what works best for you!

In [54]:
translate(t, 'eng_Latn', 'kin_Latn', do_sample=True, num_beams=2, temperature=1.5)

['Nanyarukiye mu iduka']

This function gets all of the translations of a similar length in a batch, then translates them all at once, which is more efficient.

In [55]:
def batched_translate(texts, batch_size=16, **kwargs):
    """Translate texts in batches of similar length"""
    idxs, texts2 = zip(*sorted(enumerate(texts), key=lambda p: len(p[1]), reverse=True))
    results = []
    for i in trange(0, len(texts2), batch_size):
        results.extend(translate(texts2[i: i+batch_size], **kwargs))
    return [p for i, p in sorted(zip(idxs, results))]

This loads the test.csv file we created in the previous tutorial through our test/train split.

In [56]:
filename = "test.csv"
file_path = F"/content/gdrive/MyDrive/{filename}"
df_test = pd.read_csv(file_path)
df_test.head()

Unnamed: 0,eng,kin
0,You were lookin for me too,Nanjye washakaga
1,No ham no turkey no goose,Nta ham nta turkiya nta ngagi
2,Well we ll talk a break,Nibyiza ko tuzaganira kuruhuka
3,So about those new pot laws,Kubijyanye rero naya mategeko mashya
4,Unforgiving reconciliation is an ethical form of retribution,Ubwiyunge butababarira nuburyo bwimyitwarire yo guhana


In [57]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


This gets a subsection of our data to test. The test dataset has already been shuffled, so this is a random sample.

In [58]:
df_trial = df_test[:100]
print(df_trial.head())

                                                            eng  \
0                                    You were lookin for me too   
1                                     No ham no turkey no goose   
2                                       Well we ll talk a break   
3                                   So about those new pot laws   
4  Unforgiving reconciliation is an ethical form of retribution   

                                                      kin  
0                                        Nanjye washakaga  
1                           Nta ham nta turkiya nta ngagi  
2                          Nibyiza ko tuzaganira kuruhuka  
3                    Kubijyanye rero naya mategeko mashya  
4  Ubwiyunge butababarira nuburyo bwimyitwarire yo guhana  


In [59]:
kin_translated = batched_translate(df_trial.eng, src_lang='eng_Latn', tgt_lang='kin_Latn')

  0%|          | 0/7 [00:00<?, ?it/s]

In [60]:
eng_translated = batched_translate(df_trial.kin, src_lang='kin_Latn', tgt_lang='eng_Latn')

  0%|          | 0/7 [00:00<?, ?it/s]

Now, we create a new subsection of df_trial that has all of the translations in it.

In [61]:
df_trial['eng_translated'] = eng_translated
df_trial['kin_translated'] = kin_translated

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trial['eng_translated'] = eng_translated
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trial['kin_translated'] = kin_translated


BLEU is nne of the most popular metrics for evaluating machine translation. It works by getting a score between 0 and 100, the closer it is to 100, the closer it is to the reference text. Determines how many of the reference words are in the outputted translation. However, BLEU depends on the tokenization of the words. To standardize this process, we will use SacreBLUE, which uses the Tokens from the workshop on Machine Translation (WMT) evaluation metrics. Both BLEU and chrF++ are similar metrics to evaluate the text, using [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall). Precision measures how many of the translated words appeared in the reference. Recall measures how many of the reference words were used. Here is an example:

In [62]:
import sacrebleu
bleu_calc = sacrebleu.BLEU()
chrf_calc = sacrebleu.CHRF(word_order=2)  # this metric is called ChrF++

In [63]:
xx, yy = ['I ran to the store'], ['I went to the store']
print(bleu_calc.corpus_score(xx, [yy]))
print(chrf_calc.corpus_score(xx, [yy]))
print(chrf_calc.corpus_score(yy, [xx]))

BLEU = 42.73 80.0/50.0/33.3/25.0 (BP = 1.000 ratio = 1.000 hyp_len = 5 ref_len = 5)
chrF2++ = 64.01
chrF2++ = 66.45


As you can see, BLEU gives scores far less than 100 for translations that are practically identical. So, we will be shooting for a score in the 30's pr 40's.

In [64]:
print(bleu_calc.corpus_score(df_trial['eng_translated'].tolist(), [df_trial['eng'].tolist()]))
print(chrf_calc.corpus_score(df_trial['eng_translated'].tolist(), [df_trial['eng'].tolist()]))
print(bleu_calc.corpus_score(df_trial['kin_translated'].tolist(), [df_trial['kin'].tolist()]))
print(chrf_calc.corpus_score(df_trial['kin_translated'].tolist(), [df_trial['kin'].tolist()]))

BLEU = 37.28 64.5/43.6/31.5/25.3 (BP = 0.964 ratio = 0.965 hyp_len = 572 ref_len = 593)
chrF2++ = 53.63
BLEU = 40.51 65.8/47.0/34.3/26.6 (BP = 0.989 ratio = 0.989 hyp_len = 445 ref_len = 450)
chrF2++ = 64.86


Both of these scores are quite reasonable! Interestingly, the model was better at translating English into Kinyarwanda that Kinyarwanda into English. Our English data was of higher quality, so it is reasonable that the model was better at using it. However, in previous papers, Seq2Seq models performed reasonable translations into English from low-resource languages without fine-tuning, because almost every translation system uses English as an intermediary.

In [65]:
pd.options.display.max_colwidth = 100

In [66]:
df_trial.sample(10, random_state=5)[['kin', 'eng', 'kin_translated', 'eng_translated']]

Unnamed: 0,kin,eng,kin_translated,eng_translated
66,Igisubizo ni kimwe,The answer is the same,Igisubizo ni kimwe,The answer is the same
32,JOHNNIE COCHRAN Yarangije,JOHNNIE COCHRAN Is she finished,JOHNNIE COCHRAN Ararangije,JOHNNIE COCHRAN Finished
46,Ubu noneho ariteguye,This time she is ready,Icyo gihe ariteguye,Now he s ready
28,Koresha amashusho yo guhanga,Use some creative visualization,Koresha amashusho ahanga,Use creative imagery
74,DeFlores nka Yank,DeFlores as the Yank,DeFlores nka Yank,DeFlorescences as Yank
23,Ndatera imbere cyane,I m aggressively progrowth,Ndi gutera imbere cyane,I m progressing enormously
10,JKV Kugisha inama Bwana,JKV Consulting of which Mr,JKV Kugisha inama Bwana,JKV Consulting Mr
20,Ibyo aribyo byose dukeneye kumenya,That s all we need to know,Ibyo aribyo byose dukeneye kumenya,That s all we need to know
17,Nagiye mu modoka nsubira mu rugo,I drove off and went back home,Nagiye gutwara imodoka nsubira murugo,I drove to the car and back home
35,Haracyari imbere y'urupfu mu gikapu,Still ahead death in a duffel bag,Biracyari imbere y'urupfu mumufuka,Still facing death in the bag


These translations also apprear quite reasonable. We will see what the average word overlap for the two translations are:

In [67]:
print((df_trial.eng == df_trial.eng_translated).mean())
print((df_trial.kin == df_trial.kin_translated).mean())

0.14
0.26


We will test on one more metric, the edit distance. The edit distance measures how many character insertions, deletions, and subsititions are needed to go from the source word to the translated version.

In [68]:
!pip install editdistance



In [80]:
import editdistance

def ed_similarity(text1, text2):
    return max(0, 1 - editdistance.eval(text1, text2) / min(len(text1), len(text2)))

print(ed_similarity('dog', 'cafeteria'))
print(ed_similarity('dog', 'dung'))
print(ed_similarity('dog', 'dug'))

0
0.33333333333333337
1.0


In [70]:
pd.Series([ed_similarity(row.eng, row.eng_translated) for row in df_trial.itertuples()]).describe()

count    100.000000
mean       0.601231
std        0.284621
min        0.000000
25%        0.443089
50%        0.609524
75%        0.808150
max        1.000000
dtype: float64

In [71]:
pd.Series([ed_similarity(row.kin, row.kin_translated) for row in df_trial.itertuples()]).describe()

count    100.000000
mean       0.735636
std        0.251979
min        0.000000
25%        0.573916
50%        0.764171
75%        1.000000
max        1.000000
dtype: float64

This is also quite alot of overlap, our translations were close to the source translation!

In [72]:
df_trial.index.name = "row_id"
model_load_name = '/content/drive/MyDrive'

In [73]:
df_trial.to_csv(model_load_name + "/dev_set_translated.tsv", sep="\t")

Now, to give more meaning to these evaluation metrics, let's compare them to the baseline model, without any finetuing. We will load the original NLLB model from Hugging Face:

In [75]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")



In [76]:
df_trial['kin_translated2'] = batched_translate(df_trial.eng, src_lang='eng_Latn', tgt_lang='kin_Latn')
#df_trial['eng_translated2'] = [translate(t, 'kin_Latn', 'eng_Latn')[0] for t in tqdm(df_trial.kin)]

  0%|          | 0/7 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trial['eng_translated2'] = batched_translate(df_trial.eng, src_lang='eng_Latn', tgt_lang='kin_Latn')


In [77]:
df_trial['eng_translated2'] = batched_translate(df_trial.kin, src_lang='kin_Latn', tgt_lang='eng_Latn')
#df_trial['kin_translated2'] = [translate(t, 'eng_Latn', 'kin_Latn')[0] for t in tqdm(df_trial.eng)]

  0%|          | 0/7 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trial['kin_translated2'] = batched_translate(df_trial.kin, src_lang='kin_Latn', tgt_lang='eng_Latn')


In [81]:
print(bleu_calc.corpus_score(df_trial['eng_translated2'].tolist(), [df_trial['eng'].tolist()]))
print(chrf_calc.corpus_score(df_trial['eng_translated2'].tolist(), [df_trial['eng'].tolist()]))
print(bleu_calc.corpus_score(df_trial['kin_translated2'].tolist(), [df_trial['kin'].tolist()]))
print(chrf_calc.corpus_score(df_trial['kin_translated2'].tolist(), [df_trial['kin'].tolist()]))

BLEU = 21.08 47.7/27.0/15.9/9.6 (BP = 1.000 ratio = 1.081 hyp_len = 641 ref_len = 593)
chrF2++ = 45.55
BLEU = 13.03 34.6/17.7/10.3/4.6 (BP = 1.000 ratio = 1.329 hyp_len = 598 ref_len = 450)
chrF2++ = 46.65


In [82]:
pd.Series([ed_similarity(row.eng, row.kin_translated2) for row in df_trial.itertuples()]).describe()

count    100.000000
mean       0.487518
std        0.286381
min        0.000000
25%        0.280000
50%        0.500000
75%        0.736253
max        1.000000
dtype: float64

In [83]:
pd.Series([ed_similarity(row.kin, row.eng_translated2) for row in df_trial.itertuples()]).describe()

count    100.000000
mean       0.423698
std        0.327217
min        0.000000
25%        0.114316
50%        0.393327
75%        0.693144
max        1.000000
dtype: float64

There is noticeable improvement in both BLEU and edit distance metrics from the base-line model to our fine-tuned version in both metrics. Kinyarwanda to English went up roughly 75%, and English to Kinyarwanda went up over 100%! This is important, as we trained this model without a premade parallel corpus, the industry standard for builing translators. Instead, we created our own artificial parallel corpus usign only backtranslation. This has lead to a massive jump in BLEU, as well as a noticeable increase in the edit distance metric. Due to limited computational resources, we used a limited number of translations, 50,000 pairs of sentences. Additionally, we only performed 30,000 training steps due to limited amounts of GPU availible. Scaled up, it is possible we would be able to get a BLEU score in the 50's with a corpus with hundreds of thouasands of sentences.

In addition to quantatative analysis, some fellow students at Hendrix College who are bilingual in English and Kinyarwanda are qualitatively checking the quality of the translations. When their feedback is complete, it will be posted here as well.  

Thank you for completing this tutorial! First, we built a parralel corpus from scratch, using the COCA dataset. Then, we used our parallel corpus to train and fine-tune a NLLB model to translate between English and Kinyarwanda. Finally, we evaluated the model, using inustry standard metrics, and determined that our model's improvements were significant. With this knowledge, you can finetune a NLLB model for any language. Additionally, you should feel more comfortable interacting with large text datasets and hugging face models.

In [79]:
df_trial.to_csv("/content/drive/MyDrive" + "/dev_set_translated.tsv", sep="\t")