# **Fine Tuning Transformer for Neural Machine Translation**

## *What is Machine Translation ?*

*Machine translation is a field that focuses on using automated systems to convert text or speech from one language to another. By Leveraging various techniques, such as computational models and linguistic analysis, machine translation aim to overcome language barriers and facilitate communication across different languages. The goals is to achieve accurate and fluid translations through the use of sophisticated algorithms and large-scale data processing*

## *What is Satistical Machine Translation (SMT) ?!*

*Statistical Machine translation is an approach within machine translation that relies on statistical models and vast amounts of bilingual text data. Unlike rule-based methods, SMT works by analyzing patterns and relationships between words or pharses in the source and target languages. By training on large corpora, SMT models estimate the probability of translation choices and generate output based on statistical likelihood. SMT has been widely used and has shown success in achieving acceptable translation quality for numerous language pairs.*

## *What is Neural Machine Learning (NMT) ?!*

*Neural machine translation is a modern approach to machine translation that utilizes artifical neural networks, particularly recurrent nerual networks (RNNs) or transformer models. NMT systems operate on a more holistic level, learning to capture the contextual meaning of the input text and generating translations based on this understanding. By leveraging the power of neural networks, NMT models excel at capturing long-range dependencies and producing coherent and fluent translations. This approach has surpassed traditional methods in terms of translation quality and has become the prevailing paradigm in machine translation research and development.*

## Reference Links for below code



1.   [Google Colab](https://colab.research.google.com/drive/1ge0aqzAbCRWS7CJIbdDVPk-NBBSxjKlv#scrollTo=biPo8vFTx5Ue)
2.  [Medium Article](https://medium.com/@tskumar1320/how-to-fine-tune-pre-trained-language-translation-model-3e8a6aace9f)
3. [HuggingFace Documentation](https://huggingface.co/docs/transformers/model_doc/t5)
4. [HuggingFace FineTuning Tips ](https://discuss.huggingface.co/t/t5-finetuning-tips/684)




## Installing Dependencies

In [2]:
# ! pip install --trusted-host pypi.org --trusted-host pypi.python.org --trusted-host files.pythonhosted.org <package_name>
# ! pip install datasets sacrebleu torch transformers sentencepiece transformers[sentencepiece]
# ! pip install accelerate -U

## Required Imports

In [3]:
import warnings
import numpy as np
import pandas as pd

import torch
import transformers

from datasets import Dataset
from datasets import load_metric

from tqdm import tqdm
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

warnings.filterwarnings("ignore")

## Constants

In [4]:
BATCH_SIZE = 16
BLEU = "bleu"
ENGLISH = "en"
ENGLISH_TEXT = "english_text"
EPOCH = "epoch"
INPUT_IDS = "input_ids"
FILENAME = "TranslationDataset.csv"
GEN_LEN = "gen_len"
MAX_INPUT_LENGTH = 128
MAX_TARGET_LENGTH = 128
MODEL_CHECKPOINT = "unicamp-dl/translation-pt-en-t5"
MODEL_NAME = MODEL_CHECKPOINT.split("/")[-1]
LABELS = "labels"
PREFIX = ""
PORTUGUESE = "pt"
PORTUGUESE_TEXT = "portuguese_text"
SCORE = "score"
SOURCE_LANG = "pt"
TARGET_LANG = "en"
TRANSLATION = "translation"
UNNAMED_COL = "Unnamed: 0"

## Helper Functions

In [5]:
def postprocess_text(preds: list, labels: list) -> tuple:
    """Performs post processing on the prediction text and labels"""

    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def prep_data_for_model_fine_tuning(source_lang: list, target_lang: list) -> list:
    """Takes the input data lists and converts into translation list of dicts"""

    data_dict = dict()
    data_dict[TRANSLATION] = []

    for sr_text, tr_text in zip(source_lang, target_lang):
        temp_dict = dict()
        temp_dict[PORTUGUESE] = sr_text
        temp_dict[ENGLISH] = tr_text

        data_dict[TRANSLATION].append(temp_dict)

    return data_dict


def generate_model_ready_dataset(dataset: list, source: str, target: str,
                                 model_checkpoint: str,
                                 tokenizer: AutoTokenizer):
    """Makes the data training ready for the model"""

    preped_data = []

    for row in dataset:
        inputs = PREFIX + row[source]
        targets = row[target]

        model_inputs = tokenizer(inputs, max_length=MAX_INPUT_LENGTH,
                                 truncation=True, padding=True)

        model_inputs[TRANSLATION] = row

        # setup the tokenizer for targets
        with tokenizer.as_target_tokenizer():
            labels = tokenizer(targets, max_length=MAX_INPUT_LENGTH,
                                 truncation=True, padding=True)
            model_inputs[LABELS] = labels[INPUT_IDS]

        preped_data.append(model_inputs)

    return preped_data



def compute_metrics(eval_preds: tuple) -> dict:
    """computes bleu score and other performance metrics """

    metric = load_metric("sacrebleu")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

    preds, labels = eval_preds

    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {BLEU: result[SCORE]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]

    result[GEN_LEN] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}

    return result

## Loading and Preparing The Dataset

In [6]:
translation_data = pd.read_csv(FILENAME)
translation_data = translation_data.drop([UNNAMED_COL], axis=1)
translation_data

Unnamed: 0,portuguese_text,english_text
0,A lua está brilhando no céu noturno.,The moon is shining in the night sky.
1,Estamos planejando uma viagem à praia.,We're planning a trip to the beach.
2,O livro que estou lendo é uma aventura.,The book I'm reading is an adventure.
3,Vou cozinhar um jantar especial para a família.,I'm going to cook a special dinner for the fam...
4,O museu de história é fascinante.,The history museum is fascinating.
...,...,...
660,A música eletrônica é conhecida por suas batid...,Electronic music is known for its catchy beats...
661,O mercado de produtos locais oferece itens fre...,The local products market offers fresh items d...
662,Estamos explorando a arquitetura de edifícios ...,We're exploring the architecture of abandoned ...
663,Eles estão liderando um projeto de preservação...,They're leading a historical heritage preserva...


## Train, Test & Validation Split of Data

In [7]:
X = translation_data[PORTUGUESE_TEXT]
y = translation_data[ENGLISH_TEXT]

In [8]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.10,
                                                    shuffle=True,
                                                    random_state=100)

print("INITIAL X-TRAIN SHAPE: ", x_train.shape)
print("INITIAL Y-TRAIN SHAPE: ", y_train.shape)
print("X-TEST SHAPE: ", x_test.shape)
print("Y-TEST SHAPE: ", y_test.shape)

INITIAL X-TRAIN SHAPE:  (598,)
INITIAL Y-TRAIN SHAPE:  (598,)
X-TEST SHAPE:  (67,)
Y-TEST SHAPE:  (67,)


In [9]:
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train,
                                                  test_size=0.20,
                                                  shuffle=True,
                                                  random_state=100)

print("FINAL X-TRAIN SHAPE: ", x_train.shape)
print("FINAL Y-TRAIN SHAPE: ", y_train.shape)
print("X-VAL SHAPE: ", x_val.shape)
print("Y-VAL SHAPE: ", y_val.shape)

FINAL X-TRAIN SHAPE:  (478,)
FINAL Y-TRAIN SHAPE:  (478,)
X-VAL SHAPE:  (120,)
Y-VAL SHAPE:  (120,)


## Load Tokenizer from AutoTokenizer Class

In [10]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.95k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/756k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

## Prepare the model ready dataset

In [11]:
training_data = prep_data_for_model_fine_tuning(x_train.values, y_train.values)

validation_data = prep_data_for_model_fine_tuning(x_val.values, y_val.values)

test_data = prep_data_for_model_fine_tuning(x_test.values, y_test.values)

In [12]:
train_data = generate_model_ready_dataset(dataset=training_data[TRANSLATION],
                                          tokenizer=tokenizer,
                                          source=PORTUGUESE,
                                          target=ENGLISH,
                                          model_checkpoint=MODEL_CHECKPOINT)

validation_data = generate_model_ready_dataset(dataset=validation_data[TRANSLATION],
                                               tokenizer=tokenizer,
                                               source=PORTUGUESE,
                                               target=ENGLISH,
                                               model_checkpoint=MODEL_CHECKPOINT)

test_data = generate_model_ready_dataset(dataset=test_data[TRANSLATION],
                                               tokenizer=tokenizer,
                                               source=PORTUGUESE,
                                               target=ENGLISH,
                                               model_checkpoint=MODEL_CHECKPOINT)

In [13]:
train_df = pd.DataFrame.from_records(train_data)
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   attention_mask  478 non-null    object
 1   input_ids       478 non-null    object
 2   labels          478 non-null    object
 3   translation     478 non-null    object
dtypes: object(4)
memory usage: 15.1+ KB


In [14]:
validation_df = pd.DataFrame.from_records(validation_data)
validation_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   attention_mask  120 non-null    object
 1   input_ids       120 non-null    object
 2   labels          120 non-null    object
 3   translation     120 non-null    object
dtypes: object(4)
memory usage: 3.9+ KB


In [15]:
test_df = pd.DataFrame.from_records(test_data)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67 entries, 0 to 66
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   attention_mask  67 non-null     object
 1   input_ids       67 non-null     object
 2   labels          67 non-null     object
 3   translation     67 non-null     object
dtypes: object(4)
memory usage: 2.2+ KB


## Convert dataframe to Dataset Class object

In [16]:
train_dataset = Dataset.from_pandas(train_df)
train_dataset

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'translation'],
    num_rows: 478
})

In [17]:
validation_dataset = Dataset.from_pandas(validation_df)
validation_dataset

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'translation'],
    num_rows: 120
})

In [18]:
test_dataset = Dataset.from_pandas(test_df)
test_dataset

Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'translation'],
    num_rows: 67
})

## Load model, Create Model Training Args and Data Collator

In [19]:
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)

Downloading (…)lve/main/config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

In [22]:
model_args = Seq2SeqTrainingArguments(
    f"{MODEL_NAME}-finetuned-{SOURCE_LANG}-to-{TARGET_LANG}",
    evaluation_strategy=EPOCH,
    learning_rate=2e-4,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    weight_decay=0.02,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True
)

In [23]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Fine Tuning the Model, finally !!

In [24]:
trainer = Seq2SeqTrainer(
    model,
    model_args,
    train_dataset=train_dataset,
    eval_dataset=validation_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [25]:
trainer.train()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,No log,0.16852,69.153,17.0417
2,No log,0.153269,74.1109,17.1167
3,No log,0.158066,72.1328,17.0583
4,No log,0.158588,74.1592,17.0583
5,No log,0.157569,74.4749,17.1667
6,No log,0.16103,74.5978,17.1
7,No log,0.164639,74.7785,17.1
8,No log,0.168339,74.0974,17.125
9,No log,0.167091,74.7655,17.0833
10,No log,0.166775,74.8013,17.0833


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

TrainOutput(global_step=300, training_loss=0.0757621955871582, metrics={'train_runtime': 117.034, 'train_samples_per_second': 40.843, 'train_steps_per_second': 2.563, 'total_flos': 95435119411200.0, 'train_loss': 0.0757621955871582, 'epoch': 10.0})

## Saving the Fine Tuned Transformer

In [26]:
trainer.save_model("FineTunedTransformer")

## Perform Translation on Test Datset

In [27]:
test_results = trainer.predict(test_dataset)

In [28]:
print("Test Bleu Score: ", test_results.metrics["test_bleu"])

Test Bleu Score:  75.2502


## Generate Prediction Sentences

In [30]:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [31]:
predictions = []
test_input = test_dataset[TRANSLATION]

for input_text in tqdm(test_input):
    source_sentence = input_text[PORTUGUESE]
    encoded_source = tokenizer(source_sentence,
                               return_tensors=PORTUGUESE,
                               padding=True,
                               truncation=True)
    encoded_source.to(device)  # Move input tensor to the same device as the model

    translated = model.generate(**encoded_source)

    predictions.append([tokenizer.decode(t, skip_special_tokens=True) for t in translated][0])

# Move the model back to CPU if needed
model.to("cpu")

100%|██████████| 67/67 [00:31<00:00,  2.13it/s]


T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [37]:
y_true_en = []
y_true_pt = []

for input_text in tqdm(test_input):
    y_true_pt.append(input_text[PORTUGUESE])
    y_true_en.append(input_text[ENGLISH])

100%|██████████| 67/67 [00:00<00:00, 56531.56it/s]


In [39]:
output_df = pd.DataFrame({"y_true_port": y_true_pt, "y_true_eng": y_true_en, "predicted_text": predictions})
output_df

Unnamed: 0,y_true_port,y_true_eng,predicted_text
0,Eles estão restaurando uma mansão histórica pa...,They're restoring a historic mansion to preser...,They're restoring a historic mansão to preserv...
1,A arquitetura neoclássica da biblioteca é impo...,The neoclassical architecture of the library i...,The neoclassical architecture of the library i...
2,A cultura local é rica e diversa.,The local culture is rich and diverse.,The local culture is rich and diverse.
3,A arte de rua nesta área é colorida e vibrante.,The street art in this area is colorful and vi...,Street art in this area is colorful and vibrant.
4,Estamos explorando as ruínas antigas de uma ci...,We're exploring the ancient ruins of a lost ci...,We're exploring the ancient ruins of a lost ci...
...,...,...,...
62,Estamos criando uma instalação de arte sonora ...,We're creating an interactive sound art instal...,We're creating an interactive sound art instal...
63,O restaurante serve pratos internacionais deli...,The restaurant serves delicious international ...,The restaurant serves delicious international ...
64,Eles estão organizando um festival de cinema i...,They're organizing an independent film festival.,They're organizing an independent film festival.
65,Estamos planejando uma viagem de observação de...,We're planning a bird-watching trip.,We're planning a bird observation trip.


## Loading the stored Model and using it for translation

In [41]:
ft_model_tokenizer = T5Tokenizer.from_pretrained("FineTunedTransformer")
ft_model = T5ForConditionalGeneration.from_pretrained("FineTunedTransformer")

You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


In [42]:
ft_prediction = []

for sentence in tqdm(x_test):
    encoded_text = ft_model_tokenizer(sentence, return_tensors=PORTUGUESE, padding=True, truncation=True)
    translated = ft_model.generate(**encoded_text)
    ft_prediction.append([tokenizer.decode(t, skip_special_tokens=True) for t in translated][0])

  0%|          | 0/67 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 67/67 [01:19<00:00,  1.19s/it]


In [43]:
ft_prediction

["They're restoring a historic mansão to preserve heritage.",
 'The neoclassical architecture of the library is impressive',
 'The local culture is rich and diverse.',
 'Street art in this area is colorful and vibrant.',
 "We're exploring the ancient ruins of a lost civilization.",
 "She's painting an urban landscape in a realistic style",
 "I'm tired after a long day.",
 "We're exploring forest trails to study ecology",
 'Teamwork is essential.',
 'The adventure waits for us beyond the horizon.',
 'Electronic music creates unique sound environments.',
 "We're planning a camping trip near the lake.",
 'Georgian architecture features symmetry and refined',
 'The exciting soccer game attracted a great audience.',
 'The potent sun paints the sky with warm tones.',
 "We're organizing a charity event to help the needy.",
 "We're creating a visual art installation.",
 "They're decorating the house for Christmas.",
 'Baroque architecture features detailed and dramatic o',
 "We're experimentin