# English-to-French Machine Translation with T5

T5 (Text-to-Text Transfer Transformer) is a state-of-the-art language model developed by Google's AI research team. It is based on the Transformer architecture, which has revolutionized natural language processing (NLP) tasks, achieving remarkable results in tasks such as machine translation, text summarization, question answering, and more1.
Here’s an overview of the T5 model architecture and its suitability for sequence-to-sequence tasks:
1.	Architecture:
* T5 employs an encoder-decoder architecture. Unlike previous models like BERT that used only encoders, T5 combines both an encoder and a decoder.
*	The encoder processes the input sequence, while the decoder generates the output sequence.
*	This architecture allows T5 to handle sequence-to-sequence tasks effectively.
2.	Text-to-Text Paradigm:
*	T5 recasts all target tasks into sequence-to-sequence tasks using the text-to-text paradigm.
*	In this paradigm, both input and output are treated as text, regardless of the specific NLP task.
* For example, machine translation becomes a sequence-to-sequence task where the input is the source language text, and the output is the target language text.
3.	Generative Span-Corruption Pre-training Task:
*	T5's pre-training involves a generative span-corruption task.
*	During pre-training, spans of the input text are randomly replaced with a special token.
*	The model learns to predict the original text from the corrupted version.
*	This task encourages the model to understand context and improve its ability to generate coherent sequences during fine-tuning.
4.	Attention Mechanisms:
*	T5 incorporates attention mechanisms to enhance translation quality.
*	Attention allows the model to focus on relevant parts of the input sequence when generating the output.
*	Specifically, self-attention layers enable T5 to weigh the importance of different tokens in the input.
*	By attending to relevant context, T5 produces more accurate translations.


Resource:[huggingface: T5 model_doc](https://huggingface.co/docs/transformers/model_doc/t5)

## Setup Libraries

The necessary libraries are imported, and the random seed is set for reproducibility. The required datasets are loaded, and additional packages are installed using pip.

In [None]:
%%bash
pip install numpy torch datasets transformers~=4.28.0 evaluate torchinfo sacrebleu --quiet
pip freeze | grep -E '^numpy|^torch|^datasets|^transformers|^evaluate|^torchinfo|^sacrebleu'

In [None]:
# set seed for reproducibility
from datasets import load_dataset
import torch
import numpy as np



In [None]:
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
raw_datasets = load_dataset('opus_books', 'en-fr')
display(raw_datasets)



In [None]:
# let's see an example of the dataset

# we see that we have an unique 'id' for each data point and
# a dictionary called 'translation' with the english in the 'en' key and french in the 'fr' key.
raw_datasets['train'][0]

## Split Dataset (Data Handling)
* The OPUS-100 dataset, specifically the English-to-French translation from books, is loaded using the load_dataset function from the Hugging Face datasets library.
* The dataset is split into training, validation, and test sets using the train_test_split method.
* An Exploratory Data Analysis (EDA) is conducted to analyze the number of words in both English and French sentences in the dataset

In [None]:
# split dataset with seed and shuffling

from datasets import DatasetDict

train_val_datasets = raw_datasets['train'].train_test_split(test_size=0.1, seed=SEED, shuffle=True)
val_test_datasets = train_val_datasets['test'].train_test_split(test_size=0.5, seed=SEED, shuffle=True)
split_datasets = DatasetDict({
    'train': train_val_datasets['train'],
    'validation': val_test_datasets['train'],
    'test': val_test_datasets['test'],
})

# display the split
split_datasets

## Exploratory Data Analysis

* An Exploratory Data Analysis (EDA) is conducted to analyze the number of words in both English and French sentences in the dataset

In [None]:
# let's get the number of words in the dataset

english_word_counts = []
french_word_counts = []

for split_type in split_datasets:
    for example in split_datasets[split_type]:
        english_word_counts.append(len(example['translation']['en'].split(' ')))
        french_word_counts.append(len(example['translation']['fr'].split(' ')))

print(f"MIN ENGLISH WORD COUNT: {min(english_word_counts)}")
print(f"MAX ENGLISH WORD COUNT: {max(english_word_counts)}")
print(f"MEAN ENGLISH WORD COUNT: {sum(english_word_counts)/len(english_word_counts)}")

print(f"MIN FRENCH WORD COUNT: {min(french_word_counts)}")
print(f"MAX FRENCH WORD COUNT: {max(french_word_counts)}")
print(f"MEAN FRENCH WORD COUNT: {sum(french_word_counts)/len(french_word_counts)}")

## Tokenize Dataset
<!--
* The code loads the smallest version of Google's T5 model using ``` AutoTokenizer.from_pretrained(CHECKPOINT)```.
* The dataset is tokenized using the T5 model, where a specific translation prompt ("translate English to French:") is added to the English sentences, and the French sentences are set as target outputs.
* Tokenization includes truncation and a maximum length limit to speed up training. -->

Tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenization strategy used. In the context of Natural Language Processing (NLP), tokenization is a crucial step as it converts raw text into a format that machine learning models can understand and process. Here's a detailed explanation of tokenization as applied in the provided code:

1. **Tokenization Process**:
   - The code uses the `AutoTokenizer.from_pretrained(CHECKPOINT)` function from the Hugging Face `transformers` library to load the tokenizer associated with Google's T5 model.
   - This tokenizer is specifically designed for the T5 model and implements a tokenization strategy that splits text into tokens suitable for the model's input format.

2. **Tokenization Strategy**:
   - The T5 model's tokenizer follows a subword tokenization strategy. This means that it breaks down words into smaller subword units, which helps in handling rare words, domain-specific terminology, and out-of-vocabulary tokens.
   - Subword tokenization is particularly effective for machine translation tasks like English-to-French, as it can handle variations in word forms, such as verb conjugations, plural forms, and compound words.

3. **Prompt Addition**:
   - During tokenization, a specific translation prompt is added to the beginning of each English text input. In this case, the prompt is "translate English to French:". This prompt helps the model understand the task it needs to perform (i.e., translation from English to French).
   - Adding prompts is a common practice in fine-tuning models for specific tasks, as it provides context and guidance to the model regarding the desired output.

4. **Target Outputs**:
   - In addition to tokenizing the English inputs with the prompt, the French sentences from the dataset are also tokenized separately as target outputs. This allows the model to learn the mapping between English inputs and their corresponding French translations during training.
   - Tokenization of target outputs ensures that the model generates valid French sentences during inference and evaluation.

5. **Max Length and Truncation**:
   - To manage memory and computational resources efficiently, tokenization includes parameters such as maximum sequence length (`max_length`) and truncation. In the code, the `max_length` parameter is set to 512, which limits the length of input sequences to 512 tokens.
   - Truncation is applied to ensure that input sequences longer than the maximum length are appropriately truncated to fit the model's input size. This helps prevent memory errors and ensures smooth training and inference.



In [None]:
# we will be using the smallest Google T5 model to speed up the process

from transformers import AutoTokenizer

CHECKPOINT = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)

In [None]:
# let's tokenize the dataset

# get languages of dataset
source_language, target_language = raw_datasets['train'].features['translation'].languages
print(f"SOURCE LANGUAGE: {source_language} (English)")
print(f"TARGET LANGUAGE: {target_language} (French)")

# T5 requires a prompt to train on
t5_translation_prompt = 'translate English to French:'
def tokenize_function(batch):
    # we add the prompt in front of all the English text inputs
    source_inputs = [f"{t5_translation_prompt} {example[source_language]}" for example in batch['translation']]
    # we get all the French text outputs
    target_outputs = [example[target_language] for example in batch['translation']]
    # we tokenize with truncation and cap max length 512 given the max length of french is 324 to speed up training
    return tokenizer(source_inputs, text_target=target_outputs, max_length=512, truncation=True)

# tokenize dataset in batch for speed
tokenized_datasets = split_datasets.map(tokenize_function, batched=True, remove_columns=raw_datasets['train'].column_names)
tokenized_datasets

# Setup Training

## Clone Model

In [None]:
# let's clone a small pre-trained t5 model

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(CHECKPOINT)

In [None]:
# let's checkout the model architecture

from torchinfo import summary

summary(model)

The provided summary outlines the architecture and parameter count of the T5 model used in the code. Here's a breakdown of the key components and their respective parameter counts:

1. **T5ForConditionalGeneration**:
   - This is the main model for conditional text generation, specifically designed for tasks like translation, summarization, etc.
   - It comprises an embedding layer and two T5Stacks for processing input and generating output.

2. **Embedding**:
   - The embedding layer has a total of 16,449,536 parameters.
   - It handles the conversion of input tokens into dense vectors suitable for processing by the model.

3. **T5Stack**:
   - The T5Stack consists of multiple T5Blocks and associated layers for processing input and generating output.
   - Each T5Stack has a total of 16,449,536 parameters, making a total of 32,899,072 parameters for both stacks.

4. **T5Block**:
   - Each T5Block within the T5Stack has its own set of parameters for operations such as self-attention, feed-forward layers, etc.
   - The model contains multiple T5Blocks, each contributing significantly to the overall parameter count.

5. **T5LayerNorm and Dropout**:
   - T5LayerNorm layers are used for normalization, and Dropout layers are used for regularization and preventing overfitting.
   - These layers collectively contribute to the model's performance and stability during training.

6. **Linear Layer**:
   - The Linear layer has 16,449,536 parameters and is responsible for converting the final hidden states into logits for token generation.

7. **Total Parameters**:
   - The entire T5 model has a total of 109,855,232 parameters, all of which are trainable.
   - These parameters are crucial for the model's ability to learn and generate high-quality text outputs.



## Setup Data Collator

In [None]:
# setup data collator designed for seq2seq models like t5

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer)

## Setup Training Metrics

In [None]:
import evaluate
import numpy as np

# import metric
bleu_metric = evaluate.load('bleu')

def compute_metrics(eval_preds):
    preds, labels = eval_preds

    # decode predicted sentence and skip special tokens
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    # add padding (-100 = invalid token) and decode predicted target labels
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # postprocess for bleu metric
    post_decoded_preds = [pred.strip() for pred in decoded_preds]
    post_decoded_labels = [[label.strip()] for label in decoded_labels]

    # compute blue score
    result = bleu_metric.compute(predictions=post_decoded_preds, references=post_decoded_labels)

    return {'bleu': result['bleu']}

## Setup Trainer

In [None]:
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

# we setup training configuration
# 1. we add seed for reproducibility
# 2. we add output to results local directory
# 3. we add future-proofed AdamW optimizer
# 4. we train for 3 epochs
# 5. we train in 16 batch sizes
# 6. we evaluate per epoch
# 7. we load the best model based on lowest validation loss
# 8. we disable logging with report_to none
# 9. we enable fp16 precision to speed up run
training_args = Seq2SeqTrainingArguments(
    seed=SEED,
    output_dir='results',
    optim='adamw_torch',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    save_strategy='epoch',
    evaluation_strategy='epoch',
    load_best_model_at_end=True,
    report_to='none',
    fp16=True,
    predict_with_generate=True,
)

# we setup trainer with all previous variables
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

For setting up the training configuration for the T5 model  Here's a more detailed explanation of each aspect of the training setup:

1. **Seed for Reproducibility**:
   - A random seed is set (`seed=SEED`) to ensure that the training process is reproducible. Setting a seed means that the random number generation during training will be the same across runs, leading to consistent results for debugging and comparison purposes.

2. **Output Directory**:
   - The `output_dir='results'` parameter specifies the directory where the trained model and training logs will be saved. After training, the model checkpoints, evaluation results, and other relevant information will be stored in this directory.

3. **Optimizer**:
   - The optimizer used for training is specified as `'adamw_torch'`. This refers to the AdamW optimizer, which is a variant of the Adam optimizer with weight decay regularization. AdamW is commonly used for training transformer-based models like T5.

4. **Number of Epochs**:
   - The `num_train_epochs=3` parameter defines the number of training epochs. An epoch refers to one complete pass through the entire training dataset. In this case, the model will be trained for 3 epochs.

5. **Batch Sizes**:
   - The batch size for training and evaluation is set using `per_device_train_batch_size=16` and `per_device_eval_batch_size=16`. These parameters control the number of training examples processed in each iteration (batch) during training and evaluation.
   - Using batch processing helps in optimizing memory usage and computational efficiency.

6. **Save Strategy**:
   - The `save_strategy='epoch'` parameter specifies the strategy for saving model checkpoints during training. In this case, the model will be saved after each epoch, ensuring that the best-performing model is retained for further evaluation.

7. **Evaluation Strategy**:
   - The `evaluation_strategy='epoch'` parameter determines when model evaluation will be performed. With `'epoch'`, evaluation will be conducted after each training epoch. This allows monitoring of the model's performance and potential early stopping based on validation metrics.

8. **Load Best Model**:
   - The `load_best_model_at_end=True` parameter indicates that the best-performing model based on validation metrics (e.g., lowest validation loss) will be loaded at the end of training. This ensures that the final model used for inference is the most optimal one.

9. **Logging**:
   - Logging during training is controlled by `report_to='none'`, which disables logging to any external platforms or services. This can help in reducing unnecessary output and focusing on the training process.

10. **FP16 Precision**:
    - The `fp16=True` parameter enables mixed-precision training using 16-bit floating-point format (`FP16`). Mixed precision can lead to faster training times and reduced memory usage, especially on GPUs that support FP16 computation.


# Train Model

In [None]:
# let's get the unfine-tuned performance on the test set

trainer.evaluate(tokenized_datasets['test'])

After running `trainer.evaluate(tokenized_datasets['test'])`, I got a dictionary-like output containing various metrics and information related to the evaluation of the T5 model on the test dataset.The meaning of the important outputs are:

1. **`eval_loss`: 2.1427161693573**
   - This value represents the evaluation loss, which is a measure of how well the model's predictions match the actual target outputs during evaluation. A lower evaluation loss indicates better performance.

2. **`eval_bleu`: 0.03824929014537052**
   - This value corresponds to the BLEU score achieved by the model during evaluation. BLEU (Bilingual Evaluation Understudy) is a metric commonly used to evaluate the quality of machine-translated text. A higher BLEU score indicates better translation quality, with a maximum value of 1.

The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations.[Google Cloud](https://cloud.google.com/translate/automl/docs/evaluate#:~:text=The%20BLEU%20score%20is%20a,of%20high%20quality%20reference%20translations.)

In [None]:
# let's train the model on the dataset

trainer.train()

Based on the evaluation metrics:
* Training Loss Trend: A decreasing trend in training loss across epochs suggests that the model is learning and improving its ability to minimize errors during training.
* Validation Loss Trend: A decreasing validation loss indicates that the model is generalizing well and performing better on unseen data.
* BLEU Score Trend: An increasing trend in BLEU score across epochs signifies that the model's translations are improving in quality and becoming more accurate and fluent.


The training loss decreases from epoch to epoch, indicating that the model is learning from the training data.
The validation loss also decreases, suggesting that the model is generalizing well and improving its translation quality on unseen data.
The BLEU score shows a slight improvement across epochs, indicating progress in generating more accurate translations.

In [None]:
# let's get the lowest validation loss model

trainer.evaluate()

In [None]:

trainer.evaluate(tokenized_datasets['test'])

* The evaluation loss (eval_loss) of approximately 1.47 indicates that the model's predictions on the test dataset are relatively close to the actual target outputs.
* The BLEU score (eval_bleu) of approximately 0.061 suggests that the model's translations on the test dataset have improved compared to the validation set, as indicated by the higher BLEU score.

# Let's try out some examples

In [None]:
# setup english to french translator pipeline

from transformers import pipeline

translator = pipeline('translation_en_to_fr', model=model.to('cpu'), tokenizer=tokenizer)

In [None]:
text = "translate English to French: I am Reza Mirjalili, PhD student."
translator(text)

In [None]:
text = "translate English to French: I am currently running some codes here!"
translator(text)

In [None]:
text = "translate English to French: My name is amele polon!"
translator(text)

## Some dialoges from GodFather


In [None]:
from IPython import display
display.Image(url= "http://res.cloudinary.com/ybmedia/image/upload/c_crop,h_314,w_477,x_0,y_0/c_scale,f_auto,q_auto,w_700/v1/m/5/5/55aaeb27ceae3941a5ee151520ccfa004a7a2cb7/wheres-michael.png")

In [None]:
text = "What are you worried about, if I wanted to kill you, you'd be dead already."
translator(text)

## Advantages:

The T5 model with its attention mechanisms offers several advantages for machine translation, especially when dealing with complex linguistic patterns and varying sequence lengths. Let's delve into these advantages:
1.	Attention Mechanisms:
*	T5 incorporates self-attention mechanisms, allowing it to focus on relevant parts of the input sequence during translation.
*  By attending to contextually important tokens, T5 captures intricate linguistic nuances, resulting in more accurate translations.
*  Attention mechanisms enable the model to weigh the significance of different words, considering their impact on the overall meaning.
2.	Handling Sequence Lengths:
*  T5's architecture is well-suited for handling variable-length sequences.
*  Unlike fixed-length encodings (such as BERT), T5 processes entire sequences, adapting to the length of input text.
*  This flexibility is crucial for machine translation, where source and target sentences can vary significantly in length.
3.	Text-to-Text Paradigm:
*  T5's text-to-text paradigm simplifies the translation task.
*  Regardless of the specific NLP task, both input and output are treated as text.
*  For machine translation, this means treating the source sentence as input text and the target sentence as output text.
*  The consistent framework streamlines training and fine-tuning.
4.	Generative Span-Corruption Pre-training Task:
*  During pre-training, T5 learns to predict the original text from a corrupted version.
*  This encourages the model to understand context and generate coherent sequences.
*  For translation, this ability to reconstruct meaningful sentences enhances translation quality.
5.	Capturing Linguistic Patterns:
*  T5's attention mechanisms allow it to capture long-range dependencies.
*  It can recognize patterns that span across multiple tokens, such as subject-verb agreements or idiomatic expressions.
*  This capability contributes to accurate translations, especially when dealing with complex linguistic structures.

## Limitations:
Some of the limitations and challenges associated with the T5 model when it comes to machine translation:
1.	Computational Resources:
*  Training T5 requires substantial computational resources due to its large-scale architecture.
*  Fine-tuning T5 on specific tasks, including machine translation, demands powerful GPUs or TPUs.
*  Smaller organizations or researchers with limited resources may find it challenging to train and fine-tune T5 effectively.
2.	Inference Time:
*  During inference (when translating new sentences), T5's inference time can be relatively slow.
*  The model’s depth and attention mechanisms contribute to this slower processing speed.
*  Real-time applications may face latency issues when using T5 for translation.
3.	Memory Requirements:
*  T5's large number of parameters necessitates significant memory during both training and inference.
*  Handling long sequences (such as lengthy paragraphs) can be memory-intensive.
*  Memory constraints can impact the batch size and overall efficiency.
4.	Domain-Specific Vocabulary:
*  T5's pre-training data includes a wide range of text from diverse domains.
*  However, it may struggle with domain-specific vocabulary not well-represented in its training data.
*  For specialized domains (e.g., legal, medical, or technical), T5 may produce suboptimal translations due to lack of exposure to relevant terminology.
5.	Rare Words and Phrases:
*  While T5 performs well on common words and phrases, it may encounter difficulties with rare or infrequent terms.
*  Rare words may not have sufficient context in the training data, leading to inaccurate translations.
*  Handling out-of-vocabulary (OOV) words remains a challenge.
6.	Contextual Ambiguity:
*  T5's attention mechanisms help capture context, but ambiguities can still arise.
*  Some sentences have multiple valid translations based on context.
*  T5 may struggle with disambiguating homonyms or polysemous words.
7.	Long-Range Dependencies:
*  Although T5's attention spans are impressive, capturing very long-range dependencies can be challenging.
*  Extremely distant context may not influence the translation as effectively.
*  Splitting long sentences into shorter segments can mitigate this issue.
8.	Fine-Tuning Data Size:
*  Fine-tuning T5 requires task-specific data, including parallel corpora for translation.
*  Availability of high-quality, domain-specific parallel data can be limited.
*  Insufficient fine-tuning data may hinder T5's performance on specific language pairs.

## Imorvement

Enhancing translation quality using the T5 model involves exploring various avenues. Here are some potential improvements and extensions:
1.	Fine-Tuning on Domain-Specific Data:
* Collect and curate parallel corpora specific to the target domain (e.g., legal, medical, technical).
* Fine-tune T5 on this domain-specific data to adapt the model to specialized terminology and context.
* Domain-specific fine-tuning improves translation accuracy within specific professional fields.
2.	Ensemble Methods:
* Combine predictions from multiple T5 models using ensemble techniques.
* Train several T5 models with different initializations or architectures.
* Aggregate their outputs (e.g., weighted averaging or voting) to improve overall translation quality.
* Ensembles mitigate individual model biases and enhance robustness.
3.	Handling Out-of-Vocabulary (OOV) Words:
* Address OOV words by:
    *	Subword Tokenization: Use subword tokenizers (e.g., SentencePiece) to split rare words into subword units.
    * BPE (Byte-Pair Encoding): Incorporate BPE to handle OOV terms during tokenization.
    * Fallback Strategies: When encountering OOV words, fall back to a dictionary-based translation or a nearest neighbor approach.
    * Copy Mechanisms: Implement copy mechanisms to directly copy OOV words from the source to the target.
4.	Contextualized Embeddings for Rare Terms:
* Enhance T5's embeddings with contextualized word representations (e.g., ELMo, BERT, or RoBERTa).
* These embeddings capture context and improve handling of rare or unseen words.
* Combine T5's pre-trained embeddings with contextualized embeddings during fine-tuning.
5.	Multi-Task Learning:
* Extend T5's capabilities by incorporating multi-task learning.
* Train T5 jointly on machine translation and related tasks (e.g., text summarization, question answering).
* Shared representations across tasks can enhance translation quality.
6.	Adaptive Attention and Positional Encodings:
* Explore adaptive attention mechanisms that dynamically adjust attention weights based on context.
* Experiment with learned positional encodings to better capture word order and dependencies.
* These techniques can improve translation quality for long sentences.
7.	Domain Adaptation:
* Use techniques like domain adaptation or domain adversarial training.
* Align T5's representations with domain-specific data during fine-tuning.
* Domain adaptation helps T5 generalize better to specific domains.
8.	Post-Editing and Human Feedback:
* Deploy T5 in a human-in-the-loop setup.
* Generate initial translations and allow human translators to post-edit.
* Collect feedback to iteratively improve T5's performance.



## Resources
1. [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)
2. [Clinical-T5: Large Language Models Built Using MIMIC Clinical Text](https://www.physionet.org/content/clinical-t5/1.0.0/)

