# Evaluating the model

Kindly download the model and load on the colab runtime with the same foldername as mentioned in the code. All files should e enclosed in folder.

In [9]:
# Load the model and tokenizer for future use
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
import re
from io import StringIO
import pandas as pd


# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"

model = T5ForConditionalGeneration.from_pretrained("my_uk_to_us_t5").to(device)
tokenizer = T5Tokenizer.from_pretrained("my_uk_to_us_t5")

# Preprocessing function for text
def preprocess_text(text):
    text = text.strip().lower()  # Remove extra spaces and convert to lowercase
    text = re.sub(r'[^a-zA-Z0-9\s\.,!?]', '', text)  # Remove unwanted characters (emoji, special symbols, etc.)
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = re.sub(r'\s+([.,!?;])', r'\1', text)  # Fix spaces before punctuation
    text = text.capitalize()  # Capitalize the first letter
    return text

# Function for Inference
def translate_uk_to_us(text):
    input_text = "UK to US: " + text
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Evaluate the model on the test dataset
# test_sentence = "I CoLoUr 🎨 the centre of my favourite book."
test_sentence = "He is travelling to the theatre."
test_sentence = "UK to US: " + preprocess_text(test_sentence)
print(test_sentence)
print("Translated: ",translate_uk_to_us(test_sentence))

UK to US: He is travelling to the theatre.
Translated:  He is traveling to the theater.


To incorporate a more structured evaluation of your model, we can compare the predicted translations with the ground-truth US English translations in your test set. By using metrics like BLEU, ROUGE, and accuracy, we can get a more quantitative assessment of the model's performance.

Here’s how you can do it:

Steps for Evaluation:
BLEU (Bilingual Evaluation Understudy): Measures how many n-grams in the predicted sentence match those in the reference (ground truth).
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures recall, i.e., how many n-grams in the reference sentence are covered by the prediction.
Accuracy: Measures how many predictions match the ground truth exactly.
Let’s modify your code to include these evaluations.

## 1. Import Required Libraries
First, you need to install and import the required libraries for evaluation:

In [10]:
!pip install datasets nltk rouge_score sacrebleu



In [11]:
import nltk
from nltk.translate.bleu_score import corpus_bleu
from rouge_score import rouge_scorer
import sacrebleu

In [12]:
data = """input_text,target_text
"I CoLoUr 🎨 the centre of my favourite book.","I color the center of my favorite book."
"He is travelling ✈️ to the THEATRE.","He is traveling to the theater."
"I have a flat near the lift.","I have an apartment near the elevator."
"I have a flat near the lift. ","I have an apartment near the elevator."
"The PROGRAMME 🗓️ will start at 6 O'CLOCK.","The program will start at 6 o'clock."
"HE has a cheque 💳 for payment.","He has a check for payment."
"She wears jewellery 💎 on occasions...","She wears jewelry on occasions."
" THEY are Practising   ⚽ for the football MATCH.","They are practicing for the soccer game."
"He is using a spanner for the repair.","He is using a wrench for the repair."
"The aeroplane ✈️ landed on time.","The airplane landed on time."
"hello... 😃 how are you?","hello... 😃 how are you?"
"She bought some colour pencils.","She bought some color pencils."
"I am going to the lift.","I am going to the elevator."
"His behaviour 🤔 is unacceptable.","His behavior is unacceptable."
"The cheque 💳 arrived late 😢.","The check arrived late."
"Do you know where the lift is?","Do you know where the elevator is?"
"The labor union is organizing a programme 🗓️.","The labour union is organizing a program."
"He enjoys playing football ⚽.","He enjoys playing soccer."
"I love visiting the theatre.","I love visiting the theater."
"Their practise sessions are improving.","Their practice sessions are improving."
"He likes the colour red.","He likes the color red."
"The cheque has been approved.","The check has been approved."
"The aeroplane ✈️ was delayed.","The airplane was delayed."
"Their neighbourhood is beautiful.","Their neighborhood is beautiful."
"They've cancelled the programme.","They've canceled the program."
"She practises yoga regularly.","She practices yoga regularly."
"The cheque has not arrived yet.","The check has not arrived yet."
"He is organizing a theatre play.","He is organizing a theater play."
"I prefer the lift to the stairs.","I prefer the elevator to the stairs."
"His behaviour has been exemplary.","His behavior has been exemplary."
"Is the cheque ready for collection?","Is the check ready for collection?"
"Please colour 🎨 this drawing.","Please color this drawing."
"The aeroplane ✈️ has landed safely.","The airplane has landed safely."
"They're still practising football ⚽.","They're still practicing soccer."
"Her jewellery collection is stunning.","Her jewelry collection is stunning."
"What's the programme for tomorrow?","What's the program for tomorrow?"
"Their labour union is powerful.","Their labor union is powerful."
"They enjoy going to the theatre.","They enjoy going to the theater."
"Her favourite dish is lasagna.","Her favorite dish is lasagna."
"I need to go to the flat.","I need to go to the apartment."
"The cheque is invalid.","The check is invalid."
"The aeroplane ✈️ is ready for boarding.","The airplane is ready for boarding."
"He prefers the colour blue.","He prefers the color blue."
"The theatre play was amazing.","The theater play was amazing."
"The programme 🗓️ starts at 10 AM.","The program starts at 10 AM."
"Their neighbourhood is very welcoming.","Their neighborhood is very welcoming."
"Please practise before the event.","Please practice before the event."
"Her jewellery is antique.","Her jewelry is antique."
"The cheque 💳 bounced.","The check bounced."
"She wears jewellery every day.","She wears jewelry every day."
"He works in the theatre.","He works in the theater."
"Her behaviour 🤔 is strange lately.","Her behavior is strange lately."
"The cheque is in processing.","The check is in processing."
"They are rehearsing for the programme.","They are rehearsing for the program."
"The aeroplane ✈️ is landing shortly.","The airplane is landing shortly."
"Her favourite sport is football ⚽.","Her favorite sport is soccer."
"The cheque will be sent tomorrow.","The check will be sent tomorrow."
"The aeroplane has been delayed again.","The airplane has been delayed again."
"They prefer the colour green.","They prefer the color green."
"She is visiting the theatre tomorrow.","She is visiting the theater tomorrow."
"The programme is about to begin.","The program is about to begin."
"The cheque 💳 is ready for pickup.","The check is ready for pickup."
"Her favourite pastime is painting.","Her favorite pastime is painting."
"His favourite sport is rugby.","His favorite sport is rugby."
"The aeroplane ✈️ is taking off.","The airplane is taking off."
"She practises football daily.","She practices soccer daily."
"The cheque is overdue.","The check is overdue."
"Her behaviour has been concerning.","Her behavior has been concerning."
"The cheque is being reissued.","The check is being reissued."
"The theatre group is performing tonight.","The theater group is performing tonight."
"They are enjoying the programme.","They are enjoying the program."
"Their jewellery is made of gold.","Their jewelry is made of gold."
"The cheque has been misplaced.","The check has been misplaced."
"Her favourite flower is a rose.","Her favorite flower is a rose."
"He is practicing football ⚽ right now.","He is practicing soccer right now."
"Her jewellery box is full.","Her jewelry box is full."
"The cheque 💳 has been canceled.","The check has been canceled."
"The aeroplane ✈️ was on time.","The airplane was on time."
"He loves the colour yellow.","He loves the color yellow."
"She is practising for the marathon.","She is practicing for the marathon."
"The programme 🗓️ was postponed.","The program was postponed."
"The aeroplane ✈️ has already taken off.","The airplane has already taken off."
"The cheque will be delivered tomorrow.","The check will be delivered tomorrow."
"They enjoy watching theatre performances.","They enjoy watching theater performances."
"She painted the colour blue on the wall.","She painted the color blue on the wall."
"He is participating in the programme.","He is participating in the program."
"The aeroplane ✈️ was delayed again.","The airplane was delayed again."
"The cheque 💳 is ready for withdrawal.","The check is ready for withdrawal."
"She has a collection of beautiful jewellery 💎.","She has a collection of beautiful jewelry."
"The cheque is still pending.","The check is still pending."
"The aeroplane ✈️ will arrive shortly.","The airplane will arrive shortly."
"The theatre's performance was breathtaking.","The theater's performance was breathtaking."
"Her behaviour has been commendable.","Her behavior has been commendable."
"The cheque was never received.","The check was never received."
"The aeroplane ✈️ took off on time.","The airplane took off on time."
"She wears jewellery for special occasions.","She wears jewelry for special occasions."
"""

In [13]:
df = pd.read_csv(StringIO(data))

# Apply preprocessing to the input and target text columns
df["input_text"] = df["input_text"].apply(preprocess_text)
df["target_text"] = df["target_text"].apply(preprocess_text)

## 2. Define Evaluation Functions
We can define functions to calculate BLEU, ROUGE, and accuracy.

In [14]:
def compute_bleu(predictions, references):
    """
    Compute the BLEU score between predicted and reference texts
    """
    # BLEU uses list of tokenized sentences, where each sentence is a list of words
    tokenized_predictions = [pred.split() for pred in predictions]
    tokenized_references = [[ref.split()] for ref in references]  # Reference is a list of lists

    return corpus_bleu(tokenized_references, tokenized_predictions)

def compute_rouge(predictions, references):
    """
    Compute the ROUGE score between predicted and reference texts
    """
    scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
    rouge_scores = {"rouge1": [], "rouge2": [], "rougeL": []}

    for pred, ref in zip(predictions, references):
        score = scorer.score(ref, pred)
        rouge_scores["rouge1"].append(score["rouge1"].fmeasure)
        rouge_scores["rouge2"].append(score["rouge2"].fmeasure)
        rouge_scores["rougeL"].append(score["rougeL"].fmeasure)

    # Average the scores
    rouge_scores = {k: sum(v) / len(v) for k, v in rouge_scores.items()}
    return rouge_scores

def compute_accuracy(predictions, references):
    """
    Compute the exact match accuracy
    """
    exact_matches = sum([1 for pred, ref in zip(predictions, references) if pred == ref])
    return exact_matches / len(predictions)

def compute_sacrebleu(predictions, references):
    """
    Compute the SacreBLEU score
    """
    return sacrebleu.corpus_bleu(predictions, [references])



BLEU Score Calculation
The BLEU score measures how many n-grams in the prediction match with the reference text.

ROUGE Score Calculation
The ROUGE score compares the overlap of n-grams, recall-based measures, between the predicted and reference text.

Accuracy Calculation
Accuracy measures how many exact matches there are between predicted and reference sentences.

SacreBLEU Calculation
SacreBLEU is an implementation of BLEU that standardizes the evaluation and uses a common tokenizer, often preferred for consistent BLEU scoring.

3. Evaluate the Model
Now, we can integrate these evaluation metrics into the evaluation part of your code. For each sentence in the test set, we will predict the translation using your fine-tuned model and then calculate the metrics.

In [15]:
# Prepare lists to store predictions and references
predictions = []
references = []

# Loop through the test dataset to generate predictions and store references
for text, target_text in zip(df["input_text"], df["target_text"]):
    text = "UK to US: " + preprocess_text(text)

    # Generate prediction
    pred = translate_uk_to_us(text)
    predictions.append(pred)
    references.append(target_text)

# Calculate BLEU
bleu_score = compute_bleu(predictions, references)
print(f"BLEU score: {bleu_score}")

# Calculate ROUGE
rouge_scores = compute_rouge(predictions, references)
print(f"ROUGE scores: {rouge_scores}")

# Calculate Accuracy
accuracy = compute_accuracy(predictions, references)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Calculate SacreBLEU
sacrebleu_score = compute_sacrebleu(predictions, references)
print(f"SacreBLEU score: {sacrebleu_score.score}")


BLEU score: 0.9598071478068125
ROUGE scores: {'rouge1': 0.9796626984126983, 'rouge2': 0.9754464285714285, 'rougeL': 0.9796626984126983}
Accuracy: 92.71%
SacreBLEU score: 96.94409002653215


## 4. Result Interpretation
- BLEU Score: Typically ranges from 0 to 1, where a higher score means that the generated translations are closer to the reference translations in terms of n-gram overlap.
- ROUGE Scores: These will give you insight into how much overlap exists in terms of recall (how many n-grams from the reference are in the prediction).
- Accuracy: The percentage of exact matches between the predicted and reference translations.
- SacreBLEU Score: Similar to BLEU but more consistent and reliable due to standardized tokenization.

### Evaluation Metrics Breakdown:
#### BLEU Score: 0.96

This is an excellent BLEU score, indicating that the model's predictions match the reference translations quite well in terms of n-gram overlap. A score above 0.9 generally suggests high translation quality.

#### ROUGE Scores:

ROUGE-1: 0.98 – The model has a very high recall of unigrams (individual words), meaning it's capturing a lot of the key words in the reference.

ROUGE-2: 0.98 – A high recall of bigrams (two-word sequences), suggesting the model is capturing key two-word combinations well.

ROUGE-L: 0.98 – This is a recall-based metric that focuses on the longest matching subsequence of words, and this score suggests that the model is capturing the flow and structure of the sentences well.

#### Accuracy: 92.71%

This indicates that nearly 93% of the predicted sentences exactly match the reference sentences, which is a strong performance in terms of exact matches.

#### SacreBLEU Score: 96.94

This score is extremely high and reinforces that the model's predictions are very close to the reference translations when considering standardized tokenization.

#### Conclusion:
These evaluation results indicate that your model is performing exceptionally well in converting UK English to US English, both in terms of lexical accuracy (correct word choices) and structural accuracy (sentence flow). The SacreBLEU score in particular highlights that the translations are consistent and robust when using a standardized evaluation framework.