# **Fine-Tuning T5 for UK-to-US Dialect Conversion**

## 1. Introduction
The goal of this project is to develop a machine learning model capable of converting UK English sentences into their US English equivalents. Although rule-based systems can handle simple transformations like spelling changes, they lack the ability to comprehend context. For a more robust solution, we use the EnglishVoice/t5-base-uk-to-us-english model, a pre-trained T5 (Text-to-Text Transfer Transformer) variant, which is specifically fine-tuned for UK-to-US English conversion tasks. This model is able to handle both lexical and contextual transformations, making it more effective in dialect conversion.

In [None]:
!pip install datasets
# !pip install transformers

In [2]:
# Import necessary libraries and modules
from io import StringIO  # For handling string data as a file
import re  # For regular expression based text cleaning
import pandas as pd  # For data manipulation
import torch  # PyTorch library
from sklearn.model_selection import train_test_split  # For splitting the dataset
from datasets import Dataset  # Hugging Face's Datasets library
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments  # Transformers library for text generation


# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


These imports bring in the required modules for data manipulation, model training, and text processing.
StringIO is used to treat the raw string data as if it were a file, allowing it to be read into a pandas DataFrame.
train_test_split will be used to split the dataset into training and test sets.

This block checks whether a GPU is available for model training. If a GPU is found, it uses the "cuda" device; otherwise, it defaults to "cpu".
torch.cuda.is_available() is used to detect if CUDA (GPU acceleration) is accessible.

In [3]:
data = """input_text,target_text
"I CoLoUr 🎨 the centre of my favourite book.","I color the center of my favorite book."
"He is travelling ✈️ to the THEATRE.","He is traveling to the theater."
"I have a flat near the lift.","I have an apartment near the elevator."
"I have a flat near the lift. ","I have an apartment near the elevator."
"The PROGRAMME 🗓️ will start at 6 O'CLOCK.","The program will start at 6 o'clock."
"HE has a cheque 💳 for payment.","He has a check for payment."
"She wears jewellery 💎 on occasions...","She wears jewelry on occasions."
" THEY are Practising   ⚽ for the football MATCH.","They are practicing for the soccer game."
"He is using a spanner for the repair.","He is using a wrench for the repair."
"The aeroplane ✈️ landed on time.","The airplane landed on time."
"hello... 😃 how are you?","hello... 😃 how are you?"
"She bought some colour pencils.","She bought some color pencils."
"I am going to the lift.","I am going to the elevator."
"His behaviour 🤔 is unacceptable.","His behavior is unacceptable."
"The cheque 💳 arrived late 😢.","The check arrived late."
"Do you know where the lift is?","Do you know where the elevator is?"
"The labor union is organizing a programme 🗓️.","The labour union is organizing a program."
"He enjoys playing football ⚽.","He enjoys playing soccer."
"I love visiting the theatre.","I love visiting the theater."
"Their practise sessions are improving.","Their practice sessions are improving."
"He likes the colour red.","He likes the color red."
"The cheque has been approved.","The check has been approved."
"The aeroplane ✈️ was delayed.","The airplane was delayed."
"Their neighbourhood is beautiful.","Their neighborhood is beautiful."
"They've cancelled the programme.","They've canceled the program."
"She practises yoga regularly.","She practices yoga regularly."
"The cheque has not arrived yet.","The check has not arrived yet."
"He is organizing a theatre play.","He is organizing a theater play."
"I prefer the lift to the stairs.","I prefer the elevator to the stairs."
"His behaviour has been exemplary.","His behavior has been exemplary."
"Is the cheque ready for collection?","Is the check ready for collection?"
"Please colour 🎨 this drawing.","Please color this drawing."
"The aeroplane ✈️ has landed safely.","The airplane has landed safely."
"They're still practising football ⚽.","They're still practicing soccer."
"Her jewellery collection is stunning.","Her jewelry collection is stunning."
"What's the programme for tomorrow?","What's the program for tomorrow?"
"Their labour union is powerful.","Their labor union is powerful."
"They enjoy going to the theatre.","They enjoy going to the theater."
"Her favourite dish is lasagna.","Her favorite dish is lasagna."
"I need to go to the flat.","I need to go to the apartment."
"The cheque is invalid.","The check is invalid."
"The aeroplane ✈️ is ready for boarding.","The airplane is ready for boarding."
"He prefers the colour blue.","He prefers the color blue."
"The theatre play was amazing.","The theater play was amazing."
"The programme 🗓️ starts at 10 AM.","The program starts at 10 AM."
"Their neighbourhood is very welcoming.","Their neighborhood is very welcoming."
"Please practise before the event.","Please practice before the event."
"Her jewellery is antique.","Her jewelry is antique."
"The cheque 💳 bounced.","The check bounced."
"She wears jewellery every day.","She wears jewelry every day."
"He works in the theatre.","He works in the theater."
"Her behaviour 🤔 is strange lately.","Her behavior is strange lately."
"The cheque is in processing.","The check is in processing."
"They are rehearsing for the programme.","They are rehearsing for the program."
"The aeroplane ✈️ is landing shortly.","The airplane is landing shortly."
"Her favourite sport is football ⚽.","Her favorite sport is soccer."
"The cheque will be sent tomorrow.","The check will be sent tomorrow."
"The aeroplane has been delayed again.","The airplane has been delayed again."
"They prefer the colour green.","They prefer the color green."
"She is visiting the theatre tomorrow.","She is visiting the theater tomorrow."
"The programme is about to begin.","The program is about to begin."
"The cheque 💳 is ready for pickup.","The check is ready for pickup."
"Her favourite pastime is painting.","Her favorite pastime is painting."
"His favourite sport is rugby.","His favorite sport is rugby."
"The aeroplane ✈️ is taking off.","The airplane is taking off."
"She practises football daily.","She practices soccer daily."
"The cheque is overdue.","The check is overdue."
"Her behaviour has been concerning.","Her behavior has been concerning."
"The cheque is being reissued.","The check is being reissued."
"The theatre group is performing tonight.","The theater group is performing tonight."
"They are enjoying the programme.","They are enjoying the program."
"Their jewellery is made of gold.","Their jewelry is made of gold."
"The cheque has been misplaced.","The check has been misplaced."
"Her favourite flower is a rose.","Her favorite flower is a rose."
"He is practicing football ⚽ right now.","He is practicing soccer right now."
"Her jewellery box is full.","Her jewelry box is full."
"The cheque 💳 has been canceled.","The check has been canceled."
"The aeroplane ✈️ was on time.","The airplane was on time."
"He loves the colour yellow.","He loves the color yellow."
"She is practising for the marathon.","She is practicing for the marathon."
"The programme 🗓️ was postponed.","The program was postponed."
"The aeroplane ✈️ has already taken off.","The airplane has already taken off."
"The cheque will be delivered tomorrow.","The check will be delivered tomorrow."
"They enjoy watching theatre performances.","They enjoy watching theater performances."
"She painted the colour blue on the wall.","She painted the color blue on the wall."
"He is participating in the programme.","He is participating in the program."
"The aeroplane ✈️ was delayed again.","The airplane was delayed again."
"The cheque 💳 is ready for withdrawal.","The check is ready for withdrawal."
"She has a collection of beautiful jewellery 💎.","She has a collection of beautiful jewelry."
"The cheque is still pending.","The check is still pending."
"The aeroplane ✈️ will arrive shortly.","The airplane will arrive shortly."
"The theatre's performance was breathtaking.","The theater's performance was breathtaking."
"Her behaviour has been commendable.","Her behavior has been commendable."
"The cheque was never received.","The check was never received."
"The aeroplane ✈️ took off on time.","The airplane took off on time."
"She wears jewellery for special occasions.","She wears jewelry for special occasions."
"""

This block defines the raw data in CSV format and loads it into a pandas DataFrame. The input_text column contains UK English sentences, and the target_text column contains the corresponding US English sentences.

In [4]:

df = pd.read_csv(StringIO(data))

# Preprocessing function for text
def preprocess_text(text):
    text = text.strip().lower()  # Remove extra spaces and convert to lowercase
    text = re.sub(r'[^a-zA-Z0-9\s\.,!?]', '', text)  # Remove unwanted characters (emoji, special symbols, etc.)
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    text = re.sub(r'\s+([.,!?;])', r'\1', text)  # Fix spaces before punctuation
    text = text.capitalize()  # Capitalize the first letter
    return text

# Apply preprocessing to the input and target text columns
df["input_text"] = df["input_text"].apply(preprocess_text)
df["target_text"] = df["target_text"].apply(preprocess_text)




This function cleans up the text by:
- Stripping leading/trailing spaces and converting the text to lowercase.
- Removing non-alphanumeric characters (except spaces and punctuation).
- Ensuring punctuation marks are correctly spaced.
- Capitalizing the first letter of the sentence.

The function is applied to both input_text and target_text columns.


In [5]:
df

Unnamed: 0,input_text,target_text
0,I colour the centre of my favourite book.,I color the center of my favorite book.
1,He is travelling to the theatre.,He is traveling to the theater.
2,I have a flat near the lift.,I have an apartment near the elevator.
3,I have a flat near the lift.,I have an apartment near the elevator.
4,The programme will start at 6 oclock.,The program will start at 6 oclock.
...,...,...
91,The theatres performance was breathtaking.,The theaters performance was breathtaking.
92,Her behaviour has been commendable.,Her behavior has been commendable.
93,The cheque was never received.,The check was never received.
94,The aeroplane took off on time.,The airplane took off on time.


In [6]:
# Split into Train (80%) and Test (20%)
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Train size: {len(train_df)}, Test size: {len(test_df)}")

Train size: 76, Test size: 20


Here, the dataset is split into training and testing subsets using an 80-20 split. The random_state=42 ensures reproducibility.
The size of the resulting training and test sets is printed out for verification.

## 2. Methodology

### 2.1 Dataset Preparation

The dataset consists of pairs of UK and US English sentences, demonstrating the various linguistic differences between the two dialects (spelling, word choice, and capitalization). The dataset was split into 80% for training and 20% for testing.

#### Example Data:

| Input Text (UK English)                           | Target Text (US English)                          |
|---------------------------------------------------|---------------------------------------------------|
| I CoLoUr 🎨 the centre of my favourite book.       | I color the center of my favorite book.           |
| He is travelling ✈️ to the THEATRE.               | He is traveling to the theater.                   |
| I have a flat near the lift.                      | I have an apartment near the elevator.            |

### 2.2 Tokenization

To prepare the data for input into the model, we used the `T5Tokenizer`. The text was preprocessed and tokenized, with a special prefix `translate UK to US:` added to the input sentences to indicate the task.

```python
inputs = ["translate UK to US: " + text for text in examples["input_text"]]
```

### 2.3 Model Selection
We selected EnglishVoice/t5-base-uk-to-us-english, a pre-trained T5 model specifically fine-tuned for UK-to-US conversion, from Hugging Face's model hub. This model leverages a sequence-to-sequence architecture that is ideal for translation tasks.

#### Why T5?
- Pre-trained for UK-to-US conversion: This model is specifically fine-tuned to handle UK-to-US English transformations.
- Text-to-Text Framework: T5 is designed to handle text generation tasks by treating all NLP problems as text-to-text transformations, making it suitable for tasks like translation and dialect conversion.
- Pre-trained on Large Corpus: The model has been trained on a large dataset, allowing it to capture linguistic nuances and contextual understanding.

### 2.4 Training Strategy
We used the following training setup:
- Loss Function: Cross-entropy loss, which is commonly used in sequence-to-sequence models for text generation.
- Batch Size: 4 or 5 (chosen based on GPU memory constraints).
- Epochs: 5 (chosen experimentally to ensure convergence without overfitting).
- Evaluation Strategy: We evaluated the model after each epoch to monitor its progress.

#### Training Arguments:
```python
training_args = TrainingArguments(
    output_dir="./results",  # Directory to save model checkpoints
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
    per_device_train_batch_size=4,  # Batch size for training
    per_device_eval_batch_size=4,  # Batch size for evaluation
    num_train_epochs=5,  # Number of epochs
    save_steps=100,  # Save checkpoint every 100 steps
    save_total_limit=2,  # Keep only the last 2 checkpoints
    logging_dir="./logs"  # Directory to save logs
)
```

## 3. Architectural Choices
### 3.1 Why EnglishVoice/t5-base-uk-to-us-english?
The EnglishVoice/t5-base-uk-to-us-english model was selected because:

- It is pre-trained specifically for UK-to-US conversion.
- Its text-to-text framework is ideal for tasks like dialect conversion.
- The model’s pre-training on a large corpus allows it to understand linguistic nuances such as spelling, punctuation, and word usage differences between UK and US English.

### 3.2 Alternative Approaches Considered

Approach | Pros| Cons
--- | --- | ---
Rule-Based Approach | Simple and fast	Limited to predefined transformations | lacks context awareness
Seq2Seq Models (RNN-based) | Captures context well | Computationally expensive, long training time
Transformer-based Models (BART/T5) | Scalable, handles context effectively | Requires task-specific fine-tuning

In [7]:
# Load the pre-trained tokenizer and model
model_name = "EnglishVoice/t5-base-uk-to-us-english"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)

# Function to preprocess data for model input
def preprocess_function(examples):
    inputs = ["UK to US: " + text for text in examples["input_text"]] # Prefix for input text
    targets = examples["target_text"]
    model_inputs = tokenizer(inputs, max_length=500, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=500, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"] # Set the target as labels for model training
    return model_inputs

# Convert DataFrames to Datasets
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Tokenize datasets
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)

# Remove original text columns
train_dataset = train_dataset.remove_columns(["input_text", "target_text"])
test_dataset = test_dataset.remove_columns(["input_text", "target_text"])

# Convert to PyTorch format
train_dataset.set_format("torch")
test_dataset.set_format("torch")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Map:   0%|          | 0/76 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

Here, the T5 base model for UK to US english conversion is loaded from the Hugging Face model hub. The model is to be fine-tuned for UK-to-US English translation based on our dataset.
The model and tokenizer are moved to the GPU (if available) for faster processing.


This preprocessing function prepares the data for training by adding a prefix ("UK to US: ") to the input sentences, which helps the model understand the task.
The inputs and targets are tokenized, truncated, and padded to a maximum length of 128 tokens.
The target labels are also tokenized and added to the model inputs.


The original text columns (input_text and target_text) are removed from the dataset since they are no longer needed after tokenization.
The datasets are then converted to PyTorch format (torch), which is the format expected by the Trainer.

In [8]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    per_device_train_batch_size=5,
    per_device_eval_batch_size=5,
    num_train_epochs=5,
    save_steps=100,
    save_total_limit=2,
    logging_dir="./logs"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)
# Train the model
trainer.train()

# API key for wandb = 89c40aa5196a528799371a485a706fb4e1d82aa4



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msiddhanthbiswas[0m ([33msiddhanthbiswas-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,No log,22.303503
2,No log,17.596125
3,No log,14.390613
4,No log,12.473543
5,No log,11.806203


TrainOutput(global_step=80, training_loss=16.143385314941405, metrics={'train_runtime': 352.8858, 'train_samples_per_second': 1.077, 'train_steps_per_second': 0.227, 'total_flos': 225980467200000.0, 'train_loss': 16.143385314941405, 'epoch': 5.0})

TrainingArguments: Specifies the parameters for training the model, such as the batch size, number of epochs, logging, and saving intervals.
Trainer: Handles the training loop. It takes in the model, training arguments, and datasets.

In [9]:
# Save the fine-tuned model and tokenizer
model.save_pretrained("fine_tuned_t5_uktous")
tokenizer.save_pretrained("fine_tuned_t5_uktous")

# Function for Inference
def translate_uk_to_us(text):
    input_text = "UK to US: " + text
    inputs = tokenizer(input_text, return_tensors="pt").to(device)
    outputs = model.generate(**inputs)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test the model with a sample sentence
test_sentence = "I CoLoUr 🎨 the centre of my favourite book."
test_sentence = preprocess_text(test_sentence)
print(translate_uk_to_us(test_sentence))


I color the center of my favorite book.


After training, the model and tokenizer are saved to disk so that they can be re-used for inference later.

This function is used for inference. It takes a UK English sentence, tokenizes it, passes it through the model, and returns the translated US English text.

This is an example of testing the trained model by passing a UK English sentence and printing the translated US English output.

## 4. Results
The fine-tuned model was tested on unseen data to evaluate its performance.

Example Inference:
Input:
```"I CoLoUr 🎨 the centre of my favourite book."```

Predicted Output:
```"I color the center of my favorite book."```

The model correctly converted UK spelling, punctuation, and even handled emojis within the text.

In [10]:
# Evaluate on test data
test_df["input_text"] = test_df["input_text"].apply(preprocess_text)
for text in test_df["input_text"]:
    print(f"UK: {text}")
    text = "UK to US: " + text
    print(f"US: {translate_uk_to_us(text)}\n")

UK: The programme was postponed.
US: The program was postponed.

UK: The aeroplane was on time.
US: The airplane was on time.

UK: Her favourite flower is a rose.
US: Her favorite flower is a rose.

UK: The aeroplane took off on time.
US: The airplane took off on time.

UK: Theyre still practising football.
US: They are still practicing soccer.

UK: She is practising for the marathon.
US: She is practicing for the marathon.

UK: The theatre group is performing tonight.
US: The theater group is performing tonight.

UK: He prefers the colour blue.
US: He prefers the color blue.

UK: I colour the centre of my favourite book.
US: I color the center of my favorite book.

UK: Hello... how are you?
US: Hello... how are you?

UK: The aeroplane is taking off.
US: The airplane is taking off.

UK: Is the cheque ready for collection?
US: Is the check ready for collection?

UK: I love visiting the theatre.
US: I love visiting the theater.

UK: The programme will start at 6 oclock.
US: The program w

This evaluates the model by running it on the entire test dataset and printing the UK-to-US translations for each sentence.

In [12]:


# Load the model and tokenizer for future use
from transformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained("fine_tuned_t5_uktous").to(device)
tokenizer = T5Tokenizer.from_pretrained("fine_tuned_t5_uktous")

# Evaluate the model on the test dataset
# test_sentence = "I CoLoUr 🎨 the centre of my favourite book."
test_sentence = "He is travelling to the theatre."
test_sentence = "UK to US: " + preprocess_text(test_sentence)
print(test_sentence)
print("Translated: ",translate_uk_to_us(test_sentence))



UK to US: He is travelling to the theatre.
Translated:  He is traveling to the theater.


## 5. Conclusion and Future Work
#### Conclusion:
This project demonstrates that fine-tuning EnglishVoice/t5-base-uk-to-us-english can effectively perform UK-to-US English dialect conversion, handling both lexical and contextual transformations. The model performs well with diverse vocabulary, including text with emojis.

#### Future Work:
- Training on a larger dataset: To improve model generalization and performance on diverse text inputs.
- Further Fine-tuning: Extended fine-tuning could improve accuracy.
- API Deployment: The model could be deployed as an API for real-time UK-to-US text conversion.

#### Suggestions for Improvement
- Model Evaluation Metrics: To formally evaluate the model, can use metrics like BLEU, ROUGE, or Accuracy to quantitatively assess performance. For example, comparing the output to ground-truth translations and calculating the BLEU score will give a more precise measure of performance.
- Emoji Handling: Emojis were removed during preprocessing, but in some cases, they might be relevant for context. May want to treat them differently (e.g., keeping them as is or adding special tokens for emojis).
- Logging and Checkpoints: Consider adding more logging to the training process (such as through wandb or TensorBoard). Could also save intermediate checkpoints during training so you can pick up training without starting over.
- Data Augmentation: dataset is relatively small (though sufficient for demonstration). If possible, use data augmentation techniques to artificially increase the size of your dataset, especially for cases where dialect conversion might require more diverse training data.
- Batch Size: The batch size of 5 is quite small. If the hardware supports it, you could try increasing the batch size for faster training and better gradient estimates.

This demonstrates how to load the fine-tuned model and tokenizer for inference after training.
Another test sentence is processed and translated.