If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [None]:
 !pip install transformers datasets evaluate sacrebleu torchtext

In [None]:
from tqdm.auto import tqdm

## Q1: Dataset Preparation (5 points)

In [None]:
from datasets import load_dataset

We use the ```load_dataset()``` function to download the dataset. Replace the dummy arguments to download the wmt14 dataset for fr-en translation as provided here: https://huggingface.co/datasets/wmt/wmt14

In [None]:
dataset = load_dataset("wmt14", "fr-en", split='train[:15000]')
dataset

In [None]:
print(dataset)

Now, we split the dataset into training and testing splits. This is done using the ```train_test_split``` function. Replace the dummy arguments with appropriate parameters.

In [None]:
split_datasets = dataset.train_test_split(train_size=0.8, seed=2025)
split_datasets


Define the test dataset as follows:

In [None]:
test_dataset = split_datasets["test"]
test_dataset

Now, follow the same process to split the train dataset to training and validation splits.

In [None]:
# load the validation set
split_to_val = load_dataset("wmt14", "fr-en", split='validation')
# further split the train set into 0.8 train and 0.2 evaluation datasets
train_eval_split = split_datasets["train"].train_test_split(train_size=0.8, seed=2025)
train_dataset = train_eval_split["train"]
eval_dataset = train_eval_split["test"]

In [None]:
# test code
train_eval_split

## Q2 Prepare for training RNNs (10)
In this part, you are required to define the tokenizers for english and french, tokenize the data, and define the dataloaders.

Choose and initialize the tokenizer

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased') # CHOOSE AN APPROPRIATE MULTILINGUAL MODEL such as https://huggingface.co/google-bert/bert-base-multilingual-cased

You will need to create a pytorch dataset to process the tokens in the required format. Complete the implementation of the dataset.

In [None]:
from torch.utils.data import Dataset

class TranslationDataset(Dataset):
    def __init__(self, dataset, input_size, output_size):
        source_texts = [text["translation"]['fr'] for text in dataset]
        target_texts = [text["translation"]['en'] for text in dataset]
        self.source_sentences = tokenizer(source_texts, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.target_sentences = tokenizer(target_texts, padding='max_length', truncation=True, return_tensors="pt")["input_ids"]
        self.input_size = input_size
        self.output_size = output_size

    def __len__(self):
        return len(self.source_sentences)

    def __getitem__(self, idx):
        return self.source_sentences[idx], self.target_sentences[idx]

Initialize the datasets

In [None]:
# pick a random vocab size
vocab_size = tokenizer.vocab_size
train_dataset_rnn = TranslationDataset(train_dataset, vocab_size, vocab_size)
eval_dataset_rnn = TranslationDataset(eval_dataset, vocab_size, vocab_size)
test_dataset_rnn = TranslationDataset(test_dataset, vocab_size, vocab_size)

In [None]:
# test code
len(test_dataset_rnn)

Get the vocab size from the tokenizer

In [None]:
vocab_size = tokenizer.vocab_size # This size is used somewhere in the model, think.

Initialize and define the dataloaders

In [None]:
#Instantiate the DataLoaders
from torch.utils.data import DataLoader
# recommende batch size using powers of 2 - effecient memory usage
BATCH_SIZE = 8
train_dataloader = DataLoader(train_dataset_rnn, batch_size=BATCH_SIZE, shuffle=True)
eval_dataloader = DataLoader(eval_dataset_rnn, batch_size=BATCH_SIZE)
test_dataloader = DataLoader(test_dataset_rnn, batch_size=BATCH_SIZE)

In [None]:
# test code
len(train_dataloader)

## Q3: Implementing RNNs (10)
Define the RNN model as an encoder-decoder RNN for the task of translation in the cell below. You may refer: https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

In [None]:
import torch
import torch.nn as nn

In [None]:
class Seq2SeqRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, dropout_p=0.5):
        super(Seq2SeqRNN, self).__init__()
        # YOUR CODE HERE
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, hidden_size) # embedding layer
        self.encoder = nn.RNN(hidden_size, hidden_size, batch_first=True) # encoder
        self.dropout = nn.Dropout(p=dropout_p) # drop out layer following tutorial
        self.decoder = nn.RNN(hidden_size, hidden_size, batch_first=True) # decoder
        self.out = nn.Linear(hidden_size, output_size) #output layer

    def forward(self, x):
        # YOUR CODE HERE
        embedded = self.dropout(self.embedding(x))
        encoder_outputs, hidden = self.encoder(embedded)
        decoder_outputs, _ = self.decoder(encoder_outputs, hidden)
        output = self.out(decoder_outputs)
        return output

In [None]:
input_size = tokenizer.vocab_size
output_size = tokenizer.vocab_size
model = Seq2SeqRNN(input_size = input_size, hidden_size= 128, output_size = output_size)
model

## Q4: Training RNNs (15)
In this question, you will define the hyperparameters, loss and optimizer for training. You will then implement a custom training loop.

In [None]:
if torch.cuda.is_available():
    model = model.cuda()
torch.cuda.is_available()

define the optimizer and the loss function

In [None]:
#from torch.optim import IMPORT_OPTIMIZER
import torch.optim as optim
#from torch.nn import IMPORT_LOSS_FUNCTION

num_train_epochs = 3 # define epochs for training
num_training_steps = num_train_epochs * len(train_dataloader)
criterion = nn.CrossEntropyLoss(ignore_index=0)# YOUR LOSS FUNCTION
optimizer = optim.Adam(model.parameters(), lr=0.01) # YOUR OPTIMIZER HERE

Write the training loop

In [None]:
from tqdm import tqdm
progress_bar = tqdm(total=num_training_steps, desc="Training Progress")

for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    total_loss = 0
    for batch_src, batch_tgt in train_dataloader:
        ## Complete the training loop
        optimizer.zero_grad()
        batch_src = batch_src.to(torch.device('cuda'))
        batch_tgt = batch_tgt.to(torch.device('cuda'))
        output = model(batch_src)
        loss = criterion(output.view(-1, output_size), batch_tgt.view(-1))
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        progress_bar.update(1)

    # Evaluation Phase
    model.eval()
    total_eval_loss = 0
    total_batches = 0
    with torch.no_grad():
        for batch_src, batch_tgt in eval_dataloader:
            batch_src = batch_src.to(torch.device('cuda'))
            batch_tgt = batch_tgt.to(torch.device('cuda'))
            output = model(batch_src)
            loss = criterion(output.view(-1, output_size), batch_tgt.view(-1))
            total_eval_loss += loss.item()
            total_batches += 1


      ### Complete the evaluation phase

    avg_loss = total_eval_loss / total_batches if total_batches > 0 else float("inf")
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

## Q5: Evaluating RNNs for Machine Translation (5)

Implement the calculation of BLEU-1,2,3,4 scores using the ```sacrebleu``` library for the test dataset.

In [None]:
import sacrebleu

In [None]:
model.eval()
bleu1, bleu2, bleu3, bleu4 = [],[],[],[]
# Complete the testing loop
for batch_src, batch_tgt in tqdm(test_dataloader):
    batch_src = batch_src.to(torch.device('cuda'))
    batch_tgt = batch_tgt.to(torch.device('cuda'))
    with torch.no_grad():
      output = model(batch_src)
    pred_tokens = torch.argmax(output, dim=-1).tolist()
    ref_tokens = batch_tgt.tolist()

    reference = [[" ".join(map(str, ref))] for ref in ref_tokens]
    hypothese = [" ".join(map(str, hyp)) for hyp in pred_tokens]

    bleu1.append(sacrebleu.corpus_bleu(hypothese, reference).precisions[0]) # BLUE-1 gram
    bleu2.append(sacrebleu.corpus_bleu(hypothese, reference).precisions[1]) # BLUE-2 gram
    bleu3.append(sacrebleu.corpus_bleu(hypothese, reference).precisions[2]) # BLUE-3 gram
    bleu4.append(sacrebleu.corpus_bleu(hypothese, reference).precisions[3]) # BLUE-4 gram

print("BLEU-1: ", sum(bleu1) / len(bleu1))
print("BLEU-2: ", sum(bleu2) / len(bleu2))
print("BLEU-3: ", sum(bleu3) / len(bleu3))
print("BLEU-4: ", sum(bleu4) / len(bleu4))


In [None]:
sacrebleu.corpus_bleu(hypothese, reference)

Congratulations! You can now work with RNNs for the task of Machine Translation!

## Q6: Prepare for training transformers (10)

In this part we cover the initial setup required before training transformer this including data preprocessing and setting up data collators and loaders.

Ensure you have loaded the dataset!

In [None]:
print(dataset)

We will begin by tokenizing the data. Based on your model selection load the appropriate tokenizer. We are using models from AutoModelForSeq2SeqLM in this assignment. You can checkout all the available models here: https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoModelForSeq2SeqLM

In [None]:
from transformers import AutoTokenizer

checkpoint = "google-t5/t5-base" #Select a model of your choice: MT5 model
#checkpoint ='bert-base-multilingual-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

We will need to tokenize both our input and outputs. Thus we make use of pre_process() function to generate tokenized model inputs and targets. Ensure you use truncation and padding! The max length will be 128.

In [None]:
##Implement the preprocess function
def preprocess_function(examples):
    inputs = [example["fr"] for example in examples["translation"]]
    targets = [example["en"] for example in examples["translation"]]
    model_inputs = tokenizer(inputs, padding = "max_length", truncation=True, max_length=128, return_tensors="pt") #Instantitate tokenizer to generate model outputs
    labels = tokenizer(targets, padding = "max_length", truncation=True, max_length=128, return_tensors="pt")
    # add tokenized target sentence to model inputs
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
tokenized_train_data = train_dataset.map(preprocess_function, batched=True)

In [None]:
# changed original code to eval_dataset, because val_dataset does not exist
tokenized_eval_data = eval_dataset.map(preprocess_function, batched=True)


We remove the column 'translation' as we do not require it for training. Also often having columns other than we created using the preprocess_function may lead to errors during training. Since model might get confused which inputs it needs to use.

In [None]:
tokenized_train_data = tokenized_train_data.remove_columns(train_dataset.column_names)
tokenized_eval_data = tokenized_eval_data.remove_columns(eval_dataset.column_names)

In [None]:
tokenized_train_data.set_format("torch")
tokenized_eval_data.set_format("torch")

To construct batches of training data for model training, we require collators that set the properties for the batches and data loaders that generate the batches.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer) #INSTANTIATE THE COLLATOR

In [None]:
#Instantiate the DataLoader for training and evaluation data

from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_train_data, batch_size=2, shuffle=True)
eval_dataloader = DataLoader(tokenized_eval_data, batch_size=2)

## Q7) Choosing & Loading the Model (5)

Choose a pre-trained transformer model that you will use for fine-tuning on the translation dataset

In [None]:
from transformers import AutoModelForSeq2SeqLM
checkpoint = "google-t5/t5-base" #
#checkpoint = 'bert-base-multilingual-cased'
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

## Q8) Training the Transformer Model

Now, that we have are data tokenized and ready in batches and model fixed. We will begin with training this model. To do so we must setup the right hyperparameters, then proceed to implment the training loop to train our model!

For training we require an optimizer and a scheduler to manage the learning rate during the training. Let's set them up before our training loop

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler

num_train_epochs = 2
# for faster training speed pick below number. If time permits, use the number #num_train_epochs * len(train_dataloader)
num_training_steps = 2000 #num_train_epochs * len(train_dataloader)


optimizer = AdamW(model.parameters(), lr=0.05)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
output_size = tokenizer.vocab_size
output_size

In [None]:
# debug

In [None]:
# debug
num_training_steps

Finally, we are here!

In the loop during training you will run a forward pass, compute the loss, compute the gradients, and then update the weights. (Don't foregt to set gradient to zero!)

During the eval phase we simply do a forward pass and compute the loss!

In [None]:
import torch
torch.cuda.is_available()

In [None]:
from tqdm.auto import tqdm
import torch
import torch.nn as nn


progress_bar = tqdm(total=num_training_steps, desc="Training Progress")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
for epoch in range(num_train_epochs):
    # Training Phase
    model.train()
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        ## Complete the training loop
        optimizer.zero_grad()
        output = model(input_ids=batch["input_ids"], labels = batch["labels"])
        loss = output.loss
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        progress_bar.update(1)

    # Evaluation Phase
    model.eval()
    total_loss = 0
    total_batches = 0
    total_eval_loss = 0
    with torch.no_grad():
        for batch in eval_dataloader:
            batch = {k: v.to(device) for k, v in batch.items()}
            output = model(input_ids=batch["input_ids"], labels = batch["labels"])
            loss = output.loss
            total_eval_loss += loss.item()
            total_batches += 1
      ### Complete the evaluation phase

    avg_loss = total_eval_loss / total_batches if total_batches > 0 else float('inf')
    print(f"Epoch {epoch}: Average Eval Loss: {avg_loss:.4f}")

The above training phase took about 90 min to run

Congratulations!! On completing the training. Now don't forget to save your model and the tokenizer

In [None]:
# Save model and tokenizer
output_dir = "HW2-transformer-model"
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

## Q9) Evaluating Transformer for Machine Translation

We will now test our trained model and analyze its performance using BLEU-1, 2, 3, 4 scores from the sacrebleu library. You will create a task evaluator for translation, load and process the test dataset, and compute the results on an existing trained model.

Below we load a model trained for french to english translation. You can read more about it here: https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-fr-en

In [None]:
#pip install sentencepiece

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Initialize an evaluator for translation task

In [None]:
## Load Evaluator for translation
from evaluate import evaluator
task_evaluator = evaluator("translation")

We will need to change our test dataset by having specific input and target columns. Thus we will use split_translation to split the translation column into two columns 'en' and 'fr'.

In [None]:
#  Implement the split function
def split_translations(example):
    en_text = example["translation"]["en"]
    fr_text = example["translation"]["fr"]
    example['en'] = en_text
    example['fr'] = fr_text
    return example

In [None]:
# debug
test_dataset

In [None]:
test_data = test_dataset.map(split_translations)

In [None]:
print(test_data)

In [None]:
# Since I trained my model on my local mac, I could not use cuda
# alternative is to use mps for mac
import torch
device = torch.device("mps")
model.to(device)

You can now go ahead and compute the results by appropriately setting up the task_evaluator.compute()

In [None]:
import sacrebleu
import evaluate
bleu_metric = evaluate.load("sacrebleu")
results = task_evaluator.compute(
    model_or_pipeline= model,
    data= test_data,
    tokenizer= tokenizer,
    metric=bleu_metric,
    input_column="fr",
    label_column="en"
)

The above cell took 49 min to run

In [None]:
print(results)

## Q10) Inferencing on Transformers

Let's check out how well this trained model's translation skills are. You can use try with a few french sentence and see how well it translates.

To do so we will setup a pipline using the existing trained model.


Loading the tokenizer and model for the pipeline

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
checkpoint = "Helsinki-NLP/opus-mt-tc-big-fr-en"

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Setup the pipeline for translation using your model and tokenizer. You can read about pipelines here: https://huggingface.co/docs/transformers/en/main_classes/pipelines

In [None]:
from transformers import pipeline
# Instatiate a pipeline for Translation using the model and tokenizer
pipeline = pipeline("translation_fr_to_en", model=model, tokenizer=tokenizer)

Translate the given sentence using the pipeline

In [None]:
input_text = "J'ai mes joies, mes peines." # input French words/sentences
translation_result = pipeline(input_text)

In [None]:
print(translation_result)

In [None]:
input_text1 = "Chicago est c´el`ebre pour ses pizzas profondes, son jazz et son architec￾ture ´epoustouflante."
input_text2 = "J’ai traduit cette phrase du fran¸cais vers l’anglais."
input_text3 = "Vous avez maintenant termin´e le deuxi`eme devoir de ce cours."

In [None]:
translate1 = pipeline(input_text1)
translate2 = pipeline(input_text2)
translate3 = pipeline(input_text3)

In [None]:
print(translate1)
print(translate2)
print(translate3)

Note: For questions Q1-Q6, all the models were trained using Google Collab with cuda. For the rest of questions, the models were trained using Mac with MPS