# Task 2: Fine-tuning a Sequence-to-Sequence Model
## Deep Learning Final Assignment

**Objective:** Fine-tune T5-base model on SQuAD dataset for Generative Question Answering

## Setup and Installation

First, we'll install all necessary libraries for this assignment.

In [1]:
!pip install -q transformers datasets torch evaluate rouge-score accelerate

## Import Required Libraries

Import all necessary libraries for data processing, model training, and evaluation.

In [2]:
import torch
import numpy as np
import pandas as pd
import gc
from datasets import load_dataset
from transformers import (
    T5Tokenizer,
    T5ForConditionalGeneration,
    Trainer,
    TrainingArguments,
    DataCollatorForSeq2Seq
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
gc.collect()



Device: cpu


90


# 1. Data Preprocessing

In this section, we'll:
1. Load the SQuAD dataset from Hugging Face
2. Explore the dataset structure
3. Preprocess the data by formatting inputs as 'question: [Q] context: [C]'
4. Tokenize the data for T5 model

## 1.1 Load the SQuAD Dataset

In [3]:
dataset = load_dataset("squad")
print(f"Train: {len(dataset['train']):,} | Validation: {len(dataset['validation']):,}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Train: 87,599 | Validation: 10,570


## 1.2 Explore the Dataset

Let's look at a sample example to understand the data structure.

In [25]:
sample = dataset['train'][0]

print(f"Context: {sample['context'][:200]}...")
print(f"\nQuestion: {sample['question']}")
print(f"Answer: {sample['answers']['text'][0]}")

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper sta...

Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer: Saint Bernadette Soubirous


## 1.3 Initialize T5 Tokenizer

Load the T5-base tokenizer that will be used to convert text to tokens.

In [5]:
tokenizer = T5Tokenizer.from_pretrained("t5-base")
print(f"Tokenizer loaded | Vocab size: {tokenizer.vocab_size:,}")

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Tokenizer loaded | Vocab size: 32,000


## 1.4 Data Preprocessing Function

Create preprocessing function that formats input as: **'question: [Q] context: [C]'**

This is the required format for our generative QA task.

In [None]:
MAX_INPUT_LENGTH = 384  
MAX_TARGET_LENGTH = 64   

def preprocess_function(examples):
    """Format: 'question: [Q] context: [C]' -> '[Answer]'"""
    inputs = [
        f"question: {question} context: {context}"
        for question, context in zip(examples['question'], examples['context'])
    ]
    targets = [answers['text'][0] for answers in examples['answers']]

    model_inputs = tokenizer(
        inputs,
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
        padding="max_length"
    )

    labels = tokenizer(
        targets,
        max_length=MAX_TARGET_LENGTH,
        truncation=True,
        padding="max_length"
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

## 1.5 Apply Preprocessing to Dataset

Apply the preprocessing function to the entire dataset. For faster training, we'll use a subset of the data.

In [None]:
TRAIN_SAMPLES = 1000  
VAL_SAMPLES = 200     

train_dataset = dataset['train'].select(range(TRAIN_SAMPLES))
val_dataset = dataset['validation'].select(range(VAL_SAMPLES))

tokenized_train = train_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=train_dataset.column_names,
    batch_size=16  
)

tokenized_val = val_dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=val_dataset.column_names,
    batch_size=16
)

print(f"Preprocessed | Train: {len(tokenized_train)} | Val: {len(tokenized_val)}")

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Preprocessed | Train: 1000 | Val: 200


## 1.6 Verify Preprocessing

Let's verify that our preprocessing worked correctly by decoding a sample.

In [24]:
sample_idx = 0
input_text = tokenizer.decode(tokenized_train[sample_idx]['input_ids'], skip_special_tokens=True)
label_text = tokenizer.decode(tokenized_train[sample_idx]['labels'], skip_special_tokens=True)

print(f"Input: {input_text[:200]}...")
print(f"\nTarget: {label_text}")

Input: question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue o...

Target: Saint Bernadette Soubirous



# 2. Model Training

In this section, we'll:
1. Load the T5-base model
2. Configure training arguments
3. Set up the Trainer API
4. Fine-tune the model on SQuAD dataset

## 2.1 Load T5-base Model

In [None]:
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model = model.to(device)

model.gradient_checkpointing_enable()

print(f"Model loaded | Parameters: {model.num_parameters():,}")
print("Gradient checkpointing enabled for memory optimization")

Model loaded | Parameters: 222,903,552
Gradient checkpointing enabled for memory optimization


## 2.2 Setup Data Collator

Data collator handles dynamic padding and prepares batches for training.

In [10]:
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

## 2.3 Configure Training Arguments

Set up hyperparameters and training configuration using Hugging Face TrainingArguments.

In [12]:
OUTPUT_DIR = "./t5-squad-finetuned"

training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    learning_rate=3e-4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy="epoch",
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    logging_dir="./logs",
    logging_steps=50,
    warmup_steps=100,
    fp16=torch.cuda.is_available(),
    dataloader_num_workers=0,
    report_to="none",
    push_to_hub=False,
    gradient_checkpointing=True,
    optim="adafactor",
)

print(f"Training: {training_args.num_train_epochs} epochs | Batch: {training_args.per_device_train_batch_size} | Grad Accum: {training_args.gradient_accumulation_steps}")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")

Training: 2 epochs | Batch: 4 | Grad Accum: 4
Effective batch size: 16


## 2.4 Initialize Trainer

Create a Trainer instance with our model, datasets, and training configuration.

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(


## 2.5 Start Training

Begin the fine-tuning process. This may take some time depending on your hardware.

**Note:** Training 5000 samples for 3 epochs will take approximately:
- With GPU: 30-60 minutes
- With CPU: 4-6 hours

In [None]:
print("Training started...\n")

train_result = trainer.train()

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("\n" + "="*80)
print(f"Training Loss: {train_result.training_loss:.4f}")
print(f"Runtime: {train_result.metrics['train_runtime']:.2f}s")
print("="*80)

Training started...



`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss
1,7.6637,0.024563
2,0.0268,0.022383


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight'].



Training Loss: 3.0566
Runtime: 13239.30s


## 2.6 Save the Fine-tuned Model

Save the model and tokenizer for future use.

In [15]:
FINAL_MODEL_DIR = "./t5-squad-final"

trainer.save_model(FINAL_MODEL_DIR)
tokenizer.save_pretrained(FINAL_MODEL_DIR)

print(f"Model saved to: {FINAL_MODEL_DIR}")

Model saved to: ./t5-squad-final



# 3. Model Evaluation

In this section, we'll:
1. Evaluate the model on the validation set
2. Test the model with custom questions and contexts
3. Demonstrate the model's question-answering capabilities

## 3.1 Evaluate on Validation Set

In [16]:
eval_results = trainer.evaluate()

print("="*80)
print("Validation Results:")
print("="*80)
print(f"Loss: {eval_results['eval_loss']:.4f}")
print(f"Runtime: {eval_results['eval_runtime']:.2f}s")
print("="*80)



Validation Results:
Loss: 0.0224
Runtime: 306.58s


## 3.2 Create Inference Function

Create a helper function to generate answers for custom questions and contexts.

In [17]:
def generate_answer(question, context, max_length=128):
    """Generate answer from question and context."""
    input_text = f"question: {question} context: {context}"

    input_ids = tokenizer(
        input_text,
        return_tensors="pt",
        max_length=512,
        truncation=True
    ).input_ids.to(device)

    with torch.no_grad():
        outputs = model.generate(
            input_ids,
            max_length=max_length,
            num_beams=4,
            early_stopping=True,
            temperature=0.7,
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

## 3.3 Test with Examples from Validation Set

Let's test the model with examples from the SQuAD validation set.

In [23]:
import random

test_indices = random.sample(range(len(val_dataset)), 3)

for i, idx in enumerate(test_indices, 1):
    example = val_dataset[idx]
    predicted = generate_answer(example['question'], example['context'])

    print(f"\nExample {i}:")
    print(f"Context: {example['context'][:200]}...")
    print(f"\nQuestion: {example['question']}")
    print(f"True Answer: {example['answers']['text'][0]}")
    print(f"Predicted: {predicted}")


Example 1:
Context: For the third straight season, the number one seeds from both conferences met in the Super Bowl. The Carolina Panthers became one of only ten teams to have completed a regular season with only one los...

Question: What seed was the Denver Broncos?
True Answer: number one
Predicted: number one

Example 2:
Context: The Broncos took an early lead in Super Bowl 50 and never trailed. Newton was limited by Denver's defense, which sacked him seven times and forced him into three turnovers, including a fumble which th...

Question: How many tackles did Von Miller get during the game?
True Answer: 5
Predicted: five

Example 3:
Context: The Broncos took an early lead in Super Bowl 50 and never trailed. Newton was limited by Denver's defense, which sacked him seven times and forced him into three turnovers, including a fumble which th...

Question: Who won the Super Bowl MVP?
True Answer: Von Miller
Predicted: Von Miller


## 3.4 Test with Custom Question and Context

Now let's test the model with a completely custom example.

In [22]:
custom_context = """
The Transformer is a deep learning model introduced in 2017, used primarily in the field of
natural language processing (NLP). It was proposed in the paper 'Attention is All You Need' by
Vaswani et al. The Transformer architecture uses self-attention mechanisms to process input
sequences in parallel, making it more efficient than recurrent neural networks (RNNs).
This architecture has become the foundation for many state-of-the-art models like BERT, GPT,
and T5.
"""

custom_question = "When was the Transformer model introduced?"
custom_answer = generate_answer(custom_question, custom_context)

print(f"Question: {custom_question}")
print(f"Answer: {custom_answer}")

Question: When was the Transformer model introduced?
Answer: 2017


## 3.5 Interactive Testing

Test the model with your own questions and contexts!

In [20]:
test_examples = [
    {
        "context": """Deep Learning is a subset of machine learning that uses artificial neural
        networks with multiple layers to learn from data. It has achieved remarkable success in
        various domains including computer vision, natural language processing, and speech recognition.""",
        "question": "What is Deep Learning?"
    },
    {
        "context": """Python is a high-level, interpreted programming language created by Guido
        van Rossum and first released in 1991. Python emphasizes code readability with its notable
        use of significant whitespace.""",
        "question": "Who created Python?"
    },
    {
        "context": """The T5 (Text-to-Text Transfer Transformer) model treats every NLP problem
        as a text-to-text problem. T5 was pre-trained on the Colossal Clean Crawled Corpus (C4)
        dataset and can be fine-tuned for specific tasks.""",
        "question": "What dataset was T5 pre-trained on?"
    }
]

for i, example in enumerate(test_examples, 1):
    answer = generate_answer(example["question"], example["context"])
    print(f"\n{i}. Q: {example['question']}")
    print(f"   A: {answer}")


1. Q: What is Deep Learning?
   A: a subset of machine learning

2. Q: Who created Python?
   A: Guido van Rossum

3. Q: What dataset was T5 pre-trained on?
   A: Colossal Clean Crawled Corpus


## 3.6 Model Performance Summary

Let's create a summary of our model's performance.

In [26]:

print("PERFORMANCE SUMMARY")

print(f"Model: T5-base fine-tuned on SQuAD")
print(f"Training Samples: {len(tokenized_train):,}")
print(f"Validation Samples: {len(tokenized_val):,}")
print(f"Epochs: {training_args.num_train_epochs}")
print(f"\nTraining Loss: {train_result.training_loss:.4f}")
print(f"Validation Loss: {eval_results['eval_loss']:.4f}")
print(f"\nSaved: {FINAL_MODEL_DIR}")


PERFORMANCE SUMMARY
Model: T5-base fine-tuned on SQuAD
Training Samples: 1,000
Validation Samples: 200
Epochs: 2

Training Loss: 3.0566
Validation Loss: 0.0224

Saved: ./t5-squad-final
