# Assignment 2: Transformer Architecture Exercise
Use this notebook as a starting point and expand on your understanding of transformer models by completing the following structured tasks. You are encouraged to experiment, analyze, and critically reflect on your findings in your report.

## Part 1: Model Training & Implementation
### 1. Dataset Preparation
- Choose one standard text dataset suitable for generative tasks. Options include:
  - CNN/DailyMail → summarization
  - WikiText-2 → language modeling (text generation)
  - SQuAD v1.1 → question answering
- Briefly describe why you selected this dataset and what task you’ll evaluate (summarization, QA, or text generation).
- Show how you preprocessed the data (tokenization, train/val split, max length, etc.).

### 2. Model Implementation

Implement and train the following:
- Decoder-only model (GPT-style): e.g., GPT-2 small from Hugging Face.
- Encoder-only model (BERT-style): e.g., BERT-base, used for masked-language-modeling or extractive QA/summarization.
- Encoder-decoder model (T5-style): e.g., T5-small, trained for the same dataset/task as the other two.

### 3. Training Documentation

- Document your training setup (batch size, learning rate, optimizer, epochs, hardware).
- Save a few training/validation loss curves or logs to show how training progressed.
- Mention any difficulties you faced and how you addressed them (e.g., memory limits, convergence).

## Part 2: Evaluation & Analysis

### 4. Performance Evaluation

- Evaluate all three models on the same task.
- Report results using at least two metrics:
  - Text generation/summarization: BLEU, ROUGE, perplexity
  - Question answering: F1, Exact Match (EM), BLEU
- Include 1–2 sample outputs per model to illustrate qualitative differences.

### 5. Comparative Discussion

- Compare the strengths and weaknesses of each architecture on your chosen task.
- Suggested angles:

  - Decoder-only: fluent text generation, but weaker at bidirectional context.
  - Encoder-only: strong understanding of context, but not designed for open generation.
  - Encoder-decoder: flexible, strong on conditional generation tasks (summarization, QA).

- Which model seemed easiest to fine-tune?
- Which produced the best outputs on your dataset?
- Which was the most efficient (speed, memory)?

### 6. Reflections on Applicability

- In what real-world scenarios would you prefer each architecture?
- Briefly note whether you think CoT reasoning would have helped these models if you had added it (conceptual discussion only—no experiments required).


# Assignment 2: Transformer Architecture Exercise

This notebook serves as a reference implementation for **Assignment 2** of the generative AI course.  The goal is to compare three prominent transformer architectures—**decoder‑only**, **encoder‑only**, and **encoder‑decoder**—on a common generative task.  The assignment requires training each architecture on the same dataset, evaluating their performance with common metrics, and analysing the implications of architectural differences on generative tasks and chain‑of‑thought reasoning.

## Dataset selection

For this exercise we use the **CNN/DailyMail** summarisation dataset (version `3.0.0`) from Hugging Face’s `datasets` library.  The dataset comprises news articles paired with human‑written summaries; each article–summary pair provides a natural input/output example for a generative model.  Because the data are already split into training/validation/test splits and are widely used for abstractive summarisation research, this dataset is appropriate for comparing generative architectures.  Although `WikiText` could be used for language modelling tasks, summarisation requires models to generate structured output given an input, which better illustrates differences between decoder‑only, encoder‑only, and encoder‑decoder designs.  For compute efficiency in this notebook we subsample the dataset (e.g. a few hundred training examples) rather than using the full corpus.



## Overview of transformer architectures

We train three different transformer models:

* **Decoder‑only (GPT‑style):** These models consist of stacked self‑attention blocks in which each token can attend only to previous tokens (causal masking).  We use `GPT‑2` as the base model and fine‑tune it to generate a summary from an article.  Because GPT‑2 is a pure language model, we construct input prompts of the form `"summarize: <article>"` and train the model to predict the target summary.  During training we mask out the prompt part of the input so that the loss is computed only on the summary tokens.

* **Encoder‑only (BERT‑style):** Encoder‑only models such as `BERT` learn bi‑directional contextual representations using masked language modelling (MLM).  They are not inherently generative; they excel at understanding tasks (e.g. classification, token classification).  For a fair comparison on generative tasks we fine‑tune BERT on the same corpus using MLM, combining article and summary text into a single sequence.  At evaluation time we assess perplexity and use the `fill‑mask` capability to approximate generation.  This highlights BERT’s limitations on tasks requiring free‑form generation.

* **Encoder‑decoder (T5‑style):** Models like `T5` encode the input sequence with an encoder and decode the output sequence with a separate decoder.  They can perform a wide range of text‑to‑text tasks, including summarisation and question answering.  We fine‑tune `T5‑small` on the CNN/DailyMail dataset using the standard prefix `"summarize: "` in the input to indicate the task.  During evaluation we compute ROUGE metrics on generated summaries.

The following sections implement data loading, preprocessing, model fine‑tuning, and evaluation for each architecture.


In [1]:
import sys
print(f"Using Python {sys.version.split()[0]}")

# Install required packages into the current notebook environment
%pip install -qU numpy matplotlib scikit-learn

# Verify versions
import numpy as np, matplotlib, sklearn
print("numpy       :", np.__version__)
print("matplotlib  :", matplotlib.__version__)
print("scikit-learn:", sklearn.__version__)
print("✅ Setup complete!")

Using Python 3.12.10
Note: you may need to restart the kernel to use updated packages.
numpy       : 2.3.3
matplotlib  : 3.10.6
scikit-learn: 1.7.2
✅ Setup complete!


In [2]:
!pip install -U datasets transformers evaluate
!pip install torch
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForMaskedLM,
    AutoModelForSeq2SeqLM,
    DataCollatorForLanguageModeling,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments,
)
import evaluate
from transformers import logging

# Silence warnings for cleaner output
logging.set_verbosity_error()

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")



  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


## Load and inspect the dataset

We load the WikiText dataset using the Hugging Face `datasets` library.  To accelerate training for demonstration purposes we take a small subset of the training and validation sets (e.g. 500 training examples and 100 validation examples).  

Below we load the dataset, inspect a few examples, and create the smaller subsets used for fine‑tuning.

### wikitext has multiple configurations:

"wikitext-2-raw-v1" → raw text (no tokenization/normalization applied).

"wikitext-103-raw-v1" → larger dataset (same style).

"wikitext-2-v1" → pre-tokenized version (less common for transformers).

For transformer language modeling tasks, use "wikitext-2-raw-v1" so the tokenizer can handle text.

In [3]:
from datasets import load_dataset

# Load the WikiText-2 dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# For quick experimentation, take small subsets
train_size = 500
val_size = 100
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(train_size))
small_val_dataset = dataset["validation"].shuffle(seed=42).select(range(val_size))

print("Dataset splits:", dataset.keys())
print("Example training record:", small_train_dataset[0])
print("Example validation record:", small_val_dataset[0])

Dataset splits: dict_keys(['test', 'train', 'validation'])
Example training record: {'text': ' Continuous , short @-@ arc , high pressure xenon arc lamps have a color temperature closely approximating noon sunlight and are used in solar simulators . That is , the chromaticity of these lamps closely approximates a heated black body radiator that has a temperature close to that observed from the Sun . After they were first introduced during the 1940s , these lamps began replacing the shorter @-@ lived carbon arc lamps in movie projectors . They are employed in typical 35mm , IMAX and the new digital projectors film projection systems , automotive HID headlights , high @-@ end " tactical " flashlights and other specialized uses . These arc lamps are an excellent source of short wavelength ultraviolet radiation and they have intense emissions in the near infrared , which is used in some night vision systems . \n'}
Example validation record: {'text': ' = = Modern times = = \n'}


## Read the same data text

In [4]:
for i in range(20):
    print(dataset["train"][i]["text"])


 = Valkyria Chronicles III = 


 Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . 

 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series 

## Encoder‑decoder model: T5 fine‑tuning

T5 is a text‑to‑text model that uses a separate encoder and decoder.  It naturally handles generative tasks such as summarisation.  We prepend the prefix `"summarize: "` to each article, then tokenize the input and the summary separately.  A `DataCollatorForSeq2Seq` takes care of padding the inputs and shifting the decoder labels.  During evaluation we use greedy decoding to produce summaries and compute ROUGE scores against the reference summaries.


In [5]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
from datasets import load_dataset
# Load the cnn_dailymail dataset (version 3.0.0)
dataset = load_dataset("cnn_dailymail", "3.0.0")

# For quick experimentation, take a small subset
train_size = 500
val_size = 100
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(train_size))
small_val_dataset = dataset["validation"].shuffle(seed=42).select(range(val_size))
print("Dataset splits:", dataset.keys())
print("Example training record:", small_train_dataset[0])

# Load T5 tokenizer and model
t5_model_name = "t5-small"
t5_tokenizer = AutoTokenizer.from_pretrained(t5_model_name)

def preprocess_t5(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    model_inputs = t5_tokenizer(inputs, max_length=512, truncation=True)

    # Tokenize targets
    labels = t5_tokenizer(examples["highlights"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_t5 = small_train_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["train"].column_names)
val_t5 = small_val_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["validation"].column_names)

# Data collator for seq2seq tasks
data_collator_t5 = DataCollatorForSeq2Seq(tokenizer=t5_tokenizer, model=t5_model_name)

# Load T5 model
t5_model = AutoModelForSeq2SeqLM.from_pretrained(t5_model_name)

# Training arguments for T5
training_args_t5 = TrainingArguments(
    output_dir="./t5-summarization",
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=1,
    warmup_steps=50,
    gradient_accumulation_steps=4,
    fp16=torch.cuda.is_available(),
    report_to=[],
)

trainer_t5 = Trainer(
    model=t5_model,
    args=training_args_t5,
    train_dataset=train_t5,
    eval_dataset=val_t5,
    data_collator=data_collator_t5,
    tokenizer=t5_tokenizer,
)

# Uncomment the line below to train the T5 model
trainer_t5.train()



Dataset splits: dict_keys(['train', 'validation', 'test'])
Example training record: {'article': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in 

Map: 100%|██████████| 500/500 [00:00<00:00, 674.99 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 669.25 examples/s]
  trainer_t5 = Trainer(


{'train_runtime': 1097.9807, 'train_samples_per_second': 0.455, 'train_steps_per_second': 0.057, 'train_loss': 2.420558142283606, 'epoch': 1.0}


TrainOutput(global_step=63, training_loss=2.420558142283606, metrics={'train_runtime': 1097.9807, 'train_samples_per_second': 0.455, 'train_steps_per_second': 0.057, 'train_loss': 2.420558142283606, 'epoch': 1.0})

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
from datasets import load_dataset
# Load the cnn_dailymail dataset (version 3.0.0)
dataset = load_dataset("cnn_dailymail", "3.0.0")

# For quick experimentation, take a small subset
train_size = 500
val_size = 100
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(train_size))
small_val_dataset = dataset["validation"].shuffle(seed=42).select(range(val_size))
print("Dataset splits:", dataset.keys())
print("Example training record:", small_train_dataset[0])

# Load T5 tokenizer and model
t5_model_name = "t5-small"
t5_tokenizer = AutoTokenizer.from_pretrained(t5_model_name)

def preprocess_t5(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    model_inputs = t5_tokenizer(inputs, max_length=128, truncation=True)

    # Tokenize targets
    labels = t5_tokenizer(examples["highlights"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_t5 = small_train_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["train"].column_names)
val_t5 = small_val_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["validation"].column_names)

# Data collator for seq2seq tasks
data_collator_t5 = DataCollatorForSeq2Seq(tokenizer=t5_tokenizer, model=t5_model_name)

# Load T5 model
t5_model = AutoModelForSeq2SeqLM.from_pretrained(t5_model_name)

# Training arguments for T5
training_args_t5 = TrainingArguments(
    output_dir="./t5-summarization",
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=1,
    warmup_steps=50,
    gradient_accumulation_steps=4,
    fp16=torch.cuda.is_available(),
    report_to=[],
)

trainer_t5 = Trainer(
    model=t5_model,
    args=training_args_t5,
    train_dataset=train_t5,
    eval_dataset=val_t5,
    data_collator=data_collator_t5,
    tokenizer=t5_tokenizer,
)

# Uncomment the line below to train the T5 model
trainer_t5.train()

Dataset splits: dict_keys(['train', 'validation', 'test'])
Example training record: {'article': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in 

Map: 100%|██████████| 500/500 [00:00<00:00, 1394.75 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 1417.90 examples/s]
  trainer_t5 = Trainer(


{'train_runtime': 557.683, 'train_samples_per_second': 0.897, 'train_steps_per_second': 0.113, 'train_loss': 3.3584403386191717, 'epoch': 1.0}


TrainOutput(global_step=63, training_loss=3.3584403386191717, metrics={'train_runtime': 557.683, 'train_samples_per_second': 0.897, 'train_steps_per_second': 0.113, 'train_loss': 3.3584403386191717, 'epoch': 1.0})

In [7]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
from datasets import load_dataset
# Load the cnn_dailymail dataset (version 3.0.0)
dataset = load_dataset("cnn_dailymail", "3.0.0")

# For quick experimentation, take a small subset
train_size = 500
val_size = 100
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(train_size))
small_val_dataset = dataset["validation"].shuffle(seed=42).select(range(val_size))
print("Dataset splits:", dataset.keys())
print("Example training record:", small_train_dataset[0])

# Load T5 tokenizer and model
t5_model_name = "t5-small"
t5_tokenizer = AutoTokenizer.from_pretrained(t5_model_name)

def preprocess_t5(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    model_inputs = t5_tokenizer(inputs, max_length=256, truncation=True)

    # Tokenize targets
    labels = t5_tokenizer(examples["highlights"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_t5 = small_train_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["train"].column_names)
val_t5 = small_val_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["validation"].column_names)

# Data collator for seq2seq tasks
data_collator_t5 = DataCollatorForSeq2Seq(tokenizer=t5_tokenizer, model=t5_model_name)

# Load T5 model
t5_model = AutoModelForSeq2SeqLM.from_pretrained(t5_model_name)

# Training arguments for T5
training_args_t5 = TrainingArguments(
    output_dir="./t5-summarization",
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=1,
    warmup_steps=50,
    gradient_accumulation_steps=4,
    fp16=torch.cuda.is_available(),
    report_to=[],
)

trainer_t5 = Trainer(
    model=t5_model,
    args=training_args_t5,
    train_dataset=train_t5,
    eval_dataset=val_t5,
    data_collator=data_collator_t5,
    tokenizer=t5_tokenizer,
)

# Uncomment the line below to train the T5 model
trainer_t5.train()

Dataset splits: dict_keys(['train', 'validation', 'test'])
Example training record: {'article': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in 

Map: 100%|██████████| 500/500 [00:00<00:00, 1223.54 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 1540.97 examples/s]
  trainer_t5 = Trainer(


{'train_runtime': 270.2395, 'train_samples_per_second': 1.85, 'train_steps_per_second': 0.233, 'train_loss': 2.806276593889509, 'epoch': 1.0}


TrainOutput(global_step=63, training_loss=2.806276593889509, metrics={'train_runtime': 270.2395, 'train_samples_per_second': 1.85, 'train_steps_per_second': 0.233, 'train_loss': 2.806276593889509, 'epoch': 1.0})

In [8]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Trainer, TrainingArguments
from datasets import load_dataset
# Load the cnn_dailymail dataset (version 3.0.0)
dataset = load_dataset("cnn_dailymail", "3.0.0")

# For quick experimentation, take a small subset
train_size = 500
val_size = 100
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(train_size))
small_val_dataset = dataset["validation"].shuffle(seed=42).select(range(val_size))
print("Dataset splits:", dataset.keys())
print("Example training record:", small_train_dataset[0])

# Load T5 tokenizer and model
t5_model_name = "t5-small"
t5_tokenizer = AutoTokenizer.from_pretrained(t5_model_name)

def preprocess_t5(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    model_inputs = t5_tokenizer(inputs, max_length=384, truncation=True)

    # Tokenize targets
    labels = t5_tokenizer(examples["highlights"], max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

train_t5 = small_train_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["train"].column_names)
val_t5 = small_val_dataset.map(preprocess_t5, batched=True, remove_columns=dataset["validation"].column_names)

# Data collator for seq2seq tasks
data_collator_t5 = DataCollatorForSeq2Seq(tokenizer=t5_tokenizer, model=t5_model_name)

# Load T5 model
t5_model = AutoModelForSeq2SeqLM.from_pretrained(t5_model_name)

# Training arguments for T5
training_args_t5 = TrainingArguments(
    output_dir="./t5-summarization",
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=1,
    warmup_steps=50,
    gradient_accumulation_steps=4,
    fp16=torch.cuda.is_available(),
    report_to=[],
)

trainer_t5 = Trainer(
    model=t5_model,
    args=training_args_t5,
    train_dataset=train_t5,
    eval_dataset=val_t5,
    data_collator=data_collator_t5,
    tokenizer=t5_tokenizer,
)

# Uncomment the line below to train the T5 model
trainer_t5.train()

Dataset splits: dict_keys(['train', 'validation', 'test'])
Example training record: {'article': "By . Anthony Bond . PUBLISHED: . 07:03 EST, 2 March 2013 . | . UPDATED: . 08:07 EST, 2 March 2013 . Three members of the same family who died in a static caravan from carbon monoxide poisoning would have been unconscious 'within minutes', investigators said today. The bodies of married couple John and Audrey Cook were discovered alongside their daughter, Maureen, at the mobile home they shared on Tremarle Home Park in Camborne, west Cornwall. The inquests have now opened into the deaths last Saturday, with investigators saying the three died along with the family's pet dog, of carbon monoxide poisoning from a cooker. Tragic: The inquests have opened into the deaths of three members of the same family who were found in their static caravan last weekend. John and Audrey Cook are pictured . Awful: The family died following carbon monoxide poisoning at this caravan at the Tremarle Home Park in 

Map: 100%|██████████| 500/500 [00:00<00:00, 1247.89 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 1424.87 examples/s]
  trainer_t5 = Trainer(


{'train_runtime': 527.6601, 'train_samples_per_second': 0.948, 'train_steps_per_second': 0.119, 'train_loss': 2.5492606995597717, 'epoch': 1.0}


TrainOutput(global_step=63, training_loss=2.5492606995597717, metrics={'train_runtime': 527.6601, 'train_samples_per_second': 0.948, 'train_steps_per_second': 0.119, 'train_loss': 2.5492606995597717, 'epoch': 1.0})

In [9]:
!pip install nltk rouge-score absl-py



# Evaluation

In [11]:
# T5 evaluation
from transformers import GenerationConfig

# Define ROUGE metric
evaluate_rouge = evaluate.load("rouge")
def compute_metrics_rouge(preds, refs):
    # Compute ROUGE scores; use newline separation between sentences in each text
    result = evaluate_rouge.compute(predictions=preds, references=refs, use_stemmer=True)
    return {k: round(v * 100, 2) for k, v in result.items()}

# Function to compute perplexity from evaluation loss
def compute_perplexity(eval_output):
    loss = eval_output["eval_loss"]
    return round(torch.exp(torch.tensor(loss)).item(), 3)


t5_eval_results = trainer_t5.evaluate()
t5_perplexity = compute_perplexity(t5_eval_results)

def evaluate_t5(model, tokenizer, dataset, num_samples=10):
     model.eval()
     preds, refs = [], []
     for i, example in enumerate(dataset.select(range(num_samples))):
         input_text = "summarize: " + example["article"]
         inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512).to(model.device)
         with torch.no_grad():
             output_ids = model.generate(**inputs, max_length=128)
         summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
         preds.append(summary)
         refs.append(example["highlights"])
     rouge_scores = compute_metrics_rouge(preds, refs)
     return rouge_scores

rouge_t5 = evaluate_t5(t5_model, t5_tokenizer, small_val_dataset)
print("T5 Perplexity:", t5_perplexity)
print("T5 ROUGE:", rouge_t5)


{'eval_loss': 2.220358371734619, 'eval_runtime': 12.2417, 'eval_samples_per_second': 8.169, 'eval_steps_per_second': 4.084, 'epoch': 1.0}
T5 Perplexity: 9.211
T5 ROUGE: {'rouge1': np.float64(37.64), 'rouge2': np.float64(15.44), 'rougeL': np.float64(25.26), 'rougeLsum': np.float64(30.62)}


## Evaluation T5 Model
### T5 (Encoder-Decoder, Seq2Seq)
####	Loss / Perplexity:
-	Eval loss: 2.22 → Perplexity ≈ 9.2 (much lower than GPT-2 → more confident predictions).
####	ROUGE scores (summarization quality):
-	Rouge-1 = 37.64
-	Rouge-2 = 15.44
-	Rouge-L = 25.26
-	Rouge-Lsum = 30.62
####	Runtime:
-	Fastest (12s, ~8.2 samples/sec).
####	Inference:
-	T5 is clearly the strongest for summarization, with significantly better ROUGE scores than GPT-2.
-	Encoder-decoder structure helps it understand input context fully and generate higher-quality summaries.
-	Also runs much faster than GPT-2.
