
# Assignment 2: Transformer Architecture Exercise

This notebook serves as a reference implementation for **Assignment 2** of the generative AI course.  The goal is to compare three prominent transformer architectures—**decoder‑only**, **encoder‑only**, and **encoder‑decoder**—on a common generative task.  The assignment requires training each architecture on the same dataset, evaluating their performance with common metrics, and analysing the implications of architectural differences on generative tasks and chain‑of‑thought reasoning.

## Dataset selection

For this exercise we use the **CNN/DailyMail** summarisation dataset (version `3.0.0`) from Hugging Face’s `datasets` library.  The dataset comprises news articles paired with human‑written summaries; each article–summary pair provides a natural input/output example for a generative model.  Because the data are already split into training/validation/test splits and are widely used for abstractive summarisation research, this dataset is appropriate for comparing generative architectures.  Although `WikiText` could be used for language modelling tasks, summarisation requires models to generate structured output given an input, which better illustrates differences between decoder‑only, encoder‑only, and encoder‑decoder designs.  For compute efficiency in this notebook we subsample the dataset (e.g. a few hundred training examples) rather than using the full corpus.



## Overview of transformer architectures

We train three different transformer models:

* **Decoder‑only (GPT‑style):** These models consist of stacked self‑attention blocks in which each token can attend only to previous tokens (causal masking).  We use `GPT‑2` as the base model and fine‑tune it to generate a summary from an article.  Because GPT‑2 is a pure language model, we construct input prompts of the form `"summarize: <article>"` and train the model to predict the target summary.  During training we mask out the prompt part of the input so that the loss is computed only on the summary tokens.

* **Encoder‑only (BERT‑style):** Encoder‑only models such as `BERT` learn bi‑directional contextual representations using masked language modelling (MLM).  They are not inherently generative; they excel at understanding tasks (e.g. classification, token classification).  For a fair comparison on generative tasks we fine‑tune BERT on the same corpus using MLM, combining article and summary text into a single sequence.  At evaluation time we assess perplexity and use the `fill‑mask` capability to approximate generation.  This highlights BERT’s limitations on tasks requiring free‑form generation.

* **Encoder‑decoder (T5‑style):** Models like `T5` encode the input sequence with an encoder and decode the output sequence with a separate decoder.  They can perform a wide range of text‑to‑text tasks, including summarisation and question answering.  We fine‑tune `T5‑small` on the CNN/DailyMail dataset using the standard prefix `"summarize: "` in the input to indicate the task.  During evaluation we compute ROUGE metrics on generated summaries.

The following sections implement data loading, preprocessing, model fine‑tuning, and evaluation for each architecture.


In [2]:
import sys
print(f"Using Python {sys.version.split()[0]}")

# Install required packages into the current notebook environment
%pip install -qU numpy matplotlib scikit-learn

# Verify versions
import numpy as np, matplotlib, sklearn
print("numpy       :", np.__version__)
print("matplotlib  :", matplotlib.__version__)
print("scikit-learn:", sklearn.__version__)
print("✅ Setup complete!")


Using Python 3.12.10
Note: you may need to restart the kernel to use updated packages.
numpy       : 2.3.3
matplotlib  : 3.10.6
scikit-learn: 1.7.2
✅ Setup complete!


In [3]:
!pip install datasets transformers evaluate
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoModelForMaskedLM,
    AutoModelForSeq2SeqLM,
    DataCollatorForLanguageModeling,
    DataCollatorForSeq2Seq,
    Trainer,
    TrainingArguments,
)
import evaluate
from transformers import logging

# Silence warnings for cleaner output
logging.set_verbosity_error()

# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")




  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu



## Load and inspect the dataset

We load the CNN/DailyMail dataset using the Hugging Face `datasets` library.  To accelerate training for demonstration purposes we take a small subset of the training and validation sets (e.g. 500 training examples and 100 validation examples).  Each record contains two fields:

* `"article"`: the news article text (input).
* `"highlights"`: the human‑written summary (target).

Below we load the dataset, inspect a few examples, and create the smaller subsets used for fine‑tuning.


In [56]:

# Load the cnn_dailymail dataset (version 3.0.0)
dataset = load_dataset("cnn_dailymail", "3.0.0")

# For quick experimentation, take a small subset
train_size = 500
val_size = 100
small_train_dataset = dataset["train"].shuffle(seed=50).select(range(train_size))
small_val_dataset = dataset["validation"].shuffle(seed=50).select(range(val_size))

print("Dataset splits:", dataset.keys())
print("Example training record:", small_train_dataset[0])
print("Example validation record:", small_val_dataset[0])


Dataset splits: dict_keys(['train', 'validation', 'test'])
Example training record: {'article': '(CNN) -- The worst measles outbreak in 20 years continues to grow. Most recently, there was a new case in Ohio, where there are already 377 cases. So far this year, measles has infected 593 people in 21 states. As a physician who treats children with compromised immune systems and as a mother of 2-year-old twins -- too young to be fully vaccinated, I am deeply concerned. A few months ago, a 4-year-old girl came to my clinic with frequent cold symptoms and infections. One way to see if her immune system functioned properly was to check her response to childhood vaccines. But because the girl\'s mother had decided against vaccinating her, I could not perform the vaccine blood test. My hands were tied behind my back; I had no way of knowing if there was a deficiency in her immune system. And if she did have a deficiency, she had no protection from the potentially deadly diseases the vaccines a

### Observation:
I was trying with Varying Train Size and Val Size and found that the Max Train Size is 287113 and Val Size is 13368. 



## Decoder‑only model: GPT‑2 fine‑tuning

A decoder‑only transformer must learn to generate a summary given an input article.  We use a prompt‑based approach: the input text has the form `"summarize: <article>"`, and the model is trained to produce the summary tokens.  To prevent the model from learning to predict the prompt tokens, we mask the loss on the prompt portion of the sequence (by setting corresponding labels to `-100`).

We use the `GPT‑2` tokenizer and model from Hugging Face.  Because GPT‑2 lacks a padding token by default, we add a pad token equal to the end‑of‑text token.  We then tokenize the inputs and construct labels accordingly.  The function below performs these steps and is mapped over the dataset.


In [57]:
# Load tokenizer and model for GPT-2
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)

# Add a padding token (GPT-2 does not have one)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Define the preprocessing function for GPT-2
def preprocess_gpt2(examples):
    prefix = "summarize: "
    inputs = [prefix + art for art in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs and targets together, pad to max_length
    model_inputs = gpt2_tokenizer(
        inputs, text_target=targets,
        max_length=512, truncation=True, padding="max_length"
    )

    # Ensure labels are padded to max_length as well
    if "labels" in model_inputs:
        labels = model_inputs["labels"]
        for i in range(len(labels)):
            labels[i] = labels[i] + [-100] * (512 - len(labels[i])) if len(labels[i]) < 512 else labels[i][:512]
        model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing to the small datasets
train_gpt2 = small_train_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["train"].column_names)
val_gpt2 = small_val_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["validation"].column_names)

print("Sample tokenized GPT-2 input:")
print(gpt2_tokenizer.decode(train_gpt2[0]["input_ids"][:100]))


Map: 100%|██████████| 500/500 [00:01<00:00, 409.49 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 715.02 examples/s]

Sample tokenized GPT-2 input:
summarize: (CNN) -- The worst measles outbreak in 20 years continues to grow. Most recently, there was a new case in Ohio, where there are already 377 cases. So far this year, measles has infected 593 people in 21 states. As a physician who treats children with compromised immune systems and as a mother of 2-year-old twins -- too young to be fully vaccinated, I am deeply concerned. A few months ago, a 4-year-old girl came to





In [59]:
# Load tokenizer and model for GPT-2
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)

# Add a padding token (GPT-2 does not have one)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Define the preprocessing function for GPT-2
def preprocess_gpt2(examples):
    prefix = "summarize: "
    inputs = [prefix + art for art in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs and targets together, pad to max_length
    model_inputs = gpt2_tokenizer(
        inputs, text_target=targets,
        max_length=512, truncation=True, padding="max_length"
    )

    # Ensure labels are padded to max_length as well
    if "labels" in model_inputs:
        labels = model_inputs["labels"]
        for i in range(len(labels)):
            labels[i] = labels[i] + [-100] * (512 - len(labels[i])) if len(labels[i]) < 512 else labels[i][:512]
        model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing to the small datasets
train_gpt2 = small_train_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["train"].column_names)
val_gpt2 = small_val_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["validation"].column_names)

print("Sample tokenized GPT-2 input:")
print(gpt2_tokenizer.decode(train_gpt2[0]["input_ids"][:100]))

Map: 100%|██████████| 100/100 [00:00<00:00, 808.48 examples/s]

Sample tokenized GPT-2 input:
summarize: (CNN) -- The worst measles outbreak in 20 years continues to grow. Most recently, there was a new case in Ohio, where there are already 377 cases. So far this year, measles has infected 593 people in 21 states. As a physician who treats children with compromised immune systems and as a mother of 2-year-old twins -- too young to be fully vaccinated, I am deeply concerned. A few months ago, a 4-year-old girl came to





#### The above run was executed after I had changed the train size and value size to the max. I see the number of Examples processed per second in this case has increased drastically compared to the previous run with train size 500  and val size 100. Also the execution took more time (4 mins and 20 Seconds)

In [41]:
# Load tokenizer and model for GPT-2
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)

# Add a padding token (GPT-2 does not have one)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Define the preprocessing function for GPT-2
def preprocess_gpt2(examples):
    prefix = "summarize: "
    inputs = [prefix + art for art in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs and targets together, pad to max_length
    model_inputs = gpt2_tokenizer(
        inputs, text_target=targets,
        max_length=100, truncation=True, padding="max_length"
    )

    # Ensure labels are padded to max_length as well
    if "labels" in model_inputs:
        labels = model_inputs["labels"]
        for i in range(len(labels)):
            labels[i] = labels[i] + [-100] * (512 - len(labels[i])) if len(labels[i]) < 512 else labels[i][:512]
        model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing to the small datasets
train_gpt2 = small_train_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["train"].column_names)
val_gpt2 = small_val_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["validation"].column_names)

print("Sample tokenized GPT-2 input:")
print(gpt2_tokenizer.decode(train_gpt2[0]["input_ids"][:1000]))

Map: 100%|██████████| 287113/287113 [02:27<00:00, 1945.64 examples/s]
Map: 100%|██████████| 13368/13368 [00:08<00:00, 1618.98 examples/s]

Sample tokenized GPT-2 input:
summarize: (CNN) -- The worst measles outbreak in 20 years continues to grow. Most recently, there was a new case in Ohio, where there are already 377 cases. So far this year, measles has infected 593 people in 21 states. As a physician who treats children with compromised immune systems and as a mother of 2-year-old twins -- too young to be fully vaccinated, I am deeply concerned. A few months ago, a 4-year-old girl came to





In [42]:
# Load tokenizer and model for GPT-2
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)

# Add a padding token (GPT-2 does not have one)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Define the preprocessing function for GPT-2
def preprocess_gpt2(examples):
    prefix = "summarize: "
    inputs = [prefix + art for art in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs and targets together, pad to max_length
    model_inputs = gpt2_tokenizer(
        inputs, text_target=targets,
        max_length=200, truncation=True, padding="max_length"
    )

    # Ensure labels are padded to max_length as well
    if "labels" in model_inputs:
        labels = model_inputs["labels"]
        for i in range(len(labels)):
            labels[i] = labels[i] + [-100] * (512 - len(labels[i])) if len(labels[i]) < 512 else labels[i][:512]
        model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing to the small datasets
train_gpt2 = small_train_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["train"].column_names)
val_gpt2 = small_val_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["validation"].column_names)

print("Sample tokenized GPT-2 input:")
print(gpt2_tokenizer.decode(train_gpt2[0]["input_ids"][:1000]))

Map: 100%|██████████| 287113/287113 [02:58<00:00, 1608.06 examples/s]
Map: 100%|██████████| 13368/13368 [00:11<00:00, 1123.31 examples/s]

Sample tokenized GPT-2 input:
summarize: (CNN) -- The worst measles outbreak in 20 years continues to grow. Most recently, there was a new case in Ohio, where there are already 377 cases. So far this year, measles has infected 593 people in 21 states. As a physician who treats children with compromised immune systems and as a mother of 2-year-old twins -- too young to be fully vaccinated, I am deeply concerned. A few months ago, a 4-year-old girl came to my clinic with frequent cold symptoms and infections. One way to see if her immune system functioned properly was to check her response to childhood vaccines. But because the girl's mother had decided against vaccinating her, I could not perform the vaccine blood test. My hands were tied behind my back; I had no way of knowing if there was a deficiency in her immune system. And if she did have a deficiency, she had no protection from the potentially deadly diseases the vaccines are designed to inoculate.





In [43]:
# Load tokenizer and model for GPT-2
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)

# Add a padding token (GPT-2 does not have one)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Define the preprocessing function for GPT-2
def preprocess_gpt2(examples):
    prefix = "summarize: "
    inputs = [prefix + art for art in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs and targets together, pad to max_length
    model_inputs = gpt2_tokenizer(
        inputs, text_target=targets,
        max_length=350, truncation=True, padding="max_length"
    )

    # Ensure labels are padded to max_length as well
    if "labels" in model_inputs:
        labels = model_inputs["labels"]
        for i in range(len(labels)):
            labels[i] = labels[i] + [-100] * (512 - len(labels[i])) if len(labels[i]) < 512 else labels[i][:512]
        model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing to the small datasets
train_gpt2 = small_train_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["train"].column_names)
val_gpt2 = small_val_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["validation"].column_names)

print("Sample tokenized GPT-2 input:")
print(gpt2_tokenizer.decode(train_gpt2[0]["input_ids"][:1000]))

Map: 100%|██████████| 287113/287113 [02:46<00:00, 1727.32 examples/s]
Map: 100%|██████████| 13368/13368 [00:08<00:00, 1647.63 examples/s]

Sample tokenized GPT-2 input:
summarize: (CNN) -- The worst measles outbreak in 20 years continues to grow. Most recently, there was a new case in Ohio, where there are already 377 cases. So far this year, measles has infected 593 people in 21 states. As a physician who treats children with compromised immune systems and as a mother of 2-year-old twins -- too young to be fully vaccinated, I am deeply concerned. A few months ago, a 4-year-old girl came to my clinic with frequent cold symptoms and infections. One way to see if her immune system functioned properly was to check her response to childhood vaccines. But because the girl's mother had decided against vaccinating her, I could not perform the vaccine blood test. My hands were tied behind my back; I had no way of knowing if there was a deficiency in her immune system. And if she did have a deficiency, she had no protection from the potentially deadly diseases the vaccines are designed to inoculate. As it turned out, she didn't ha




In [49]:
# Load tokenizer and model for GPT-2
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)

# Add a padding token (GPT-2 does not have one)
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Define the preprocessing function for GPT-2
def preprocess_gpt2(examples):
    prefix = "summarize: "
    inputs = [prefix + art for art in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs and targets together, pad to max_length
    model_inputs = gpt2_tokenizer(
        inputs, text_target=targets,
        max_length=512, truncation=True, padding="max_length"
    )

    # Ensure labels are padded to max_length as well
    if "labels" in model_inputs:
        labels = model_inputs["labels"]
        for i in range(len(labels)):
            labels[i] = labels[i] + [-100] * (512 - len(labels[i])) if len(labels[i]) < 512 else labels[i][:512]
        model_inputs["labels"] = labels
    return model_inputs

# Apply preprocessing to the small datasets
train_gpt2 = small_train_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["train"].column_names)
val_gpt2 = small_val_dataset.map(preprocess_gpt2, batched=True, remove_columns=dataset["validation"].column_names)

print("Sample tokenized GPT-2 input:")
print(gpt2_tokenizer.decode(train_gpt2[0]["input_ids"][:1000]))

Sample tokenized GPT-2 input:
summarize: (CNN) -- The worst measles outbreak in 20 years continues to grow. Most recently, there was a new case in Ohio, where there are already 377 cases. So far this year, measles has infected 593 people in 21 states. As a physician who treats children with compromised immune systems and as a mother of 2-year-old twins -- too young to be fully vaccinated, I am deeply concerned. A few months ago, a 4-year-old girl came to my clinic with frequent cold symptoms and infections. One way to see if her immune system functioned properly was to check her response to childhood vaccines. But because the girl's mother had decided against vaccinating her, I could not perform the vaccine blood test. My hands were tied behind my back; I had no way of knowing if there was a deficiency in her immune system. And if she did have a deficiency, she had no protection from the potentially deadly diseases the vaccines are designed to inoculate. As it turned out, she didn't ha

# Observation:
#### With a Max Length to 512, Following is the result:
- Map: 100%|██████████| 287113/287113 [03:57<00:00, 1207.51 examples/s]
- Map: 100%|██████████| 13368/13368 [00:21<00:00, 617.03 examples/s]
#### Max Length changed to 100, Following is the result:
- Map: 100%|██████████| 287113/287113 [02:27<00:00, 1945.64 examples/s]
- Map: 100%|██████████| 13368/13368 [00:08<00:00, 1618.98 examples/s]
#### Max Length changed to 200, Following is the result:
- Map: 100%|██████████| 287113/287113 [02:58<00:00, 1608.06 examples/s]
- Map: 100%|██████████| 13368/13368 [00:11<00:00, 1123.31 examples/s]
#### Max Length changed to 350, Following is the result:
- Map: 100%|██████████| 287113/287113 [02:46<00:00, 1727.32 examples/s]
- Map: 100%|██████████| 13368/13368 [00:08<00:00, 1647.63 examples/s]

Max Length	Train Speed (examples/s)	Train Time	Val Speed (examples/s)	Val Time
- 512	    1207 /s	                    ~3m57s	    617 /s	                ~21s
- 100	    1945 /s	                    ~2m27s	    1619 /s	                ~8s
- 200	    1608 /s	                    ~2m58s	    1123 /s	                ~11s
- 350	    1727 /s	                    ~2m46s	    1648 /s	                ~8s

### Analysis
#### At Max Length = 512
-	Slowest preprocessing (only ~1207 examples/s).
-	Longer sequences → more tokens per example → heavier tokenization work.
-	Ensures full context of article/summary pairs, but at cost of speed.
-	Useful when articles are long and you don’t want to truncate important info.
________________________________________
#### At Max Length = 100
-	Fastest preprocessing (1945 examples/s for training, 1619 for validation).
-	Very short sequences → quick to tokenize & pad.
-	Risk: losing lots of article information due to truncation → summaries may be incomplete.
-	Likely faster training too, but quality of generated summaries may degrade.
________________________________________
#### At Max Length = 200
-	Middle ground. Processing speed drops compared to 100 (1608 examples/s).
-	Still much faster than 512, while retaining more context.
-	Balances efficiency and retaining content.
________________________________________
#### At Max Length = 350
-	Interesting result: faster than 200 (1727 examples/s vs 1608).
-	Likely due to batching and memory usage quirks: padding aligns better with GPU/CPU batch sizes → can sometimes improve speed.
-	Validation also ran very fast (1647 /s).
________________________________________
### Analysis
-	Speed vs Context Trade-off:
-	Shorter max length = faster preprocessing, but higher risk of truncating data.
-	Longer max length = more accurate context preserved, but slower preprocessing.
####	Practical Implications:
-	If articles are mostly short (<100 tokens), a max length of 100 is enough → fastest and efficient.
-	If articles are long (200–400 tokens), using 100 would chop critical info → bad for summarization quality.
-	350 seems like a sweet spot: retains substantial context while still faster than 200.
-	512 is safest for data quality but slowest.

###	Training Phase Impact:
-	Shorter max length = smaller training batches in memory = faster iterations.
-	Longer max length = larger memory use, slower updates, possibly fewer steps per epoch.
________________________________________
### Final Takeaway
-	100 → fastest, but risky (likely poor summaries).
-	200 → balance, but slower than 350.
-	350 → best compromise: retains more info, processes quickly (faster than 200!).
-	512 → maximum context preserved, but slowest.



# Observation: Input max length = 512 and varied label padding length (128, 256, 384, 512). 
________________________________________
#### Results Recap
#### Input Max Length	Label Max Length	Train Speed (examples/s)	Train Time	Val Speed (examples/s)	Val Time
- 512	128	1347/s	~3m33s	1367/s	~9s
- 512	256	1161/s	~4m07s	1220/s	~10s
- 512	384	926/s	~5m09s	433/s	~30s
- 512	512	1191/s	~4m01s	1596/s	~8s
________________________________________
### Observations
#### Labels = 128
-	Fastest training (1347/s).
-	Summaries truncated/padded to 128 tokens only.
-	Good efficiency, but risk of information loss if many summaries exceed 128 tokens.
________________________________________
#### Labels = 256
-	Slower (1161/s).
-	Retains longer summaries than 128.
-	Balanced option, but not as efficient as 128 or 512.
________________________________________
#### Labels = 384
-	Slowest overall (926/s train, 433/s validation).
-	Big performance hit, especially in validation (30 seconds vs 8–10s).
-	Likely because padding to an “awkward” middle length increases memory use and decreases batching efficiency.
-	Preserves long summaries, but inefficient.
________________________________________
#### Labels = 512
-	Surprisingly faster than 256 and 384 (1191/s train, 1596/s validation).
-	Why? Likely due to implementation optimization when inputs and labels match length (both 512). Vectorized better.
-	Safest choice for data quality (no truncation), but uses most memory during model training.
________________________________________
### Analysis
####	Speed vs Quality Trade-off
-	Shorter labels (128) → fastest, but summaries get cut → hurts model quality.
-	Longer labels (512) → more context preserved, reasonable speed → better for accuracy.
-	384 is the worst compromise: slower and not significantly more informative than 256 or 512.
####	Consistency Helps
-	When both inputs and labels are padded to 512, the pipeline runs more efficiently, even though it processes more tokens.
####	Validation Bottleneck at 384
-	Validation slowed down massively at 384 → likely due to padding mismatch across batches (less efficient batching).
________________________________________
### Final Takeaway
-	128 → Best for speed, worst for quality.
-	256 → Middle ground, okay if summaries are usually short.
-	384 → Worst choice (slow + inefficient).
-	512 → Best overall balance: preserves all summary info, surprisingly efficient in preprocessing, but expect heavier GPU memory use when training.



### GPT‑2 Fine Tuning

We use the Hugging Face `Trainer` API to fine‑tune the GPT‑2 model.  A `DataCollatorForLanguageModeling` automatically pads the inputs and labels and performs dynamic masking where appropriate (although in our custom loss masking we already set `-100` values).  The training arguments below specify a small number of epochs and batch sizes for illustration; adjust these for a full training run.


In [12]:
pip install --upgrade "accelerate>=0.26.0"

Note: you may need to restart the kernel to use updated packages.


### Loss on All tokens

In [61]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)
# Define data collator
data_collator_gpt2 = DataCollatorForLanguageModeling(tokenizer=gpt2_tokenizer, mlm=False)

# Load the GPT-2 model
gpt2_model = AutoModelForCausalLM.from_pretrained(gpt2_model_name)

# Training arguments
training_args_gpt2 = TrainingArguments(
    output_dir="./gpt2-summarization",
    eval_strategy="steps",
    eval_steps=100,
    logging_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=1,
    warmup_steps=50,
    gradient_accumulation_steps=4,
    fp16=torch.cuda.is_available(),
    report_to=[],  # disable logging to wandb
)

# Create Trainer for GPT-2
trainer_gpt2 = Trainer(
    model=gpt2_model,
    args=training_args_gpt2,
    train_dataset=train_gpt2,
    eval_dataset=val_gpt2,
    data_collator=data_collator_gpt2,
)

# Uncomment the line below to train; training can take several minutes even on small subsets
trainer_gpt2.train()


{'train_runtime': 1628.2215, 'train_samples_per_second': 0.307, 'train_steps_per_second': 0.039, 'train_loss': 3.2344224717881946, 'epoch': 1.0}


TrainOutput(global_step=63, training_loss=3.2344224717881946, metrics={'train_runtime': 1628.2215, 'train_samples_per_second': 0.307, 'train_steps_per_second': 0.039, 'train_loss': 3.2344224717881946, 'epoch': 1.0})

### Summary-only loss

In [65]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorWithPadding,   # <-- use this when labels are prebuilt
    Trainer,
    TrainingArguments,
)
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# --------------------------------------------------
# 1) Tokenizer & model setup (pad + special SEP)
# --------------------------------------------------
gpt2_model_name = "gpt2"
gpt2_tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)
# GPT-2 has no pad token → reuse EOS
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Add a separator token to split article vs summary
SEP = "<|sep|>"
if SEP not in gpt2_tokenizer.get_vocab():
    gpt2_tokenizer.add_special_tokens({"additional_special_tokens": [SEP]})

gpt2_model = AutoModelForCausalLM.from_pretrained(gpt2_model_name)
# IMPORTANT: resize embeddings after adding tokens
gpt2_model.resize_token_embeddings(len(gpt2_tokenizer))

# --------------------------------------------------
# 2) Preprocessing: SUMMARY-ONLY loss w/ length budget
#    - masks loss on prompt+article, trains only on summary
#    - reserves space (max_new_tokens) so summary isn’t truncated
# --------------------------------------------------
def preprocess_gpt2_summary_only(examples, max_length=256, max_new_tokens=128):
    prefix = "summarize: "
    sep_id = gpt2_tokenizer.convert_tokens_to_ids(SEP)
    pad_id = gpt2_tokenizer.pad_token_id

    input_ids_list, attn_mask_list, labels_list = [], [], []

    for art, tgt in zip(examples["article"], examples["highlights"]):
        # tokenize separately so we can budget space
        prompt_ids = gpt2_tokenizer.encode(prefix, add_special_tokens=False)
        art_ids    = gpt2_tokenizer.encode(art,   add_special_tokens=False)
        tgt_ids    = gpt2_tokenizer.encode(tgt,   add_special_tokens=False)

        # reserve room for SEP + summary
        max_src = max_length - max_new_tokens - 1  # -1 for SEP
        # trim article so prompt+article fit in max_src
        art_ids = art_ids[: max(0, max_src - len(prompt_ids))]

        # build final input
        input_ids = prompt_ids + art_ids + [sep_id] + tgt_ids
        input_ids = input_ids[:max_length]  # hard cap

        # find SEP and mask everything up to SEP (inclusive)
        sep_pos = input_ids.index(sep_id) if sep_id in input_ids else len(input_ids) - 1
        labels = [-100] * (sep_pos + 1) + input_ids[sep_pos + 1:]

        # pad to max_length
        attn_mask = [1] * len(input_ids)
        if len(input_ids) < max_length:
            pad_len = max_length - len(input_ids)
            input_ids += [pad_id] * pad_len
            labels    += [-100]   * pad_len
            attn_mask += [0]      * pad_len

        input_ids_list.append(input_ids)
        attn_mask_list.append(attn_mask)
        labels_list.append(labels)

    return {
        "input_ids": input_ids_list,
        "attention_mask": attn_mask_list,
        "labels": labels_list,
    }

# --------------------------------------------------
# 3) Map your raw datasets → tokenized datasets
#    Replace these two lines with whatever your raw sets are called.
#    If you already have small_train_dataset/small_val_dataset, use them here.
# --------------------------------------------------
# Example: using `small_train_dataset` / `small_val_dataset` produced earlier
train_gpt2 = small_train_dataset.map(
    lambda x: preprocess_gpt2_summary_only(x, max_length=256, max_new_tokens=128),
    batched=True,
    remove_columns=small_train_dataset.column_names,
)
val_gpt2 = small_val_dataset.map(
    lambda x: preprocess_gpt2_summary_only(x, max_length=256, max_new_tokens=128),
    batched=True,
    remove_columns=small_val_dataset.column_names,
)

# --------------------------------------------------
# 4) Collator: simple padding (labels already set)
#    Do NOT use DataCollatorForLanguageModeling here.
# --------------------------------------------------
data_collator = DataCollatorWithPadding(gpt2_tokenizer, pad_to_multiple_of=8)

# IMPORTANT: use the correct key: evaluation_strategy (not eval_strategy)
training_args_gpt2 = Seq2SeqTrainingArguments(
    output_dir="./gpt2-summarization",
    eval_strategy="steps",   # <-- fix this key
    eval_steps=100,
    logging_steps=100,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    weight_decay=0.01,
    save_steps=500,
    save_total_limit=1,
    warmup_steps=50,
    gradient_accumulation_steps=4,
    fp16=torch.cuda.is_available(),
    gradient_checkpointing=True,
    predict_with_generate=True,    # now valid
    generation_max_length=128,
    generation_num_beams=1,
    report_to=[],
)

# if you prebuilt labels (summary-only loss), use DataCollatorWithPadding
# data_collator = DataCollatorWithPadding(gpt2_tokenizer, pad_to_multiple_of=8)

trainer_gpt2 = Seq2SeqTrainer(
    model=gpt2_model,
    args=training_args_gpt2,
    train_dataset=train_gpt2,
    eval_dataset=val_gpt2,
    data_collator=data_collator,   # your pad collator
    tokenizer=gpt2_tokenizer,
)
# --------------------------------------------------
# 7) Train
# --------------------------------------------------
trainer_gpt2.train()

Map: 100%|██████████| 500/500 [00:01<00:00, 361.65 examples/s]
Map: 100%|██████████| 100/100 [00:00<00:00, 426.54 examples/s]
  trainer_gpt2 = Seq2SeqTrainer(


{'train_runtime': 825.8914, 'train_samples_per_second': 0.605, 'train_steps_per_second': 0.076, 'train_loss': 3.0960211375403026, 'epoch': 1.0}


TrainOutput(global_step=63, training_loss=3.0960211375403026, metrics={'train_runtime': 825.8914, 'train_samples_per_second': 0.605, 'train_steps_per_second': 0.076, 'train_loss': 3.0960211375403026, 'epoch': 1.0})

### The above runs was to train GPT-2 to calcualte LOSS for all Tokens V/S LOSS for only Summary  
### These runs tell about objective choice and context/summary budget for GPT-2 summarization.

### Quick roll-up
#### Run	Objective	max_length	max_new_tokens	Train runtime	Steps/s	Samples/s	Train loss
- R0	All tokens	(assumed 512)	—	27m 10s	0.039	0.307	3.2344
- R1	Summary-only	512	128	46m 19s	0.023	0.180	2.6723
- R2	Summary-only	128	64	14m 25s	0.073	0.581	3.4784
- R3	Summary-only	128	64	13m 50s	0.076	0.605	3.0960

(Two near-identical R2/R3 runs show normal seed/shuffle variance.)

Inference
1) Objective matters (and summary-only is the right one)

Summary-only loss (R1) is much lower than all-tokens loss (R0) at similar context (≈512).

We're optimizing exactly what we care about—predicting the summary—instead of wasting loss on the prompt+article tokens. This usually yields less copying and better ROUGE once you evaluate with generation.

Note: the numeric losses aren’t apples-to-apples (you average over different token sets), but the direction is what you want: focusing loss on summaries helps the model learn the task.

2) Context (max_length) and summary budget (max_new_tokens) trade speed vs fit

Shorter context (R2/R3: 128/64) is ~3.2× faster than long context (R1: 512/128), but train loss is higher (3.10–3.48 vs 2.67).

With less of the article available before <|sep|>, the model has less evidence to predict the summary tokens.

Bigger summary budget helps. R1 (max_new_tokens=128) beats R2/R3 (64). Longer targets let the model learn fuller summary structure instead of being forced into ultra-short endings.

3) Your 512-token summary-only run is slower than the all-tokens run

R1 steps/s (0.023) < R0 (0.039). That’s expected if, for R1, you used Seq2SeqTrainer with predict_with_generate=True (generation at eval steps adds overhead) and/or gradient checkpointing. The forward/backward still processes 512 tokens either way.

4) Run-to-run variance at short context is real

R2 vs R3 differ by ~0.38 in train loss at identical hyperparams. Lock a seed (seed=42, data_seed=42) or average 2–3 runs when reporting.

#### Takeaways:

Aligning the loss with the task (summary-only) gives a clear quality win and should translate to higher ROUGE/BERTScore and fewer verbatim copies.

Context matters: cutting max_length from 512→128 triples throughput but hurts fit; useful when you need speed or your articles are short.

Reserve target budget: larger max_new_tokens (e.g., 128 vs 64) improves learning of complete summaries.

Choose a sweet spot: if your articles are often >128 tokens, try max_length=256 or 384 with max_new_tokens=128—commonly a strong speed/quality balance.

In [74]:
# Test the model with a sample input
sample_input = "summarize: Have you considered how and when you'll withdraw from your retirement accounts? Doing so in the incorrect order may end up costing you."
inputs = gpt2_tokenizer(sample_input, return_tensors="pt").to(device)
outputs = gpt2_model.generate(inputs["input_ids"], max_length=50, num_return_sequences=1)
print("Generated summary:", gpt2_tokenizer.decode(outputs[0], skip_special_tokens=True))



Generated summary: summarize: Have you considered how and when you'll withdraw from your retirement accounts? Doing so in the incorrect order may end up costing you.

Do you have a plan to retire early?

Do you have a plan to retire


## What do you observe in the output?
1. Can you postprocess the output so that it only starts printing after the input sequence?
2. Can you iterate and improve the summary quality?

#### Postprocessing the the output so that it only starts printing after the input sequence?

In [75]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load GPT-2 tokenizer and model
gpt2_model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(gpt2_model_name)
model = AutoModelForCausalLM.from_pretrained(gpt2_model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Input text (article you want to summarize)
input_text = """Article: 
Have you considered how and when you'll withdraw from your retirement accounts? Doing so in the incorrect order may end up costing you.
Summary:"""

# Encode input
inputs = tokenizer(input_text, return_tensors="pt")

# Generate output
outputs = model.generate(
    **inputs,
    max_new_tokens=80,               # generate summary tokens only
    do_sample=True,                  # sampling instead of greedy
    top_k=50,                        # top-k sampling
    top_p=0.9,                       # nucleus sampling
    temperature=0.7,                 # smoothness
    num_return_sequences=3,          # generate multiple candidates
    eos_token_id=tokenizer.eos_token_id,
)

# Post-process: only keep the "new" tokens after the input
generated_summaries = []
for output in outputs:
    gen_text = tokenizer.decode(output, skip_special_tokens=True)
    # Trim the input prefix → only keep the continuation
    continuation = gen_text[len(input_text):].strip()
    generated_summaries.append(continuation)

print("Generated Summaries:\n")
for i, summ in enumerate(generated_summaries, 1):
    print(f"{i}. {summ}\n")

Generated Summaries:

1. I've always thought it was important to ensure that I was able to afford a comfortable retirement.
However, a number of factors have come into play.
First, it's important to note that, while I have a retirement account in my name, I'm not entitled to any retirement benefits.
If I have an annuity (e.g. 401(k) or 403(

2. Your account will not be closed until the end of your retirement. If you choose not to withdraw from your account, you will be able to withdraw your money from your bank account at any time. If you withdraw your money from a retirement account, you may withdraw your money from your bank account only once or twice. You must also agree to pay all taxes on your withdrawal.
What are my

3. We offer you the option to withdraw your retirement account at any time during the first month of your plan's life. You can withdraw your retirement account at any time during the first month of your plan's life, at any time before your first month of your plan's 

#### Iterate and improve the summary quality

In [76]:
# Simple heuristic: pick the shortest valid summary
best_summary = min(generated_summaries, key=lambda x: len(x.split()))
print("Best summary:", best_summary)

Best summary: I've always thought it was important to ensure that I was able to afford a comfortable retirement.
However, a number of factors have come into play.
First, it's important to note that, while I have a retirement account in my name, I'm not entitled to any retirement benefits.
If I have an annuity (e.g. 401(k) or 403(



## Evaluation

After fine‑tuning the models (training steps are commented out by default), we evaluate them on the validation subset. Different metrics are appropriate for each architecture:

* **GPT‑2** (decoder‑only): We generate summaries using greedy decoding and compute ROUGE metrics (ROUGE‑1, ROUGE‑2, ROUGE‑L). We also compute perplexity using the loss returned by the trainer.

* **BERT** (encoder‑only): BERT is not designed to generate full sequences; instead we use it for downstream tasks such as text classification. For a classification scenario, the evaluation metrics are typically confusion matrix and F1-score.

* **T5** (encoder‑decoder): We generate summaries using greedy decoding and compute ROUGE metrics.  Perplexity is computed similarly to GPT‑2 by exponentiating the validation loss.

The code below demonstrates evaluation routines for each model. Running these functions requires trained models; if you skipped training above, the evaluation will use the pre‑trained weights and therefore will not yield good summarization quality.


In [67]:
!pip install nltk rouge-score absl-py

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting absl-py
  Downloading absl_py-2.3.1-py3-none-any.whl.metadata (3.3 kB)
Collecting click (from nltk)
  Downloading click-8.2.1-py3-none-any.whl.metadata (2.5 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   --------------------------- ------------ 1.0/1.5 MB 6.3 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 6.1 MB/s  0:00:00
Downloading absl_py-2.3.1-py3-none-any.whl (135 kB)
Downloading click-8.2.1-py3-none-

# Evaluation

In [72]:
from transformers import GenerationConfig

# make sure GPT-2 can pad
if gpt2_tokenizer.pad_token is None:
    gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token


# Define ROUGE metric
evaluate_rouge = evaluate.load("rouge")

def compute_metrics_rouge(preds, refs):
    # Compute ROUGE scores; use newline separation between sentences in each text
    result = evaluate_rouge.compute(predictions=preds, references=refs, use_stemmer=True)
    return {k: round(v * 100, 2) for k, v in result.items()}

# Function to generate summaries with GPT-2
def evaluate_gpt2(model, tokenizer, dataset, num_samples=10):
    model.eval()
    preds, refs = [], []
    for i, example in enumerate(dataset.select(range(num_samples))):
        prompt = "summarize: " + example["article"]
        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=128).to(model.device)
        with torch.no_grad():
            output_ids = model.generate(**inputs, max_length=512)
        summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        preds.append(summary)
        refs.append(example["highlights"])
    rouge_scores = compute_metrics_rouge(preds, refs)
    return rouge_scores

# Function to compute perplexity from evaluation loss
def compute_perplexity(eval_output):
    loss = eval_output["eval_loss"]
    return round(torch.exp(torch.tensor(loss)).item(), 3)

# Evaluate GPT-2 (if trained) -- example usage
gpt2_eval_results = trainer_gpt2.evaluate()
gpt2_perplexity = compute_perplexity(gpt2_eval_results)
rouge_gpt2 = evaluate_gpt2(gpt2_model, gpt2_tokenizer, small_val_dataset)
print("GPT-2 Perplexity:", gpt2_perplexity)
print("GPT-2 ROUGE:", rouge_gpt2)


{'eval_loss': 2.5808355808258057, 'eval_runtime': 40.2239, 'eval_samples_per_second': 2.486, 'eval_steps_per_second': 1.243, 'epoch': 1.0}
GPT-2 Perplexity: 13.208
GPT-2 ROUGE: {'rouge1': np.float64(14.32), 'rouge2': np.float64(6.07), 'rougeL': np.float64(10.11), 'rougeLsum': np.float64(12.31)}


## Evaluation GPT-2 
### GPT-2 (Decoder-only LM)
####	Loss / Perplexity:
-	Eval loss: 2.58 → Perplexity ≈ 13.2 (higher = more uncertain predictions).
####	ROUGE scores (summarization quality):
-	Rouge-1 = 14.32
-	Rouge-2 = 6.07
-	Rouge-L = 10.11
-	Rouge-Lsum = 12.31
####	Inference:
-	GPT-2 can generate fluent text but struggles to capture summary content fidelity.
-	Low ROUGE scores suggest poor overlap with reference summaries.
-	Slowest runtime among the three (40s, ~2.5 samples/sec).



## Analysis and discussion

After fine‑tuning the models and running the evaluation routines, you should fill in a comparison of the results.  Typical observations include:

- **Decoder‑only (GPT‑2):** GPT‑2 fine‑tuned on a summarization corpus learns to generate coherent summaries.  Its perplexity should decrease significantly compared with the pre‑trained model, and ROUGE scores should improve.  Because GPT‑2 has no separate encoder, it must memorize how to map the input prompt to the desired output, which can make training less sample‑efficient for conditional tasks.  However, at inference time GPT‑2 generates outputs quickly via a single decoder.

- **Encoder‑only (BERT):** BERT excels at understanding tasks but struggles with generative tasks.  MLM fine‑tuning improves its perplexity on the article‑summary text, but it cannot generate full summaries.  The `fill‑mask` pipeline can fill individual tokens, but the lack of an auto‑regressive decoder makes long‑form generation impractical.  This illustrates why encoder‑only architectures are not suited for free‑form text generation.

- **Encoder‑decoder (T5):** T5 is designed for text‑to‑text tasks and typically achieves the best summarization scores among the three models when fine‑tuned properly.  Its separate encoder compresses the input, and the decoder generates output conditioned on the encoded context.  T5 often yields higher ROUGE scores and lower perplexity than GPT‑2 on summarization because the architecture explicitly models conditional generation.  The trade‑off is increased computational cost due to the encoder and decoder.

### Chain‑of‑thought (CoT) reasoning

Chain‑of‑thought reasoning refers to models generating intermediate reasoning steps before arriving at a final answer.  Decoder‑only models (like GPT‑2 and GPT‑3) naturally support CoT prompting because they generate text token by token.  Encoder‑decoder models like T5 can also perform CoT when prompted appropriately (e.g. instructing the model to "think step by step").  Encoder‑only models lack a decoding mechanism and therefore are not directly applicable to CoT generation.  In practice, CoT reasoning quality improves with larger models and more sophisticated training (e.g. instruction‑tuning or reinforcement learning with human feedback), which are beyond the scope of this introductory exercise.

## Conclusion

In this notebook we implemented and compared three transformer architectures on a common summarization task using the CNN/DailyMail dataset.  We demonstrated how to fine‑tune a decoder‑only model (GPT‑2), an encoder‑only model (BERT), and an encoder‑decoder model (T5).  The code illustrated data preprocessing, training setups, and evaluation routines using ROUGE and perplexity metrics.  While only small subsets of the dataset were used for demonstration purposes, you should expand the training data and adjust hyperparameters for a thorough experiment.  The analysis underscores the strengths and limitations of each architecture and highlights why encoder‑decoder models are generally preferred for conditional text generation tasks like summarization.


# Assignment 2: Transformer Architecture Exercise
Use this notebook as a starting point and expand on your understanding of transformer models by completing the following structured tasks. You are encouraged to experiment, analyze, and critically reflect on your findings in your report.

## Part 1: Model Training & Implementation
### 1. Dataset Preparation
- Choose one standard text dataset suitable for generative tasks. Options include:
  - CNN/DailyMail → summarization
  - WikiText-2 → language modeling (text generation)
  - SQuAD v1.1 → question answering
- Briefly describe why you selected this dataset and what task you’ll evaluate (summarization, QA, or text generation).
- Show how you preprocessed the data (tokenization, train/val split, max length, etc.).

### 2. Model Implementation

Implement and train the following:
- Decoder-only model (GPT-style): e.g., GPT-2 small from Hugging Face.
- Encoder-only model (BERT-style): e.g., BERT-base, used for masked-language-modeling or extractive QA/summarization.
- Encoder-decoder model (T5-style): e.g., T5-small, trained for the same dataset/task as the other two.

### 3. Training Documentation

- Document your training setup (batch size, learning rate, optimizer, epochs, hardware).
- Save a few training/validation loss curves or logs to show how training progressed.
- Mention any difficulties you faced and how you addressed them (e.g., memory limits, convergence).

## Part 2: Evaluation & Analysis

### 4. Performance Evaluation

- Evaluate all three models on the same task.
- Report results using at least two metrics:
  - Text generation/summarization: BLEU, ROUGE, perplexity
  - Question answering: F1, Exact Match (EM), BLEU
- Include 1–2 sample outputs per model to illustrate qualitative differences.

### 5. Comparative Discussion

- Compare the strengths and weaknesses of each architecture on your chosen task.
- Suggested angles:

  - Decoder-only: fluent text generation, but weaker at bidirectional context.
  - Encoder-only: strong understanding of context, but not designed for open generation.
  - Encoder-decoder: flexible, strong on conditional generation tasks (summarization, QA).

- Which model seemed easiest to fine-tune?
- Which produced the best outputs on your dataset?
- Which was the most efficient (speed, memory)?

### 6. Reflections on Applicability

- In what real-world scenarios would you prefer each architecture?
- Briefly note whether you think CoT reasoning would have helped these models if you had added it (conceptual discussion only—no experiments required).