# CSC 8614 - Language Models
## CI2 - Fine-tuning a language model for text classification

In this TP, you will work on fine-tuning a language model to move from text generation to text classification, specifically working on Spam Detection.

The exercise (and code) has been adapted from the book _Build a Large Language Model (From Scratch)_, by Sebastian Raschka, and its [official github repository](https://github.com/rasbt/LLMs-from-scratch).

This TP will be done in this notebook, and requires some additional files (available from the course website). You will have to fill the missing portions of code, and perform some additional experiments by testing different parameters.

Working on this TP:
- The easiest way is probably to work directly on the notebook, using jupyter notebook or visual studio code. An alternative is also to use Google colab.
- You should be able to run everything on your machine, but you can connect to the GPUs if needed.

Some files are required, and are available on the course website:
- `requirements.txt`
- `gpt_utils.py`


## About the report
You will have to return this notebook (completed), as well as a mini-report (`TP2/rapport.md`).

The notebook and report shall be submitted via a GitHub repository, similarly to what you did for the first session (remember to use a different folder: `TP2`).
For the notebook, it is sufficient to complete the code and submit the final version.

For the mini-report, you have to answer the questions asked in this notebook, and discuss some of your findings as requested.
As for the first session:
- "Vous devez y mettre : réponses courtes, résultats observés (copie de sorties), captures d’écran demandées, et une courte interprétation."
- "Ne collez pas des pages entières : soyez concis et sélectionnez les éléments pertinents."

Reproducibility: 
- fix a random seed and write it in the report
- indicate in the report the specific python version OS, and the library versions.

**Question 1**: Dans `TP1/rapport.md`, ajoutez immédiatement un court en-tête (quelques lignes) contenant : (i) votre nom/prénom, (ii) la commande d’installation/activation d’environnement utilisée, (iii) les versions (Python + bibliothèques principales).

Ajoutez ensuite au fil du TP des sections/titres à votre convenance, tant que l’on peut retrouver clairement vos réponses et vos preuves d’exécution.

In [3]:
# [Instructor code: install requirements]
!pip install -r requirements.txt




[notice] A new release of pip is available: 23.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


## Preparing the model

In [2]:
# --- [INSTRUCTOR CODE: load the model weights into memory] ---
import torch
import tiktoken
from gpt_utils import GPTModel, download_and_load_gpt2, load_weights_into_gpt

# Download the model weights (124M param version) / This function (which we put in gpt_utils) handles the downloading
settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2_weights")
print("Weights downloaded and loaded into memory.")

File already exists and is up-to-date: gpt2_weights\124M\checkpoint
File already exists and is up-to-date: gpt2_weights\124M\encoder.json
File already exists and is up-to-date: gpt2_weights\124M\hparams.json
File already exists and is up-to-date: gpt2_weights\124M\model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2_weights\124M\model.ckpt.index
File already exists and is up-to-date: gpt2_weights\124M\model.ckpt.meta
File already exists and is up-to-date: gpt2_weights\124M\vocab.bpe
Weights downloaded and loaded into memory.


The `settings` obtained with `download_and_load_gpt2` are the GPT-2 weights made publicly available by OpenAI.

**Question 2**: What type is the object `setting`, and what is its structure (e.g. if it is a list, its length; if a dictionary, its keys, etc.)?

**Question 3**: What type is the object `params`, and what is its structure?

In [3]:
# Analyse `settings`
print("settings :")
print("type settings:", type(settings))

if isinstance(settings, dict):
    print("keys settings:", list(settings.keys()))
    print("\nDétails:")
    for k, v in settings.items():
        if isinstance(v, (int, float, str, bool, type(None))):
            short = v
        elif isinstance(v, (list, tuple)):
            short = f"{type(v).__name__}(len={len(v)})"
        elif hasattr(v, "shape"):
            short = f"{type(v).__name__}(shape={getattr(v,'shape',None)})"
        else:
            short = type(v).__name__
        print(f"- {k}: {short}")
elif isinstance(settings, (list, tuple)):
    print("len settings:", len(settings))
    print("type first:", type(settings[0]) if len(settings) > 0 else None)
else:
    print("repr settings :", repr(settings)[:800])

# Analyse `params`
print("params :")
print("type params:", type(params))

if isinstance(params, dict):
    print("keys:", list(params.keys()))
    print("sous structures :")
    for k, v in params.items():
        if isinstance(v, dict):
            print(f"- {k}: dict (keys sample={list(v.keys())[:10]})")
        elif torch.is_tensor(v):
            print(f"- {k}: tensor shape={tuple(v.shape)}, dtype={v.dtype}")
        else:
            print(f"- {k}: {type(v).__name__}")
elif isinstance(params, (list, tuple)):
    print("len params:", len(params))
    if len(params) > 0:
        v0 = params[0]
        print("type first:", type(v0))
else:
    print("repr params:", repr(params)[:800])



settings :
type settings: <class 'dict'>
keys settings: ['n_vocab', 'n_ctx', 'n_embd', 'n_head', 'n_layer']

Détails:
- n_vocab: 50257
- n_ctx: 1024
- n_embd: 768
- n_head: 12
- n_layer: 12
params :
type params: <class 'dict'>
keys: ['blocks', 'b', 'g', 'wpe', 'wte']
sous structures :
- blocks: list
- b: ndarray
- g: ndarray
- wpe: ndarray
- wte: ndarray


Look at the `GPTModel` in the file `gpt_utils.py`. In the `__init__` method, we have to pass a config (parameter `cfg`). 

**Question 4:** 
Analyse the `__init__` method, and check what is the required structure for the `cfg` parameter. Is the `settings` variable we have obtained in the right format? If not, perform the mapping to convert the variable `setting` into a variable `model_config` with the right structure.

In [4]:
# Configure the model, mapping OpenAI specific keys to our model's keys (if needed)

model_config = {
    "vocab_size": settings["n_vocab"],
    "context_length": settings["n_ctx"],
    "emb_dim": settings["n_embd"],
    "n_heads": settings["n_head"],
    "n_layers": settings["n_layer"],
    "drop_rate": 0.1,
    "qkv_bias": True,
}

In [5]:
model = GPTModel(model_config)

# Load the pre-trained weights
load_weights_into_gpt(model, params)
model.eval() 

print("GPT-2 Model Loaded and Configured successfully!")

GPT-2 Model Loaded and Configured successfully!


## Preparing the data

Context from the lecture: The raw data is just text messages. 

The model needs numbers (token IDs). We also need to pad the messages so they are all the same length in a batch.

We will use a `SpamDataset` class (provided below) to tokenize the text.

In [7]:
# --- [INSTRUCTOR CODE: Run this cell to define the Dataset Class] ---
from torch.utils.data import Dataset
import pandas as pd
import urllib.request
import zipfile
import os

class SpamDataset(Dataset):
    def __init__(self, csv_file, tokenizer, max_length=120, pad_token_id=50256):
        self.data = pd.read_csv(csv_file)
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.pad_token_id = pad_token_id
        # Encode labels: "spam" -> 1, "ham" -> 0
        self.data["label_encoded"] = self.data["Label"].map({"spam": 1, "ham": 0})

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]["Text"]
        label = self.data.iloc[idx]["label_encoded"]
        # Tokenize
        encoded = self.tokenizer.encode(text, allowed_special={'<|endoftext|>'})       
        # Truncate if too long
        encoded = encoded[:self.max_length]
        # Pad if too short
        pad_len = self.max_length - len(encoded)
        encoded += [self.pad_token_id] * pad_len
        return torch.tensor(encoded, dtype=torch.long), torch.tensor(label, dtype=torch.long)

# Download the dataset zip file
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"
extract_path = "sms_spam_collection"
data_file_path = os.path.join(extract_path, "SMSSpamCollection")
if not os.path.exists(zip_path):
    print("Downloading dataset...")
    urllib.request.urlretrieve(url, zip_path)
    print("Download complete.")
# Unzip
if not os.path.exists(extract_path):
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(extract_path)
# Read the TSV file
df = pd.read_csv(
    data_file_path, 
    sep="\t", 
    header=None, 
    names=["Label", "Text"]
)
print(f"Total samples loaded: {len(df)}")

# 4. Create Train/Test Split (80 train / 20 test)
df = df.sample(frac=1, random_state=123).reset_index(drop=True)
# Split index
split_idx = int(0.8 * len(df))

# TODO: if needed (for performance resons), you can come back here and reduce the size of the training set.
train_df = df.iloc[:2000]  # [:2000]  # Readd this to only consider 2000 training samples
test_df = df.iloc[split_idx:]

# Save as CSVs, so the SpamDataset class can read them.
train_df.to_csv("train.csv", index=False)
test_df.to_csv("test.csv", index=False)
print("Created 'train.csv' and 'test.csv' successfully!")
print(f"Train size: {len(train_df)}")
print(f"Test size: {len(test_df)}")

Total samples loaded: 5572
Created 'train.csv' and 'test.csv' successfully!
Train size: 2000
Test size: 1115


**Question 5.1**: In the cell above, why did we do `df = df.sample(frac=1, random_state=123)` when creating the train/test split?

**Question 5.2**: Analyse the datasets, what is the distribution of the two classes in the train set? Are they balanced or unbalanced? In case they are unbalanced, might this lead to issues for the fine-tuning of the model?

In [8]:
# TODO: Your code here.
train_counts = train_df["Label"].value_counts()
train_props = train_df["Label"].value_counts(normalize=True)

print("Train class counts:\n", train_counts)
print("\nTrain class proportions:\n", train_props)



Train class counts:
 Label
ham     1726
spam     274
Name: count, dtype: int64

Train class proportions:
 Label
ham     0.863
spam    0.137
Name: proportion, dtype: float64


**Question 6**: Create the dataloaders for training and test.

In [9]:
# TODO: add any imports which are needed
from torch.utils.data import DataLoader
# Create the Tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

# Instantiate the Dataset
train_dataset = SpamDataset("train.csv", tokenizer)
test_dataset = SpamDataset("test.csv", tokenizer)

# --- TODO: Create DataLoaders ---
# 1. Create a train_loader with batch_size=16 and shuffle=True
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
# 2. Create a test_loader with batch_size=16 and shuffle=False
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)


In [10]:
# Check your work
for input_batch, target_batch in train_loader:
    print("Input batch shape:", input_batch.shape) # Should be [16, 120] (unless you use batch_size != 16)
    print("Target batch shape:", target_batch.shape) # Should be [16]
    break

Input batch shape: torch.Size([16, 120])
Target batch shape: torch.Size([16])


**Question 7**: Looking at the batch size and the training size, how many batches will you have in total? Please report the size of the subsampled training data, you reduce it due to performance constraints.

In [None]:
# TODO: add your code.
import math

train_size = len(train_dataset)
batch_size = train_loader.batch_size
num_batches = math.ceil(train_size / batch_size)

print("Train size:", train_size)
print("Batch size:", batch_size)
print("Number of batches per epoch:", num_batches)


Train size: 2000
Batch size: 16
Number of batches per epoch: 125


## Fine-tuning

**Context**: GPT-2 was trained to predict the next word (output size ~50,000). We want to predict binary classes (output size 2), so we must replace the final layer.

**Question 8**:

**8.1**: In the cell below, define the number of output classes (`num_classes`) for the new spam detection task.

**8.2**: Also, pring the original and updated output heads (hint: `out_head` from `GPTModel`)

**8.3**: Why do we freeze the internal layers with `param.requires_grad = False`?

In [12]:
# Freeze the internal layers
import torch.nn as nn

for param in model.parameters():
    param.requires_grad = False

print(f"Original output head: {model.out_head}") # TODO: YOUR CODE HERE

num_classes = 2 # TODO: YOUR CODE HERE
model.out_head = nn.Linear(768, num_classes, bias=True) # TODO: YOUR CODE HERE
# Hint: The input size of the last layer in GPT-2 small is 768.

# Enable gradient calculation ONLY for the new head and the final LayerNorm
for param in model.out_head.parameters():
    param.requires_grad = True
for param in model.trf_blocks[-1].norm2.parameters():
    param.requires_grad = True

print(f"New output head: {model.out_head}") # TODO: YOUR CODE HERE

Original output head: Linear(in_features=768, out_features=50257, bias=False)
New output head: Linear(in_features=768, out_features=2, bias=True)


You now have to **finalise the code for the training loop** (see individual steps below).

In the first cell below you can find the code to move the model to GPU (if available), define the optimizer, and calculate the accuracy. The following cell contains the code for the training (fine-tuning) loop.

You will have to complete the code of the training loop, by answering the following questions:

**Question 9.1**: Reset the gradients of the `optimizer` ([hint](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html)).

**Question 9.2**: Compute cross-entropy loss ([hint](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html)).

**Question 9.3**: Add code for the backward pass, to compute the gradient ([hint](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.cross_entropy.html))

**Question 9.4**: Add code for the optimizer step, to update the weights ([hint](https://docs.pytorch.org/docs/stable/generated/torch.optim.Optimizer.step.html))

**Question 9.5**: Add code to calculate the accuracy on train and test (hint: you can use the `calc_accuracy` method).

**Note about the speed**: On my laptop's CPU 1 epoch with the full training dataset (~4400 samples, batch_size=16) took ~20 minutes; 1 epoch with a train set of 2000 samples (batch_size=16) took ~12 minutes. 

To iterate more quickly, you could:
- i) set `num_epochs = 1` (but only at the beginning), just to make sure that the code is working;
- ii) increase batch_size to 32 or 64 (but careful with possible memory issues).
- iii) reduce the size of the training dataset, by going back to the *Preparing the data* section, and changing the line `train_df = df.iloc[:split_idx]` to `train_df = df.iloc[:split_idx][:2000]` or similar. Be careful that if you reduce the training data too much, the model will not have enough data for fine-tuning.
- Use a GPU; it would be much quicker (few minutes on the whole training data).


In [13]:
# [--- INSTRUCTOR CODE ---]

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Measure imbalance
count_ham = len(train_df[train_df['Label']=='ham'])
count_spam = len(train_df[train_df['Label']=='spam'])

# Calculate weight: penalize missing the minority class (Spam) more
# Weight = Count(Majority) / Count(Minority)
pos_weight = count_ham / count_spam  # approx 6.46 (for full training dataset)
class_weights = torch.tensor([1.0, pos_weight]).to(device)
print(f"Using class weights: {class_weights}")

# Define Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5, weight_decay=0.1)

# Calculate Accuracy Helper Function
def calc_accuracy(loader, model, device):
    correct, total = 0, 0
    # Track spam specifically
    spam_correct, spam_total = 0, 0
    model.eval()
    with torch.no_grad():
        for inputs, labels in loader:
            inputs, labels = inputs.to(device), labels.to(device)
            logits = model(inputs)[:, -1, :]
            predicted = torch.argmax(logits, dim=-1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            # Filter for Spam (Label 1)
            spam_mask = (labels == 1)
            spam_total += spam_mask.sum().item()
            spam_correct += (predicted[spam_mask] == labels[spam_mask]).sum().item()
    # Avoid division by zero
    spam_acc = spam_correct / spam_total if spam_total > 0 else 0.0
    global_acc = correct / total
    return global_acc, spam_acc


Using class weights: tensor([1.0000, 6.2993])


In [None]:
# TODO: Add your code where needed in this cell
import torch.nn.functional as F


num_epochs = 3  # TODO: (Optional questions: update this to see how the fine-tuning changes with more epochs)

for epoch in range(num_epochs):
    model.train()
    for batch_idx, (inputs, targets) in enumerate(train_loader):
        inputs, targets = inputs.to(device), targets.to(device)

        # 9.1. Reset Gradients (of the `optimizer`)
        # TODO: YOUR CODE HERE
        optimizer.zero_grad()

        # Forward Pass. The model outputs (batch, seq_len, vocab_size). 
        # We only want the prediction for the LAST token in the sequence.
        logits = model(inputs)[:, -1, :]

        # 9.2. Calculate the cross entropy loss
        loss = F.cross_entropy(logits, targets, weight=class_weights)

        # 9.3. Backward Pass
        loss.backward()
        # 9.4 Optimizer Step
        optimizer.step()

        if batch_idx % 10 == 0:
            print(f"Epoch {epoch+1}, Batch {batch_idx}, Loss: {loss.item():.4f}")

    # 9.5 Add code to calculate the accuracy on train and test
    train_acc, train_spam_acc = calc_accuracy(train_loader, model, device)
    test_acc, test_spam_acc = calc_accuracy(test_loader, model, device)
    print(f"Epoch {epoch+1}: Train Acc: {train_acc*100:.2f}% (Spam: {train_spam_acc*100:.2f}%) | Test Acc: {test_acc*100:.2f}% (Spam: {test_spam_acc*100:.2f}%)")

Epoch 1, Batch 0, Loss: 1.1945
Epoch 1, Batch 10, Loss: 1.4102
Epoch 1, Batch 20, Loss: 1.4365
Epoch 1, Batch 30, Loss: 0.5337
Epoch 1, Batch 40, Loss: 0.8444
Epoch 1, Batch 50, Loss: 1.5472
Epoch 1, Batch 60, Loss: 0.9079
Epoch 1, Batch 70, Loss: 0.8715
Epoch 1, Batch 80, Loss: 1.0448
Epoch 1, Batch 90, Loss: 0.5626
Epoch 1, Batch 100, Loss: 1.0816
Epoch 1, Batch 110, Loss: 0.5925
Epoch 1, Batch 120, Loss: 0.5880
Epoch 1: Train Acc: 13.85% (Spam: 100.00%) | Test Acc: 13.63% (Spam: 100.00%)
Epoch 2, Batch 0, Loss: 0.7279
Epoch 2, Batch 10, Loss: 0.4161
Epoch 2, Batch 20, Loss: 0.7002
Epoch 2, Batch 30, Loss: 0.7854
Epoch 2, Batch 40, Loss: 0.8618
Epoch 2, Batch 50, Loss: 0.5622
Epoch 2, Batch 60, Loss: 0.7506
Epoch 2, Batch 70, Loss: 0.6329
Epoch 2, Batch 80, Loss: 0.8232
Epoch 2, Batch 90, Loss: 0.5807
Epoch 2, Batch 100, Loss: 0.6686
Epoch 2, Batch 110, Loss: 0.8737
Epoch 2, Batch 120, Loss: 0.9309
Epoch 2: Train Acc: 16.10% (Spam: 100.00%) | Test Acc: 15.87% (Spam: 99.33%)
Epoch 3, 

**Question 10**: 

Now run the cell above. You should see how the training loss changes after each batch (and epoch).
Describe thie trend: what do you see, is the model learning?

**Question 11 (optional)**: Change the number of epochs and/or the learning rate and/or the size of the training data, and investigate how the loss/accuracy of the model changes. You can do this editing and re-running the cells above, or creating new cells below.

In [None]:
import torch
import torch.nn.functional as F
from torch.utils.data import DataLoader

train_limit = 1000
num_epochs = 4
learning_rate = 1e-4
batch_size = 16
seed = 123
torch.manual_seed(seed)
print(f"train_limit={train_limit}, epochs={num_epochs}, lr={learning_rate}, batch_size={batch_size}, seed={seed}")
train_df_small = train_df.iloc[:train_limit].copy()
train_df_small.to_csv("train_tmp.csv", index=False)
train_dataset_small = SpamDataset("train_tmp.csv", tokenizer)
train_loader_small = DataLoader(train_dataset_small, batch_size=batch_size, shuffle=True)
print("Subsampled train size:", len(train_dataset_small))

count_ham = (train_df_small["Label"] == "ham").sum()
count_spam = (train_df_small["Label"] == "spam").sum()
pos_weight = count_ham / max(count_spam, 1)
class_weights = torch.tensor([1.0, pos_weight], dtype=torch.float32, device=device)
print(f"Class distribution (train subsample): ham={count_ham}, spam={count_spam}")
print(f"Using class weights: {class_weights} (dtype={class_weights.dtype})")

model_exp = GPTModel(model_config).to(device)
load_weights_into_gpt(model_exp, params)

for p in model_exp.parameters():
    p.requires_grad = False

model_exp.out_head = torch.nn.Linear(768, 2, bias=True).to(device)
for p in model_exp.out_head.parameters():
    p.requires_grad = True

for p in model_exp.trf_blocks[-1].norm2.parameters():
    p.requires_grad = True

optimizer = torch.optim.AdamW(model_exp.parameters(), lr=learning_rate, weight_decay=0.1)
for epoch in range(num_epochs):
    model_exp.train()
    running_loss = 0.0
    for batch_idx, (inputs, targets) in enumerate(train_loader_small):
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        logits = model_exp(inputs)[:, -1, :] 
        loss = F.cross_entropy(logits, targets, weight=class_weights)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()

    avg_loss = running_loss / len(train_loader_small)
    train_acc, train_spam_acc = calc_accuracy(train_loader_small, model_exp, device)
    test_acc, test_spam_acc = calc_accuracy(test_loader, model_exp, device)

    print(
        f"Epoch {epoch+1}/{num_epochs} | "
        f"loss={avg_loss:.4f} | "
        f"train_acc={train_acc*100:.2f}% (spam={train_spam_acc*100:.2f}%) | "
        f"test_acc={test_acc*100:.2f}% (spam={test_spam_acc*100:.2f}%)"
    )


Running experiment: train_limit=1000, epochs=4, lr=0.0001, batch_size=16, seed=123
Subsampled train size: 1000
Class distribution (train subsample): ham=860, spam=140
Using class weights: tensor([1.0000, 6.1429]) (dtype=torch.float32)
Epoch 1/4 | loss=0.7445 | train_acc=81.90% (spam=27.86%) | test_acc=84.30% (spam=32.67%)
Epoch 2/4 | loss=0.7429 | train_acc=83.30% (spam=82.86%) | test_acc=85.56% (spam=81.33%)
Epoch 3/4 | loss=0.7274 | train_acc=81.90% (spam=90.71%) | test_acc=84.57% (spam=91.33%)
Epoch 4/4 | loss=0.6913 | train_acc=78.80% (spam=95.00%) | test_acc=80.27% (spam=93.33%)


**Question 12 (optional)**: Now test the model *on your own text*.

In [21]:
def classify_text(text, model, tokenizer, device, max_length=120, pad_token_id=50256):
    model.eval()

    encoded = tokenizer.encode(text, allowed_special={'<|endoftext|>'})
    encoded = encoded[:max_length]
    encoded += [pad_token_id] * (max_length - len(encoded))

    x = torch.tensor(encoded).unsqueeze(0).to(device)  # [1, max_length]

    with torch.no_grad():
        logits = model(x)[:, -1, :]
        pred = torch.argmax(logits, dim=-1).item()

    return "SPAM" if pred == 1 else "NOT SPAM"


# --- Test the model on custom texts ---
text = (
    "URGENT! You have WON a free iPhone 15. Claim now at http://bit.ly/free-iphone "
    "or reply WIN to get your prize!"
)

print(f" {text} -> {classify_text(text, model, tokenizer, device)}")


 URGENT! You have WON a free iPhone 15. Claim now at http://bit.ly/free-iphone or reply WIN to get your prize! -> SPAM


---