# Homework 5: LLM Fine-tuning with Transformers

In this homework you will be **finetuning a small instruction-tuned Language Model (LLM) to do well on previous 189 exam problems while maintaining its general knowledge capabilities.** We will be evaluating you on a hidden test set containing 189 exam problems and general knowledge questions, all in multiple choice format.

We will walk you through how to fine-tune on custom datasets using the standard Hugging Face `transformers` and `trl` libraries.

In this notebook, we use **Qwen/Qwen2.5-0.5B-Instruct**. This is a small but capable model (0.5 billion parameters) that fits easily on most GPUs and trains quickly, allowing us to perform **full fine-tuning** (updating all weights) rather than needing parameter-efficient methods like LoRA (although you are welcome to use LoRA instead).

In this notebook, we provide a small subset of CS189 Exam Questions for you to test and a walkthrough of a simple finetuning pipeline. The actual test set you would be making predictions will be a mix of different questions (Full details are provided in the accompanying PDF)

## Overview
Your task is to:
1. **Adapt the provided notebook**
2. **Generate predictions** on the private test questions (test.csv)
3. **Submit** your results to Kaggle  

**Kaggle competition link:**  
https://www.kaggle.com/t/11c8ffdc967fe3f27755cde6fb5810e8

---
### Rules

You are encouraged to improve the model's performance!

**What you CAN change:**
- **Parsing Logic:** You can improve `parse_choice_from_boxed` to handle more edge cases or different output formats.
- **Training and Testing Data:** You can mix in additional datasets to the training set and build your own eval sets to test if your model is overfitting.
- **Test-Time Adaptations:** You can try different decoding strategies, majority voting, or other inference-time techniques.
- **Prompt Engineering:** You can experiment with Chain-of-Thought (CoT) prompting or different system prompts during inference.

**What you CANNOT change:**
- **The Model:** You must train the `Qwen/Qwen2.5-0.5B-Instruct` model. Do not switch to a different model architecture or size.
- **Colab Compatibility:** Your final notebook must be runnable in Google Colab. Do not add dependencies or steps that break this compatibility.

#### Evaluation 

**Important:** Your model will be evaluated on a hidden test set containing both:
1.  **CS189 Exam Problems:** Similar to the ones in your training set.
2.  **General Knowledge Questions:** To check if the model has retained its general capabilities.

**Catastrophic Forgetting:**
As shown in the paper you read for this homework, fine-tuning on a narrow dataset (e.g. just CS189 MCQs) can sometimes cause the model to "forget" how to answer general questions or lose its reasoning abilities.

**Recommendation:**
We strongly encourage you to build your own **test set** that includes both domain-specific and general knowledge questions. Use this to monitor your model's performance and ensure it isn't suffering from catastrophic forgetting. You might want to mix in some general datasets during training or use early stopping to prevent this (the paper you read will be useful here).

---
## The Finetuning Pipeline

Now let's walk through a simple finetuning pipeline using the Hugging Face `transformers` and `trl` libraries. Even though we are using a smaller model, the core structure of the finetuning pipeline follows the classic **ML Lifecycle** covered in lecture!

<img src="https://i.imgur.com/ya2hBEk.png" width="60%">

---

## Learning Problem (P)

**Goal:** Decide what behavior we want the LLM to learn, and from what data.

We want to fine-tune our base Qwen model to better perform on CS189-style multiple choice questions. Our objective is to minimize cross-entropy loss on a dataset of these questions.

Concretely, we will:
- **Load the CS189 MCQ dataset:** A CSV file containing questions, options (A-E), and the correct answer.
- **Format the data:** Convert each row into a "chat" format that the model understands.
  - User: The question + options.
  - Assistant: The correct answer (e.g., `\boxed{A}`).

---
## Model Design (L)

**Goal:** Decide which model we use and how we adapt it.

We use **Qwen/Qwen2.5-0.5B-Instruct**.
- **Architecture:** A Transformer-based Causal Language Model.
- **Adaptation:** We use **Full Fine-tuning**. Since the model is small, we can update all parameters. This differs from "LoRA" (Low-Rank Adaptation) which is often used for larger models (7B+) to save memory.

---
## Optimization (M)

**Goal:** Train the model on the dataset by minimizing loss.

We use TRLâ€™s `SFTTrainer` (Supervised Fine-Tuning Trainer) to perform gradient-based optimization:
- **Loss function:** Standard token-level cross-entropy loss.
- **Optimizer:** AdamW (8-bit version to save some memory, though standard AdamW may also fit depending on your GPU).
- **Hyperparameters:** Learning rate, batch size, etc.

---
## Predict & Evaluate (O)

**Goal:** Check whether the fine-tuned model behaves as desired.

After training, we:
- **Run inference:** Ask the model to answer the MCQs.
- **Compute accuracy:** Check if the model's output (parsed from `\boxed{X}`) matches the ground truth.
- **Compare:** We measure accuracy *before* and *after* fine-tuning to quantify improvement.

---
## Part 0: Environment Setup

This cell installs and imports the required libraries.

In [None]:
import sys
IS_COLAB = 'google.colab' in sys.modules
if IS_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd /content/drive/MyDrive/cs189/hw/hw5
    ! pip install -q transformers==4.57.2 accelerate datasets trl bitsandbytes

In [None]:
import os
import re
import math
import pandas as pd
import torch
from datasets import Dataset, concatenate_datasets, load_dataset, load_from_disk
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
#MAKE SURE YOU ARE USING GPU
print('Using device:', device)

---
## Part 1: Configuration & Model Loading

Here we define all our settings and load the base model.

**Model Design (L):** We select `Qwen/Qwen2.5-0.5B-Instruct`.

In [None]:
# ============================================================================
# === CONFIGURATION - ALL SETTINGS IN ONE PLACE ===
# ============================================================================

# --- Model Configuration ---
MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct" # YOU CANNOT CHANGE THIS

# --- Dataset Configuration ---
#TODO: REPLACE WITH YOUR OWN PATH
MCQ_CSV_PATH = "hw5_sample_eval.csv"  # Path to CS189 MCQ sample eval dataset

# --- Training Configuration (feel free to adjust!) ---
TRAIN_BATCH_SIZE = 1
GRADIENT_ACCUMULATION_STEPS = 4
WARMUP_STEPS = 5
MAX_STEPS = 50  # or set num_train_epochs instead
LEARNING_RATE = 1e-5
WEIGHT_DECAY = 0.01
LR_SCHEDULER_TYPE = "linear"
OPTIM = "adamw_8bit"  # requires bitsandbytes
SEED = 189

# --- Evaluation Configuration ---
EVAL_MAX_NEW_TOKENS = 64  # How many tokens to generate for inference
OUTPUT_DIR = "./mcq_finetuned_model"

In [None]:
# === Load base model & tokenizer ===
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Ensure we have a pad token for training
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else None,
)
model.resize_token_embeddings(len(tokenizer))
model.to(device)
model.eval()
print('Model loaded.')

---
## Part 2: Data Preparation

**Learning Problem (P):** We need to format our raw CSV data into training examples.

We define helper functions to:
1.  Load the CSV.
2.  Build a "prompt" (Question + Options).
3.  Build the full "SFT text" (Prompt + Answer) using the model's chat template.

In [None]:
# === MCQ helpers ===
LETTER_SET = set(list("ABCDE"))

def load_mcq_dataset(csv_path: str = MCQ_CSV_PATH):
    """Load the CS189 MCQ dataset.

    Expected columns:
        - question
        - A, B, C, D, E
        - answer (single letter A-E)
    """
    df = pd.read_csv(csv_path)
    required = ["question", "A", "B", "C", "D", "E", "answer"]
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required columns in MCQ CSV: {missing}")

    df = df.copy()
    df["answer"] = (
        df["answer"]
        .astype(str)
        .str.strip()
        .str.upper()
    )
    df = df[df["answer"].isin(LETTER_SET)].reset_index(drop=True)
    return df

def build_mcq_prompt(row):
    """Prompt for inference: instruction + question + options.

    The model is expected to answer with the correct letter in \\boxed{} format.
    """
    q = str(row["question"]).strip()
    options = "\n".join([
        f"A. {row['A']}",
        f"B. {row['B']}",
        f"C. {row['C']}",
        f"D. {row['D']}",
        f"E. {row['E']}",
    ])
    prompt = (
        "Choose exactly one correct option from A, B, C, D, and E.\n"
        "Return your answer inside a LaTeX box.\n\n"
        f"{q}\n\n{options}\n\nAnswer:"
    )
    return prompt

### Understanding the Chat Format (OpenAI Style)

To fine-tune a chat model, we need to structure our data as a conversation. This is often called the **OpenAI Chat Format** or **Messages Format**.

Instead of a single string of text, each example is a list of dictionaries, where each dictionary represents a message in the conversation:
-   `{"role": "user", "content": "..."}`: The input prompt or question.
-   `{"role": "assistant", "content": "..."}`: The model's desired response.

For our MCQ task, we structure it as:
1.  **User**: "Choose exactly one correct option... [Question] ... [Options]"
2.  **Assistant**: "\\boxed{A}"

We then use `tokenizer.apply_chat_template()` to convert this structured list into the specific string format that the model expects (e.g., adding special tokens like `<|im_start|>user...<|im_end|>`).

In [None]:
def parse_choice_from_boxed(text: str):
    """Parse an MCQ choice Aâ€“E from the model output.

    We first look for a literal '\\boxed{X}' pattern. If not found, we
    fallback to the last standalone A-E in the decoded text.
    """
    if text is None:
        return None
    # Direct \\boxed{A} ... \\boxed{E}
    m = re.search(r"\\boxed\{\s*([A-E])\s*\}", text)
    if m:
        return m.group(1)
    # Fallback: last standalone Aâ€“E
    letters = re.findall(r"\b([A-E])\b", text.upper())
    if letters:
        return letters[-1]
    return None

### **Load and Format the (Eval) Dataset**

We load the MCQ dataset and apply the formatting function.
These are sample eval sets we provided. The actual test set you would be making predictions will be a mix of different questions (Full details are provided in the accompanying PDF)

In [None]:
# === Load MCQ CSV (Evaluation Data) ===
try:
    mcq_df = load_mcq_dataset(MCQ_CSV_PATH)
    print(f"Loaded MCQ dataset with {len(mcq_df)} rows from {MCQ_CSV_PATH}.")
except Exception as e:
    mcq_df = None
    print("Error loading MCQ CSV â€” check MCQ_CSV_PATH.")
    raise e
mcq_df

### **Load Training Dataset (MMLU)**

We will use the **MMLU (Massive Multitask Language Understanding)** dataset, specifically the `machine_learning` subset, as our training data. This helps the model learn general machine learning concepts which should transfer to the CS189 exam problems.

In [None]:
# === MMLU Helper Functions ===
def load_mmlu_dataset(subset: str = "machine_learning", split: str = "test"):
    """Load a subset of the MMLU dataset from Hugging Face."""
    print(f"Loading MMLU dataset (subset={subset}, split={split})...")
    ds = load_dataset("cais/mmlu", subset, split=split)
    return ds

def build_mmlu_prompt(row):
    """Prompt for inference: instruction + question + options."""
    q = str(row["question"]).strip()
    choices = row["choices"]

    options_list = []
    for i, choice in enumerate(choices):
        letter = chr(ord("A") + i)
        options_list.append(f"{letter}. {choice}")
    options_str = "\n".join(options_list)

    prompt = (
        "Choose exactly one correct option from the choices provided.\n"
        "Return your answer inside a LaTeX box.\n\n"
        f"{q}\n\n{options_str}\n\nAnswer:"
    )
    return prompt

def build_mmlu_sft_text(row, tokenizer):
    """Build properly formatted chat template text for training."""
    user_content = build_mmlu_prompt(row)

    answer_int = row["answer"]
    answer_letter = chr(ord("A") + answer_int)
    assistant_content = f"\\boxed{{{answer_letter}}}"

    messages = [
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": assistant_content}
    ]

    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )

# === Load MMLU Machine Learning Dataset ===
mmlu_ds = load_mmlu_dataset("machine_learning", split="test")
mmlu_text_ds = mmlu_ds.map(lambda x: {"text": build_mmlu_sft_text(x, tokenizer)})
print("Loaded MMLU ML dataset with", len(mmlu_text_ds), "rows")

# Set the training dataset - you can mix and match datasets here
train_dataset = mmlu_text_ds

Let's look at an example of the training data to see what the model is actually seeing as input. Note that there are now start and stop tokens `<|im_start|>` and `<|im_end|>` as well as the role `user` and `assistant` indicating who is speaking.

In [None]:
# print out what the first row looks like
print(train_dataset[0]['text'])

---
## Part 3: Baseline Evaluation

**Predict & Evaluate (O):** Before we train, let's see how the model performs "zero-shot" or "few-shot" (depending on the prompt) on our CS189 questions.

In [None]:
def eval_mcq_accuracy(
    curr_model,
    curr_tokenizer,
    df,
    max_new_tokens: int = 64,
    return_details: bool = False,
):
    """Evaluate a model on the MCQ dataset using greedy decoding.

    If return_details=True, also return a pandas DataFrame with
    [idx, question, A, B, C, D, E, gold, decoded, parsed, correct].
    """
    curr_model.eval()
    n = len(df)
    correct = 0
    total = 0
    records = []

    for idx in range(n):
        row = df.iloc[idx]
        user_content = build_mcq_prompt(row)

        # Apply chat template for inference
        messages = [{"role": "user", "content": user_content}]
        prompt = curr_tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = curr_tokenizer(prompt, return_tensors="pt").to(device)

        with torch.no_grad():
            outputs = curr_model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=False,
            )

        gen_tokens = outputs[0][inputs["input_ids"].shape[1]:]
        decoded = curr_tokenizer.decode(gen_tokens, skip_special_tokens=True)

        pred = parse_choice_from_boxed(decoded)
        is_correct = (pred is not None and pred == row["answer"])
        if is_correct:
            correct += 1
        total += 1

        records.append({
            "idx": idx,
            "question": row["question"],
            "A": row["A"],
            "B": row["B"],
            "C": row["C"],
            "D": row["D"],
            "E": row["E"],
            "gold": row["answer"],
            "prompt": prompt,
            "decoded": decoded,
            "parsed": pred,
            "correct": is_correct,
        })

        if (idx + 1) % 20 == 0:
            print(f"Processed {idx + 1}/{n} questions...")

    acc = correct / max(total, 1)
    print(f"MCQ accuracy: {acc * 100:.2f}% ({correct}/{total})")

    details_df = pd.DataFrame(records)
    if return_details:
        return acc, details_df
    return acc

# === Baseline MCQ accuracy before fine-tuning ===
print("Evaluating baseline model on MCQ dataset...")
baseline_acc, baseline_details = eval_mcq_accuracy(
    model,
    tokenizer,
    mcq_df,
    max_new_tokens=EVAL_MAX_NEW_TOKENS,
    return_details=True,
)
baseline_details.head()

---
## Part 4: Training (Optimization)

**Optimization (M):** We now configure the `SFTTrainer`.

We set:
- `dataset_text_field="text"`: Tells the trainer which column contains the formatted chat.
- `learning_rate`, `batch_size`: Standard hyperparameters.
- `optim="adamw_8bit"`: Efficient optimizer.

In [None]:
# === Set up SFTTrainer ===
sft_config = SFTConfig(
    dataset_text_field="text",
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    warmup_steps=WARMUP_STEPS,
    max_steps=MAX_STEPS,
    learning_rate=LEARNING_RATE,
    logging_steps=1,
    optim=OPTIM,
    weight_decay=WEIGHT_DECAY,
    lr_scheduler_type=LR_SCHEDULER_TYPE,
    seed=SEED,
    report_to="none",
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=train_dataset,
    eval_dataset=None,
    processing_class=tokenizer,
)

trainer

### Run Training

This will iterate through the dataset and update the model's weights.

In [None]:
# === Fine-tune the model ===
model.train()
trainer.train()
model.eval()

---
## Part 5: Post-Training Evaluation

**Predict & Evaluate (O):** Now that the model is trained, we evaluate it again on the same MCQ dataset to see if accuracy improved.

In [None]:
# === Evaluate MCQ accuracy after fine-tuning ===
print("Evaluating fine-tuned model on MCQ dataset...")
ft_acc, ft_details = eval_mcq_accuracy(
    model,
    tokenizer,
    mcq_df,
    max_new_tokens=EVAL_MAX_NEW_TOKENS,
    return_details=True,
)
ft_details.head()
print(f"Baseline acc: {baseline_acc:.4f}, Fine-tuned acc: {ft_acc:.4f}")

### Save the Model

We save the fine-tuned model and tokenizer so we can use them later.

In [None]:
# === Save fine-tuned model (optional) ===
os.makedirs(OUTPUT_DIR, exist_ok=True)
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
print("Saved fine-tuned model to", OUTPUT_DIR)

---

## ðŸ§© **YOUR TURN â€” Final Kaggle Submission**

Now itâ€™s your turn to run the *full* ML lifecycle.

You should now be able to load and inspect the **private test CSV**, which contains **169 questions** with mixed types.  
Use your **fine-tuned LLM** to make predictions on these questions.  
**Beware of formatting**: Kaggle will reject incorrectly formatted submissions!

### Objective

Your task is to:

1. **Use the provided notebook** to fine-tune the base model on your own choice of train data (feel free to adapt the one we provided)
2. **Adapt the same pipeline** to run inference on `test.csv`.
3. For each row in `test.csv`, output **exactly one letter** from the set  
   **{A, B, C, D, E}**.
4. Save these predictions in the **strict submission format** described below and
   upload your CSV to Kaggle:
   - https://www.kaggle.com/t/11c8ffdc967fe3f27755cde6fb5810e8



### **Submission Format (Strict)**

Your submission must be a **CSV** with exactly **two columns** â€” `id` and `prediction` â€” and a **single header row**.

A valid submission looks like:

| id          | prediction |
|-------------|------------|
| test_00001  | A          |
| test_00002  | A          |
| test_00003  | C          |
| test_00004  | E          |
| ...         | ...        |




### Evaluation Metric

Submissions are evaluated using **accuracy**: the fraction of test examples for which your
predicted answer matches the hidden correct answer.


### Example

| **id**        | **True Answer** | **Your Prediction** | **Correct?** |
|---------------|-----------------|---------------------|--------------|
| `test_00001`  | A               | A                   | Yes          |
| `test_00002`  | A               | B                   | No           |
| `test_00003`  | A               | A                   | Yes          |

Here, the accuracy would be 2/3, or approximately 66.7%

The leaderboard is split into:

- **Public leaderboard**: 50% of the test data  
- **Private leaderboard**: remaining 50% (used for final ranking and grading)


In [None]:
YOUR_PATH_TO_TEST_CSV = 'kaggle_test.csv' #TODO: REPLACE WITH YOUR OWN PATH
test_questions = pd.read_csv(YOUR_PATH_TO_TEST_CSV)
test_questions
#TODO:
# 1. Make predictions on your finetuned model
# 2. submit to kaggle following the expected format (id, prediction)

In [None]:
# Dummy Place Holder

# Create a dummy submission dataframe with all "A"
submission = pd.DataFrame({
    "id": test_questions["id"],
    "prediction": ["A"] * len(test_questions)
})

# Save to CSV
submission.to_csv("dummy_submission.csv", index=False)

print("Saved dummy_submission.csv")
submission.head()