

 <h1>
Welcome to the Math Question Answer Verification Competition! üöÄ

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

This notebook is a starter guide designed to get you up and running quickly. We'll walk through a simplified training process using a small subset of the data (5,000 examples) and lightweight parameters. The main goal here is to understand the complete workflow, from loading data to generating a submission file, not to achieve a top score.

Good luck, and have fun! üéâ

- Anthony Olcek
- N18364039

## **Step 1: Install Necessary Libraries**

First, we need to install the required Python libraries. We'll be using the unsloth library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single free-tier GPU. We'll also install xformers for further optimization.


In [None]:
# %%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install --no-deps "xformers<0.0.26" "trl<0.9.0" "peft<0.12.0" "accelerate<0.32.0" "bitsandbytes<0.44.0" "transformers<4.43.0"

Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-hxdo4gc0/unsloth_ee50b154a87b4ee2961cfd021fe04dcc
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-hxdo4gc0/unsloth_ee50b154a87b4ee2961cfd021fe04dcc
  Resolved https://github.com/unslothai/unsloth.git to commit 2267b5c5532957141a33bfa5bb9f0b220a4b3efe
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting unsloth_zoo>=2025.10.13 (from unsloth@ git+https://github.com/unslothai/unsloth.git->unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Downloading unsloth_zoo-2025.10.13-py3-none-any.whl.metadata (32 kB)
Collecting tyro (from unsloth@ git+https://github.com/unslothai/unsloth.

In [None]:
!pip install unsloth_zoo



## **Step 2: Load the Model and Tokenizer**

Next, we'll load the Llama-3-8B model, which is the only model permitted for this competition. We'll use Unsloth's FastLanguageModel to handle this efficiently.

A key technique we'll use is 4-bit quantization (load_in_4bit = True). Think of this as compressing the model's knowledge into a much smaller file size. This significantly reduces the amount of GPU memory required, allowing us to fine-tune this large model even on a free platform like Google Colab.



In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Choose any sequence length
compress_pos_emb = 2  # halves memory cost if long sequences

dtype = None  # This will auto-detect the best data type for your GPU
load_in_4bit = True  # Use 4-bit quantization to save memory

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

ModuleNotFoundError: No module named 'unsloth'

## **Step 3: Prepare the Dataset**

This is a crucial step where we format our data into a structure the model can learn from. The process involves three parts:

1.  **Loading**: We'll load the official competition dataset from Hugging Face.
2.  **Splitting**: The full dataset is massive. For this starter notebook, we'll create a much smaller, more manageable version to speed things up: **5,000 samples for training** and **500 for validation**.
3.  **Prompting**: We will format each data sample into a clear instructional prompt. This helps the model understand its role as a mathematician verifying a solution.



In [None]:
from datasets import load_dataset
from sklearn.model_selection import KFold

# Load the full training dataset from Hugging Face
full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

# Shuffle for randomness and reproducibility
shuffled_dataset = full_dataset.shuffle(seed=42)

# Define subset sizes
NUM_TRAIN = 5000
NUM_VALID = 500

# Standard single train/validation split (fast)
train_dataset = shuffled_dataset.select(range(NUM_TRAIN))
validation_dataset = shuffled_dataset.select(range(NUM_TRAIN, NUM_TRAIN + NUM_VALID))


# K-Fold Mini Cross-Validation Setup
# This mimics your neural model selection lab, allowing rotation through folds.
# Each fold can act as a validation split for fine-tuning experiments.
N_FOLDS = 5
kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=42)

fold_indices = list(kf.split(range(NUM_TRAIN + NUM_VALID)))
train_idx, val_idx = fold_indices[0]  # Example: use first fold

cv_train_dataset = shuffled_dataset.select(train_idx.tolist())
cv_valid_dataset = shuffled_dataset.select(val_idx.tolist())

print(f"Fold 1 ‚Üí Train samples: {len(cv_train_dataset)} | Validation samples: {len(cv_valid_dataset)}")


Fold 1 ‚Üí Train samples: 4400 | Validation samples: 1100


In [None]:
# The instructional prompt template for training
training_prompt = """You are an expert mathematician evaluating whether a given solution correctly answers a math question. Follow this rigorous process:

1. **Parse the Question Carefully**: Read the question word by word. Identify all variables, constants, constraints, and exactly what is being asked.

2. **Analyze the Provided Solution**: Examine the solution step by step. Check for:
   - Correct interpretation of the question
   - Proper mathematical operations and formulas
   - Logical reasoning flow
   - Computational accuracy
   - Appropriate units and rounding

3. **Solve Independently**: Work through the problem yourself step by step without looking at the provided solution. Show your work clearly.

4. **Cross-Verify**: Compare your independent solution with the provided solution. Check if they arrive at the same final answer through valid methods.

5. **Validate Logic**: Ensure the solution actually answers what was asked in the question, not a similar or related question.

6. **Final Judgment**: Only if the provided solution is completely correct in both method and final answer should you respond with 'True'. Any errors in reasoning, calculation, or interpretation warrant 'False'.

After completing this analysis, respond with ONLY a single word: 'True' or 'False'.

Question:
{}
Solution:
{}
Output:"""

# Retrieve the model's End-Of-Sequence token
EOS_TOKEN = tokenizer.eos_token

# Function to format each dataset entry into the prompt template
def formatting_prompts_func(examples):
    texts = []
    for q, s, o in zip(examples["question"], examples["solution"], examples["is_correct"]):
        formatted = training_prompt.format(q.strip(), str(s).strip(), str(o).strip())
        # Append EOS token to mark end of completion
        formatted += EOS_TOKEN
        texts.append(formatted)
    return {"text": texts}

# Apply the formatting to both training and validation datasets
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

# Optional: format the K-Fold datasets if using cross-validation
formatted_cv_train = cv_train_dataset.map(formatting_prompts_func, batched=True)
formatted_cv_valid = cv_valid_dataset.map(formatting_prompts_func, batched=True)

print("Datasets formatted successfully and ready for tokenization.")

Datasets formatted successfully and ready for tokenization.


## **Step 4: Configure LoRA and Set Up the Trainer**

### **LoRA Configuration**

Instead of training the entire model (which has billions of parameters), we'll use a technique called **Lo**w-**R**ank **A**daptation (LoRA). üéõÔ∏è

Think of it like this: rather than rewriting an entire textbook, we're just adding small, efficient "sticky notes" (the LoRA adapters) to update the model's knowledge. This is much faster and requires significantly less memory. We'll use a small **rank** (`r = 8`) to keep the training process light and quick for this starter notebook.


In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,  # higher rank ‚Üí more learning capacity
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,       # ‚âà 2 √ó r
    lora_dropout = 0.05,   # regularization to prevent overfitting
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 42,
)

NameError: name 'FastLanguageModel' is not defined


### **SFTTrainer Setup**

Now we'll set up the `SFTTrainer` (Supervised Fine-tuning Trainer). This is the main tool from the `trl` library that will handle the entire training loop for us. We'll give it our model, tokenizer, dataset, and a set of training instructions, such as the batch size and number of epochs.

We will train for just **one epoch** (a single pass over our 5,000-sample dataset) to keep this demonstration fast.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback

# Format validation dataset for evaluation and early stopping
formatted_val_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_train_dataset,
    eval_dataset = formatted_val_dataset,        # add validation set
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,               # smoother warmup
        max_steps = 200,                  # more training iterations
        learning_rate = 1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 10,
        optim = "paged_adamw_8bit",       # memory-efficient optimizer
        weight_decay = 0.01,
        lr_scheduler_type = "cosine",     # gradual decay of learning rate
        seed = 42,
        output_dir = "outputs",
        report_to = "none",
        save_total_limit = 2,             # keep only latest checkpoints

        # Compatible argument names for transformers <4.43
        eval_strategy = "steps",          # evaluate periodically
        eval_steps = 20,                  # run validation every 20 steps
        load_best_model_at_end = True,    # reload best checkpoint automatically
    ),
)

# Add early stopping (stop if no improvement for 2 consecutive evaluations)
trainer.add_callback(EarlyStoppingCallback(early_stopping_patience=2))


## **Step 5: Start Training\!**

Now, we'll call the `train()` function on our `trainer` object. This will kick off the fine-tuning process. Based on our settings, this will run for one full epoch over our 5,000 examples.

Grab a coffee, as this will take a few minutes\! ‚òï


In [None]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 20,971,520 of 8,051,232,768 (0.26% trained)


Step,Training Loss,Validation Loss
20,1.0696,0.652294
40,0.4714,0.448503
60,0.4342,0.424793
80,0.4584,0.415681
100,0.4216,0.411614
120,0.4449,0.409185
140,0.4296,0.407599
160,0.4186,0.40685
180,0.4117,0.406252
200,0.4164,0.406157


TrainOutput(global_step=200, training_loss=0.52573037981987, metrics={'train_runtime': 6535.2159, 'train_samples_per_second': 0.245, 'train_steps_per_second': 0.031, 'total_flos': 3.972345621287731e+16, 'train_loss': 0.52573037981987, 'epoch': 0.32})


## **Step 6: Inference and Evaluation**

Now that our model is trained, we need to test it on our validation set. We'll use a slightly different prompt for inference‚Äîone where we leave the `Output:` section blank for the model to complete.

Let's test it on a single example from our validation set to see what it predicts.

In [None]:
from torch.nn.functional import softmax

# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = """You are an expert mathematician evaluating whether a given solution correctly answers a math question. Follow this rigorous process:

1. **Parse the Question Carefully**: Read the question word by word. Identify all variables, constants, constraints, and exactly what is being asked.

2. **Analyze the Provided Solution**: Examine the solution step by step. Check for:
   - Correct interpretation of the question
   - Proper mathematical operations and formulas
   - Logical reasoning flow
   - Computational accuracy
   - Appropriate units and rounding

3. **Solve Independently**: Work through the problem yourself step by step without looking at the provided solution. Show your work clearly.

4. **Cross-Verify**: Compare your independent solution with the provided solution. Check if they arrive at the same final answer through valid methods.

5. **Validate Logic**: Ensure the solution actually answers what was asked in the question, not a similar or related question.

6. **Final Judgment**: Only if the provided solution is completely correct in both method and final answer should you respond with 'True'. Any errors in reasoning, calculation, or interpretation warrant 'False'.

After completing this analysis, respond with ONLY a single word: 'True' or 'False'.

Question:
{}
Solution:
{}
Output:"""

# Select a sample from the validation set (change index as needed)
example = validation_dataset[10]
question = example["question"]
solution = example["solution"]

# Format the prompt with the validation data
inputs = tokenizer(
    [inference_prompt.format(question, str(solution))],
    return_tensors="pt"
).to("cuda")

# Generate the model's response
outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
response_text = tokenizer.batch_decode(outputs)[0]

# Optional: compute confidence of the prediction
logits = model(**inputs).logits[:, -1, :]
probs = softmax(logits, dim=-1)
confidence = float(probs.max())

# Parse prediction text
try:
    prediction = response_text.split("Output:")[1].strip().split()[0]
except IndexError:
    prediction = "UNKNOWN"

# Display results
print("#### QUESTION ####")
print(question)
print("\n#### SOLUTION ####")
print(solution)
print("\n#### MODEL'S PREDICTION ####")
print(prediction)
print(f"\n#### CONFIDENCE ####\n{confidence:.3f}")
print("\n#### CORRECT ANSWER ####")
print(example["is_correct"])

#### QUESTION ####
Jason and Jeremy want to paint their wall white and agreed to split the cost of the paint. A gallon of paint costs $45 and can cover up to 400 square feet. How much will each of them contribute to the cost of the paint if their walls have a total area of 1600 square feet and will need a second coat?

#### SOLUTION ####
The question asks how much each of them will pay for the paint. So let's first find the total cost of the paint. Let's solve this problem using Python code.
<llm-code>
# area of one gallon of paint
area_per_gallon = 400
# number of gallons
gallons = 1600 / area_per_gallon
# price per gallon
price_per_gallon = 45
# number of coats
number_of_coats = 2
cost_of_paint = gallons * number_of_coats * price_per_gallon
cost_of_paint
</llm-code>
<llm-code-output>
360.0
</llm-code-output>
Each will pay \boxed{180} dollars.

#### MODEL'S PREDICTION ####
<|end_of_text|>

#### CONFIDENCE ####
1.000

#### CORRECT ANSWER ####
True


Consider using tensor.detach() first. (Triggered internally at /pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:835.)
  confidence = float(probs.max())


## **Step 7: Generate Submission File**

This is the final step\! We will now run our fine-tuned model on the official `test` dataset.

We will loop through each example in the test set, generate a prediction, and format the results into a CSV file with two columns: `ID` and `is_correct`, as required by the competition.


In [None]:
# Run the fine-tuned model on the official test dataset and produce submission.csv

import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Function to parse 'True' or 'False' from the model's generated output
def parse_output(response_text):
    # Extract text after "Output:" if present
    if "Output:" in response_text:
        output_part = response_text.split("Output:")[-1]
    else:
        output_part = response_text
    # Normalize and check for 'true' or 'false'
    text_lower = output_part.strip().lower()
    if "true" in text_lower:
        return True
    elif "false" in text_lower:
        return False
    return False  # Default fallback

# Generate predictions for all test examples
for example in tqdm(test_dataset, desc="Generating predictions"):
    question = example["question"]
    solution = example["solution"]

    # Format inference prompt (same template as in Step 6)
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate model output
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Parse the generated output
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create submission DataFrame with required format
submission = pd.DataFrame({
    "ID": range(len(predictions)),
    "is_correct": predictions
})

# Save results to CSV file
submission.to_csv("submission.csv", index=False)

print("\nSubmission file 'submission.csv' created successfully.")
print("You can now download this file and submit it to the competition.")

Generating predictions: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 10000/10000 [1:52:11<00:00,  1.49it/s]


Submission file 'submission.csv' created successfully.
You can now download this file and submit it to the competition.





# SAVE THE MODEL TO DRIVE AND RUN INFERENCE
Add code to save the model checkpoint to Google Drive, load the model from the checkpoint, and generate the final submission CSV file.

## Mount google drive

### Subtask:
Mount Google Drive to save the model checkpoint.


**Reasoning**:
Mount Google Drive to save the model checkpoint.



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Save model checkpoint

### Subtask:
Save the trained model checkpoint to the specified path in Google Drive.


**Reasoning**:
Define the save path and save the model and tokenizer to Google Drive.



In [None]:
import os

# Define the path to save the model checkpoint in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"

# Create the directory if it doesn't exist
os.makedirs(save_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print(f"Model checkpoint and tokenizer saved to: {save_path}")

Model checkpoint and tokenizer saved to: /content/drive/MyDrive/llama3_8b_math_verifier_checkpoint


## Load model from checkpoint

### Subtask:
Load the model from the saved checkpoint.


**Reasoning**:
Load the model and tokenizer from the saved checkpoint path in Google Drive and prepare the model for inference.



In [None]:
# Define the path where the model checkpoint was saved in Google Drive
save_path = "/content/drive/MyDrive/llama3_8b_math_verifier_checkpoint"

# Load the model and tokenizer from the saved path
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = save_path,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

# Prepare the loaded model for faster inference
FastLanguageModel.for_inference(model)

print(f"Model and tokenizer loaded from: {save_path}")

NameError: name 'FastLanguageModel' is not defined

## Generate submission file

### Subtask:
Generate the submission CSV file using the loaded model.


**Reasoning**:
Generate the submission CSV file by iterating through the test dataset, generating predictions using the loaded model, and saving the results to a pandas DataFrame.



In [None]:
# Generate predictions for the official test dataset and create submission.csv

import pandas as pd
from tqdm import tqdm
from datasets import load_dataset

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

# Create the prompt template for inference (no answer included)
inference_prompt = """You are an expert mathematician evaluating whether a given solution correctly answers a math question. Follow this rigorous process:

1. **Parse the Question Carefully**: Read the question word by word. Identify all variables, constants, constraints, and exactly what is being asked.

2. **Analyze the Provided Solution**: Examine the solution step by step. Check for:
   - Correct interpretation of the question
   - Proper mathematical operations and formulas
   - Logical reasoning flow
   - Computational accuracy
   - Appropriate units and rounding

3. **Solve Independently**: Work through the problem yourself step by step without looking at the provided solution. Show your work clearly.

4. **Cross-Verify**: Compare your independent solution with the provided solution. Check if they arrive at the same final answer through valid methods.

5. **Validate Logic**: Ensure the solution actually answers what was asked in the question, not a similar or related question.

6. **Final Judgment**: Only if the provided solution is completely correct in both method and final answer should you respond with 'True'. Any errors in reasoning, calculation, or interpretation warrant 'False'.

After completing this analysis, respond with ONLY a single word: 'True' or 'False'.

Question:
{}
Solution:
{}
Output:"""

# Function to parse the model's generated output into a boolean
def parse_output(response_text):
    # Extract portion after "Output:" if present
    if "Output:" in response_text:
        output_part = response_text.split("Output:")[-1]
    else:
        output_part = response_text
    # Normalize text for comparison
    text_lower = output_part.strip().lower()
    if "true" in text_lower:
        return True
    elif "false" in text_lower:
        return False
    return False  # Default fallback if unclear

# Loop through test dataset
for example in tqdm(test_dataset, desc="Generating predictions"):
    question = example["question"]
    solution = example["solution"]

    # Construct inference prompt
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate model response
    outputs = model.generate(**inputs, max_new_tokens=8, use_cache=True)
    response_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Parse and store prediction
    prediction = parse_output(response_text)
    predictions.append(prediction)

# Create submission DataFrame
submission = pd.DataFrame({
    "ID": range(len(predictions)),
    "is_correct": predictions
})

# Save predictions to CSV
submission.to_csv("submission.csv", index=False)

print("\nSubmission file 'submission.csv' created successfully.")
print("You can now download this file and submit it to the competition.")

Generating predictions:   0%|          | 0/10000 [00:00<?, ?it/s]


NameError: name 'tokenizer' is not defined