<a href="https://colab.research.google.com/github/sankirthk/CS-GY-ECE-GY-6953-7123-DL-FL25-Midterm/blob/main/DL_Midterm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Step 1: Install Necessary Libraries**
First, we need to install the required Python libraries. We'll be using the **unsloth** library, which provides highly efficient, memory-saving training methods for large language models, making it possible to fine-tune powerful models on a single GPU. We'll also install other essential libraries like `trl`, `peft`, `accelerate`, and `bitsandbytes` for the fine-tuning workflow, and `datasets`, `pandas`, and `tqdm` for data handling and tracking progress. We use `%%capture` to keep the output clean.

In [1]:
!pip install uv # installing via uv is much quicker

!uv pip install  unsloth unsloth_zoo
!uv pip install wandb

Collecting uv
  Downloading uv-0.9.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Downloading uv-0.9.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.4/21.4 MB[0m [31m132.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.9.7
[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m91 packages[0m [2min 462ms[0m[0m
[2K[2mPrepared [1m12 packages[0m [2min 2.42s[0m[0m
[2mUninstalled [1m3 packages[0m [2min 56ms[0m[0m
[2K[2mInstalled [1m12 packages[0m [2min 19ms[0m[0m
 [32m+[39m [1mbitsandbytes[0m[2m==0.48.2[0m
 [32m+[39m [1mcut-cross-entropy[0m[2m==25.1.1[0m
 [31m-[39m [1mdatasets[0m[2m==4.0.0[0m
 [32m+[39m [1mdatasets[0m[2m==4.3.0[0m
 [32m+[39m [1mmsgspec[0m[2m==0.19.0[0m
 [31m-[39m [1mpyarrow[0m[2m==18.1.0[0m
 [32m+[39m [1mpyarrow[0m[2m==22.0.0[0m
 

## **Step 1.1: Wandb setup**


In [2]:
import os
import wandb


wandb.login()
# Shared between both users
WANDB_PROJECT = "nyu_math_eval_colab_experiment5"
WANDB_ENTITY = "KachraSweep-Colab"   # replace with your team/org/user handle on W&B

# Unique per user
USER_NAME = "Sankirth"          # teammate sets this to their own name

# Export for automatic detection
os.environ["WANDB_PROJECT"] = WANDB_PROJECT
os.environ["WANDB_ENTITY"] = WANDB_ENTITY
os.environ["WANDB_RUN_GROUP"] = "incremental_training"

print(f"✅ Configured for project: {WANDB_ENTITY}/{WANDB_PROJECT}")
print(f"User run name prefix: {USER_NAME}")


  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msk11617[0m ([33msk11617-new-york-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


✅ Configured for project: KachraSweep-Colab/nyu_math_eval_colab_experiment5
User run name prefix: Sankirth


## **Step 2: Mount Google Drive and Set Paths**
Mounting Google Drive is essential for this competition. It allows us to **save and load model checkpoints** and our **processed dataset**. This ensures your progress is not lost if the Colab session disconnects, enabling you to resume training where you left off. We also define a base path for all project files.
**! IMPORTANT:** Please update the `DRIVE_BASE_PATH` variable to a path in your own Google Drive.

In [3]:
from google.colab import drive
import os

try:
    drive.mount('/content/drive')
    print("\n Google Drive mounted successfully.")
except Exception as e:
    print(f"\n Could not mount Google Drive. Training will not be persistent. Error: {e}")

# Define a base path in your Google Drive for all competition files
# ! UPDATE THIS PATH to your desired location in Google Drive
# ==== Common shared base path ====
DRIVE_BASE_PATH = "/content/drive/MyDrive/DL_Fall_2025_Kaggle"

# ==== Unique per-user path ====
# Replace with your actual short name or initials
USER_NAME = "Sankirth"     # or "userB"

# ==== Subdirectories ====
CHECKPOINT_BASE = f"{DRIVE_BASE_PATH}/checkpoints/{USER_NAME}"
DATASET_BASE = f"{DRIVE_BASE_PATH}/dataset"
RESULTS_BASE = f"{DRIVE_BASE_PATH}/results"

# ==== Create directories ====
import os
os.makedirs(CHECKPOINT_BASE, exist_ok=True)
os.makedirs(DATASET_BASE, exist_ok=True)
os.makedirs(RESULTS_BASE, exist_ok=True)

print(f"✅ Configured paths for {USER_NAME}")
print(f"CHECKPOINT_BASE: {CHECKPOINT_BASE}")

Mounted at /content/drive

 Google Drive mounted successfully.
✅ Configured paths for Sankirth
CHECKPOINT_BASE: /content/drive/MyDrive/DL_Fall_2025_Kaggle/checkpoints/Sankirth


## **Step 3: Load the Model and Tokenizer**
Next, we'll load the competition-approved **Llama-3-8B** model and its tokenizer. We use **Unsloth's FastLanguageModel** for high efficiency.

A crucial technique here is **4-bit Quantization** (`load_in_4bit = True`). This compresses the model's parameters, dramatically reducing GPU memory usage. This makes fine-tuning the 8-billion parameter model feasible even on free-tier GPUs. We also set a standard sequence length and let Unsloth automatically select the optimal data type (`bf16` or `fp16`).

# If starting training from scratch:

In [None]:
from unsloth import FastLanguageModel
import torch

# Configuration constants
MAX_SEQ_LENGTH = 2048 # Standard sequence length for instruction fine-tuning
DTYPE = None          # Auto-detect the best data type for the GPU (e.g., bfloat16)
LOAD_IN_4BIT = True   # Enable 4-bit quantization to save memory

# Clean up any existing model to free VRAM
try:
    del model
except NameError:
    pass
except UnboundLocalError:
    pass

# Load the model and tokenizer from Hugging Face
# Note: We use the base model, not a 4-bit pre-quantized one,
# to ensure we start from the official weights.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B", # Competition-approved model
    max_seq_length = MAX_SEQ_LENGTH,
    dtype = DTYPE,
    load_in_4bit = LOAD_IN_4BIT,
)
print(f"\nModel '{model.config._name_or_path}' and Tokenizer loaded.")

# If resuming training:

In [4]:
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    "/content/drive/MyDrive/DL_Fall_2025_Kaggle/checkpoints/Sankirth/Sankirth_run_20000_to_40000/final_model",
    load_in_4bit=True,
    device_map="auto",
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.1: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

Unsloth 2025.11.1 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## **Step 4: Advanced Dataset Preparation and Balancing**
The raw competition dataset is massive (over 1 million samples) and may be imbalanced (skewed toward 'True' or 'False' answers).

This block performs three key actions:
1.  **Check Balance:** Analyze the distribution of 'True' and 'False' in the full dataset.
2.  **Create a Balanced Subset:** We will sample an equal number of 'True' and 'False' examples (matching the minority class size) from the entire dataset. This is essential for preventing the model from simply predicting the majority class.
3.  **Save/Load Balanced Data:** The balanced dataset is saved to Google Drive, so you only need to run this step once.

In [6]:
from datasets import load_dataset, load_from_disk, concatenate_datasets
import pandas as pd
import os

# Define the save path for balanced dataset
save_path = f"{DATASET_BASE}/balanced_dataset"

# Check if balanced dataset already exists
print(f"\n{'='*50}")
if os.path.exists(save_path):
    print(f"Balanced dataset found at: {save_path}")
    print("Loading existing balanced dataset...")
    balanced_dataset = load_from_disk(save_path)
    print(f"Loaded {len(balanced_dataset):,} samples")
    print(f"True: {sum(balanced_dataset['is_correct']):,}, False: {len(balanced_dataset) - sum(balanced_dataset['is_correct']):,}")
else:
    print("Balanced dataset not found. Creating new balanced dataset...")

    # Load the full training dataset
    full_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="train")

    print(f"Total training samples: {len(full_dataset)}")
    print("\nDataset structure:")
    print(full_dataset)

    # Analyze class distribution
    labels = full_dataset["is_correct"]
    true_count = sum(labels)
    false_count = len(labels) - true_count

    print(f"\n{'='*50}")
    print("CLASS DISTRIBUTION ANALYSIS")
    print(f"{'='*50}")
    print(f"True labels:  {true_count:,} ({true_count/len(labels)*100:.2f}%)")
    print(f"False labels: {false_count:,} ({false_count/len(labels)*100:.2f}%)")
    print(f"Imbalance ratio: {max(true_count, false_count) / min(true_count, false_count):.2f}:1")

    # Sample equal numbers from each class
    n_samples_per_class = 400000  # Adjust based on available samples
    shuffled_dataset = full_dataset.shuffle(seed=42)

    # Get balanced samples
    true_samples = shuffled_dataset.filter(lambda x: x["is_correct"] == True).select(range(n_samples_per_class))
    false_samples = shuffled_dataset.filter(lambda x: x["is_correct"] == False).select(range(n_samples_per_class))

    # Combine and shuffle thoroughly
    balanced_dataset = concatenate_datasets([true_samples, false_samples]).shuffle(seed=42)

    print(f"\nBalanced dataset created: {len(balanced_dataset):,} samples")
    print(f"True: {sum(balanced_dataset['is_correct']):,}, False: {len(balanced_dataset) - sum(balanced_dataset['is_correct']):,}")

    # Save to Google Drive
    os.makedirs(save_path, exist_ok=True)
    balanced_dataset.save_to_disk(save_path)

    print(f"Balanced dataset saved to: {save_path}")

# === CRITICAL: Additional shuffle to ensure no ordering artifacts ===
print("\n" + "="*50)
print("APPLYING ADDITIONAL SHUFFLE FOR ROBUSTNESS")
print("="*50)
balanced_dataset = balanced_dataset.shuffle(seed=42)
print("✅ Dataset re-shuffled with seed=42")


Balanced dataset found at: /content/drive/MyDrive/DL_Fall_2025_Kaggle/dataset/balanced_dataset
Loading existing balanced dataset...
Loaded 800,000 samples
True: 400,000, False: 400,000

APPLYING ADDITIONAL SHUFFLE FOR ROBUSTNESS
✅ Dataset re-shuffled with seed=42


## **Step 5: Create training and validation splits**
Create training and validation splits from the balanced dataset with support for incremental training. We can train on progressively larger subsets (e.g., 10k → 30k → 50k) to iteratively improve the model while managing compute constraints.

**Strategy:**
- **Incremental Training**: Train on successive chunks of data, resuming from previous checkpoints
- **Fixed Validation Set**: Use a consistent validation set (last 5000 samples) across all runs to reliably track improvement
- **Versioned Checkpoints**: Automatically name checkpoints based on training indices for easy tracking

In [8]:
# === Incremental Training Configuration (Shared + User-Aware) ===
from pathlib import Path
from datasets import DatasetDict

# These are already defined earlier:
# USER_NAME, DRIVE_BASE_PATH, CHECKPOINT_BASE, DATASET_BASE

# Ensure subdirectories exist
Path(CHECKPOINT_BASE).mkdir(parents=True, exist_ok=True)
Path(DATASET_BASE).mkdir(parents=True, exist_ok=True)

# === STEP 1: Create stratified train/val split (only needs to be done once) ===
split_save_path = f"{DATASET_BASE}/train_val_split"

if os.path.exists(split_save_path):
    print("="*60)
    print("Loading existing train/validation split...")
    print("="*60)
    split_datasets = load_from_disk(split_save_path)
    training_pool = split_datasets['train']
    validation_dataset = split_datasets['validation']
    print(f"Loaded training pool: {len(training_pool):,} samples")
    print(f"Loaded validation set: {len(validation_dataset):,} samples")
else:
    print("="*60)
    print("Creating stratified train/validation split...")
    print("="*60)

    n_val_samples = 5000

    # Manual stratified split since stratify_by_column doesn't work with boolean
    # Separate True and False samples
    true_samples = balanced_dataset.filter(lambda x: x["is_correct"] == True)
    false_samples = balanced_dataset.filter(lambda x: x["is_correct"] == False)

    print(f"Total True samples: {len(true_samples):,}")
    print(f"Total False samples: {len(false_samples):,}")

    # Split each class proportionally (50% of validation set from each class)
    n_val_per_class = n_val_samples // 2

    # Shuffle each class separately for random selection
    true_shuffled = true_samples.shuffle(seed=42)
    false_shuffled = false_samples.shuffle(seed=42)

    # Create validation set (first n_val_per_class from each)
    val_true = true_shuffled.select(range(n_val_per_class))
    val_false = false_shuffled.select(range(n_val_per_class))
    validation_dataset = concatenate_datasets([val_true, val_false]).shuffle(seed=42)

    # Create training pool (remaining samples from each class)
    train_true = true_shuffled.select(range(n_val_per_class, len(true_shuffled)))
    train_false = false_shuffled.select(range(n_val_per_class, len(false_shuffled)))
    training_pool = concatenate_datasets([train_true, train_false]).shuffle(seed=42)

    # Verify balance
    val_true_count = sum(validation_dataset["is_correct"])
    val_false_count = len(validation_dataset) - val_true_count
    train_true_count = sum(training_pool["is_correct"])
    train_false_count = len(training_pool) - train_true_count

    print(f"\nSplit created:")
    print(f"Training pool: {len(training_pool):,} samples")
    print(f"  True:  {train_true_count:,} ({train_true_count/len(training_pool)*100:.1f}%)")
    print(f"  False: {train_false_count:,} ({train_false_count/len(training_pool)*100:.1f}%)")
    print(f"\nValidation set: {len(validation_dataset):,} samples")
    print(f"  True:  {val_true_count:,} ({val_true_count/len(validation_dataset)*100:.1f}%)")
    print(f"  False: {val_false_count:,} ({val_false_count/len(validation_dataset)*100:.1f}%)")

    # Save the split for consistency across runs
    split_to_save = DatasetDict({
        'train': training_pool,
        'validation': validation_dataset
    })
    split_to_save.save_to_disk(split_save_path)
    print(f"\nTrain/validation split saved to: {split_save_path}")
    print("   (This ensures consistent validation across all incremental training runs)")

# === STEP 2: Configure incremental training ===
print("\n" + "="*60)
print("INCREMENTAL TRAINING CONFIGURATION")
print("="*60)

train_start_idx = 0        # UPDATE THIS: 0 for first run, 20000 for second, etc.
n_train_samples = 20000    # Number of samples to train per run

train_end_idx = min(train_start_idx + n_train_samples, len(training_pool))

# Select current training chunk from the training pool
train_dataset = training_pool.select(range(train_start_idx, train_end_idx))

print(f"Training pool size: {len(training_pool):,}")
print(f"Current training chunk: {len(train_dataset):,} samples")
print(f"  └─ Indices: {train_start_idx:,} → {train_end_idx:,}")
print(f"Validation set: {len(validation_dataset):,} samples (FIXED across all runs)")

# Verify balance in current training chunk
train_true = sum(train_dataset["is_correct"])
train_false = len(train_dataset) - train_true
val_true = sum(validation_dataset["is_correct"])
val_false = len(validation_dataset) - val_true

print(f"\nData Balance:")
print(f"Training chunk - True: {train_true:,} ({train_true/len(train_dataset)*100:.1f}%), False: {train_false:,} ({train_false/len(train_dataset)*100:.1f}%)")
print(f"Validation set - True: {val_true:,} ({val_true/len(validation_dataset)*100:.1f}%), False: {val_false:,} ({val_false/len(validation_dataset)*100:.1f}%)")

# === STEP 3: Checkpoint configuration ===
print("\n" + "="*60)
print("CHECKPOINT CONFIGURATION")
print("="*60)

run_name = f"{USER_NAME}_run_{train_start_idx}_to_{train_end_idx}_colab"
output_dir = f"{CHECKPOINT_BASE}/{run_name}"

# Determine if we should resume from a previous checkpoint
previous_run_name = (
    f"{USER_NAME}_run_{train_start_idx - n_train_samples}_to_{train_start_idx}"
    if train_start_idx > 0 else None
)

resume_checkpoint = None
if previous_run_name:
    # Check for final model first, then checkpoint-final, then latest checkpoint
    possible_checkpoints = [
        f"{CHECKPOINT_BASE}/{previous_run_name}/final_model",
        f"{CHECKPOINT_BASE}/{previous_run_name}/checkpoint-final",
    ]

    for ckpt_path in possible_checkpoints:
        if os.path.exists(ckpt_path):
            resume_checkpoint = ckpt_path
            break

print(f"Current user: {USER_NAME}")
print(f"Current run:  {run_name}")
print(f"Output dir:   {output_dir}")

if resume_checkpoint:
    print(f"Will resume from: {resume_checkpoint}")
else:
    if train_start_idx > 0:
        print(f"WARNING: No previous checkpoint found!")
        print(f"   Expected: {CHECKPOINT_BASE}/{previous_run_name}/")
        print(f"   Starting fresh training (not recommended for incremental training)")
    else:
        print("Starting fresh training (first run)")

print(f"\nFor next incremental run:")
print(f"   Set train_start_idx = {train_end_idx}")
print(f"   Keep n_train_samples = {n_train_samples} (or adjust)")
print(f"   This will train on indices {train_end_idx:,} → {min(train_end_idx + n_train_samples, len(training_pool)):,}")

# Calculate progress
progress = (train_end_idx / len(training_pool)) * 100
print(f"\nTraining Progress: {progress:.1f}% of training pool")
print(f"   ({train_end_idx:,} / {len(training_pool):,} samples)")

Creating stratified train/validation split...


Filter:   0%|          | 0/800000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/800000 [00:00<?, ? examples/s]

Total True samples: 400,000
Total False samples: 400,000

Split created:
Training pool: 795,000 samples
  True:  397,500 (50.0%)
  False: 397,500 (50.0%)

Validation set: 5,000 samples
  True:  2,500 (50.0%)
  False: 2,500 (50.0%)


Saving the dataset (0/2 shards):   0%|          | 0/795000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]


Train/validation split saved to: /content/drive/MyDrive/DL_Fall_2025_Kaggle/dataset/train_val_split
   (This ensures consistent validation across all incremental training runs)

INCREMENTAL TRAINING CONFIGURATION
Training pool size: 795,000
Current training chunk: 20,000 samples
  └─ Indices: 0 → 20,000
Validation set: 5,000 samples (FIXED across all runs)

Data Balance:
Training chunk - True: 10,001 (50.0%), False: 9,999 (50.0%)
Validation set - True: 2,500 (50.0%), False: 2,500 (50.0%)

CHECKPOINT CONFIGURATION
Current user: Sankirth
Current run:  Sankirth_run_0_to_20000_colab
Output dir:   /content/drive/MyDrive/DL_Fall_2025_Kaggle/checkpoints/Sankirth/Sankirth_run_0_to_20000_colab
Starting fresh training (first run)

For next incremental run:
   Set train_start_idx = 20000
   Keep n_train_samples = 20000 (or adjust)
   This will train on indices 20,000 → 40,000

Training Progress: 2.5% of training pool
   (20,000 / 795,000 samples)


## **Step 6: Format training data**
Define the instructional prompt template that structures how the model receives questions and solutions. We format each training example into this template and add an EOS (End of Sequence) token to signal completion. This consistent formatting helps the model learn the verification task effectively.

In [9]:
# The instructional prompt template for training
training_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
{}"""

# We must add an End Of Sequence (EOS) token to tell the model when a completion is finished.
EOS_TOKEN = tokenizer.eos_token

# This function formats our data samples into the prompt template.
def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    outputs = examples["is_correct"]
    texts = []
    for question, solution, output in zip(questions, solutions, outputs):
        # Convert boolean to string explicitly: True -> "True", False -> "False"
        output_text = "True" if output else "False"

        # Format the prompt and add the EOS token
        text = training_prompt.format(question, str(solution), output_text) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts }

# Apply the formatting function to our training dataset
formatted_train_dataset = train_dataset.map(formatting_prompts_func, batched=True)

formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True)

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

## **Step 7: LORA config**
Configure LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning. Instead of updating all 8 billion parameters, LoRA adds small trainable adapter layers, dramatically reducing memory requirements and training time.

**Key Parameters:**
- **Rank (r=64)**: Higher rank increases model capacity to learn nuanced patterns in mathematical reasoning. Research suggests r=32-64 is optimal for complex reasoning tasks.
- **Alpha (64)**: Standard practice is alpha = r. Can experiment with alpha = 2*r for stronger updates, or alpha < r (e.g., 16) for more stable training with high ranks.
- **Dropout (0.05)**: Higher dropout (0.05-0.15) provides regularization, especially important when training for multiple epochs or on smaller datasets.
- **Target Modules**: Targeting all attention and feed-forward layers. Can optionally add embedding layers or enable bias training for potential gains.
- **Gradient Checkpointing**: Enables longer sequence lengths (up to 2048 tokens) to capture full question+solution context, especially for LaTeX-heavy or code solutions.

**Note on Compute Constraints:**
Given limited compute, focus on:
1. Single strong model with optimal hyperparameters rather than ensembles
2. Efficient training with gradient checkpointing
3. Smart data sampling (incremental training on balanced data)

In [None]:
# Skip this step if we are continuing training from a saved model file.
from unsloth import FastLanguageModel

target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj"
]

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=target_modules,
    lora_alpha=64,
    lora_dropout=0,      # must be zero for Unsloth fast patching
    bias="none",
    use_gradient_checkpointing="unsloth",
    use_rslora=False,
    loftq_config=None,
)
# Calculate trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTrainable params: {trainable_params:,} ({trainable_params/total_params*100:.2f}% of total)")

Unsloth 2025.10.12 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.



Trainable params: 41,943,040 (0.90% of total)


## **Step 8: Trainer Config**
Configure the Supervised Fine-Tuning Trainer with optimized hyperparameters for mathematical reasoning. Key considerations:

**Training Hyperparameters:**
- **Learning Rate (2e-5)**: Lower LR for stable fine-tuning. Too high can cause plateaus or divergence.
- **Batch Size (effective=16)**: Balance between gradient stability and memory. Effective batch = per_device_batch_size × gradient_accumulation_steps.
- **Warmup Steps (100)**: Longer warmup stabilizes initial training, especially important with LoRA.
- **Epochs (3)**: Multiple epochs with validation monitoring. Single epoch often insufficient for complex reasoning.
- **Weight Decay (0.01)**: Regularization to prevent overfitting.
- **LR Schedule (cosine)**: Smooth learning rate decay for better convergence.

**Monitoring Strategy:**
- Evaluate on validation set every epoch
- Save best checkpoint based on validation accuracy
- Early stopping if performance plateaus (patience = 2 epochs)
- Log training metrics frequently for debugging

**Note:** With incremental training, we'll resume from previous checkpoint if available.

In [13]:
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback
import os

# === Hyperparameters ===



# === Output directory ===
output_dir = f"{CHECKPOINT_BASE}/{USER_NAME}_run_{train_start_idx}_to_{train_end_idx}"

args = TrainingArguments(
    output_dir=output_dir,
    run_name=run_name,
    logging_dir=f"{output_dir}/logs",
    logging_steps=10,
    per_device_train_batch_size = 4,  # Controls the batch size per device
    gradient_accumulation_steps = 2,  # Accumulates gradients to simulate a larger batch
    num_train_epochs = 3,
    learning_rate=1e-4,
    warmup_steps=100,
    weight_decay=0.01,
    optim="paged_adamw_32bit",
    max_grad_norm=1.0,
    lr_scheduler_type="cosine",
    # lr_scheduler_kwargs={"num_cycles": 4},
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    save_total_limit=1,
    seed=3407,
    load_best_model_at_end = True,    # Loads the best model at the end
    report_to="wandb",          # send metrics to W&B
    eval_strategy='steps',
    eval_steps=500,  # evalaute every 20% of the trainig step
    save_steps=500,  # save every 20% of the trainig steps
    metric_for_best_model="eval_loss",   # NEW: Explicitly set metric
    greater_is_better=False,
    gradient_checkpointing=True,
)

# === Training setup ===
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    dataset_text_field="text",
    train_dataset=formatted_train_dataset,
    eval_dataset=formatted_validation_dataset,
    max_seq_length=1024,
    args=args,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
)



In [None]:
# === STEP: TRAIN MODEL ===
import os

print("="*60)
print("STARTING TRAINING")
print("="*60)
print(f"Training samples: {len(formatted_train_dataset):,}")
print(f"Validation samples: {len(formatted_validation_dataset):,}")
print(f"Output directory: {output_dir}")
print(f"Model will be saved to: {output_dir}/final_model")
print("="*60 + "\n")

# Train the model
train_result = trainer.train()

print("\n" + "="*60)
print("TRAINING COMPLETE")
print("="*60)
print(f"Training loss: {train_result.training_loss:.4f}")
if hasattr(train_result, 'metrics'):
    print(f"Training metrics: {train_result.metrics}")
print("="*60 + "\n")

# Save final model checkpoint
final_checkpoint_path = f"{output_dir}/final_model"
print(f"Saving final model to: {final_checkpoint_path}")

os.makedirs(final_checkpoint_path, exist_ok=True)
trainer.save_model(final_checkpoint_path)
tokenizer.save_pretrained(final_checkpoint_path)

print(f"\nTraining complete!")
print(f"Model saved to: {final_checkpoint_path}")

# Print best checkpoint info (useful for analysis)
if trainer.state.best_model_checkpoint:
    print(f"Best checkpoint: {trainer.state.best_model_checkpoint}")
    print(f"Best metric: {trainer.state.best_metric:.4f}")
else:
    print("ℹNo best checkpoint tracked (load_best_model_at_end might be False)")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 20,000 | Num Epochs = 3 | Total steps = 7,500
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


[34m[1mwandb[0m: Detected [huggingface_hub.inference, openai] in use.
[34m[1mwandb[0m: Use W&B Weave for improved LLM call tracing. Install Weave with `pip install weave` then add `import weave` to the top of your script.
[34m[1mwandb[0m: For more information, check out the docs at: https://weave-docs.wandb.ai/


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
1500,0.4539,0.463056
3000,0.3664,0.454211
4500,0.3551,0.440888
6000,0.2993,0.463426


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


KeyboardInterrupt: 

## **Step 8: Validation Error Analysis**
Evaluate the fine-tuned model on the fixed validation set to identify systematic mistakes.  
This step computes accuracy, precision, recall, and a confusion matrix, then lists misclassified examples.  
Use it to detect whether the model is too lenient (predicts *True* too often), too skeptical (predicts *False* too often), or struggles with specific math types.  
Insights from this step guide prompt tweaks or targeted retraining before test-set inference.

In [15]:
import random
from tqdm import tqdm
import torch

# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

def parse_output(response_text: str):
    """
    Parse the model's output to extract True/False prediction.
    The response_text contains the full prompt + generated output.
    """
    # Split by "Output:\n" to get only the generated part
    if "Output:\n" in response_text:
        output_part = response_text.split("Output:\n")[-1].strip()
    else:
        output_part = response_text.strip()

    # Check what the model generated (case-insensitive for robustness)
    output_lower = output_part.lower()

    if output_lower.startswith("true"):
        return True
    elif output_lower.startswith("false"):
        return False
    else:
        return None  # Malformed output


def evaluate_accuracy(model, tokenizer, dataset, n=100, seed=42):
    """
    Evaluate model accuracy on n random samples from the dataset.
    """
    random.seed(seed)
    indices = random.sample(range(len(dataset)), n)
    correct = 0
    malformed = 0

    for i in tqdm(indices, desc="Evaluating"):
        ex = dataset[i]
        question, solution, truth = ex["question"], ex["solution"], ex["is_correct"]

        prompt = inference_prompt.format(question, str(solution))
        inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=8,
                temperature=0.0,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
                use_cache=True,
            )
        text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        pred = parse_output(text)

        if pred is None:
            malformed += 1
            # Optionally print malformed outputs for debugging
            # print(f"\nMalformed output: {text}")
        else:
            correct += int(pred == truth)

    acc = correct / n
    print(f"\n{'='*60}")
    print(f"Evaluated {n} random samples")
    print(f"Correct: {correct}/{n} ({acc*100:.1f}%)")
    print(f"Accuracy: {acc:.4f}")
    if malformed > 0:
        print(f"Malformed outputs: {malformed}/{n} ({malformed/n*100:.1f}%)")
    print(f"{'='*60}")

    return acc

# === Optional: Detailed error analysis ===
def detailed_error_analysis(model, tokenizer, dataset, n=100, seed=42):
    """
    Analyze where the model makes mistakes.
    """
    random.seed(seed)
    indices = random.sample(range(len(dataset)), n)

    errors = {
        "false_positives": [],  # Predicted True, actually False
        "false_negatives": [],  # Predicted False, actually True
        "malformed": []         # Could not parse output
    }

    for i in tqdm(indices, desc="Analyzing errors"):
        ex = dataset[i]
        question, solution, truth = ex["question"], ex["solution"], ex["is_correct"]

        prompt = inference_prompt.format(question, str(solution))
        inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=8,
                temperature=0.0,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
                use_cache=True,
            )
        text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
        pred = parse_output(text)

        if pred is None:
            errors["malformed"].append({
                "question": question[:100],  # First 100 chars
                "output": text.split("Output:\n")[-1] if "Output:\n" in text else text
            })
        elif pred != truth:
            if pred == True and truth == False:
                errors["false_positives"].append({
                    "question": question[:100],
                    "solution": str(solution)[:100]
                })
            else:
                errors["false_negatives"].append({
                    "question": question[:100],
                    "solution": str(solution)[:100]
                })

    print(f"\n{'='*60}")
    print("ERROR ANALYSIS")
    print(f"{'='*60}")
    print(f"False Positives (said True, was False): {len(errors['false_positives'])}")
    print(f"False Negatives (said False, was True): {len(errors['false_negatives'])}")
    print(f"Malformed outputs: {len(errors['malformed'])}")

    if errors["malformed"]:
        print(f"\nSample malformed outputs:")
        for i, err in enumerate(errors["malformed"][:3]):  # Show first 3
            print(f"  {i+1}. Output: '{err['output']}'")

    return errors


In [16]:
# === Run evaluation ===
print("Starting validation evaluation...")
validation_accuracy = evaluate_accuracy(model, tokenizer, validation_dataset, n=500)  # Use 500 for better estimate

Starting validation evaluation...


Evaluating: 100%|██████████| 500/500 [02:34<00:00,  3.24it/s]


Evaluated 500 random samples
Correct: 424/500 (84.8%)
Accuracy: 0.8480





In [17]:
errors = detailed_error_analysis(model, tokenizer, validation_dataset, n=200)

Analyzing errors: 100%|██████████| 200/200 [01:01<00:00,  3.26it/s]


ERROR ANALYSIS
False Positives (said True, was False): 11
False Negatives (said False, was True): 18
Malformed outputs: 0





## **Step 10: Generate Submission File**
Use the fine-tuned model to predict `is_correct` for every example in the official test set.  
This step runs inference with the final prompt, parses each prediction as **True** or **False**, and saves results in a Kaggle-ready CSV file.  
Before submission, open the file to confirm outputs contain only `True`/`False` values and the distribution looks reasonable.  
Upload the resulting `submission.csv` to Kaggle to evaluate leaderboard performance.

In [18]:
# Prepare the model for faster inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

# Select a sample from the validation set
example_idx = 99  # Change this to test different examples
example = validation_dataset[example_idx]
question = example["question"]
solution = example["solution"]
truth = example["is_correct"]

# Format the prompt with the validation data
inputs = tokenizer(
    [inference_prompt.format(question, str(solution))],
    return_tensors="pt"
).to("cuda")

# Generate the model's response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=8,
        temperature=0.0,        # Deterministic output
        do_sample=False,        # No sampling
        pad_token_id=tokenizer.eos_token_id,
        use_cache=True
    )

response = tokenizer.batch_decode(outputs, skip_special_tokens=True)

# Extract the generated prediction
full_response = response[0]
if "Output:\n" in full_response:
    prediction_text = full_response.split("Output:\n")[1].strip()
else:
    prediction_text = full_response.strip()

# Parse the prediction
prediction = None
if prediction_text.lower().startswith("true"):
    prediction = True
elif prediction_text.lower().startswith("false"):
    prediction = False

# Print the results
print("="*60)
print(f"VALIDATION EXAMPLE #{example_idx}")
print("="*60)

print("\n#### QUESTION ####")
print(question)

print("\n#### SOLUTION ####")
print(solution)

print("\n#### MODEL'S PREDICTION ####")
print(f"Raw output: '{prediction_text}'")
print(f"Parsed as: {prediction}")

print("\n#### CORRECT ANSWER ####")
print(truth)

print("\n#### RESULT ####")
if prediction == truth:
    print("CORRECT!")
else:
    print("INCORRECT")
    if prediction is None:
        print("Warning: Could not parse model output")

print("="*60)

VALIDATION EXAMPLE #99

#### QUESTION ####
Elmo makes $N$ sandwiches for a fundraiser. For each sandwich he uses $B$ globs of peanut butter at $4$ cents per glob and $J$ blobs of jam at $5$ cents per blob.  The cost of the peanut butter and jam to make all the sandwiches is $\$2.53$. Assume that  $B$, $J$, and $N$ are positive integers with $N>1$. What is the cost, in dollars, of the jam Elmo uses to make the sandwiches?

#### SOLUTION ####
Let's write down the equations to solve:
We have $N$ sandwiches so we multiply cost of peanut butter ($4$ cents per glob * $B$ globs) and jam ($5$ cents per blob * $J$ blobs) for all sandwiches.
Then the total cost is $2.53$.
$4 * B * N + 5 * J * N = 2.53$
Since we know the cost of peanut butter and jam, we can calculate $B$ and $J$.
Let's use sympy:
<llm-code>
from sympy import symbols, Eq, solve

# Define the variables
N = 2
B = symbols('B')
J = symbols('J')

eq1 = Eq(4 * B * N + 5 * J * N, 2.53)

solutions = solve(eq1, (B, J))
B_value, J_value = 

In [None]:
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import torch

# Load the official test set
test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")

# Prepare model for inference
FastLanguageModel.for_inference(model)

# Create the prompt template for inference (no answer included)
inference_prompt = """You are a great mathematician and you are tasked with finding if a solution to a given maths question is correct or not. Your response should be 'True' if the solution is correct, otherwise 'False'. Below is the Question and Solution.
Question:
{}
Solution:
{}
Output:
"""

# Improved function to parse 'True' or 'False' from the model's raw output
def parse_output(response_text):
    """
    Parse the model's output to extract True/False prediction.
    """
    # Extract the part after "Output:\n"
    if "Output:\n" in response_text:
        output_part = response_text.split("Output:\n")[-1].strip()
    else:
        output_part = response_text.strip()

    # Check what was generated (case-insensitive)
    output_lower = output_part.lower()

    if output_lower.startswith("true"):
        return True
    elif output_lower.startswith("false"):
        return False
    else:
        # Default to False if malformed (you can also return None and handle separately)
        print(f"Malformed output: '{output_part[:50]}...'")
        return False

# Store predictions and tracking info
predictions = []
prediction_details = []
malformed_count = 0

# Loop through the test dataset and generate a prediction for each example
print(f"\nGenerating predictions for {len(test_dataset):,} test examples...")
print("="*80)

for idx, example in enumerate(tqdm(test_dataset, desc="Predicting")):
    question = example["question"]
    solution = example["solution"]

    # Format the prompt
    prompt = inference_prompt.format(question, str(solution))
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")

    # Generate the prediction
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=8,
            temperature=0.0,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
            use_cache=True
        )

    response_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

    # Parse the prediction
    prediction = parse_output(response_text)
    predictions.append(prediction)

    # Extract just the generated output for display
    if "Output:\n" in response_text:
        generated_output = response_text.split("Output:\n")[-1].strip()
    else:
        generated_output = response_text.strip()

    # Track malformed outputs
    if not (generated_output.lower().startswith("true") or generated_output.lower().startswith("false")):
        malformed_count += 1

    # Store details for later analysis
    prediction_details.append({
        'ID': idx,
        'prediction': prediction,
        'raw_output': generated_output,
        'question_preview': question[:100]  # First 100 chars
    })

    # Print first 10 predictions as examples
    if idx < 10:
        print(f"ID {idx:4d}: Predicted = {str(prediction):5s} | Raw: '{generated_output}'")

print("="*80)

# Create the submission DataFrame
submission = pd.DataFrame({
    'ID': range(len(predictions)),
    'is_correct': predictions
})

# Display prediction statistics
print("\n" + "="*80)
print("PREDICTION STATISTICS")
print("="*80)
print(f"Total predictions: {len(predictions):,}")
print(f"Predicted True:  {sum(predictions):,} ({sum(predictions)/len(predictions)*100:.1f}%)")
print(f"Predicted False: {len(predictions) - sum(predictions):,} ({(len(predictions) - sum(predictions))/len(predictions)*100:.1f}%)")
if malformed_count > 0:
    print(f"Malformed outputs: {malformed_count} ({malformed_count/len(predictions)*100:.1f}%)")
print("="*80)

# Save the DataFrame to a CSV file
submission_filename = f'submission_{USER_NAME}_{train_start_idx}_to_{train_end_idx}.csv'
submission.to_csv(submission_filename, index=False)

print(f"\nSubmission file '{submission_filename}' created successfully!")
print(f"   Rows: {len(submission)}")
print(f"   Columns: {list(submission.columns)}")

# Display first few rows
print("\nFirst 10 rows of submission:")
print(submission.head(10))

# Save detailed predictions for analysis
details_df = pd.DataFrame(prediction_details)
details_filename = f'prediction_details_{USER_NAME}_{train_start_idx}_to_{train_end_idx}.csv'
details_df.to_csv(details_filename, index=False)
print(f"\nDetailed predictions saved to '{details_filename}'")

# Optional: Save to Google Drive
import shutil
drive_submission_path = f'{RESULTS_BASE}/{submission_filename}'
shutil.copy(submission_filename, drive_submission_path)
print(f"Saved to Google Drive: {drive_submission_path}")

print("\nYou can now download and submit to Kaggle!")

### Step 11: Save Fine-Tuned Model and Tokenizer
Save the final fine-tuned model and tokenizer for future inference or continued training.  
This ensures all LoRA adapter weights and tokenizer vocabulary are preserved.  
You can later reload them with `FastLanguageModel.from_pretrained(save_dir)` to resume training or generate new predictions.

In [None]:
import os
import json

print("\n" + "="*60)
print("SAVING FINAL MODEL")
print("="*60)

# Create save directory
save_dir = os.path.join(output_dir, "final_model")
os.makedirs(save_dir, exist_ok=True)

# Save model and tokenizer
print(f"Saving to: {save_dir}")
model.save_pretrained(save_dir)
print("Model saved")
tokenizer.save_pretrained(save_dir)
print("Tokenizer saved")

# Save metadata
metadata = {
    "user": USER_NAME,
    "train_start_idx": train_start_idx,
    "train_end_idx": train_end_idx,
    "n_train_samples": len(train_dataset),
    "learning_rate": args.learning_rate,
    "lora_r": 64,
    "lora_alpha": 128,
}

with open(os.path.join(save_dir, "training_metadata.json"), 'w') as f:
    json.dump(metadata, f, indent=2)
print("Metadata saved")

# Verify and report
saved_files = os.listdir(save_dir)
total_size = sum(os.path.getsize(os.path.join(save_dir, f)) for f in saved_files)

print(f"\nSaved {len(saved_files)} files ({total_size / (1024**3):.2f} GB)")
print(f"\nModel saved successfully!")
print(f"\nTo reload: FastLanguageModel.from_pretrained(\"{save_dir}\", load_in_4bit=True)")
print("="*60)


Model and tokenizer saved to: /content/drive/MyDrive/DL_Fall_2025_Kaggle/checkpoints/Sankirth/Sankirth_run_20000_to_40000/final_model
You can later reload them with:
  model, tokenizer = FastLanguageModel.from_pretrained("/content/drive/MyDrive/DL_Fall_2025_Kaggle/checkpoints/Sankirth/Sankirth_run_20000_to_40000/final_model")
