

 <h1>
Welcome to the Math Question Answer Verification Competition! 🚀

The goal is to fine-tune a Llama-3-8B model to predict if a given solution to a math problem is correct or not. Your model should output True if the solution is correct, and False otherwise.

This notebook is a starter guide designed to get you up and running quickly. We'll walk through a simplified training process using a small subset of the data (5,000 examples) and lightweight parameters. The main goal here is to understand the complete workflow, from loading data to generating a submission file, not to achieve a top score.

Good luck, and have fun! 🎉

updated 小仪式

Author:Ziaho Li

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("Google Drive mounted at /content/drive")


Mounted at /content/drive
Google Drive mounted at /content/drive


In [None]:
#install the right versions
!pip uninstall -y unsloth unsloth_zoo trl transformers torchao xformers

!pip install --upgrade --force-reinstall --no-cache-dir \
    "unsloth==2025.10.8" \
    "unsloth_zoo==2025.10.9" \
    "trl==0.16.1" \
    "transformers==4.56.1" \
    "accelerate>=1.0.0"

print("restart runtime:")



[0mFound existing installation: transformers 4.57.1
Uninstalling transformers-4.57.1:
  Successfully uninstalled transformers-4.57.1
Found existing installation: torchao 0.10.0
Uninstalling torchao-0.10.0:
  Successfully uninstalled torchao-0.10.0
[0mCollecting unsloth==2025.10.8
  Downloading unsloth-2025.10.8-py3-none-any.whl.metadata (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo==2025.10.9
  Downloading unsloth_zoo-2025.10.9-py3-none-any.whl.metadata (31 kB)
Collecting trl==0.16.1
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Collecting transformers==4.56.1
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m255.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate>=1.0.0
  Downloading accelerate-1.11.0-py3-none-any.whl.metadata (19 kB)
Collecting wheel>

restart runtime:


In [None]:
# fix align_logprobs_with_mask problem.(align_logprobs_with_mask does not effect the trainning method we use)
import unsloth
from unsloth.models import rl

if "align_logprobs_with_mask" not in rl.RL_REPLACEMENTS:
    print("Injecting missing RL replacement: align_logprobs_with_mask...")
    rl.RL_REPLACEMENTS["align_logprobs_with_mask"] = lambda *args, **kwargs: None
else:
    print("align_logprobs_with_mask already exists")


align_logprobs_with_mask already exists


In [None]:
import sys, unsloth, transformers, trl
from importlib import import_module

print("Python:", sys.version)
print("unsloth =", unsloth.__version__)
print("transformers =", transformers.__version__)
print("trl =", trl.__version__)

rl = import_module("unsloth.models.rl")
print("align_logprobs_with_mask:", "align_logprobs_with_mask" in rl.RL_REPLACEMENTS)
#final check for versions

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
unsloth = 2025.10.8
transformers = 4.56.1
trl = 0.16.1
align_logprobs_with_mask: True


# 11th submission ver2

still use the method from this ver submission 10, but with larger learning rate and smaller steps. decrease from 3000 to 2700

This was 11th submission. but now it is 12th submission

# Imports and base config

In [None]:
import unsloth
from unsloth import FastLanguageModel
from datasets import load_dataset, concatenate_datasets

import torch
from torch.nn.utils.rnn import pad_sequence
from transformers import TrainingArguments, EarlyStoppingCallback
from trl import SFTTrainer

import pandas as pd
from tqdm import tqdm


max_seq_length = 2048
dtype = None
load_in_4bit = True

Load the base model

In [None]:
model_id = "unsloth/Meta-Llama-3.1-8B"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = model_id,
    max_seq_length = max_seq_length,
    dtype          = dtype,
    load_in_4bit   = load_in_4bit,
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.truncation_side = "left"
tokenizer.padding_side    = "left"

==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Lora

In [None]:
try:
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16,
        lora_alpha = 32,
        target_modules = [
            "q_proj","k_proj","v_proj","o_proj",
            "gate_proj","up_proj","down_proj",
        ],
        lora_dropout = 0.05,
        bias = "none",
        use_gradient_checkpointing = "unsloth",
        random_state = 42,
    )
    print("LoRA adapters configured.")
except Exception:
    print("LoRA already present, restart or move on")

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.10.8 patched 32 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


LoRA adapters configured.


Dataset formatting

In [None]:
DATASET_ID = "ad6398/nyu-dl-teach-maths-comp"
full_dataset = load_dataset(DATASET_ID, split="train").shuffle(seed=42)

# Split by label
true_data  = full_dataset.filter(lambda x: x["is_correct"] == True)
false_data = full_dataset.filter(lambda x: x["is_correct"] == False)
print(len(true_data), len(false_data))

# Target sizes
train_per_class = 45_000   #90k total
val_per_class   = 3000

train_true         = true_data.select(range(train_per_class))
train_false        = false_data.select(range(train_per_class))
validation_true    = true_data.select(range(train_per_class, train_per_class + val_per_class))
validation_false   = false_data.select(range(train_per_class, train_per_class + val_per_class))

# Merge + shuffle
from datasets import concatenate_datasets
train_dataset       = concatenate_datasets([train_true, train_false]).shuffle(seed=42)
validation_dataset  = concatenate_datasets([validation_true, validation_false]).shuffle(seed=42)


print(f"Train size: {len(train_dataset)}   (True={train_per_class}, False={train_per_class})")
print(f"Validation size: {len(validation_dataset)} (True={val_per_class}, False={val_per_class})")

EOS_TOKEN      = tokenizer.eos_token or "<|end_of_text|>"
ANSWER_TAG     = "Your Final Answer (True/False):\n"
RESERVE_TOKENS = 12

PROMPT_TEMPLATE = (
    "You are a strict and highly accurate math verifier. And you are doing your best. Because you know you are getting $300 if you made a right verification\n You will loss $500 if you made a wrong verication!\n"
    "Your task:\n"
    "1) Recalculate the correct answer yourself.Make sure you consider everything so the solution and the answer you got it right! \n"
    "2) Compare it with the provided solution ONLY.\n"
    "3) Judge if the provided solution is correct.\n"
    "4) Step by step, make sure to check provided solution step by step.\n"
    "5) Becarful to the all small things!\n"
    "Important rules:\n"
    "- Think privately inside <scratch></scratch>.\n"
    "- DO NOT reveal intermediate calculations.\n"
    "- Output ONLY 'True' or 'False' after the final answer tag.\n\n"
    "Question:\n{question}\n\n"
    "<scratch>Step-by-step internal reasoning here (not visible to users)</scratch>\n"
    "Provided Solution:\n{provided_solution}\n\n"
    f"{ANSWER_TAG}"
)

def safe_join(question: str, provided_solution: str, label: str) -> str:
    prefix = PROMPT_TEMPLATE.format(question=question, provided_solution=provided_solution)
    ids = tokenizer(prefix, add_special_tokens=False)["input_ids"]
    if len(ids) > max_seq_length - RESERVE_TOKENS:
        extra = len(ids) - (max_seq_length - RESERVE_TOKENS)
        provided_solution = provided_solution[extra:]
        prefix = PROMPT_TEMPLATE.format(question=question, provided_solution=provided_solution)
    return prefix + label + EOS_TOKEN

def formatting_prompts_func(batch):
    texts = []
    for q, s, o in zip(batch["question"], batch["solution"], batch["is_correct"]):
        label = "True" if bool(o) else "False"
        texts.append(safe_join(q.strip(), str(s).strip(), label))
    return {"text": texts}

formatted_train_dataset      = train_dataset.map(formatting_prompts_func,      batched=True, num_proc=8)
formatted_validation_dataset = validation_dataset.map(formatting_prompts_func, batched=True, num_proc=8)

# check the ANSWER_TAG
def _has_answer_tag(example):
    ids = tokenizer(example["text"], add_special_tokens=False)["input_ids"]
    tag_ids = tokenizer(ANSWER_TAG, add_special_tokens=False)["input_ids"]
    L, T = len(ids), len(tag_ids)
    for i in range(max(0, L - T + 1)):
        if ids[i:i+T] == tag_ids:
            return True
    return False

formatted_train_dataset      = formatted_train_dataset.filter(_has_answer_tag,      num_proc=8)
formatted_validation_dataset = formatted_validation_dataset.filter(_has_answer_tag, num_proc=8)

print("Final Dataset:", len(formatted_train_dataset), len(formatted_validation_dataset))

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1000000 [00:00<?, ? examples/s]

400000 600000
Train size: 90000   (True=45000, False=45000)
Validation size: 6000 (True=3000, False=3000)


Map (num_proc=8):   0%|          | 0/90000 [00:00<?, ? examples/s]

Map (num_proc=8):   0%|          | 0/6000 [00:00<?, ? examples/s]

Filter (num_proc=8):   0%|          | 0/90000 [00:00<?, ? examples/s]

Filter (num_proc=8):   0%|          | 0/6000 [00:00<?, ? examples/s]

Final Dataset: 90000 6000


Collator

In [None]:
class CompletionOnlyCollator:
    def __init__(self, tokenizer, response_template, max_length=2048):
        self.tokenizer     = tokenizer
        self.max_length    = max_length
        self.template_ids  = tokenizer(response_template, add_special_tokens=False)["input_ids"]

    def _mask_after_template(self, input_ids: torch.Tensor) -> torch.Tensor:
        tpl    = torch.tensor(self.template_ids, dtype=torch.long)
        labels = input_ids.clone()
        L, T   = len(input_ids), len(tpl)
        start  = -1
        for i in range(max(0, L - T + 1)):
            if torch.equal(input_ids[i:i+T], tpl):
                start = i
                break
        if start == -1:
            labels[:] = -100
        else:
            labels[:start+T] = -100
        return labels

    def __call__(self, features):
        ids_list, attn_list, labels_list = [], [], []
        for f in features:
            enc   = self.tokenizer(f["text"], truncation=True, max_length=self.max_length, add_special_tokens=False)
            ids   = torch.tensor(enc["input_ids"], dtype=torch.long)
            attn  = torch.ones_like(ids, dtype=torch.long)
            labs  = self._mask_after_template(ids)
            ids_list.append(ids); attn_list.append(attn); labels_list.append(labs)
        input_ids      = pad_sequence(ids_list,  batch_first=True, padding_value=self.tokenizer.pad_token_id)
        attention_mask = pad_sequence(attn_list, batch_first=True, padding_value=0)
        labels         = pad_sequence(labels_list, batch_first=True, padding_value=-100)
        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

collator = CompletionOnlyCollator(tokenizer, ANSWER_TAG, max_seq_length)

# Training args

In [None]:
output_dir = "/content/drive/MyDrive/llama3_math_contest/Sunday_NOV_2_2025_Fin_Check" #you can change the output dir as you wish

training_args = TrainingArguments(
    output_dir = output_dir,
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 8,
    max_steps = 3000,
    warmup_steps = 150,
    learning_rate = 8e-5,
    lr_scheduler_type = "cosine",

    logging_steps     = 25,
    save_strategy     = "steps",
    save_steps        = 150,
    save_total_limit  = 2,

    eval_strategy           = "steps",
    eval_steps              = 150,
    load_best_model_at_end  = True,
    metric_for_best_model   = "eval_loss",
    greater_is_better       = False,

    remove_unused_columns = False,

    bf16  = torch.cuda.is_bf16_supported(),
    fp16  = not torch.cuda.is_bf16_supported(),
    optim = "adamw_8bit",
    weight_decay = 0.02,
    seed = 42,
    report_to = "none",
)

# Train

In [10]:
trainer = SFTTrainer(
    model              = model,
    tokenizer          = tokenizer,
    train_dataset      = formatted_train_dataset,
    eval_dataset       = formatted_validation_dataset,
    dataset_text_field = "text",
    data_collator      = collator,
    max_seq_length     = max_seq_length,
    packing            = False,
    args               = training_args,
    callbacks          = [EarlyStoppingCallback(early_stopping_patience=3)],
)


trainer.train()

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/90000 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/6000 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 90,000 | Num Epochs = 2 | Total steps = 3,000
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 8 x 1) = 32
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss,Validation Loss
150,0.3226,0.23587
300,0.2108,0.206314
450,0.2013,0.173101
600,0.1843,0.165886
750,0.1522,0.153211
900,0.1556,0.152585
1050,0.1552,0.145136


Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


Step,Training Loss,Validation Loss
150,0.3226,0.23587
300,0.2108,0.206314
450,0.2013,0.173101
600,0.1843,0.165886
750,0.1522,0.153211
900,0.1556,0.152585
1050,0.1552,0.145136
1200,0.1549,0.155731
1350,0.1413,0.143877
1500,0.1323,0.131725


TrainOutput(global_step=3000, training_loss=0.16142667500178018, metrics={'train_runtime': 29660.0909, 'train_samples_per_second': 3.237, 'train_steps_per_second': 0.101, 'total_flos': 2.3831484370967593e+18, 'train_loss': 0.16142667500178018})

Running it again for checking reproducibility

---



# Save

In [11]:
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print("✅ Saved model to:", output_dir)

✅ Saved model to: /content/drive/MyDrive/llama3_math_contest/Sunday_NOV_2_2025_Fin_Check


# Inference + submission

In [12]:
# Reload best checkpoint adapters
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name     = output_dir,
    max_seq_length = 2048,
    dtype          = None,
    load_in_4bit   = True,
)
model.eval().to("cuda")

def parse_true_false(text: str) -> bool:
    t = text.strip().lower()
    if t.startswith("true"):
        return True
    if t.startswith("false"):
        return False
    if "true" in t and "false" not in t:  return True
    if "false" in t and "true"  not in t: return False
    return False

test_dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp", split="test")
predictions = []

for ex in tqdm(test_dataset):
    q = ex["question"].strip()
    s = str(ex["solution"]).strip()

    prompt = PROMPT_TEMPLATE.format(question=q, provided_solution=s)
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens = 8,     # small headroom
            do_sample      = False,
            temperature    = 0.0,
            eos_token_id   = tokenizer.eos_token_id,
            use_cache      = True,
        )

    new_tokens = outputs[0, inputs["input_ids"].size(1):]
    decoded    = tokenizer.decode(new_tokens, skip_special_tokens=True)

    pred = parse_true_false(decoded)
    predictions.append(pred)

submission = pd.DataFrame({
    "ID": range(len(predictions)),
    "is_correct": predictions
})
submission.to_csv("submission.csv", index=False)
print("submission.csv is ready")

==((====))==  Unsloth 2025.10.8: Fast Llama patching. Transformers: 4.56.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


100%|██████████| 10000/10000 [45:51<00:00,  3.63it/s]

submission.csv is ready



