# Math Question Answer Verification Competition

## Team: fallguys

### Members:
* Sahil Faizal (sf4140)
* Rohit Mohanty (rm6201)
* Jack Chenghao Yang (cy2668)

Borrowed from [official Unsloth implementation](https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=MKX_XKs_BNZR)

# **Installations**

In [None]:
import torch
print(torch.__version__)
import fastai
print(fastai.__version__)
# %%capture
# This cell will take time
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

2.5.1+cu121
2.7.18
Found existing installation: unsloth 2024.11.7
Uninstalling unsloth-2024.11.7:
  Successfully uninstalled unsloth-2024.11.7
Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-fnx78bqg/unsloth_306d8effbbe14ce29c1dab74e4af4bd4
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-fnx78bqg/unsloth_306d8effbbe14ce29c1dab74e4af4bd4
  Resolved https://github.com/unslothai/unsloth.git to commit f26d4e739ed507de7a9088da53d10fd02f58d160
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: unsloth
  Building wheel for unsloth (pyproject.toml) ... [?25l[?25hdone
  Created wheel for unsloth: filename=unsloth-2

# **Loading the model**

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048 # Setting the max sequence length param
dtype = None # None for auto detection
load_in_4bit = True # Use 4bit quantization to reduce memory usage

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

==((====))==  Unsloth 2024.11.7: Fast Llama patching. Transformers = 4.46.2.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu121. CUDA = 8.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/345 [00:00<?, ?B/s]

## Load model and wrap with LoRA adapters

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # rank of lora adapter
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 128,
    lora_dropout = 0, # lora dropout, 0 is optimized
    bias = "none",    # "none" is optimized
    use_gradient_checkpointing = "unsloth", # for very long context
    random_state = 3407,
    use_rslora = True,  # rank stabilized LoRA
    loftq_config = None, # LoftQ disabled
)

Unsloth 2024.11.7 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Competition dataset

In [None]:
# download and load competition dataset

from datasets import load_dataset
dataset = load_dataset("ad6398/nyu-dl-teach-maths-comp")
# print and see dataset
dataset

README.md:   0%|          | 0.00/2.09k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/3.65M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 1000000
    })
    test: Dataset({
        features: ['question', 'is_correct', 'answer', 'solution'],
        num_rows: 10000
    })
})

# **Creating the prompt template and loading the train dataset**

In [None]:
prompt = """As a renowned mathematician, determine if the given answer to the math question is correct. Respond strictly with 'True' or 'False' and provide no additional explanation.

### Question:
{}

### Solution:
{}

### Answer:
{}

### Output:
{}"""


EOS_TOKEN = tokenizer.eos_token


def formatting_prompts_func(examples):
    questions = examples["question"]
    solutions = examples["solution"]
    answers = examples["answer"]
    outputs = examples["is_correct"]
    texts = []
    for q, s, a, o in zip(questions, solutions, answers, outputs):
        text = prompt.format(q, s, a, o) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

train_dataset = dataset['train'].map(formatting_prompts_func, batched=True)


Map:   0%|          | 0/1000000 [00:00<?, ? examples/s]

In [None]:
#print a smaple training example
train_dataset['text'][0]

"As a renowned mathematician, determine if the given answer to the math question is correct. Respond strictly with 'True' or 'False' and provide no additional explanation.\n\n### Question:\nWhat is the radius of the circle inscribed in triangle $ABC$ if $AB = 22, AC=12,$ and $BC=14$? Express your answer in simplest radical form.\n\n### Solution:\nThe circle is inscribed in a triangle, and we know the sides of the triangle.\nTo use the inradius formula, we need to know the area of the triangle.\nWe can use Heron's formula to calculate the area.\n<llm-code>\nimport math\nfrom sympy import *\n\nAB, AC, BC = 22, 12, 14\n\n# Calculate the semiperimeter and area using Heron's formula\ns = (AB + AC + BC) / 2\nK = sqrt(s * (s - AB) * (s - AC) * (s - BC))\n\nprint(K)\n</llm-code>\n<llm-code-output>\n75.8946638440411\n</llm-code-output>\nLet's now use the formula for the radius of the inscribed circle.\n<llm-code>\nr = K / s\nprint(r)\n</llm-code>\n<llm-code-output>\n3.16227766016838\n</llm-code

# **Installing and initiating the weights and biases monitoring tool**

In [None]:
!pip install wandb



In [None]:
wandb login(key="")

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## **SFT Param Setup**

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        warmup_steps=2,
        max_steps=2500,
        learning_rate = 5e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps =10,
        optim = "adamw_hf",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb",
        save_strategy = "steps",
        save_steps = 100,
    )


trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = training_args
)

Map (num_proc=4):   0%|          | 0/1000000 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


# **Starting the model training**

In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,000,000 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 4 | Gradient Accumulation steps = 8
\        /    Total batch size = 32 | Total steps = 2,500
 "-____-"     Number of trainable parameters = 167,772,160
[34m[1mwandb[0m: Currently logged in as: [33msf4140[0m ([33msf4140-new-york-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss
10,0.9933
20,0.7519
30,0.7493
40,0.7013
50,0.7159
60,0.6902
70,0.6723
80,0.7177
90,0.6907
100,0.7131


Step,Training Loss
10,0.9933
20,0.7519
30,0.7493
40,0.7013
50,0.7159
60,0.6902
70,0.6723
80,0.7177
90,0.6907
100,0.7131


## **Performing inference**

In [None]:
# Sample inferene data point
test_dataset = dataset['test']

sample_ques = test_dataset['question'][0]
sample_sol = test_dataset['solution'][0]
sample_ans = test_dataset['answer'][0]

In [None]:
test_dataset[0]

{'question': 'The Parker family needs to leave the house by 5 pm for a dinner party. Mrs. Parker was waiting to get into the bathroom at 2:30 pm. Her oldest daughter used the bathroom for 45 minutes and her youngest daughter used the bathroom for another 30 minutes. Then her husband used it for 20 minutes. How much time will Mrs. Parker have to use the bathroom to leave on time?',
 'is_correct': True,
 'answer': '205',
 'solution': "Let's solve this problem using Python code.\n<llm-code>\nminutes_per_hour = 60\nminutes_left_before_5 = 5 * minutes_per_hour\ntotal_time_spent_by_family = 45 + 30 + 20\nminutes_before_5_after_family = minutes_left_before_5 - total_time_spent_by_family\nminutes_before_5_after_family\n</llm-code>\n<llm-code-output>\n205\n</llm-code-output>\nThus Mrs. Parker will have \\boxed{205} minutes in the bathroom before the family leaves."}

In [None]:
sample_exp = test_dataset['solution'][0]


In [None]:
# Running inference on single test
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
input_prompt = prompt.format(
        sample_ques, # ques
        sample_ans, # given answer
        sample_exp,
        "", # output - leave this blank for generation! LLM willl generate is it is True or False
    )

print("Input Promt:\n", input_prompt)
inputs = tokenizer(
[
    input_prompt
], return_tensors = "pt").to("cuda")

input_shape = inputs['input_ids'].shape
input_token_len = input_shape[1] # 1 because of batch
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
# you can get the whole generated text by uncommenting the below line
# text_generated = tokenizer.batch_decode([outputs, skip_special_tokens=True)

response = tokenizer.batch_decode([outputs[0][input_token_len:]], skip_special_tokens=True)
response

Input Promt:
 As a renowned mathematician, determine if the given answer to the math question is correct. Respond strictly with 'True' or 'False' and provide no additional explanation.

### Question:
The Parker family needs to leave the house by 5 pm for a dinner party. Mrs. Parker was waiting to get into the bathroom at 2:30 pm. Her oldest daughter used the bathroom for 45 minutes and her youngest daughter used the bathroom for another 30 minutes. Then her husband used it for 20 minutes. How much time will Mrs. Parker have to use the bathroom to leave on time?

### Solution:
205

### Answer:
Let's solve this problem using Python code.
<llm-code>
minutes_per_hour = 60
minutes_left_before_5 = 5 * minutes_per_hour
total_time_spent_by_family = 45 + 30 + 20
minutes_before_5_after_family = minutes_left_before_5 - total_time_spent_by_family
minutes_before_5_after_family
</llm-code>
<llm-code-output>
205
</llm-code-output>
Thus Mrs. Parker will have \boxed{205} minutes in the bathroom before 

['False']

In [None]:
print(type(test_dataset))  # Should print a list or a dict
print(type(test_dataset[0]))  # Check the type of the first element in the dataset


<class 'datasets.arrow_dataset.Dataset'>
<class 'dict'>


# **Saving the model**

In [None]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

# **Running inference on test data**

In [None]:

batch_size = 10
predictions = []

model.eval()

for start_idx in range(0, len(test_dataset), batch_size):
    batch_samples = test_dataset[start_idx:start_idx + batch_size]
    batch_questions = batch_samples['question']
    batch_solutions = batch_samples['solution']
    batch_answers = batch_samples['answer']

    batch_prompts = [
        prompt.format(q, s, a, "")
        for q, s, a in zip(batch_questions, batch_solutions, batch_answers)
    ]


    inputs = tokenizer(
        batch_prompts,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=2048)


    with torch.no_grad():

      outputs = model.generate(
          input_ids=inputs['input_ids'],
          attention_mask=inputs['attention_mask'],
          max_new_tokens=256,
          do_sample=False,

      )

    for i in range(len(outputs)):
        input_len = inputs['input_ids'].shape[1]
        output_ids = outputs[i][input_len:]
        response = tokenizer.decode(output_ids, skip_special_tokens=True).strip()


        if response == "True":
            predictions.append("True")
        elif response == "False":
            predictions.append("False")
        else:
            predictions.append("False")

    torch.cuda.empty_cache()
    print(f"Processed {min(start_idx + batch_size, len(test_dataset))}/{len(test_dataset)} examples.")



Processed 10/10000 examples.
Processed 20/10000 examples.
Processed 30/10000 examples.
Processed 40/10000 examples.
Processed 50/10000 examples.
Processed 60/10000 examples.
Processed 70/10000 examples.
Processed 80/10000 examples.
Processed 90/10000 examples.
Processed 100/10000 examples.
Processed 110/10000 examples.
Processed 120/10000 examples.
Processed 130/10000 examples.
Processed 140/10000 examples.
Processed 150/10000 examples.
Processed 160/10000 examples.
Processed 170/10000 examples.
Processed 180/10000 examples.
Processed 190/10000 examples.
Processed 200/10000 examples.
Processed 210/10000 examples.
Processed 220/10000 examples.
Processed 230/10000 examples.
Processed 240/10000 examples.
Processed 250/10000 examples.
Processed 260/10000 examples.
Processed 270/10000 examples.
Processed 280/10000 examples.
Processed 290/10000 examples.
Processed 300/10000 examples.
Processed 310/10000 examples.
Processed 320/10000 examples.
Processed 330/10000 examples.
Processed 340/10000

# **Preparing predictions file**

In [None]:
import pandas as pd
# 创建提交文件
submission = pd.DataFrame({
    "ID": range(len(predictions)),
    "is_correct": predictions
})
submission.to_csv("/content/submission.csv", index=False)

print("Submission file saved as 'submission.csv'")

Submission file saved as 'submission.csv'


# **Downloading the model files**

In [None]:
import shutil
from google.colab import files

# Define the folder to zip and the output zip file name
folder_to_zip = 'lora_model'  # Replace with the folder you want to zip
zipped_file_name = 'lora_model.zip'  # Replace with your desired zip file name

# Zip the folder
shutil.make_archive(zipped_file_name.replace('.zip', ''), 'zip', folder_to_zip)

# Download the zip file
files.download(zipped_file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>