# Contribution

SEAN TAN KAI WEN (U2240611G) - Trained and tested model

CHAN MIN ADELINE ALYSSA (U2221138E) - Trained and tested model


# Project 1: Teaching NanoGPT to Do Math


Our project attempts to use reinforcement learning DPO to teach NanoGPT to solve some simple algebra and arithmetic problems.

References used:
- https://huggingface.co/blog/pref-tuning
- https://github.com/togethercomputer/together-cookbook/blob/main/Finetuning/DPO_Finetuning.ipynb
- https://docs.pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingLR.html

### Step 1: Install necesscary packages






In [2]:
!pip install ipykernel matplotlib torch numpy transformers datasets tiktoken wandb tqdm

import sys
sys.path.append('.')  # so 'model.py' is importable



### Step 2: Package imports and configuration


Above, we start by configuring our hyperparameters. Several key parameters we explored and modified include:

|Paramter| Value |
|-----------|-----------|
| base_lr   | 1e-3  |
| epochs   | 20 |
| max_new_tokens   | 50   |
| temperature   | 1e-30   |
| top_k   | None   |
| beta   | 0.8   |


### base_lr:
*   The base learning rate for our AdamW optimizer. Controls how much the model's weights are updated based on computed gradients.
*   We varied this value from 1e-6 to 1e-3 until we discovered the optimal base_lr by comparing the output from our testset. If we set the learning rate too small, the model may make very little progress in finding the optimal loss. Likewise, if we set the learning rate too large, the model may overlook the optimal loss.

### epochs:
*   number of passes done over the dataset. A balance here had to be struck between a high number of epochs leading to overfitting on our training dataset and computational inefficiencies and a low number of epochs leading to underfitting and not converging.


### max_new_tokens:
*   determines the maximum number of generated tokens. It controls the length of the generated output. Sticking to the default value of 200 results the model producing outputs and that includes unnecessary and meaningless text. Hence max_new_tokens was reduced to 50, which was ideal in this math context.

### temperature:
*    We discovered that a low temperature is ideal for our model as high temperatures lead to more randomess in the output. Hence, a low temperature serves more deterministic outputs which is ideal for arithemtic equations. Allows greedy decoding. On a technical side, if we chose temperature smaller than 1e-30, it is essentially the same as setting temperature to 0 because it will throw CUDA device assert errors.

### top_k:
*   determines how many of the top tokens will be considered. Since we're using a near-zero temperature, our model will always choose the token with the highest probability, maintaining the deterministic behavior we expect from it. top_k is set to None because DPO needs unaltered probability distribution and math outputs do not benefit from controlled randomness.

### beta:
*   is the hyperparamter of the implicit reward of the perferred (positive) response over the negative response. Here we tested several betas from 0.5 to 0.9 and compared the models output on the testing set and identified the optimal beta. Since the optimal beta is set at 0.8, positive samples are weighted 25% more strongly in the DPO algorithm.


In [3]:
import sys
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import torch
import torch.nn as nn
import torch.nn.functional as F
import random
import pickle
sys.path.append(os.path.abspath('..'))  # add parent dir relative to notebook
from model import GPT, GPTConfig
import random
from tqdm import tqdm
import time
import json
import matplotlib.pyplot as plt
# Configuration
device = 'cuda' if torch.cuda.is_available() else 'cpu'
base_lr = 1e-3
epochs = 20
beta = 0.8
batch_size = 64
max_length =64
num_samples = 1
max_new_tokens = 50
temperature = 1e-30
top_k = None


### Tokenization and Data Encoding

Here, we implemented character tokenization that converts text to integer sequences that the model can process.
unk_id is returned to handle any unknown characters.


In [4]:
# tokenizer
with open("../sft/meta.pkl", "rb") as f:
    meta = pickle.load(f)
stoi, itos = meta["stoi"], meta["itos"]

unk_id = stoi.get("<unk>", 0)  # fallback to 0 if no <unk> in vocab

def encode(s):
    return [stoi.get(c, unk_id) for c in s]
# def encode(s): return [stoi[c] for c in s]
def decode(l): return ''.join([itos[i] for i in l])

EOS_ID = stoi.get("\n", None)  # we'll append this at train time and stop on it at test time

In [5]:
import torch, platform
print("torch:", torch.__version__, "cuda:", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("device count:", torch.cuda.device_count())

torch: 2.8.0 cuda: None
cuda available: False
device count: 0


In [None]:
meta['stoi']['!'] = 0
meta['itos'][0] = '!'

stoi is string to integer mapping, while itos is integer to string mapping.

If there is an unknown character faced, or special symbols such as !, they will be encoded to 0.

If there is a line spacing "\n", the encoder returns None

### Step 3: Define helper functions

In [None]:
def compute_logprob(input_ids):
    inputs = input_ids[:, :-1]
    targets = input_ids[:, 1:]
    logits, _ = gpt(inputs, full_seq=True)
    B, T, V = logits.size()
    logits_flat = logits.reshape(-1, V)
    targets_flat = targets.reshape(-1)
    loss = F.cross_entropy(logits_flat, targets_flat, ignore_index=0, reduction='none')
    loss = loss.reshape(B, T)
    attention_mask = (targets != 0).float()
    loss = (loss * attention_mask).sum(dim=1) / attention_mask.sum(dim=1)
    return -loss

def pad_or_truncate(seq, max_length):
    return seq[-max_length:] if len(seq) > max_length else seq + [0] * (max_length - len(seq))

def get_batches(lines, batch_size):
    random.shuffle(lines)
    #for l in lines:
    #    print(l[1])
    for i in range(0, len(lines), batch_size):
        batch = lines[i:i+batch_size]
        if len(batch) < batch_size:
            continue
        neg_inputs = [pad_or_truncate(encode(p['negative'] + '\n\n\n\n'), max_length) for p in batch]
        pos_inputs = [pad_or_truncate(encode(p['positive'] + '\n\n\n\n'), max_length) for p in batch]
        neg_tensor = torch.tensor(neg_inputs, dtype=torch.long, device=device)
        pos_tensor = torch.tensor(pos_inputs, dtype=torch.long, device=device)
        yield neg_tensor, pos_tensor

### Step 4: Load the pretrained NanoGPT model

In [6]:
ckpt = torch.load("../sft/gpt.pt", map_location=device)
gptconf = GPTConfig(**ckpt['model_args'])
gpt = GPT(gptconf)
state_dict = ckpt['model']
unwanted_prefix = '_orig_mod.'
for k in list(state_dict.keys()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)
gpt.to(device).train()

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(74, 348)
    (wpe): Embedding(256, 348)
    (drop): Dropout(p=0.2, inplace=False)
    (h): ModuleList(
      (0-5): 6 x Block(
        (ln_1): LayerNorm()
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=348, out_features=1044, bias=False)
          (c_proj): Linear(in_features=348, out_features=348, bias=False)
          (attn_dropout): Dropout(p=0.2, inplace=False)
          (resid_dropout): Dropout(p=0.2, inplace=False)
        )
        (ln_2): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=348, out_features=1392, bias=False)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=1392, out_features=348, bias=False)
          (dropout): Dropout(p=0.2, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=348, out_features=74, bias=False)
)

### Step 5: Load Data

We generated this training json dataset using a generator python script (pos_neg_pairs_idk_less_extreme_FIXED.py) that outputs a .json. It consists of 100k items with each items being of either:

            * "x_times_b_eq_c",
            * "x_plus_b_eq_c",
            * "x_minus_b_eq_c",
            * "b_minus_x_eq_c",
            * "b_over_x_eq_c",
            * "a_plus_b_q",
            * "a_minus_b_q",
            * "a_times_b_q",
            * "a_over_b_q",


Additionally, to ensure the model is trained to recognize that the position of operands may vary while the underlying arithmetic relationship remains the same, we included sequential variations. For example:

x + 4 = 10

4 + x = 10

While the dataset only contains 100k entries, careful thought and experimentation was carried out in order to balance potential overfitting and wider generalization.

* A dataset with a narrower numerical range could result in overfitting to these specific values but failing to generalize beyond unseen values.
* A dataset with a wider numerical range may lead to many extreme cases which the model might struggle to learn pattens from. An example would be 8/11 = 0.727272..., or 999*999, which are not covered in the test set.
* Hence a tradeoff must be found between coverage and diversity

The key component used behind the generation of data was the random number generator, where it will randomly select a expression of a specific operand. After that, the random number generator is used again to randomly choose a number within a specified range, to be used in the expression.

We realized that using the random number generator to randomly choose the expressions leads to a skewed distribution of expressions, which led to suboptimal test results. As such, we decided to have a equal proportion of expressions to further improve the test results.






In [7]:
# load the dataset
with open("pos_neg_pairs_idk_less_extreme_FIXED.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# check length
print("Number of samples:", len(data))

# inspect a few entries
for i in range(3):
    print(data[i])


Number of samples: 100000
{'positive': '3-2=? The answer is 1 because 3-2 equals 1.', 'negative': '3-2=? Sorry, I do not know!', 'Q': '3-2=?', 'A': '1', 'type': 'a_minus_b_q'}
{'positive': '522/29=? The answer is 18 because 522/29 equals 18.', 'negative': '522/29=? Sorry, I do not know!', 'Q': '522/29=?', 'A': '18', 'type': 'a_over_b_q'}
{'positive': '86-69=? The answer is 17 because 86-69 equals 17.', 'negative': '86-69=? Sorry, I do not know!', 'Q': '86-69=?', 'A': '17', 'type': 'a_minus_b_q'}


In DPO, the model is taught by showing it the positive and negative responses to the same question:

* Positive Response: The correct, preferred response - 'x+95=98, x=? The answer is 3 because 98-95 equals 3.'
* Negative Response: An incorrect, undesirable response - 'x+95=98, x=? Sorry, I do not know!'

The model learns to increase probability of positive samples and decrease probability of negative samples. These contrastive examples are used to ensure the model attempts to computes the answer as we intend the model to learn mathematical reasoning.

### Step 6: Build the optimizer and scheduler

We use an AdamW optimizer with a warmup + cosine annealing learning rate schedule to ensure stable and efficient training. AdamW is a variant of the Adam optimizer; however, AdamW decouples weight decay from the gradiant update leading to more consistent regularization and better generalization.


In [None]:
# --- OPTIMIZER ---
optimizer = torch.optim.AdamW(gpt.parameters(), lr=base_lr, betas=(0.9, 0.95), eps=1e-8, weight_decay=0.1)

# --- SCHEDULER ---
import math
steps_per_epoch = max(1, len(data) // batch_size)  # clamp to at least 1
total_steps = epochs * steps_per_epoch
warmup_steps = max(1, int(0.4 * total_steps)) # CHANGED HERE
min_lr = base_lr * 0.05 # CHANGED HERE

def lr_lambda(step):
    if step < warmup_steps: # Warmup Phase
        return step / warmup_steps
    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
    cosine = 0.5 * (1.0 + math.cos(math.pi * progress)) # Cosine Delay Phase
    return (min_lr / base_lr) + (1.0 - (min_lr / base_lr)) * cosine

scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

|Paramter| Value |
|-----------|-----------|
| warmup_ratio   | 0.4  |
| min_lr_ratio   | 0.05 |




lr_lambda defines the model's learning rate schedule


Warmup ratio: indicates a linear increase of learning rate from 0 to base_lr over warmup_steps. In this case 40% of total training steps is spent ramping up the learning rate from 0 to base_lr

Min_lr_ratio: the ratio between minimum learning rate and base learning rate during cosine or linear decay, used in calculating the minimum learning rate.

Cosine decay phase: gradually decreases learning rate from base_lr to min_lr using a half cosine curve.

This ensures a smooth transition and prevents sudden jumps in learning rate.

### Step 7: Begin training

Here we begin our training loop for the model using a DPO algorithm to train the model on positive and negative sequences.

For each epoch loop, we then loop over each batch in said epoch.
The negative and positive examples passed into compute_logprob and the respective log proability of the sequence is returned for both.

neg_logprob and pos_logprob is then passed to the loss function:

        loss = -F.logsigmoid((pos_logprob - neg_logprob) / beta).mean() - pos_logprob.mean() * 0.1

pos_logprob - neg_logprob measures how much more likely the model thinks the positive is compared to the negative example. logsigmoid then converts this to a smooth and differentiable loss.

      - pos_logprob.mean() * 0.1
is a KL-Divergence Regularization penalty to encourages positive sequences to remain highly probable

The beta parameter controls the weight of this preference

Our loss function directly encourages the model to assign higher probailities to correct answers rather than incorrect ones.

We then start the backpropagation process and compute the gradient loss over all the parameters. Gradiants are then rescaled to not exceed 1.0 to present exploding gradiants that can destabilize the model.

The AdamW optimizer then update the model's parameters and the learning rate is adjusted to the cosine delay with warmup. We then clear the optimization for the next batch.

In [None]:
import numpy as np
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

total_steps = len(data) // batch_size
global_step = 0
for epoch in range(epochs):
    pbar = tqdm(get_batches(data, batch_size))
    for step, (neg_tensor,pos_tensor) in enumerate(pbar):
        ###########################################################
        # Please complete the training code here!
        # Examples:
        # ...
        neg_logprob = compute_logprob(neg_tensor)
        pos_logprob = compute_logprob(pos_tensor)
        loss = -F.logsigmoid((pos_logprob - neg_logprob) / beta).mean() - pos_logprob.mean() * 0.1
        loss.backward()
        torch.nn.utils.clip_grad_norm_(gpt.parameters(), 1.0) #prevent exploding gradients
        optimizer.step() #optimizer update
        scheduler.step() #learning rate update
        optimizer.zero_grad()
        global_step += 1
        pbar.update(1)
        pbar.set_postfix(loss=float(loss.item()),
                         lr=float(optimizer.param_groups[0]["lr"]))
        # ...
        ###########################################################
    pbar.close()
    ckpt_path = f"dpo_epoch{epoch+1}.pt"
    torch.save({
        "model_state_dict": gpt.state_dict(),
        "model_args": ckpt['model_args'],
    }, ckpt_path)
    print(f"Saved checkpoint to {ckpt_path}")

1562it [01:07, 22.99it/s, loss=0.0284, lr=0.000125]


Saved checkpoint to ./dpo/dpo_epoch1.pt


1562it [01:06, 23.46it/s, loss=0.024, lr=0.00025]


Saved checkpoint to ./dpo/dpo_epoch2.pt


1562it [01:06, 23.32it/s, loss=0.0224, lr=0.000375]


Saved checkpoint to ./dpo/dpo_epoch3.pt


1562it [01:06, 23.49it/s, loss=0.0198, lr=0.0005]


Saved checkpoint to ./dpo/dpo_epoch4.pt


1562it [01:06, 23.64it/s, loss=0.0198, lr=0.000625]


Saved checkpoint to ./dpo/dpo_epoch5.pt


1562it [01:06, 23.59it/s, loss=0.0191, lr=0.00075]


Saved checkpoint to ./dpo/dpo_epoch6.pt


1562it [01:06, 23.61it/s, loss=0.0195, lr=0.000875]


Saved checkpoint to ./dpo/dpo_epoch7.pt


1562it [01:06, 23.56it/s, loss=0.0193, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch8.pt


1562it [01:06, 23.63it/s, loss=0.0182, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch9.pt


1562it [01:06, 23.59it/s, loss=0.0189, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch10.pt


1562it [01:06, 23.62it/s, loss=0.0177, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch11.pt


1562it [01:06, 23.42it/s, loss=0.0173, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch12.pt


1562it [01:06, 23.34it/s, loss=0.0183, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch13.pt


1562it [01:06, 23.52it/s, loss=0.0183, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch14.pt


1562it [01:06, 23.60it/s, loss=0.0185, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch15.pt


1562it [01:06, 23.49it/s, loss=0.0178, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch16.pt


1562it [01:06, 23.53it/s, loss=0.0185, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch17.pt


1562it [01:06, 23.46it/s, loss=0.0181, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch18.pt


1562it [01:06, 23.50it/s, loss=0.0173, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch19.pt


1562it [01:06, 23.57it/s, loss=0.0174, lr=0.001]


Saved checkpoint to ./dpo/dpo_epoch20.pt


The model is then saved to memory, creating a checkpoint, enabling the testing process to be more efficient in the future. The checkpoint also allows us to compare the model at various epochs.

### Step 8: Begin testing

The pre-trained model checkpoint is loaded and then run inference (test) mode to generate answers for several test cases.

Each prompt is tokenized and is fed into generate() to output an answer. These generated token are decoded back to text and regex is used to extract the final numerical output.

In [None]:
import re
import numpy as np


SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)


# Load the fine-tuned model
ckpt_path = "dpo_epoch20.pt"
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
gpt = GPT(gptconf).cuda()
try:
    state_dict = checkpoint['model']
except:
    state_dict = checkpoint['model_state_dict']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)
# Test
gpt.eval()
test_set = ["17+19=?", "3*17=?", "72/4=?", "72-x=34,x=?", "x*11=44,x=?", "3*17=?", "72/4=?", "72-x=34,x=?"]
with torch.no_grad():
    for prompt in test_set:
        prompt_ids = encode(prompt)
        ###########################################################
        # Please complete the test code here!
        # ...
        x = torch.tensor(prompt_ids, dtype=torch.long, device=device).unsqueeze(0)
        out = gpt.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
        if isinstance(out, tuple):
          y = out[0]           # token ids
          # logits = out[1]    # optional, if you need them
        else:
          y = out

        gen_tokens = y[0, x.size(1):].tolist()   # newly generated tokens
        text = decode(gen_tokens)
        ans_line = text.split("\n", 1)[0].strip()
        nums = re.findall(r"-?\d+", ans_line)
        ans = nums[0] if nums else ans_line
        print(f"Q: {prompt}\nA: {ans}\n")

Q: 17+19=?
A: 36

Q: 3*17=?
A: 51

Q: 72/4=?
A: 18

Q: 72-x=34,x=?
A: 38

Q: x*11=44,x=?
A: 4

Q: 3*17=?
A: 51

Q: 72/4=?
A: 18

Q: 72-x=34,x=?
A: 38



To further test the correctness of our model, we then tested it on a larger test set of 1000 samples. This dataset was generated using the same generator for the training dataset to ensure it comes from the same data distribution. Hence, we are able to fairly test it's generalizability and accuracy. We also break down its error by equation type to identify dataset skew. If the model consistently fails on certain types of problems, it may indicate that those types are underrepresented in the training data. Detecting such skews allows us to rebalance the dataset and improving overall generalization.

In [None]:
import json
import re
with open("pos_neg_pairs_idk_less_extreme_FIXED_test.json", "r", encoding="utf-8") as f:
    raw_data = json.load(f)


test_data = []
for ex in raw_data:
    q = ex["positive"].split("The answer is")[0].strip()
    m = re.search(r"The answer is\s+(-?\d+)", ex["positive"])
    a = m.group(1) if m else None
    test_data.append({"Q": q, "A": a, "type": ex['type']})

for i in range(3):
  print(raw_data[i])
  print(test_data[i])

print(f'Loaded {len(test_data)} test cases.')


{'positive': '36+18=? The answer is 54 because 36+18 equals 54.', 'negative': '36+18=? Sorry, I do not know!', 'Q': '36+18=?', 'A': '54', 'type': 'a_plus_b_q'}
{'Q': '36+18=?', 'A': '54', 'type': 'a_plus_b_q'}
{'positive': '89-x=42,x=? The answer is 47 because 89-42 equals 47.', 'negative': '89-x=42,x=? Sorry, I do not know!', 'Q': '89-x=42,x=?', 'A': '47', 'type': 'b_minus_x_eq_c'}
{'Q': '89-x=42,x=?', 'A': '47', 'type': 'b_minus_x_eq_c'}
{'positive': '58-38=? The answer is 20 because 58-38 equals 20.', 'negative': '58-38=? Sorry, I do not know!', 'Q': '58-38=?', 'A': '20', 'type': 'a_minus_b_q'}
{'Q': '58-38=?', 'A': '20', 'type': 'a_minus_b_q'}
Loaded 1000 test cases.


In [None]:
from collections import defaultdict
import re

# Load the fine-tuned model
ckpt_path = "dpo_epoch20.pt"
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
#gpt = GPT(gptconf).cuda()
gpt = GPT(gptconf)

try:
    state_dict = checkpoint['model']
except:
    state_dict = checkpoint['model_state_dict']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
gpt.load_state_dict(state_dict)

gpt.to(device)
# Test
gpt.eval()
with torch.no_grad():

        total = 0
        correct_count = 0
        breakdown = defaultdict(lambda: {"correct": 0, "incorrect": 0})
        incorrect_list = []

        with open(f"test_cases_with_answers_final.txt", "w") as f:
            for prompt in test_data:
              prompt_ids = encode(prompt['Q'])

              x = torch.tensor(prompt_ids, dtype=torch.long, device=device).unsqueeze(0)
              out = gpt.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
              if isinstance(out, tuple):
                y = out[0]           # token ids
              else:
                y = out

              gen_tokens = y[0, x.size(1):].tolist()
              text = decode(gen_tokens)
              ans_line = text.split("\n", 1)[0].strip()
              nums = re.findall(r"-?\d+", ans_line)
              ans = nums[0] if nums else ans_line

              ground_truth = prompt['A']
              eq_type = prompt['type']

              total += 1

              if str(ans) == str(ground_truth):
                  correct = "Correct"
                  correct_count += 1
                  breakdown[eq_type]["correct"] += 1
              else:
                  correct = "Incorrect"
                  breakdown[eq_type]["incorrect"] += 1
                  incorrect_list.append({
                      "Q": prompt['Q'],
                      "A": ans,
                      "Ground Truth": ground_truth,
                      "Type": eq_type
                    })


              f.write(f"Q: {prompt['Q']}\n")
              f.write(f"A: {ans}\n")
              f.write(f"Ground Truth: {ground_truth}\n")
              f.write(f"Equation Type: {eq_type}\n")
              f.write(f"Result: {correct}\n")
              if total % 100 == 0:
                print(f"{total} test cases done...")

            # summary
            f.write("\n=== Summary ===\n")
            f.write(f"Total: {total}, Correct: {correct_count}, Incorrect: {total - correct_count}\n")
            f.write(f"Accuracy: {correct_count/total:.2%}\n\n")

            # breakdown by type
            f.write("=== Breakdown by Equation Type ===\n")
            for t, stats in breakdown.items():
                f.write(f"{t}: Correct={stats['correct']}, Incorrect={stats['incorrect']}, Percentage ={stats['correct']/(stats['incorrect']+stats['correct']):.3%}\n")
            f.write("\n")


            f.write("=== Incorrect Equations ===\n")
            for item in incorrect_list:
                f.write(f"Q: {item['Q']}\n")
                f.write(f"A: {item['A']}\n")
                f.write(f"Ground Truth: {item['Ground Truth']}\n")
                f.write(f"Type: {item['Type']}\n")
                f.write("-"*40+"\n")

            print(f"Total: {total}, Correct: {correct_count}, Incorrect: {total - correct_count}\n")
            print(f"Accuracy: {correct_count/total:.2%}\n\n")
            print("=== Breakdown by Equation Type ===\n")
            for t, stats in breakdown.items():
                print(f"{t}: Correct={stats['correct']}, Incorrect={stats['incorrect']}, Percentage ={stats['correct']/(stats['incorrect']+stats['correct']):.3%}\n")



100 test cases done...
200 test cases done...
300 test cases done...
400 test cases done...
500 test cases done...
600 test cases done...
700 test cases done...
800 test cases done...
900 test cases done...
1000 test cases done...
Total: 1000, Correct: 978, Incorrect: 22

Accuracy: 97.80%


=== Breakdown by Equation Type ===

a_plus_b_q: Correct=99, Incorrect=0, Percentage =100.000%

b_minus_x_eq_c: Correct=114, Incorrect=0, Percentage =100.000%

a_minus_b_q: Correct=102, Incorrect=0, Percentage =100.000%

x_minus_b_eq_c: Correct=123, Incorrect=0, Percentage =100.000%

x_times_b_eq_c: Correct=127, Incorrect=0, Percentage =100.000%

a_times_b_q: Correct=107, Incorrect=0, Percentage =100.000%

b_over_x_eq_c: Correct=106, Incorrect=12, Percentage =89.831%

x_plus_b_eq_c: Correct=103, Incorrect=0, Percentage =100.000%

a_over_b_q: Correct=97, Incorrect=10, Percentage =90.654%

