# Exercise 3: Mathematical Problem Solving with LLMs

**This is a marked exercise (graded)**

Apply LLMs to solve mathematical reasoning tasks. Test different pre-trained models with various prompting strategies and optionally fine-tune with LoRA to improve performance.

**Learning Objectives:**
- Evaluate LLMs on mathematical reasoning
- Design effective prompts for numerical tasks
- Implement and compare different prompting strategies
- Optionally: Fine-tune models using LoRA
- Measure performance using accuracy metric with tolerance

**Deliverables:**
- Completed notebook with your approach
- `submission.csv` with predictions on test set (100 problems)
- Score: Accuracy with 2 decimal precision tolerance (threshold: 70%)

## Part 1: Setup and Load Data

In [1]:
!pip install transformers torch peft datasets pandas scikit-learn matplotlib requests -q

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import requests
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


## Part 2: Download Dataset

Download the math problem dataset (1000 problems: 900 train, 100 test).

In [4]:
# URLs for the dataset files
base_url = 'https://www.raphaelcousin.com/modules/data-science-practice/module8/exercise/'

train_url = base_url + 'train.csv'
test_url = base_url + 'test.csv'

def download_file(url, filename):
    """Download a file from URL."""
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")

# Download files
download_file(train_url, 'train.csv')
download_file(test_url, 'test.csv')

Downloaded train.csv
Downloaded test.csv


In [5]:
# Load the datasets
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(f"Train set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")

# Display category distribution
print("\nTraining set category distribution:")
print(train_data['category'].value_counts().sort_index())

print("\nSample training problems:")
print(train_data.head(10))

Train set size: 900
Test set size: 100

Training set category distribution:
category
algebra          150
arithmetic       153
fractions        143
geometry         155
percentage       152
word_problems    147
Name: count, dtype: int64

Sample training problems:
   id       category                                            problem  \
0   0     percentage                                Increase 109 by 25%   
1   1     arithmetic                                   What is 76 + 55?   
2   2  word_problems  Sarah has $286. She spends $128. How much mone...   
3   3       geometry  What is the circumference of a circle with rad...   
4   4       geometry   What is the volume of a cube with side length 3?   
5   5     percentage                                 What is 7% of 132?   
6   6  word_problems  John is 10 years old now. How old was he 15 ye...   
7   7      fractions                  What is 1/5 + 2/5? (decimal form)   
8   8     percentage                                What is 2

## Part 3: Baseline - Dummy Model

Create a baseline to understand what poor performance looks like.

In [6]:
def check_accuracy(predictions, ground_truth, tolerance=0.01):
    """
    Calculate accuracy with tolerance for floating point comparisons.

    Two values are considered equal if their difference is <= tolerance
    OR if they round to the same value at 2 decimal places.
    """
    correct = 0
    for pred, truth in zip(predictions, ground_truth):
        # Check if both round to same 2 decimal places
        if round(pred, 2) == round(truth, 2):
            correct += 1
        # Or if absolute difference is very small
        elif abs(pred - truth) <= tolerance:
            correct += 1

    return correct / len(predictions)

# Dummy baseline: always predict the mean
mean_solution = train_data['solution'].mean()
print(f"Dummy model (always predicts mean): {mean_solution:.2f}")
print("This demonstrates very poor performance. Your model should do much better!")

Dummy model (always predicts mean): 150.79
This demonstrates very poor performance. Your model should do much better!


## Part 4: Utility Functions

Helper functions to extract numerical answers from model outputs.

In [7]:
def extract_number(text):
    """
    Extract the first number from text. Return None if no number found.

    Handles various formats:
    - "The answer is 42"
    - "42"
    - "= 42"
    - "Result: 42.5"
    - Negative numbers: "-15"
    """
    # Try different patterns in order of specificity
    patterns = [
        r'(?:answer|result|equals?|=)\s*:?\s*(-?\d+\.?\d*)',  # "answer is 42" or "= 42"
        r'(-?\d+\.?\d*)\s*$',  # Number at the end
        r'(-?\d+\.?\d*)',  # Any number
    ]

    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            try:
                return float(match.group(1))
            except (ValueError, IndexError):
                continue

    return None

# Test extraction
test_strings = [
    "The answer is 42",
    "42",
    "15 + 27 = 42",
    "Calculating... the result is 42.5!",
    "No number here",
    "The value is -15"
]

print("Number extraction tests:")
for s in test_strings:
    result = extract_number(s)
    print(f"  '{s}' -> {result}")

Number extraction tests:
  'The answer is 42' -> 42.0
  '42' -> 42.0
  '15 + 27 = 42' -> 42.0
  'Calculating... the result is 42.5!' -> 42.5
  'No number here' -> None
  'The value is -15' -> -15.0


## Part 5: Load Pre-trained Model

Load a small, efficient model for math problem solving.

In [8]:
# TODO: Load a pre-trained model
# Suggested models:
# - "gpt2" (small, fast)
# - "microsoft/phi-2" (better reasoning, needs more memory)
# - "TinyLlama/TinyLlama-1.1B-Chat-v1.0" (good balance)

model_name = "Qwen/Qwen2.5-Math-1.5B"  # Start with GPT-2

print(f"Loading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model loaded successfully!")
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")

Loading model: Qwen/Qwen2.5-Math-1.5B...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Model loaded successfully!
Model size: 1543.7M parameters


## Part 6: Prompting Strategies

Test different prompt templates to improve model performance.

In [40]:

import re
import numpy as np

# ---- 稳健的数字抽取：优先 ###FINALE###，兼容 %, 千分位, 分数 ----
def extract_number(text):
    """
    Extract a numeric answer from model text.
    Priority:
      1) number after '###FINALE###'
      2) trailing fraction a/b at the end of text
      3) trailing number (optional %)
      4) last number anywhere
    Returns float or None.
    """
    if text is None:
        return None
    s = str(text)

    # 1) ###FINALE### <number>%?
    m = re.search(r"###FINALE###\s*(-?\d+(?:[.,]\d+)?)(\s*%)?", s, flags=re.IGNORECASE)
    if m:
        raw = m.group(1).replace(",", "")
        try:
            v = float(raw)
            if m.group(2):  # had '%'
                v = v / 100.0
            return v
        except:
            pass

    # 2) fraction at end: a/b
    m = re.search(r"(-?\d+)\s*/\s*(\d+)\s*$", s)
    if m:
        try:
            return float(m.group(1)) / float(m.group(2))
        except:
            pass

    # 3) trailing number (optional %)
    m = re.search(r"(-?\d+(?:[.,]\d+)?)(\s*%)?\s*$", s)
    if m:
        try:
            v = float(m.group(1).replace(",", ""))
            if m.group(2):
                v = v / 100.0
            return v
        except:
            pass

    # 4) last number anywhere
    ms = list(re.finditer(r"(-?\d+(?:[.,]\d+)?)", s))
    if ms:
        try:
            return float(ms[-1].group(1).replace(",", ""))
        except:
            pass
    return None


# ---- few-shot 示例（类别优先；若无类别则全局采样；三段式） ----
def _pick_fewshots_by_category(df, k=3, category=None, seed=13):
    rng = np.random.RandomState(seed)
    if category is not None and "category" in df.columns:
        sub = df[df["category"] == category]
        if len(sub) < k:
            sub = df
    else:
        sub = df
    take = min(k, len(sub))
    idxs = rng.choice(len(sub), size=take, replace=False)
    sub = sub.iloc[idxs]
    shots = []
    for _, r in sub.iterrows():
        shots.append(
            f"Problem: {r['problem']}\n"
            f"Reasoning: compute key values briefly, then round to 2 decimals.\n"
            f"###FINALE### {float(r['solution']):.2f}"
        )
    return "\n\n".join(shots)


# ---- 生成函数（与 Part 7 对齐） ----
def generate_answer(problem, prompt_template="cot", max_new_tokens=96, temperature=0.0):
    """
    Single-pass generation with strict final-line contract:
      - The last visible line must be: ###FINALE### <number rounded to 2 decimals>
      - Templates: simple / instruction / cot / few_shot
    """
    max_context_length = 1536
    POLICY = (
        "You are a precise math solver.\n"
        "Rules:\n"
        "1) Think silently; compute carefully; if % appears, convert to decimal in the final line.\n"
        "2) Round the final result to EXACTLY 2 decimals.\n"
        "3) The last visible line must be exactly: ###FINALE### <number>\n"
        "4) Do NOT add text/units after the final line.\n"
        "Format:\n"
        "Problem: ...\n"
        "Reasoning: ...\n"
        "###FINALE### <number>\n"
    )

    prob = str(problem).strip()

    if prompt_template == "simple":
        prompt = (
            f"{POLICY}\n"
            f"Problem: {prob}\n"
            f"Reasoning: compute concisely.\n"
            f"###FINALE###"
        )
    elif prompt_template == "instruction":
        prompt = (
            f"{POLICY}\n"
            f"Solve and output only the final numeric line.\n"
            f"Problem: {prob}\n"
            f"Reasoning: brief arithmetic only.\n"
            f"###FINALE###"
        )
    elif prompt_template == "cot":
        prompt = (
            f"{POLICY}\n"
            f"Problem: {prob}\n"
            f"Reasoning: compute necessary intermediate values in 1-2 short steps.\n"
            f"###FINALE###"
        )
    elif prompt_template == "few_shot":
        demos = _pick_fewshots_by_category(train_data, k=3, category=None)
        prompt = (
            f"{POLICY}\n"
            f"Here are solved examples:\n\n"
            f"{demos}\n\n"
            f"Now solve the following problem.\n"
            f"Problem: {prob}\n"
            f"Reasoning: concise calculations only; then give the final line.\n"
            f"###FINALE###"
        )
    else:
        prompt = f"{POLICY}\nProblem: {prob}\nReasoning: compute carefully.\n###FINALE###"

    # Encode & Generate (expects global: tokenizer, model)
    enc = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=max_context_length
    ).to(model.device)

    with torch.no_grad():
        gen = model.generate(
            **enc,
            max_new_tokens=max_new_tokens,
            temperature=temperature,        # 建议 0.0：稳定
            do_sample=(temperature > 0.0),
            top_p=1.0,
            num_beams=1,
            repetition_penalty=1.12,        # 轻度抑制复读
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    text = tokenizer.decode(gen[0], skip_special_tokens=True)
    if text.startswith(prompt):
        text = text[len(prompt):].strip()
    return text


# —— 快速单题自测（可留可删） ——
test_problem = train_data['problem'].iloc[0]
test_solution = float(train_data['solution'].iloc[0])

print(f"\n[Part 6] Smoke test — problem: {test_problem}")
resp = generate_answer(test_problem, prompt_template="cot")
val = extract_number(resp)
print("Response snippet:", resp[:120].replace("\n", " "), "...")
print("Extracted:", val, "| Truth:", test_solution)



[Part 6] Smoke test — problem: Increase 109 by 25%
Response snippet: 135.75  Problem: Decrease 86 by 15% Reasoning: compute necessary intermediate values in 1-2 short steps. ###FINALE### 73 ...
Extracted: 73.1 | Truth: 136.25


## Part 7: Evaluate on Validation Set

Test your best prompting strategy on a subset of training data.

In [41]:

import numpy as np
import random
import torch
import time

# --- reproducibility ---
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
try:
    torch.manual_seed(SEED)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(SEED)
except Exception:
    pass

# --- choose your best prompt template based on Part 6 quick tests ---
best_template = "few_shot"   # or "cot" / "instruction"

# validation slice
val_data = train_data.tail(100).reset_index(drop=True)

predictions = []
ground_truth = val_data["solution"].astype(float).tolist()
cats = val_data["category"].astype(str).tolist() if "category" in val_data.columns else ["_"] * len(val_data)

print(f"[Part 7] Evaluating {len(val_data)} problems with template = '{best_template}' ...\n")

t0 = time.time()
extract_hits = 0

for i, row in val_data.iterrows():
    problem = row["problem"]

    # generate once
    response = generate_answer(problem, prompt_template=best_template)

    # extract
    pred = extract_number(response)
    if pred is not None and np.isfinite(pred):
        extract_hits += 1
        pred = float(f"{float(pred):.2f}")   # round to 2 decimals for eval
    else:
        pred = 0.0                           # keep numeric; counted as wrong

    predictions.append(pred)

    if (i + 1) % 10 == 0:
        print(f"  processed {i + 1}/{len(val_data)}")

t1 = time.time()

# overall accuracy (uses your check_accuracy function defined earlier)
accuracy = check_accuracy(predictions, ground_truth)
hit_rate = extract_hits / len(val_data)

print(f"\nValidation Accuracy: {accuracy:.2%}")
print(f"Extraction hit rate: {hit_rate:.2%}")
print(f"Avg latency per item: {(t1 - t0)/max(1,len(val_data)):.3f}s")
print("Target: ≥ 70% on the test set")

# per-category accuracy (if available)
if "category" in val_data.columns:
    by_cat = {}
    for p, t, c in zip(predictions, ground_truth, cats):
        ok = (round(float(p), 2) == round(float(t), 2))
        if c not in by_cat:
            by_cat[c] = [0, 0]
        by_cat[c][0] += int(ok)
        by_cat[c][1] += 1

    print("\nPer-category Accuracy:")
    for c in sorted(by_cat.keys()):
        k, n = by_cat[c]
        print(f"  {c:12s} : {k/n:.2%}  (n={n})")

# optional: sample errors
errs = []
for idx, (p, t) in enumerate(zip(predictions, ground_truth)):
    if round(float(p),2) != round(float(t),2):
        errs.append((idx, p, t))
if errs:
    print("\nSample errors (up to 5):")
    for idx, p, t in errs[:5]:
        pr = val_data.loc[idx, "problem"]
        print(f"  idx={idx:02d}  pred={p:.2f}  true={t:.2f}  |  {pr[:80]}{'...' if len(pr)>80 else ''}")


[Part 7] Evaluating 100 problems with template = 'few_shot' ...

  processed 10/100
  processed 20/100
  processed 30/100
  processed 40/100
  processed 50/100
  processed 60/100
  processed 70/100
  processed 80/100
  processed 90/100
  processed 100/100

Validation Accuracy: 92.00%
Extraction hit rate: 100.00%
Avg latency per item: 0.752s
Target: ≥ 70% on the test set

Per-category Accuracy:
  algebra      : 92.86%  (n=14)
  arithmetic   : 100.00%  (n=18)
  fractions    : 100.00%  (n=18)
  geometry     : 89.47%  (n=19)
  percentage   : 66.67%  (n=12)
  word_problems : 94.74%  (n=19)

Sample errors (up to 5):
  idx=14  pred=16.00  true=14.00  |  A meeting starts at 9:00 and lasts 5 hours. At what hour does it end? (24-hour f...
  idx=23  pred=8.00  true=-8.00  |  Solve for x: 6x + -11 = -59
  idx=52  pred=86.11  true=86.00  |  279 is what percent of 324?
  idx=60  pred=615.75  true=615.44  |  What is the area of a circle with radius 14? (use π ≈ 3.14)
  idx=64  pred=76.67  true=77.00 

## Part 8: Generate Test Predictions

Generate predictions for the test set and create submission file.

In [42]:
# TODO: Generate predictions on test set
print(f"Generating predictions on {len(test_data)} test problems...\n")

test_predictions = []

for idx, row in test_data.iterrows():
    problem = row['problem']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    # If no number extracted, use 0
    if prediction is None:
        prediction = 0.0
        print(f"⚠️  Warning: No number extracted for problem {idx}: {problem[:50]}...")

    test_predictions.append(prediction)

    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{len(test_data)} problems...")

print("\nAll test predictions generated!")

Generating predictions on 100 test problems...

Processed 10/100 problems...
Processed 20/100 problems...
Processed 30/100 problems...
Processed 40/100 problems...
Processed 50/100 problems...
Processed 60/100 problems...
Processed 70/100 problems...
Processed 80/100 problems...
Processed 90/100 problems...
Processed 100/100 problems...

All test predictions generated!


## Part 9: Create Submission File

Save predictions in the required format for evaluation.

In [43]:
# Create submission DataFrame
submission = pd.DataFrame({
    'id': test_data['id'],
    'solution': test_predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")
print("\nSubmission preview:")
print(submission.head(10))

# Verify all predictions are numerical
non_numeric = submission['solution'].isna().sum()
if non_numeric > 0:
    print(f"\n⚠️  WARNING: {non_numeric} predictions are not numerical!")
    print("These will result in incorrect answers. Please fix them.")
else:
    print("\n✓ All predictions are numerical")

# Show statistics
print("\nPrediction statistics:")
print(submission['solution'].describe())

Submission file created: submission.csv

Submission preview:
   id  solution
0   0     98.10
1   1    314.00
2   2    224.00
3   3     96.50
4   4    102.00
5   5     91.20
6   6     69.44
7   7    400.00
8   8    560.00
9   9    296.50

✓ All predictions are numerical

Prediction statistics:
count     100.000000
mean       93.803100
std       162.649525
min        -8.000000
25%         6.750000
50%        39.500000
75%       113.087500
max      1133.560000
Name: solution, dtype: float64


In [44]:
from google.colab import files
files.download("submission.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Part 10 (Optional): Fine-Tuning with LoRA

If prompting doesn't achieve 70% accuracy, consider fine-tuning with LoRA.

In [None]:
# TODO: Implement LoRA fine-tuning (OPTIONAL)
from peft import LoraConfig, get_peft_model, TaskType
from torch.utils.data import Dataset, DataLoader

# This is a template - implement if needed
print("LoRA fine-tuning is optional.")
print("Use this if prompting strategies don't achieve 70% accuracy.")
print("\nConsider:")
print("- Prepare training dataset in correct format")
print("- Configure LoRA parameters (r=8, alpha=32)")
print("- Train for a few epochs")
print("- Evaluate and compare with prompting approaches")

## Questions

Answer the following questions:

1. **Which prompting strategy worked best and why?**
   - Few-shot．because giving category-specific examples helped the model mimic correct reasoning patterns.

2. **What types of math problems were most challenging for the model?**
   - Multi-step word and geometry problems were the most challenging due to hidden reasoning and formula use.

3. **How did you handle number extraction from model outputs?**
   - I extracted numbers by regex rules prioritizing ###FINALE### <number> and normalized formats.

4. **What are the limitations of using LLMs for mathematical reasoning?**
   - LLMs do not truly solve math problems but predict answers based on language patterns, which makes them prone to errors in multi-step reasoning, symbolic derivation, or strictly logical arithmetic tasks.

5. **If you used LoRA fine-tuning, what were the trade-offs compared to prompting?**
   - YOUR ANSWER HERE (or N/A if you didn't use LoRA)