# Exercise 3: Mathematical Problem Solving with LLMs

**This is a marked exercise (graded)**

Apply LLMs to solve mathematical reasoning tasks. Test different pre-trained models with various prompting strategies and optionally fine-tune with LoRA to improve performance.

**Learning Objectives:**
- Evaluate LLMs on mathematical reasoning
- Design effective prompts for numerical tasks
- Implement and compare different prompting strategies
- Optionally: Fine-tune models using LoRA
- Measure performance using accuracy metric with tolerance

**Deliverables:**
- Completed notebook with your approach
- `submission.csv` with predictions on test set (100 problems)
- Score: Accuracy with 2 decimal precision tolerance (threshold: 70%)

## Part 1: Setup and Load Data

In [None]:
!pip install transformers torch peft datasets pandas scikit-learn matplotlib requests -q

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import requests
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Part 2: Download Dataset

Download the math problem dataset (1000 problems: 900 train, 100 test).

In [None]:
# URLs for the dataset files
base_url = 'https://www.raphaelcousin.com/modules/data-science-practice/module8/exercise/'

train_url = base_url + 'train.csv'
test_url = base_url + 'test.csv'

def download_file(url, filename):
    """Download a file from URL."""
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")

# Download files
download_file(train_url, 'train.csv')
download_file(test_url, 'test.csv')

In [None]:
# Load the datasets
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(f"Train set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")

# Display category distribution
print("\nTraining set category distribution:")
print(train_data['category'].value_counts().sort_index())

print("\nSample training problems:")
print(train_data.head(10))

## Part 3: Baseline - Dummy Model

Create a baseline to understand what poor performance looks like.

In [None]:
def check_accuracy(predictions, ground_truth, tolerance=0.01):
    """
    Calculate accuracy with tolerance for floating point comparisons.

    Two values are considered equal if their difference is <= tolerance
    OR if they round to the same value at 2 decimal places.
    """
    correct = 0
    for pred, truth in zip(predictions, ground_truth):
        # Check if both round to same 2 decimal places
        if round(pred, 2) == round(truth, 2):
            correct += 1
        # Or if absolute difference is very small
        elif abs(pred - truth) <= tolerance:
            correct += 1

    return correct / len(predictions)

# Dummy baseline: always predict the mean
mean_solution = train_data['solution'].mean()
print(f"Dummy model (always predicts mean): {mean_solution:.2f}")
print("This demonstrates very poor performance. Your model should do much better!")

## Part 4: Utility Functions

Helper functions to extract numerical answers from model outputs.

In [None]:
import re, math, torch

def extract_number(text: str):
    """
    Extraction robuste:
      1) '###FINALE### <num>' (optionnellement avec % → converti en décimal)
      2) fraction en fin 'a/b'
      3) nombre en fin (option % → converti)
      4) dernier nombre trouvé
    Retourne float ou None.
    """
    if text is None:
        return None
    s = str(text)

    m = re.search(r"###FINALE###\s*(-?\d+(?:[.,]\d+)?)(\s*%)?", s, flags=re.IGNORECASE)
    if m:
        raw = m.group(1).replace(",", "")
        try:
            v = float(raw)
            if m.group(2):  # s'il y avait un '%', on renvoie en décimal
                v = v / 100.0
            return v
        except:
            pass

    m = re.search(r"(-?\d+)\s*/\s*(\d+)\s*$", s)
    if m:
        try:
            return float(m.group(1)) / float(m.group(2))
        except:
            pass

    m = re.search(r"(-?\d+(?:[.,]\d+)?)(\s*%)?\s*$", s)
    if m:
        try:
            v = float(m.group(1).replace(",", ""))
            if m.group(2):
                v = v / 100.0
            return v
        except:
            pass

    ms = list(re.finditer(r"(-?\d+(?:[.,]\d+)?)", s))
    if ms:
        try:
            return float(ms[-1].group(1).replace(",", ""))
        except:
            pass
    return None

# Test extraction
test_strings = [
    "The answer is 42",
    "42",
    "15 + 27 = 42",
    "Calculating... the result is 42.5!",
    "No number here",
    "The value is -15"
]

print("Number extraction tests:")
for s in test_strings:
    result = extract_number(s)
    print(f"  '{s}' -> {result}")

In [None]:
import ast, operator as op, re

_ops = {
    ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv,
    ast.Pow: op.pow, ast.USub: op.neg, ast.UAdd: op.pos,
    ast.FloorDiv: op.floordiv, ast.Mod: op.mod
}

def _eval_expr(expr: str) -> float:
    expr = re.sub(r'(?<=\d),(?=\d{3}\b)', '', expr)  # 1,000 -> 1000
    expr = expr.replace(',', '.')                    # 1,5 -> 1.5
    expr = expr.replace('^', '**')
    def _ev(n):
        if isinstance(n, ast.Constant) and isinstance(n.value, (int, float)): return n.value
        if isinstance(n, ast.BinOp) and type(n.op) in _ops: return _ops[type(n.op)](_ev(n.left), _ev(n.right))
        if isinstance(n, ast.UnaryOp) and type(n.op) in _ops: return _ops[type(n.op)](_ev(n.operand))
        raise ValueError(f"Unsupported node: {type(n).__name__}")
    return float(_ev(ast.parse(expr, mode='eval').body))

def _to_float(x):
    return float(str(x).replace(',', '.'))

PCT_KW = ("percent", "%", "increase", "decrease", "of", "more than", "less than", "discount", "sale", "from", "to")

def is_percentage_problem(problem: str) -> bool:
    s = (problem or "").lower()
    return any(k in s for k in PCT_KW)

def rule_solver(problem: str):
    """ Règles exactes, avec couverture élargie pour 'percentage'. """
    if not isinstance(problem, str): return None
    s = problem.strip().lower().replace('\u00a0', ' ')
    s = re.sub(r'\s+', ' ', s)

    # A) p% of y
    m = re.search(r'([-+]?\d+(?:[.,]\d+)?)\s*%\s*(?:of|de)\s*([-+]?\d+(?:[.,]\d+)?)', s)
    if m:
        p,y = map(_to_float, m.groups()); return p/100.0*y

    # B) increase/decrease A by B%
    m = re.search(r'(increase|decrease)\s+([-+]?\d+(?:[.,]\d+)?)\s+by\s+([-+]?\d+(?:[.,]\d+)?)\s*%', s)
    if m:
        kind,a,b = m.groups(); a=_to_float(a); b=_to_float(b)
        return a*(1+b/100.0) if kind=="increase" else a*(1-b/100.0)

    # C) increase/decrease A by B (sans %)
    m = re.search(r'(increase|decrease)\s+([-+]?\d+(?:[.,]\d+)?)\s+by\s+([-+]?\d+(?:[.,]\d+)?)\b', s)
    if m and '%' not in s:
        kind,a,b = m.groups(); a=_to_float(a); b=_to_float(b)
        return a+b if kind=="increase" else a-b

    # D) x is what percent of y?
    m = re.search(r'([-+]?\d+(?:[.,]\d+)?)\s+is\s+what\s+percent\s+of\s+([-+]?\d+(?:[.,]\d+)?)', s)
    if m:
        x,y = map(_to_float, m.groups());
        if y != 0: return 100.0*x/y

    # E) from X to Y, what percent (increase/decrease)?
    m = re.search(r'from\s+([-+]?\d+(?:[.,]\d+)?)\s+to\s+([-+]?\d+(?:[.,]\d+)?)\s*,?\s+what\s+percent', s)
    if m:
        x,y = map(_to_float, m.groups())
        if x != 0: return 100.0*(y-x)/x

    # F) discount / final price: "price P with a D% discount" → P*(1-D%)
    m = re.search(r'(?:price|cost)\s*([-+]?\d+(?:[.,]\d+)?)\s*(?:with|after)\s*(?:a\s*)?([-+]?\d+(?:[.,]\d+)?)\s*%\s*(?:discount|off)', s)
    if m:
        P,D = map(_to_float, m.groups()); return P*(1-D/100.0)

    # G) A is P% more/less than B → si la question demande "what percent", renvoyer P
    m = re.search(r'([-+]?\d+(?:[.,]\d+)?)\s+is\s+([-+]?\d+(?:[.,]\d+)?)\s*%\s*(more|less)\s+than\s+([-+]?\d+(?:[.,]\d+)?)', s)
    if m:
        a,p,ml,b = m.groups(); a=_to_float(a); p=_to_float(p); b=_to_float(b)
        # si cohérent, renvoie P (souvent on demande juste le pourcentage)
        if b != 0 and abs(a - (1 + (p/100.0 if ml=="more" else -p/100.0))*b) < 1e-6:
            return p

    # H) fraction-of: a/b of n
    m = re.search(r'([-+]?\d+(?:[.,]\d+)?)/([-+]?\d+(?:[.,]\d+)?)\s*(?:of|de)\s*([-+]?\d+(?:[.,]\d+)?)', s)
    if m:
        a,b,n = map(_to_float, m.groups())
        if b != 0: return (a/b)*n

    # I) "calculate/compute/evaluate ..." -> expression
    m = re.search(r'(?:calculate|compute|evaluate)\s+(.+?)(?:\?|$)', s)
    if m:
        try: return _eval_expr(m.group(1))
        except: pass

    # J) "what is ..." expression
    m = re.search(r'what\s+is\s+(.+?)(?:\?|$)', s)
    if m:
        try: return _eval_expr(m.group(1))
        except: pass

    # K) expression nue
    m = re.search(r'^([0-9\.,\s\+\-\*\/\^\(\)]+)\s*(?:\?|=)?\s*$', s)
    if m:
        try: return _eval_expr(m.group(1))
        except: pass

    return None


## Part 5: Load Pre-trained Model

Load a small, efficient model for math problem solving.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "Qwen/Qwen2.5-Math-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
).to("cuda" if torch.cuda.is_available() else "cpu")

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


## Part 6: Prompting Strategies

Test different prompt templates to improve model performance.

In [None]:
import numpy as np

def _pick_fewshots_percentage(df, k=3, seed=7):
    """Prend k exemples de la catégorie 'percentage' (sinon fallback global) avec format Reasoning + ###FINALE###."""
    rng = np.random.RandomState(seed)
    if "category" in df.columns:
        sub = df[df["category"].str.lower() == "percentage"]
        if len(sub) < k: sub = df
    else:
        sub = df
    take = min(k, len(sub))
    idxs = rng.choice(len(sub), size=take, replace=False)
    sub = sub.iloc[idxs]
    shots = []
    for _, r in sub.iterrows():
        shots.append(
            "Problem: " + str(r['problem']) + "\n"
            "Reasoning: compute key values briefly, round at the end.\n"
            f"###FINALE### {float(r['solution']):.2f}"
        )
    return "\n\n".join(shots)

def generate_answer(problem, prompt_template="few_shot", max_new_tokens=96, temperature=0.0):
    POLICY = (
        "You are a precise math solver.\n"
        "Rules:\n"
        "1) Think briefly; compute carefully.\n"
        "2) Round the final result to EXACTLY 2 decimals.\n"
        "3) The last visible line must be exactly: ###FINALE### <number>\n"
        "4) Do NOT add text/units after the final line.\n"
        "Format:\n"
        "Problem: ...\n"
        "Reasoning: ...\n"
        "###FINALE### <number>\n"
    )
    prob = str(problem).strip()

    if prompt_template == "simple":
        prompt = f"{POLICY}\nProblem: {prob}\nReasoning: brief.\n###FINALE###"
    elif prompt_template == "instruction":
        prompt = f"{POLICY}\nSolve and output only the final numeric line.\nProblem: {prob}\nReasoning: brief.\n###FINALE###"
    elif prompt_template == "cot":
        prompt = f"{POLICY}\nProblem: {prob}\nReasoning: compute necessary intermediate values in 1-2 short steps.\n###FINALE###"
    else:  # few_shot par défaut
        demos = _pick_fewshots_percentage(train_data, k=3)
        prompt = (
            f"{POLICY}\n"
            f"Here are solved examples:\n\n{demos}\n\n"
            f"Now solve the following problem.\n"
            f"Problem: {prob}\n"
            f"Reasoning: concise calculations only; then give the final line.\n"
            f"###FINALE###"
        )

    enc = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=1536).to(model.device)

    with torch.no_grad():
        gen = model.generate(
            **enc,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=(temperature > 0.0),
            top_p=1.0,
            num_beams=1,
            repetition_penalty=1.12,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            return_dict_in_generate=True
        )

    # Décode uniquement les tokens générés
    gen_ids = gen.sequences[0][enc["input_ids"].shape[1]:]
    text = tokenizer.decode(gen_ids, skip_special_tokens=True).strip()
    return text


# Test different prompts on a sample problem
test_problem = train_data['problem'].iloc[0]
test_solution = train_data['solution'].iloc[0]

print(f"Testing problem: {test_problem}")
print(f"Correct answer: {test_solution}\n")
print("="*70)

for template in ["simple", "instruction", "cot", "few_shot"]:
    response = generate_answer(test_problem, template)
    extracted = extract_number(response)

    correct = "✓" if extracted is not None and round(extracted, 2) == round(test_solution, 2) else "✗"

    print(f"{correct} {template}:")
    print(f"  Response: {response[:100]}{'...' if len(response) > 100 else ''}")
    print(f"  Extracted: {extracted}\n")

In [None]:
def generate_sc_percentage(problem, k=3):
    vals = []
    for _ in range(k):
        r = generate_answer(problem, prompt_template="few_shot", max_new_tokens=96, temperature=0.3)
        x = extract_number(r)
        if x is not None and math.isfinite(x):
            vals.append(x)
    if vals:
        vals.sort()
        return vals[len(vals)//2]  # médiane
    return None


## Part 7: Evaluate on Validation Set

Test your best prompting strategy on a subset of training data.

In [None]:
best_template = "few_shot"

val_data = train_data.tail(50).reset_index(drop=True)
predictions = []
ground_truth = val_data['solution'].astype(float).tolist()

print(f"Evaluating on {len(val_data)} validation problems...\n")
rule_hits, sc_calls, llm_calls = 0, 0, 0

for i, row in val_data.iterrows():
    q = row["problem"]

    # 1) Règles d’abord
    pred = rule_solver(q)
    if pred is not None:
        rule_hits += 1
    else:
        # 2) LLM direct
        if is_percentage_problem(q):
            # Self-consistency ciblé
            sc_calls += 1
            pred = generate_sc_percentage(q, k=3)
            if pred is None:
                # fallback one-shot deterministe
                resp = generate_answer(q, prompt_template=best_template, max_new_tokens=96, temperature=0.0)
                pred = extract_number(resp)
        else:
            llm_calls += 1
            resp = generate_answer(q, prompt_template=best_template, max_new_tokens=96, temperature=0.0)
            pred = extract_number(resp)

    if pred is None:
        pred = 0.0

    predictions.append(float(f"{float(pred):.2f}"))  # arrondi 2 décimales

    if (i + 1) % 10 == 0:
        print(f"Processed {i+1}/{len(val_data)} problems...")

acc = check_accuracy(predictions, ground_truth)
print(f"\n✅ Validation Accuracy: {acc:.2%}")
print(f"Rules solved: {rule_hits}/{len(val_data)} | SC calls (percentage): {sc_calls} | Plain LLM calls: {llm_calls}")


## Part 8: Generate Test Predictions

Generate predictions for the test set and create submission file.

In [None]:
# TODO: Generate predictions on test set
print(f"Generating predictions on {len(test_data)} test problems...\n")

test_predictions = []

for idx, row in test_data.iterrows():
    problem = row['problem']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    # If no number extracted, use 0
    if prediction is None:
        prediction = 0.0
        print(f"⚠️  Warning: No number extracted for problem {idx}: {problem[:50]}...")

    test_predictions.append(prediction)

    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{len(test_data)} problems...")

print("\nAll test predictions generated!")

## Part 9: Create Submission File

Save predictions in the required format for evaluation.

In [None]:
# Create submission DataFrame
submission = pd.DataFrame({
    'id': test_data['id'],
    'solution': test_predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")
print("\nSubmission preview:")
print(submission.head(10))

# Verify all predictions are numerical
non_numeric = submission['solution'].isna().sum()
if non_numeric > 0:
    print(f"\n⚠️  WARNING: {non_numeric} predictions are not numerical!")
    print("These will result in incorrect answers. Please fix them.")
else:
    print("\n✓ All predictions are numerical")

# Show statistics
print("\nPrediction statistics:")
print(submission['solution'].describe())

In [None]:
from google.colab import files
files.download('submission.csv')

## Part 10 (Optional): Fine-Tuning with LoRA

If prompting doesn't achieve 70% accuracy, consider fine-tuning with LoRA.

In [None]:
# TODO: Implement LoRA fine-tuning (OPTIONAL)
from peft import LoraConfig, get_peft_model, TaskType
from torch.utils.data import Dataset, DataLoader

# This is a template - implement if needed
print("LoRA fine-tuning is optional.")
print("Use this if prompting strategies don't achieve 70% accuracy.")
print("\nConsider:")
print("- Prepare training dataset in correct format")
print("- Configure LoRA parameters (r=8, alpha=32)")
print("- Train for a few epochs")
print("- Evaluate and compare with prompting approaches")

## Questions

Answer the following questions:

1. **Which prompting strategy worked best and why?**
   - YOUR ANSWER HERE

2. **What types of math problems were most challenging for the model?**
   - YOUR ANSWER HERE

3. **How did you handle number extraction from model outputs?**
   - YOUR ANSWER HERE

4. **What are the limitations of using LLMs for mathematical reasoning?**
   - YOUR ANSWER HERE

5. **If you used LoRA fine-tuning, what were the trade-offs compared to prompting?**
   - YOUR ANSWER HERE (or N/A if you didn't use LoRA)