# Exercise 3: Mathematical Problem Solving with LLMs

**This is a marked exercise (graded)**

Apply LLMs to solve mathematical reasoning tasks. Test different pre-trained models with various prompting strategies and optionally fine-tune with LoRA to improve performance.

**Learning Objectives:**
- Evaluate LLMs on mathematical reasoning
- Design effective prompts for numerical tasks
- Implement and compare different prompting strategies
- Optionally: Fine-tune models using LoRA
- Measure performance using accuracy metric with tolerance

**Deliverables:**
- Completed notebook with your approach
- `submission.csv` with predictions on test set (100 problems)
- Score: Accuracy with 2 decimal precision tolerance (threshold: 70%)

## Part 1: Setup and Load Data

In [1]:
!pip install transformers torch peft datasets pandas scikit-learn matplotlib requests -q

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import requests
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


## Part 2: Download Dataset

Download the math problem dataset (1000 problems: 900 train, 100 test).

In [3]:
# URLs for the dataset files
base_url = 'https://www.raphaelcousin.com/modules/data-science-practice/module8/exercise/'

train_url = base_url + 'train.csv'
test_url = base_url + 'test.csv'

def download_file(url, filename):
    """Download a file from URL."""
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")

# Download files
download_file(train_url, 'train.csv')
download_file(test_url, 'test.csv')

Downloaded train.csv
Downloaded test.csv


In [4]:
# Load the datasets
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(f"Train set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")

# Display category distribution
print("\nTraining set category distribution:")
print(train_data['category'].value_counts().sort_index())

print("\nSample training problems:")
print(train_data.head(10))

Train set size: 900
Test set size: 100

Training set category distribution:
category
algebra          150
arithmetic       153
fractions        143
geometry         155
percentage       152
word_problems    147
Name: count, dtype: int64

Sample training problems:
   id       category                                            problem  \
0   0     percentage                                Increase 109 by 25%   
1   1     arithmetic                                   What is 76 + 55?   
2   2  word_problems  Sarah has $286. She spends $128. How much mone...   
3   3       geometry  What is the circumference of a circle with rad...   
4   4       geometry   What is the volume of a cube with side length 3?   
5   5     percentage                                 What is 7% of 132?   
6   6  word_problems  John is 10 years old now. How old was he 15 ye...   
7   7      fractions                  What is 1/5 + 2/5? (decimal form)   
8   8     percentage                                What is 2

## Part 3: Baseline - Dummy Model

Create a baseline to understand what poor performance looks like.

In [5]:
def check_accuracy(predictions, ground_truth, tolerance=0.01):
    """
    Calculate accuracy with tolerance for floating point comparisons.

    Two values are considered equal if their difference is <= tolerance
    OR if they round to the same value at 2 decimal places.
    """
    correct = 0
    for pred, truth in zip(predictions, ground_truth):
        # Check if both round to same 2 decimal places
        if round(pred, 2) == round(truth, 2):
            correct += 1
        # Or if absolute difference is very small
        elif abs(pred - truth) <= tolerance:
            correct += 1

    return correct / len(predictions)

# Dummy baseline: always predict the mean
mean_solution = train_data['solution'].mean()
print(f"Dummy model (always predicts mean): {mean_solution:.2f}")
print("This demonstrates very poor performance. Your model should do much better!")

Dummy model (always predicts mean): 150.79
This demonstrates very poor performance. Your model should do much better!


## Part 4: Utility Functions

Helper functions to extract numerical answers from model outputs.

In [6]:
def extract_number(text):
    """
    Extract the first number from text. Return None if no number found.

    Handles various formats:
    - "The answer is 42"
    - "42"
    - "= 42"
    - "Result: 42.5"
    - Negative numbers: "-15"
    """
    # Try different patterns in order of specificity
    patterns = [
        r'(?:answer|result|equals?|=)\s*:?\s*(-?\d+\.?\d*)',  # "answer is 42" or "= 42"
        r'(-?\d+\.?\d*)\s*$',  # Number at the end
        r'(-?\d+\.?\d*)',  # Any number
    ]

    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            try:
                return float(match.group(1))
            except (ValueError, IndexError):
                continue

    return None

# Test extraction
test_strings = [
    "The answer is 42",
    "42",
    "15 + 27 = 42",
    "Calculating... the result is 42.5!",
    "No number here",
    "The value is -15"
]

print("Number extraction tests:")
for s in test_strings:
    result = extract_number(s)
    print(f"  '{s}' -> {result}")

Number extraction tests:
  'The answer is 42' -> 42.0
  '42' -> 42.0
  '15 + 27 = 42' -> 42.0
  'Calculating... the result is 42.5!' -> 42.5
  'No number here' -> None
  'The value is -15' -> -15.0


## Part 5: Load Pre-trained Model

Load a small, efficient model for math problem solving.

In [7]:
# TODO: Load a pre-trained model
# Suggested models:
# - "gpt2" (small, fast)
# - "microsoft/phi-2" (better reasoning, needs more memory)
# - "TinyLlama/TinyLlama-1.1B-Chat-v1.0" (good balance)

model_name = "Qwen/Qwen2.5-Math-1.5B"  # Start with GPT-2
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device)

print(f"Loading model: {model_name}...")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))


# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model loaded successfully!")
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Loading model: Qwen/Qwen2.5-Math-1.5B...
What is your name? adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
 adress
Model loaded successfully!
Model size: 1543.7M parameters


## Part 6: Prompting Strategies

Test different prompt templates to improve model performance.

In [None]:
#Because there always error in few_shot
import re

def extract_number(text):
    """
    Extract a number from text.
    1. If "###finish### <number>" exists, return that number.
    2. Otherwise, return the last number in the text.
    Returns None if no number is found or conversion fails.
    """
    # Try to find the special final tag first
    match = re.search(r"###finish###\s*(-?\d+\.?\d*)", text)
    if match:
        try:
            return float(match.group(1))
        except ValueError:
            return None

    # Otherwise, find the last number in the text
    matches = re.findall(r"-?\d+\.?\d*", text)
    if matches:
        try:
            return float(matches[-1])
        except ValueError:
            return None

    return None


In [None]:
def generate_answer(problem, prompt_template="simple", max_new_tokens=10, temperature=0.1):
    """
    Generate answer using different prompt templates.

    Templates:
    - simple: Just the problem
    - instruction: Add instruction to solve
    - cot: Chain-of-thought prompting
    - few_shot: Include examples from training data
    """
    if prompt_template == "simple":
        prompt = f"{problem}\nAnswer:"

    elif prompt_template == "instruction":
        prompt = f"Solve this math problem and provide only the numerical answer.\n\nProblem: {problem}\nAnswer:"

    elif prompt_template == "cot":
        prompt = f"Solve this math problem step by step, then provide the final numerical answer.\n\nProblem: {problem}\nSolution:\n"

    elif prompt_template == "few_shot":
        examples = []
        for i in range(min(3, len(train_data))):
            example_problem = train_data['problem'].iloc[i]
            example_solution = train_data['solution'].iloc[i]

            cot_example = (
                f"Problem: {example_problem}\n"
                f"Solution:\n"
                f"1. Analyse the problem.\n"
                f"2. Calculate: [Détails de calcul...]\n"
                f"###finish### {example_solution}"
            )
            examples.append(cot_example)

        examples_text = "\n\n".join(examples)

        prompt = f"{examples_text}\n\nSolve the next math problem step by step.\n\nProblem: {problem}\nSolution:\n"

    else:
        prompt = problem

    # Generate
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=500,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            pad_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Remove the prompt from response
    response = response[len(prompt):].strip()

    return response

# Test different prompts on a sample problem
test_problem = train_data['problem'].iloc[5]
test_solution = train_data['solution'].iloc[5]

print(f"Testing problem: {test_problem}")
print(f"Correct answer: {test_solution}\n")
print("="*70)

for template in ["simple", "instruction", "cot", "few_shot"]:
    response = generate_answer(test_problem, template)
    extracted = extract_number(response)

    correct = "✓" if extracted is not None and round(extracted, 2) == round(test_solution, 2) else "✗"

    print(f"{correct} {template}:")
    print(f"  Response: {response[:100]}{'...' if len(response) > 100 else ''}")
    print(f"  Extracted: {extracted}\n")

Testing problem: What is 7% of 132?
Correct answer: 9.24

✓ simple:
  Response: 9.24

To find 7% of 132, we can use the formula:

Percentage = (Percentage / 100) * Number

In this ...
  Extracted: 9.24

✓ instruction:
  Response: ___________
To find 7% of 132, we can follow these steps:

1. Convert the percentage to a decimal. S...
  Extracted: 9.24

✓ cot:
  Response: 1. Convert the percentage to a decimal: 7% = 0.07
2. Multiply the decimal by the number: 0.07 * 132 ...
  Extracted: 9.24

✓ few_shot:
  Response: 1. Analyse the problem.
2. Calculate: [Détails de calcul...]
###FINALE### 9.24

Problem: What is 12%...
  Extracted: 9.24



## Part 7: Evaluate on Validation Set

Test your best prompting strategy on a subset of training data.

In [28]:
# TODO: Choose your best prompt template
best_template = "few_shot"  # Change based on your experiments

# Evaluate on a small validation set (last 50 training examples)
val_data = train_data.tail(50)

predictions = []
ground_truth = val_data['solution'].tolist()

print(f"Evaluating on {len(val_data)} validation problems...\n")

for idx, row in val_data.iterrows():
    problem = row['problem']
    solution = row['solution']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    # If no number extracted, use 0 (will be wrong)
    if prediction is None:
        prediction = 0.0

    predictions.append(prediction)

    if (len(predictions) % 10) == 0:
        print(f"Processed {len(predictions)}/{len(val_data)} problems...")

# Calculate accuracy
accuracy = check_accuracy(predictions, ground_truth)
print(f"\nValidation Accuracy: {accuracy:.2%}")
print(f"Need to achieve: 70% on test set")

Evaluating on 50 validation problems...

Processed 10/50 problems...
Processed 20/50 problems...
Processed 30/50 problems...
Processed 40/50 problems...
Processed 50/50 problems...

Validation Accuracy: 82.00%
Need to achieve: 70% on test set


## Part 8: Generate Test Predictions

Generate predictions for the test set and create submission file.

In [26]:
# TODO: Generate predictions on test set
print(f"Generating predictions on {len(test_data)} test problems...\n")

test_predictions = []

for idx, row in test_data.iterrows():
    problem = row['problem']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    # If no number extracted, use 0
    if prediction is None:
        prediction = 0.0
        print(f"⚠️  Warning: No number extracted for problem {idx}: {problem[:50]}...")

    test_predictions.append(prediction)

    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{len(test_data)} problems...")

print("\nAll test predictions generated!")

Generating predictions on 100 test problems...

Processed 10/100 problems...
Processed 20/100 problems...
Processed 30/100 problems...
Processed 40/100 problems...
Processed 50/100 problems...
Processed 60/100 problems...
Processed 70/100 problems...
Processed 80/100 problems...
Processed 90/100 problems...
Processed 100/100 problems...

All test predictions generated!


## Part 9: Create Submission File

Save predictions in the required format for evaluation.

In [27]:
# Create submission DataFrame
submission = pd.DataFrame({
    'id': test_data['id'],
    'solution': test_predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")
print("\nSubmission preview:")
print(submission.head(10))

# Verify all predictions are numerical
non_numeric = submission['solution'].isna().sum()
if non_numeric > 0:
    print(f"\n⚠️  WARNING: {non_numeric} predictions are not numerical!")
    print("These will result in incorrect answers. Please fix them.")
else:
    print("\n✓ All predictions are numerical")

# Show statistics
print("\nPrediction statistics:")
print(submission['solution'].describe())

Submission file created: submission.csv

Submission preview:
   id  solution
0   0     98.10
1   1    314.00
2   2    224.00
3   3     96.50
4   4    102.00
5   5     91.20
6   6     69.44
7   7    400.00
8   8    560.00
9   9    304.50

✓ All predictions are numerical

Prediction statistics:
count     100.000000
mean       94.012333
std       162.816939
min        -8.000000
25%         7.000000
50%        39.500000
75%       113.087500
max      1133.860000
Name: solution, dtype: float64


## Part 10 (Optional): Fine-Tuning with LoRA

If prompting doesn't achieve 70% accuracy, consider fine-tuning with LoRA.

In [13]:
# TODO: Implement LoRA fine-tuning (OPTIONAL)
from peft import LoraConfig, get_peft_model, TaskType
from torch.utils.data import Dataset, DataLoader

# This is a template - implement if needed
print("LoRA fine-tuning is optional.")
print("Use this if prompting strategies don't achieve 70% accuracy.")
print("\nConsider:")
print("- Prepare training dataset in correct format")
print("- Configure LoRA parameters (r=8, alpha=32)")
print("- Train for a few epochs")
print("- Evaluate and compare with prompting approaches")

LoRA fine-tuning is optional.
Use this if prompting strategies don't achieve 70% accuracy.

Consider:
- Prepare training dataset in correct format
- Configure LoRA parameters (r=8, alpha=32)
- Train for a few epochs
- Evaluate and compare with prompting approaches


## Questions

Answer the following questions:

1. **Which prompting strategy worked best and why?**
   - YOUR ANSWER HERE

2. **What types of math problems were most challenging for the model?**
   - YOUR ANSWER HERE

3. **How did you handle number extraction from model outputs?**
   - YOUR ANSWER HERE

4. **What are the limitations of using LLMs for mathematical reasoning?**
   - YOUR ANSWER HERE

5. **If you used LoRA fine-tuning, what were the trade-offs compared to prompting?**
   - YOUR ANSWER HERE (or N/A if you didn't use LoRA)