# Exercise 3: Mathematical Problem Solving with LLMs

**This is a marked exercise (graded)**

Apply LLMs to solve mathematical reasoning tasks. Test different pre-trained models with various prompting strategies and optionally fine-tune with LoRA to improve performance.

**Learning Objectives:**
- Evaluate LLMs on mathematical reasoning
- Design effective prompts for numerical tasks
- Implement and compare different prompting strategies
- Optionally: Fine-tune models using LoRA
- Measure performance using accuracy metric with tolerance

**Deliverables:**
- Completed notebook with your approach
- `submission.csv` with predictions on test set (100 problems)
- Score: Accuracy with 2 decimal precision tolerance (threshold: 70%)


## Récapitulatif :   
#### Parsing problem :  
Le problème principal était d'extraire le résultat numérique de la réponse du LLM. Le code par défaut, parse le texte avec des expressions régulières mais ne fonctionne pas.
Certaines stratégies de parsing fonctionnait avec certains prompt et pas avec d'autres.

#### Prompt :  
few_shot prompt est clairement celui qui donne les meilleurs résultats. Donc j'ai adapté mon parsing au prompt few_shot. Et j'ai aussi controlé avec mon prompt la structure de la réponse du LLM pour mieux parser, de sorte que le résultat numérique soit en premier.

#### Chatbot answer :
Avec few_shot, le chatbot répétait les exemples donnés ce qui m'a un peu dérangé.
TinyLama performe mieux que gpt2. Environ 12% pour Tinylama et 2% pour gpt2, Tinylama ayant plus de paramètres.  
microsoft/phi-2 demande trop de RAM donc je n'ai pas pu l'utiliser.  
Même en fine-tunant Lama je ne pense pas atteindre les 70%, donc je pense simplement à changer de modèle. J'ai choisi QwenMath qui n'a pas plus de paramètres que Tinylama mais est plus performant pour résoudre des problèmes mathématiques.

## Part 1: Setup and Load Data

In [None]:
!pip install transformers torch peft datasets pandas scikit-learn matplotlib requests -q

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import requests
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cuda


## Part 2: Download Dataset

Download the math problem dataset (1000 problems: 900 train, 100 test).

In [2]:
# URLs for the dataset files
base_url = 'https://www.raphaelcousin.com/modules/data-science-practice/module8/exercise/'

train_url = base_url + 'train.csv'
test_url = base_url + 'test.csv'

def download_file(url, filename):
    """Download a file from URL."""
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded {filename}")

# Download files
download_file(train_url, 'train.csv')
download_file(test_url, 'test.csv')

Downloaded train.csv
Downloaded test.csv


In [3]:
# Load the datasets
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

print(f"Train set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")

# Display category distribution
print("\nTraining set category distribution:")
print(train_data['category'].value_counts().sort_index())


Train set size: 900
Test set size: 100

Training set category distribution:
category
algebra          150
arithmetic       153
fractions        143
geometry         155
percentage       152
word_problems    147
Name: count, dtype: int64


In [4]:
print("Sample test problems:")
print(test_data.head(10))

Sample test problems:
   id       category                                            problem
0   0     percentage                                Decrease 109 by 10%
1   1       geometry  What is the area of a circle with radius 10? (...
2   2     arithmetic                                    What is 28 × 8?
3   3      fractions                                What is 1/2 of 193?
4   4  word_problems  A car travels at 34 km/h for 3 hours. What dis...
5   5      fractions                                What is 4/5 of 114?
6   6     percentage                          50 is what percent of 72?
7   7     arithmetic                             What is (28 + 23) × 8?
8   8       geometry  What is the volume of a rectangular prism with...
9   9       geometry  What is the area of a triangle with base 23 an...


## Part 3: Baseline - Dummy Model

Create a baseline to understand what poor performance looks like.

In [5]:
def check_accuracy(predictions, ground_truth, tolerance=0.01):
    """
    Calculate accuracy with tolerance for floating point comparisons.

    Two values are considered equal if their difference is <= tolerance
    OR if they round to the same value at 2 decimal places.
    """
    correct = 0
    for pred, truth in zip(predictions, ground_truth):
        # Check if both round to same 2 decimal places
        if round(pred, 2) == round(truth, 2):
            correct += 1
        # Or if absolute difference is very small
        elif abs(pred - truth) <= tolerance:
            correct += 1

    return correct / len(predictions)

# Dummy baseline: always predict the mean
mean_solution = train_data['solution'].mean()
print(f"Dummy model (always predicts mean): {mean_solution:.2f}")
print("This demonstrates very poor performance. Your model should do much better!")

Dummy model (always predicts mean): 150.79
This demonstrates very poor performance. Your model should do much better!


## Part 4: Utility Functions

Helper functions to extract numerical answers from model outputs.

In [6]:
def extract_number(text):
    """
    Extract the first number from text. Return None if no number found.

    Handles various formats:
    - "The answer is 42"
    - "42"
    - "= 42"
    - "Result: 42.5"
    - Negative numbers: "-15"
    - Prioritizes extracting the first number found.
    """
    # Try different patterns in order of specificity, prioritizing the beginning of the string
    patterns = [
        r'(?:answer|result|equals?|=)\s*:?\s*(-?\d+\.?\d*)',  # "answer is 42" or "= 42" anywhere
        r'(-?\d+\.?\d*)',  # Any number
    ]

    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            try:
                return float(match.group(1))
            except (ValueError, IndexError):
                continue

    return None

# Test extraction
test_strings = [
    "The answer is 42",
    "42",
    "15 + 27 = 42",
    "Calculating... the result is 42.5!",
    "No number here",
    "The value is -15",
    "Problem: What is 1/5 + 2/5? (decimal form)\nAnswer: 0.6\n\nProblem: What is 2/4 + 1/4? (decimal form)\nAnswer: 0.75", # few_shot output example
    "Solution:\n109 * 0.25 = 27.25\n109 + 27.25 = 136.25\nFinal Answer: 136.25" # cot output example
]

print("Number extraction tests:")
for s in test_strings:
    result = extract_number(s)
    print(f"  '{s}' -> {result}")

Number extraction tests:
  'The answer is 42' -> 42.0
  '42' -> 42.0
  '15 + 27 = 42' -> 42.0
  'Calculating... the result is 42.5!' -> 42.5
  'No number here' -> None
  'The value is -15' -> -15.0
  'Problem: What is 1/5 + 2/5? (decimal form)
Answer: 0.6

Problem: What is 2/4 + 1/4? (decimal form)
Answer: 0.75' -> 0.6
  'Solution:
109 * 0.25 = 27.25
109 + 27.25 = 136.25
Final Answer: 136.25' -> 27.25


In [7]:

import re

def extract_answer_after_problem(text, problem):
    """
    Extracts the numeric answer that comes right after a specific 'Problem:' line.
    Returns None if not found.
    """
    # Escape special regex chars in the problem text
    escaped_problem = re.escape(problem.strip())

    # Pattern: find "Problem: <problem>" then the next "Answer: <number>"
    pattern = (
        escaped_problem
        + r".*?Answer:\s*([\-]?\d+(?:\.\d+)?(?:[eE][\-+]?\d+)?)"
    )

    match = re.search(pattern, text, flags=re.DOTALL)
    return float(match.group(1)) if match else None


In [8]:
import re

def extract_numeric_answer(response):
    """
    Extrait le premier nombre (int ou float) de la réponse du modèle.
    """
    match = re.search(r"-?\d+(?:\.\d+)?", response)
    if match:
        return float(match.group())
    return None


## Part 5: Load Pre-trained Model

Load a small, efficient model for math problem solving.

In [9]:
# TODO: Load a pre-trained model
# Suggested models:
# - "gpt2" (small, fast)
# - "microsoft/phi-2" (better reasoning, needs more memory)
# - "TinyLlama/TinyLlama-1.1B-Chat-v1.0" (good balance)
#"Qwen/Qwen2.5-Math-1.5B"
model_name = "Qwen/Qwen2.5-Math-1.5B"  # Start with GPT-2

print(f"Loading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model loaded successfully!")
print(f"Model size: {sum(p.numel() for p in model.parameters()) / 1e6:.1f}M parameters")

Loading model: Qwen/Qwen2.5-Math-1.5B...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/676 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

Model loaded successfully!
Model size: 1543.7M parameters


## Part 6: Prompting Strategies

Test different prompt templates to improve model performance.

In [10]:
def generate_answer(problem, prompt_template="simple", max_new_tokens=50, temperature=0.1):
    """
    Generate answer using different prompt templates.

    Templates:
    - simple: Just the problem
    - instruction: Add instruction to solve
    - cot: Chain-of-thought prompting
    - few_shot: Include examples from training data
    """
    if prompt_template == "simple":
        prompt = f"{problem}\nAnswer:"

    elif prompt_template == "instruction":
        prompt = f"Solve this math problem and provide only the numerical answer.\n\nProblem: {problem}\nAnswer:"

    elif prompt_template == "cot":
        prompt = f"Solve this math problem step by step, then provide the final numerical answer.\n\nProblem: {problem}\nSolution:\n"

    elif prompt_template == "few_shot":
        n_examples = min(5, len(train_data))
        examples = []

        for i in range(n_examples):
            ex_problem = train_data["problem"].iloc[i].strip()
            ex_solution = train_data["solution"].iloc[i]
            # On précise que c'est juste le résultat numérique
            examples.append(f"Q: {ex_problem}\nA: {ex_solution}")

        examples_text = "\n\n".join(examples)  # double saut de ligne pour plus de clarté

        # Prompt final pour TinyLlama
        prompt = (
            "You are a helpful math assistant. "
            "Read each problem carefully and answer only the last problem with the numeric result.\n\n"
            f"Here are some examples:\n{examples_text}\n\n"
            f"Now solve the following problem, step by step:\nQ: {problem}\nA:"
        )



    else:
        prompt = problem
    print('+'*70,'\n',prompt,'+'*70,'\n')

    # Generate
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            pad_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Remove the prompt from response
    response = response[len(prompt):].strip()

    return response


# Test different prompts on a sample problem
test_problem = train_data['problem'].iloc[150]
test_solution = train_data['solution'].iloc[150]

print(f"Testing problem: {test_problem}")
print(f"Correct answer: {test_solution}\n")
print("="*70)

for template in ["simple", "instruction", "cot", "few_shot"]:
    response = generate_answer(test_problem, template)
    #extracted = extract_answer_after_problem(response,test_problem)
    #extracted = extract_number(response)
    extracted = extract_numeric_answer(response)

    correct = "✓" if extracted is not None and round(extracted, 2) == round(test_solution, 2) else "✗"

    print(f"{correct} {template}:")
    #print(f"  Response: {response[:100]}{'...' if len(response) > 100 else ''}")
    print(f"  Response: {response}")
    print(f"  Extracted: {extracted}\n")

Testing problem: Decrease 27 by 20%
Correct answer: 21.6

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
 Decrease 27 by 20%
Answer: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

✓ simple:
  Response: 21.6

To decrease 27 by 20%, we need to follow these steps:

1. Calculate 20% of 27.
2. Subtract the result from 27.

Step 1: Calculate
  Extracted: 21.6

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
 Solve this math problem and provide only the numerical answer.

Problem: Decrease 27 by 20%
Answer: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

✗ instruction:
  Response: ___________
To solve the problem of decreasing 27 by 20%, we can follow these steps:

1. Calculate 20% of 27.
2. Subtract the result from 27.

First, let's calculate
  Extracted: 27.0

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
 Solve this math problem step by step, then provide the final 

In [None]:
print(response,'debut',test_problem)

21.6

Q: What is 12 + 12?
A: 24.0

Q: John has 12 apples. He gives away 3 apples. How many apples does John have left? debut Decrease 27 by 20%


## Part 7: Evaluate on Validation Set

Test your best prompting strategy on a subset of training data.

In [11]:
# TODO: Choose your best prompt template
best_template = "few_shot"  # Change based on your experiments

# Evaluate on a small validation set (last 50 training examples)
val_data = train_data.tail(50)

predictions = []
ground_truth = val_data['solution'].tolist()

print(f"Evaluating on {len(val_data)} validation problems...\n")

for idx, row in val_data.iterrows():
    problem = row['problem']
    solution = row['solution']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_numeric_answer(response)
    #prediction = extract_number(response)
    #print(response,'\n','='*70)
    #print(prediction,'n','*'*70)

    # If no number extracted, use 0 (will be wrong)
    if prediction is None:
        prediction = 0.0

    predictions.append(prediction)

    if (len(predictions) % 10) == 0:
        print(f"Processed {len(predictions)}/{len(val_data)} problems...")

# Calculate accuracy
accuracy = check_accuracy(predictions, ground_truth)
print(f"\nValidation Accuracy: {accuracy:.2%}")
print(f"Need to achieve: 70% on test set")

Evaluating on 50 validation problems...

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
 You are a helpful math assistant. Read each problem carefully and answer only the last problem with the numeric result.

Here are some examples:
Q: Increase 109 by 25%
A: 136.25

Q: What is 76 + 55?
A: 131.0

Q: Sarah has $286. She spends $128. How much money does she have left?
A: 158.0

Q: What is the circumference of a circle with radius 14? (use π ≈ 3.14)
A: 87.92

Q: What is the volume of a cube with side length 3?
A: 27.0

Now solve the following problem, step by step:
Q: What is the volume of a rectangular prism with length 8, width 2, and height 4?
A: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
 You are a helpful math assistant. Read each problem carefully and answer only the last problem with the numeric result.

Here are some examples:
Q: Increase 109 by 25%
A: 136

## Part 8: Generate Test Predictions

Generate predictions for the test set and create submission file.

In [12]:
# TODO: Generate predictions on test set
print(f"Generating predictions on {len(test_data)} test problems...\n")

test_predictions = []

for idx, row in test_data.iterrows():
    problem = row['problem']

    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)

    # If no number extracted, use 0
    if prediction is None:
        prediction = 0.0
        print(f"⚠️  Warning: No number extracted for problem {idx}: {problem[:50]}...")

    test_predictions.append(prediction)

    if (idx + 1) % 10 == 0:
        print(f"Processed {idx + 1}/{len(test_data)} problems...")

print("\nAll test predictions generated!")

Generating predictions on 100 test problems...

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
 You are a helpful math assistant. Read each problem carefully and answer only the last problem with the numeric result.

Here are some examples:
Q: Increase 109 by 25%
A: 136.25

Q: What is 76 + 55?
A: 131.0

Q: Sarah has $286. She spends $128. How much money does she have left?
A: 158.0

Q: What is the circumference of a circle with radius 14? (use π ≈ 3.14)
A: 87.92

Q: What is the volume of a cube with side length 3?
A: 27.0

Now solve the following problem, step by step:
Q: Decrease 109 by 10%
A: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
 You are a helpful math assistant. Read each problem carefully and answer only the last problem with the numeric result.

Here are some examples:
Q: Increase 109 by 25%
A: 136.25

Q: What is 76 + 55?
A: 131.0

Q: Sarah has $286.

## Part 9: Create Submission File

Save predictions in the required format for evaluation.

In [13]:
# Create submission DataFrame
submission = pd.DataFrame({
    'id': test_data['id'],
    'solution': test_predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")
print("\nSubmission preview:")
print(submission.head(10))

# Verify all predictions are numerical
non_numeric = submission['solution'].isna().sum()
if non_numeric > 0:
    print(f"\n⚠️  WARNING: {non_numeric} predictions are not numerical!")
    print("These will result in incorrect answers. Please fix them.")
else:
    print("\n✓ All predictions are numerical")

# Show statistics
print("\nPrediction statistics:")
print(submission['solution'].describe())

Submission file created: submission.csv

Submission preview:
   id    solution
0   0   98.090000
1   1  314.000000
2   2  224.000000
3   3   96.500000
4   4  102.000000
5   5   91.200000
6   6   69.444444
7   7   51.000000
8   8  560.000000
9   9  295.500000

✓ All predictions are numerical

Prediction statistics:
count     100.000000
mean       86.075278
std       158.390266
min        -8.000000
25%        11.500000
50%        34.875000
75%        98.567500
max      1133.560000
Name: solution, dtype: float64


## Part 10 (Optional): Fine-Tuning with LoRA

If prompting doesn't achieve 70% accuracy, consider fine-tuning with LoRA.

In [None]:
# TODO: Implement LoRA fine-tuning (OPTIONAL)
from peft import LoraConfig, get_peft_model, TaskType
from torch.utils.data import Dataset, DataLoader

# This is a template - implement if needed
print("LoRA fine-tuning is optional.")
print("Use this if prompting strategies don't achieve 70% accuracy.")
print("\nConsider:")
print("- Prepare training dataset in correct format")
print("- Configure LoRA parameters (r=8, alpha=32)")
print("- Train for a few epochs")
print("- Evaluate and compare with prompting approaches")

LoRA fine-tuning is optional.
Use this if prompting strategies don't achieve 70% accuracy.

Consider:
- Prepare training dataset in correct format
- Configure LoRA parameters (r=8, alpha=32)
- Train for a few epochs
- Evaluate and compare with prompting approaches


## Questions

Answer the following questions:

1. **Which prompting strategy worked best and why?**
   - The `few_shot` prompting strategy worked best. By providing a few examples of math problems and their solutions, the model was able to better understand the desired output format and the type of reasoning required to solve the problems. This led to a higher accuracy compared to the other strategies which were less specific in their instructions or did not provide examples.
2. **What types of math problems were most challenging for the model?**
   - Based on the validation accuracy and the warning messages during test prediction generation, problems that required more complex reasoning or precise numerical calculations seemed to be more challenging. Specifically, problems involving percentages, fractions, and word problems that required multiple steps or careful interpretation of the language posed more difficulty than straightforward arithmetic or simple geometry calculations. Problems where the model failed to extract a number also indicated potential issues with understanding or generating the correct output format for those specific problem types.
3. **How did you handle number extraction from model outputs?**
   - I used the `extract_numeric_answer` function, which utilizes regular expressions to find the first occurrence of a number (integer or float) in the model's response. This approach was chosen because the `few_shot` prompt was designed to have the numerical answer appear early in the response.
4. **What are the limitations of using LLMs for mathematical reasoning?**
   - Limitations include:
     - **Hallucination of facts and calculations:** LLMs can sometimes produce incorrect calculations or reasoning steps that appear plausible but are wrong.
     - **Sensitivity to prompting:** Performance is highly dependent on the prompt design, and small changes can significantly impact results.
     - **Difficulty with complex, multi-step problems:** While models can perform well on simpler problems, they may struggle with problems requiring extensive or intricate reasoning chains.
     - **Extraction challenges:** Reliably extracting the final numerical answer from free-form text generation can be difficult and requires robust parsing mechanisms.
     - **Lack of true mathematical understanding:** LLMs are pattern-matching systems and do not possess a fundamental understanding of mathematical principles in the same way a human does.
5. **If you used LoRA fine-tuning, what were the trade-offs compared to prompting?**
   - (N/A - LoRA fine-tuning was not used in this solution.) If LoRA were used, potential trade-offs compared to prompting could include:
     - **Pros:** Potentially higher accuracy on the specific task and dataset, better adaptation to the desired output format.
     - **Cons:** Requires additional computational resources and time for fine-tuning, risks overfitting to the training data, and the fine-tuned model might perform worse on general tasks compared to the base model.