# Exercise 3: Mathematical Problem Solving ⭐

**This is a marked exercise (graded)**

Apply LLMs to mathematical reasoning tasks. Test different pre-trained models with various prompting strategies and optionally fine-tune with LoRA to improve performance.

**Learning Objectives:**
- Evaluate LLMs on mathematical reasoning
- Design effective prompts for numerical tasks
- Implement and compare different prompting strategies
- Optionally: Fine-tune models using LoRA
- Measure performance using MSE metric

**Deliverables:**
- Completed notebook with your approach
- `submission.csv` with predictions on test set
- Score will be MSE (lower is better, +inf if non-numerical values)

## Part 1: Setup and Load Data

In [None]:
!pip install transformers torch peft datasets pandas scikit-learn matplotlib -q

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
# Load the math dataset
# The dataset will be provided with:
# - train.csv: problem, solution (numerical value)
# - test.csv: problem (no solution)

# For now, we'll create a sample dataset
# You should replace this with the actual dataset when provided

train_data = pd.DataFrame({
    'problem': [
        "What is 15 + 27?",
        "Calculate 8 * 9",
        "What is 100 - 35?",
        "Divide 144 by 12",
        "What is 25 + 30 + 15?",
        "Calculate 7 * 8 - 10",
        "What is (20 + 10) * 2?",
        "Find the result of 50 / 5 + 3",
        "What is 3^3?",
        "Calculate 100 - 25 * 2"
    ],
    'solution': [42, 72, 65, 12, 70, 46, 60, 13, 27, 50]
})

test_data = pd.DataFrame({
    'problem': [
        "What is 23 + 19?",
        "Calculate 6 * 7",
        "What is 90 - 45?",
        "Divide 81 by 9",
        "What is 12 + 18 + 24?"
    ]
})

# Ground truth for evaluation (not given to students)
test_ground_truth = [42, 42, 45, 9, 54]

print(f"Train set size: {len(train_data)}")
print(f"Test set size: {len(test_data)}")
print("\nSample training data:")
print(train_data.head())

## Part 2: Baseline - Dummy Model

First, create a baseline to understand what "bad" performance looks like.

In [None]:
# TODO: Implement dummy baseline
def dummy_model_predict(problems):
    """Always predict the mean of training set."""
    mean_solution = train_data['solution'].mean()
    return [mean_solution] * len(problems)

# Evaluate dummy model
dummy_predictions = dummy_model_predict(test_data['problem'])
dummy_mse = mean_squared_error(test_ground_truth, dummy_predictions)

print("Dummy Model (predicts mean):")
print(f"  Predicted value: {dummy_predictions[0]:.2f}")
print(f"  MSE: {dummy_mse:.2f}")
print("\nThis demonstrates bad performance. Your model should do much better!")

## Part 3: Utility Functions

Helper functions to extract numerical answers from model outputs.

In [None]:
def extract_number(text):
    """Extract the first number from text. Return None if no number found."""
    # TODO: Implement number extraction
    # Look for patterns like: "The answer is 42", "42", "= 42", etc.
    
    # Try to find number patterns
    patterns = [
        r'(?:answer is|equals|=)\s*(-?\d+\.?\d*)',  # "answer is 42" or "= 42"
        r'(-?\d+\.?\d*)\s*$',  # Number at the end
        r'(-?\d+\.?\d*)',  # Any number
    ]
    
    for pattern in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            try:
                return float(match.group(1))
            except:
                continue
    
    return None

# Test extraction
test_strings = [
    "The answer is 42",
    "42",
    "15 + 27 = 42",
    "Calculating... the result is 42!",
    "No number here"
]

print("Number extraction tests:")
for s in test_strings:
    result = extract_number(s)
    print(f"  '{s}' -> {result}")

## Part 4: Approach A - Prompt Engineering

Test different prompting strategies with various pre-trained models.

In [None]:
# TODO: Load a pre-trained model
# Suggested models: gpt2, microsoft/phi-2, TinyLlama/TinyLlama-1.1B-Chat-v1.0

model_name = "gpt2"  # Start with GPT-2
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to(device)

# Set padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Loaded model: {model_name}")

In [None]:
def generate_answer(problem, prompt_template="simple", max_new_tokens=50):
    """
    Generate answer using different prompt templates.
    
    Templates:
    - simple: Just the problem
    - instruction: Add instruction to solve
    - cot: Chain-of-thought prompting
    - few_shot: Include examples
    """
    # TODO: Implement different prompt templates
    
    if prompt_template == "simple":
        prompt = f"{problem}\nAnswer:"
    
    elif prompt_template == "instruction":
        prompt = f"Solve this math problem and provide only the numerical answer.\n\n{problem}\nAnswer:"
    
    elif prompt_template == "cot":
        prompt = f"Solve this math problem step by step.\n\n{problem}\nLet's think step by step:\n"
    
    elif prompt_template == "few_shot":
        # Include examples from training data
        examples = "\n\n".join([
            f"Problem: {train_data['problem'].iloc[i]}\nAnswer: {train_data['solution'].iloc[i]}"
            for i in range(min(3, len(train_data)))
        ])
        prompt = f"{examples}\n\nProblem: {problem}\nAnswer:"
    
    else:
        prompt = problem
    
    # Generate
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.1,  # Low temperature for more deterministic output
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Remove the prompt from response
    response = response[len(prompt):].strip()
    
    return response

# Test different prompts
test_problem = train_data['problem'].iloc[0]
print(f"Testing problem: {test_problem}")
print(f"Correct answer: {train_data['solution'].iloc[0]}\n")

for template in ["simple", "instruction", "cot", "few_shot"]:
    response = generate_answer(test_problem, template)
    extracted = extract_number(response)
    print(f"{template}:")
    print(f"  Response: {response[:100]}...")
    print(f"  Extracted: {extracted}\n")

## Part 5: Evaluate on Test Set

Apply your best prompting strategy to the test set.

In [None]:
# TODO: Choose your best prompt template and generate predictions
best_template = "few_shot"  # Change this based on your experiments

predictions = []
responses = []

for problem in test_data['problem']:
    response = generate_answer(problem, prompt_template=best_template)
    prediction = extract_number(response)
    
    # If no number extracted, use 0 (will result in high MSE)
    if prediction is None:
        prediction = 0
    
    predictions.append(prediction)
    responses.append(response)
    
    print(f"Problem: {problem}")
    print(f"Response: {response[:100]}...")
    print(f"Prediction: {prediction}\n")

# Calculate MSE
mse = mean_squared_error(test_ground_truth, predictions)
print(f"\nMSE on test set: {mse:.2f}")
print(f"Improvement over dummy: {(dummy_mse - mse) / dummy_mse * 100:.1f}%")

## Part 6: (Optional) Fine-Tuning with LoRA

If prompting doesn't work well enough, fine-tune the model using LoRA.

In [None]:
# TODO: Implement LoRA fine-tuning (OPTIONAL)
from peft import LoraConfig, get_peft_model, TaskType
from torch.utils.data import Dataset, DataLoader

class MathDataset(Dataset):
    def __init__(self, problems, solutions, tokenizer, max_length=128):
        self.problems = problems
        self.solutions = solutions
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.problems)
    
    def __getitem__(self, idx):
        # Format: "Problem: ... Answer: ..."
        text = f"Problem: {self.problems[idx]}\nAnswer: {self.solutions[idx]}"
        
        encoding = self.tokenizer(
            text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': encoding['input_ids'].flatten()
        }

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # Rank
    lora_alpha=32,
    target_modules=["c_attn"],  # For GPT-2
    lora_dropout=0.1,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Create PEFT model
# model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()

print("LoRA fine-tuning setup complete (commented out for now)")
print("Uncomment and implement if you want to fine-tune")

## Part 7: Try Different Models

Compare performance across different pre-trained models.

In [None]:
# TODO: Compare different models
# Suggested models to try:
# - gpt2
# - gpt2-medium
# - microsoft/phi-2 (if you have enough memory)
# - TinyLlama/TinyLlama-1.1B-Chat-v1.0

models_to_try = [
    "gpt2",
    # Add more models here
]

results = {}

for model_name in models_to_try:
    print(f"\nTesting {model_name}...")
    # Load model, generate predictions, calculate MSE
    # YOUR CODE HERE
    pass

# Plot comparison
# YOUR CODE HERE

## Part 8: Create Submission File

Generate predictions for the test set and save to submission.csv

In [None]:
# TODO: Generate final predictions and create submission file
submission = pd.DataFrame({
    'problem': test_data['problem'],
    'prediction': predictions
})

# Save to CSV
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")
print("\nSubmission preview:")
print(submission)

# Verify all predictions are numerical
non_numeric = submission['prediction'].isna().sum()
if non_numeric > 0:
    print(f"\n⚠️  WARNING: {non_numeric} predictions are not numerical!")
    print("These will result in +inf score. Please fix them.")
else:
    print("\n✓ All predictions are numerical")

## Questions

Answer the following questions:

1. **Which prompting strategy worked best and why?**
   - YOUR ANSWER HERE

2. **How do different models compare in mathematical reasoning?**
   - YOUR ANSWER HERE

3. **What are the main challenges in extracting numerical answers from LLM outputs?**
   - YOUR ANSWER HERE

4. **How could you improve the model's performance further?**
   - YOUR ANSWER HERE

5. **If you used LoRA fine-tuning, what were the trade-offs compared to prompting?**
   - YOUR ANSWER HERE (or N/A if you didn't use LoRA)

## Deliverables Checklist

- [ ] Implemented at least 3 different prompting strategies
- [ ] Tested at least 2 different pre-trained models
- [ ] Calculated MSE on test set
- [ ] Created `submission.csv` with all numerical predictions
- [ ] Answered all questions
- [ ] (Optional) Implemented LoRA fine-tuning