# TP2: Mathematical Problem Solving with LLMs

**Day 2 - AI for Sciences Winter School**

**Instructor:** Raphael Cousin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/racousin/ai_for_sciences/blob/main/day2/tp2_bonus.ipynb)

---

## Objectives

By the end of this practical, you will understand:

1. **Context Engineering**: How prompt design affects LLM performance
2. **Prompting Strategies**: Zero-shot, few-shot, and chain-of-thought
3. **Fine-tuning with LoRA**: Adapting models efficiently with limited resources
4. **Evaluation**: Measuring accuracy on mathematical reasoning tasks

---

# Part 1: Setup and Data Loading

In [None]:
!pip install -q git+https://github.com/racousin/ai_for_sciences.git
!pip install -q transformers torch peft accelerate bitsandbytes

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print("Setup complete!")

## Load the Math Problems Dataset

We have 900 math problems across different categories: arithmetic, algebra, geometry, fractions, percentages, and word problems.

In [None]:
# Load data from local file (or download if running on Colab)
import os

data_path = 'data/maths.csv'
if not os.path.exists(data_path):
    # Download from GitHub if not available locally
    !mkdir -p data
    !wget -q -O data/maths.csv https://raw.githubusercontent.com/racousin/ai_for_sciences/main/day2/data/maths.csv

data = pd.read_csv(data_path)
print(f"Dataset size: {len(data)} problems")
print(f"\nCategory distribution:")
print(data['category'].value_counts())

In [None]:
# Look at some example problems
print("Sample problems:\n")
for i in range(5):
    row = data.iloc[i]
    print(f"[{row['category']}] {row['problem']}")
    print(f"  Answer: {row['solution']}\n")

In [None]:
# Create a test set (100 problems) for evaluation
test_data = data.sample(n=100, random_state=42).reset_index(drop=True)
train_data = data.drop(test_data.index).reset_index(drop=True)

print(f"Train set: {len(train_data)} problems (for few-shot examples & fine-tuning)")
print(f"Test set: {len(test_data)} problems (for evaluation)")

---

# Part 2: Utility Functions

We need functions to extract numerical answers from model outputs and evaluate accuracy.

In [None]:
def extract_number(text):
    """Extract answer from JSON embedded in model response."""
    if text is None:
        return None
    
    # Find JSON object containing "answer" in the text
    match = re.search(r'\{[^{}]*"answer"\s*:\s*[^{}]*\}', text)
    if match:
        try:
            data = json.loads(match.group())
            return float(data["answer"])
        except:
            pass
    
    return None


# Test the extraction function
test_cases = [
    '{"thought": "5 + 3 = 8", "answer": 8}',
    '{"thought": "The result is negative", "answer": -15.5}',
    'Solution: {"thought": "calculating", "answer": 42} more text',
    'Some explanation\n{"thought": "...", "answer": 100}\nProblem: next',
    'No JSON here',
]

print("Number extraction tests:")
for s in test_cases:
    result = extract_number(s)
    print(f"  {s[:50]:50} -> {result}")

In [None]:
def compute_accuracy(predictions, ground_truth, tolerance=0.01):
    """
    Calculate accuracy with tolerance for floating point comparisons.
    
    Two values are considered equal if:
    - They round to the same value at 2 decimal places, OR
    - Their absolute difference is <= tolerance
    """
    correct = 0
    for pred, truth in zip(predictions, ground_truth):
        if pred is None:
            continue
        if round(pred, 2) == round(truth, 2):
            correct += 1
        elif abs(pred - truth) <= tolerance:
            correct += 1
    
    return correct / len(predictions)

---

# Part 3: Load Pre-trained Model

We'll use a small instruction-tuned model that can run on limited hardware. Qwen2-0.5B-Instruct is a good choice: small enough for Colab, but capable of following instructions and doing basic math.

In [None]:
# Load a small instruction-tuned model (works on Colab free tier)
model_name = "Qwen/Qwen2-0.5B-Instruct"

print(f"Loading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
)
model = model.to(device)

# Set padding token if needed
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model loaded!")
print(f"Parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M")

In [None]:
def generate_answer(prompt, model, tokenizer, max_new_tokens=50):
    """
    Generate an answer using the specified prompt.
    Uses greedy decoding for deterministic, reproducible outputs.
    """
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # Greedy decoding for reproducible outputs
            pad_token_id=tokenizer.eos_token_id,
        )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Remove the prompt from the response
    response = response[len(prompt):].strip()

    return response

---

# Part 4: Context Engineering - Prompting Strategies

**Context engineering** (or prompt engineering) is the art of designing prompts that help LLMs perform better. The same model can produce very different results depending on how you ask!

We'll explore three main strategies:

1. **Zero-shot**: Just ask the question directly
2. **Few-shot**: Provide examples before asking
3. **Chain-of-Thought (CoT)**: Ask the model to reason step by step

## Strategy 1: Zero-shot Prompting

The simplest approach: just ask the question.

In [None]:
def make_zero_shot_prompt(problem):
    """Simple prompt - ask for JSON output."""
    return f'''Solve this math problem. Reply with JSON: {{"thought": "your reasoning", "answer": number}}

Problem: {problem}'''

# Test on one problem
test_problem = test_data.iloc[0]
prompt = make_zero_shot_prompt(test_problem['problem'])

print("ZERO-SHOT PROMPT:")
print("-" * 50)
print(prompt)
print("-" * 50)

response = generate_answer(prompt, model, tokenizer)
extracted = extract_number(response)

print(f"\nModel response: {response[:150]}{'...' if len(response) > 150 else ''}")
print(f"Extracted answer: {extracted}")
print(f"Correct answer: {test_problem['solution']}")

## Strategy 2: Few-shot Prompting

Provide examples to show the model what format we expect.

In [None]:
def make_few_shot_prompt(problem, n_examples=3):
    """Include examples showing the expected JSON format with real reasoning."""
    # Hand-crafted examples with actual step-by-step reasoning
    examples = [
        ('What is 15% of 80?', '{"thought": "15% = 0.15, so 0.15 × 80 = 12", "answer": 12}'),
        ('Calculate 48 + 37', '{"thought": "48 + 37 = 85", "answer": 85}'),
        ('What is the area of a rectangle with length 6 and width 4?', 
         '{"thought": "Area = length × width = 6 × 4 = 24", "answer": 24}'),
    ]
    
    examples_text = "\n\n".join([f'Problem: {p}\n{a}' for p, a in examples[:n_examples]])
    return f'''Solve math problems. Reply with JSON: {{"thought": "your reasoning", "answer": number}}

{examples_text}

Problem: {problem}'''

# Test on the same problem
prompt = make_few_shot_prompt(test_problem['problem'])

print("FEW-SHOT PROMPT:")
print("-" * 50)
print(prompt)
print("-" * 50)

response = generate_answer(prompt, model, tokenizer)
extracted = extract_number(response)

print(f"\nModel response: {response[:150]}{'...' if len(response) > 150 else ''}")
print(f"Extracted answer: {extracted}")
print(f"Correct answer: {test_problem['solution']}")

## Strategy 3: Chain-of-Thought Prompting

Ask the model to reason step by step. This often improves performance on math problems.

In [None]:
def make_cot_prompt(problem):
    """Chain-of-thought: ask for detailed reasoning in JSON."""
    return f'''Solve this math problem step by step. Reply with JSON: {{"thought": "detailed step-by-step reasoning", "answer": number}}

Problem: {problem}'''

# Test on the same problem
prompt = make_cot_prompt(test_problem['problem'])

print("CHAIN-OF-THOUGHT PROMPT:")
print("-" * 50)
print(prompt)
print("-" * 50)

response = generate_answer(prompt, model, tokenizer, max_new_tokens=100)
extracted = extract_number(response)

print(f"\nModel response: {response[:200]}{'...' if len(response) > 200 else ''}")
print(f"Extracted answer: {extracted}")
print(f"Correct answer: {test_problem['solution']}")

## Compare All Strategies

Let's evaluate each strategy on a subset of the test set.

In [None]:
def evaluate_strategy(strategy_name, prompt_fn, test_subset, model, tokenizer, max_new_tokens=50):
    """Evaluate a prompting strategy on test data."""
    predictions = []
    ground_truth = test_subset['solution'].tolist()
    
    for idx, row in test_subset.iterrows():
        prompt = prompt_fn(row['problem'])
        response = generate_answer(prompt, model, tokenizer, max_new_tokens)
        pred = extract_number(response)
        predictions.append(pred if pred is not None else 0.0)
    
    accuracy = compute_accuracy(predictions, ground_truth)
    return accuracy, predictions

# Evaluate on a small subset (20 problems) for speed
eval_subset = test_data.head(20)

print("Evaluating prompting strategies on 20 test problems...\n")

strategies = {
    'Zero-shot': make_zero_shot_prompt,
    'Few-shot (3 examples)': make_few_shot_prompt,
    'Chain-of-Thought': make_cot_prompt,
}

results = {}
for name, prompt_fn in strategies.items():
    print(f"Testing {name}...", end=" ")
    max_tokens = 100 if name == 'Chain-of-Thought' else 50
    acc, preds = evaluate_strategy(name, prompt_fn, eval_subset, model, tokenizer, max_tokens)
    results[name] = acc
    print(f"Accuracy: {acc:.1%}")

print("\n" + "="*50)
print("RESULTS SUMMARY:")
print("="*50)
for name, acc in results.items():
    bar = "*" * int(acc * 20)
    print(f"{name:25} {acc:6.1%}  {bar}")

### Question 1

1. Which prompting strategy performed best? Why do you think that is?
2. Qwen2-0.5B is a small model (500M parameters). How might results differ with larger models?
3. Can you think of other prompting strategies that might help?

---

# Part 5: Fine-tuning with LoRA

When prompting isn't enough, we can **fine-tune** the model on our specific task.

**LoRA (Low-Rank Adaptation)** is an efficient fine-tuning technique that:
- Freezes the original model weights
- Adds small trainable matrices to specific layers
- Reduces memory usage by 10-100x compared to full fine-tuning
- Can be trained on consumer hardware

```
Original weight matrix W (frozen)
         +
LoRA matrices: A @ B (trainable, low-rank)
         =
Adapted weights: W + A @ B
```

In [None]:
# Clear memory before fine-tuning (important for Colab!)
import gc

del model
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print("Memory cleared, ready for fine-tuning")

In [None]:
from peft import LoraConfig, get_peft_model, TaskType

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,                    # Rank of the update matrices (lower = fewer parameters)
    lora_alpha=32,          # Scaling factor
    lora_dropout=0.1,       # Dropout for regularization
    target_modules=["q_proj", "v_proj"],  # Attention layers to adapt
)

print("LoRA Configuration:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Target modules: {lora_config.target_modules}")

In [None]:
# Reload the base model for fine-tuning
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
)
base_model = base_model.to(device)

# Apply LoRA
lora_model = get_peft_model(base_model, lora_config)

# Count parameters
total_params = sum(p.numel() for p in lora_model.parameters())
trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)

print(f"\nParameter comparison:")
print(f"  Total parameters: {total_params / 1e6:.0f}M")
print(f"  Trainable parameters: {trainable_params / 1e6:.2f}M")
print(f"  Trainable %: {100 * trainable_params / total_params:.2f}%")
print(f"\nWe're only training {trainable_params / total_params:.2%} of the model!")

## Prepare Training Data

We need to format our math problems for causal language modeling.

In [None]:
from torch.utils.data import Dataset, DataLoader

class MathDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        # Format as JSON output
        text = f'Problem: {row["problem"]}\n{{"thought": "solving", "answer": {row["solution"]}}}'
        
        encoding = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )
        
        input_ids = encoding['input_ids'].squeeze()
        attention_mask = encoding['attention_mask'].squeeze()
        
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': input_ids.clone()  # For causal LM, labels = input_ids
        }

# Create datasets
# Note: We use only 800 examples here for speed. For better results, try using
# more data by adding your own math problems!
train_subset = train_data.head(800)
train_dataset = MathDataset(train_subset, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

print(f"Training dataset: {len(train_dataset)} examples")
print(f"Batches per epoch: {len(train_loader)}")

## Training Loop

In [None]:
from torch.optim import AdamW

# Training setup
optimizer = AdamW(lora_model.parameters(), lr=1e-4)
num_epochs = 3

lora_model.train()
losses = []

print("Starting LoRA fine-tuning...")
print("="*50)

for epoch in range(num_epochs):
    epoch_losses = []
    
    for batch_idx, batch in enumerate(train_loader):
        # Move to device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        # Forward pass
        outputs = lora_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_losses.append(loss.item())
        losses.append(loss.item())
    
    avg_loss = np.mean(epoch_losses)
    print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}")

print("="*50)
print("Fine-tuning complete!")

In [None]:
# Plot training loss
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(losses, alpha=0.7)
ax.set_xlabel('Training Step')
ax.set_ylabel('Loss')
ax.set_title('LoRA Fine-tuning Loss')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Evaluate Fine-tuned Model

In [None]:
lora_model.eval()

print("Comparing Prompting Strategies vs LoRA Fine-tuning")
print("="*60)

# Get best prompting result from Part 4

In [None]:
best_prompting = max(results.items(), key=lambda x: x[1])
print(f"\nBest prompting strategy: {best_prompting[0]} ({best_prompting[1]:.1%})")

# Evaluate fine-tuned model (zero-shot, since it learned the task)

In [None]:
eval_subset = test_data.head(20)
print("Evaluating LoRA model (zero-shot)...", end=" ")
lora_acc, _ = evaluate_strategy("LoRA", make_zero_shot_prompt, eval_subset, lora_model, tokenizer)
print(f"Accuracy: {lora_acc:.1%}")

print("\n" + "="*60)
print("COMPARISON:")
print(f"  Best prompting ({best_prompting[0]}): {best_prompting[1]:.1%}")
print(f"  LoRA fine-tuned (zero-shot):  {lora_acc:.1%}")
if lora_acc > best_prompting[1]:
    print(f"  Improvement: +{lora_acc - best_prompting[1]:.1%}")
else:
    print(f"  (Fine-tuning may need more data or epochs)")

In [None]:
# Compare base model vs fine-tuned model
lora_model.eval()

print("Comparing Base Model vs LoRA Fine-tuned Model")
print("="*60)

# Evaluate both on same test subset
eval_subset = test_data.head(20)

# Base model (few-shot, since it performed best)
print("\nEvaluating base model (few-shot)...", end=" ")
base_acc, _ = evaluate_strategy("Base", make_few_shot_prompt, eval_subset, model, tokenizer)
print(f"Accuracy: {base_acc:.1%}")

# Fine-tuned model (zero-shot, since it learned the task)
print("Evaluating LoRA model (zero-shot)...", end=" ")
lora_acc, _ = evaluate_strategy("LoRA", make_zero_shot_prompt, eval_subset, lora_model, tokenizer)
print(f"Accuracy: {lora_acc:.1%}")

print("\n" + "="*60)
print("COMPARISON:")
print(f"  Base model (few-shot):     {base_acc:.1%}")
print(f"  LoRA fine-tuned (zero-shot): {lora_acc:.1%}")
if lora_acc > base_acc:
    print(f"  Improvement: +{lora_acc - base_acc:.1%}")
else:
    print(f"  (Fine-tuning may need more data or epochs)")

### Question 2

1. How does the fine-tuned model compare to the prompting approaches?
2. Why might fine-tuning help (or not help) for this task?
3. What are the trade-offs between prompting vs fine-tuning?

---

# Part 6: Exercise - Improve the Results

Now it's your turn! Try to improve the model's accuracy.

## Exercise 1: Better Prompts

Design a better prompt template. Consider:
- More specific instructions
- Different number of few-shot examples
- Category-specific examples

In [None]:
# TODO: Create your own prompt template
def make_custom_prompt(problem):
    """Your custom prompt strategy. Use JSON format: {"thought": "...", "answer": X}"""
    # <-- Modify this!
    prompt = f'''Solve this math problem. Reply with JSON: {{"thought": "your reasoning", "answer": number}}

Problem: {problem}'''
    return prompt

# Test your prompt
test_problem = test_data.iloc[0]
prompt = make_custom_prompt(test_problem['problem'])
print("Your prompt:")
print(prompt)
print("\n" + "-"*50)

response = generate_answer(prompt, model, tokenizer)
print(f"Response: {response}")
print(f"Extracted: {extract_number(response)}")
print(f"Correct: {test_problem['solution']}")

## Exercise 2: Experiment with LoRA Parameters

Try different LoRA configurations:
- `r`: 4, 8, 16 (higher = more parameters)
- `lora_alpha`: 16, 32, 64
- More training epochs
- Different learning rates

In [None]:
# TODO: Experiment with different LoRA configurations
# Hint: Reload the base model and try different settings

# Example:
# new_lora_config = LoraConfig(
#     task_type=TaskType.CAUSAL_LM,
#     r=16,  # <-- Try different values
#     lora_alpha=64,
#     lora_dropout=0.05,
#     target_modules=["q_proj", "v_proj"],  # Attention layers for Qwen2
# )

print("Experiment with different configurations!")

---

# Summary

## Key Takeaways

### Context Engineering (Prompting)
- **Zero-shot**: Simple but often insufficient for complex tasks
- **Few-shot**: Providing examples helps the model understand the task format
- **Chain-of-Thought**: Step-by-step reasoning improves math performance
- The same model can perform very differently based on how you prompt it

### Fine-tuning with LoRA
- **LoRA** adds small trainable matrices while keeping base model frozen
- Much more efficient than full fine-tuning (trains ~0.1-1% of parameters)
- Can be done on consumer hardware
- Trade-off: requires training data and compute, but can outperform prompting

### When to Use What?

| Approach | When to Use |
|----------|-------------|
| Zero-shot | Quick experiments, simple tasks |
| Few-shot | Have a few examples, need better format adherence |
| Chain-of-Thought | Reasoning tasks (math, logic) |
| LoRA Fine-tuning | Have training data, need maximum performance |


---

## Reflection Questions

1. **In your research domain**, what tasks might benefit from context engineering vs fine-tuning?

2. **What kind of examples** would you include in few-shot prompts for your domain?

3. **If you were to fine-tune** a model for your research, what data would you use?