# Math Reasoning Optimization with DSPy

This notebook demonstrates how to optimize mathematical reasoning using DSPy:
- Building math reasoning modules with step-by-step solutions
- Using DSPy optimizers to improve accuracy
- Handling different types of mathematical problems
- Creating training data and evaluation metrics
- Comparing optimized vs unoptimized performance

Mathematical reasoning is a challenging task that benefits significantly from systematic optimization.

## Setup and Imports

In [None]:
import os
import sys
sys.path.append('../../')

import dspy
import re
from typing import List, Dict, Any
from utils import setup_default_lm, print_step, print_result, print_error
from utils.datasets import get_sample_math_data
from dotenv import load_dotenv

# Load environment variables
load_dotenv('../../.env')

## Configure Language Model

In [None]:
print_step("Configuring Language Model", "Setting up DSPy with OpenAI")

try:
    lm = setup_default_lm(provider="openai", model="gpt-3.5-turbo", max_tokens=1200)
    dspy.configure(lm=lm)
    print_result("Language model configured successfully!")
except Exception as e:
    print_error(f"Failed to configure language model: {e}")
    print("Make sure you have set your OPENAI_API_KEY in the .env file")

## Basic Math Reasoning Module

Let's start with a basic math reasoning module before optimization.

In [None]:
print_step("Basic Math Reasoning", "Creating unoptimized math solver")

class MathReasoning(dspy.Signature):
    """Solve mathematical problems step by step."""
    problem = dspy.InputField(desc="The mathematical problem to solve")
    reasoning = dspy.OutputField(desc="Step-by-step solution showing all work")
    answer = dspy.OutputField(desc="The final numerical answer")

class BasicMathSolver(dspy.Module):
    """Basic mathematical problem solver."""
    
    def __init__(self):
        super().__init__()
        self.solve = dspy.ChainOfThought(MathReasoning)
    
    def forward(self, problem):
        result = self.solve(problem=problem)
        return dspy.Prediction(
            problem=problem,
            reasoning=result.reasoning,
            answer=result.answer
        )

# Create basic solver
basic_solver = BasicMathSolver()

# Test with sample problems
test_problems = [
    "If a train travels 60 miles per hour for 2.5 hours, how far does it travel?",
    "What is the area of a rectangle with length 8 and width 5?",
    "Solve for x: 2x + 5 = 13",
    "A store offers a 20% discount on a $80 item. What is the final price?",
    "If 3 apples cost $2.25, how much do 8 apples cost?"
]

print("Basic Math Solver Results:")
print("=" * 70)

basic_results = []
for i, problem in enumerate(test_problems, 1):
    result = basic_solver(problem=problem)
    basic_results.append(result)
    
    print(f"\n{i}. Problem: {problem}")
    print(f"   Reasoning: {result.reasoning}")
    print(f"   Answer: {result.answer}")
    print("-" * 50)

print("\n" + "=" * 70)

## Create Training and Test Data

Let's create comprehensive training data for optimization.

In [None]:
print_step("Creating Training Data", "Generating math problems with solutions")

def create_math_training_data():
    """Create training data for math reasoning optimization."""
    
    training_examples = [
        # Basic arithmetic
        dspy.Example(
            problem="What is 15 + 27?",
            reasoning="To add 15 + 27: 15 + 27 = 42",
            answer="42"
        ),
        dspy.Example(
            problem="Calculate 144 ÷ 12",
            reasoning="To divide 144 by 12: 144 ÷ 12 = 12",
            answer="12"
        ),
        
        # Word problems - distance/speed/time
        dspy.Example(
            problem="A car travels 45 miles per hour for 3 hours. How far does it travel?",
            reasoning="Distance = Speed × Time\nDistance = 45 mph × 3 hours = 135 miles",
            answer="135 miles"
        ),
        dspy.Example(
            problem="If a runner covers 10 miles in 80 minutes, what is their speed in miles per hour?",
            reasoning="Speed = Distance ÷ Time\nFirst convert 80 minutes to hours: 80 ÷ 60 = 1.33 hours\nSpeed = 10 miles ÷ 1.33 hours = 7.5 mph",
            answer="7.5 mph"
        ),
        
        # Geometry
        dspy.Example(
            problem="What is the area of a circle with radius 6?",
            reasoning="Area of circle = π × r²\nArea = π × 6² = π × 36 = 36π ≈ 113.1 square units",
            answer="36π or approximately 113.1 square units"
        ),
        dspy.Example(
            problem="Find the perimeter of a rectangle with length 12 and width 8",
            reasoning="Perimeter of rectangle = 2(length + width)\nPerimeter = 2(12 + 8) = 2(20) = 40 units",
            answer="40 units"
        ),
        
        # Algebra
        dspy.Example(
            problem="Solve for x: 3x - 7 = 14",
            reasoning="3x - 7 = 14\nAdd 7 to both sides: 3x = 21\nDivide by 3: x = 7",
            answer="x = 7"
        ),
        dspy.Example(
            problem="If y = 2x + 3 and x = 4, what is y?",
            reasoning="Substitute x = 4 into y = 2x + 3\ny = 2(4) + 3 = 8 + 3 = 11",
            answer="y = 11"
        ),
        
        # Percentages and ratios
        dspy.Example(
            problem="What is 25% of 120?",
            reasoning="25% = 25/100 = 0.25\n25% of 120 = 0.25 × 120 = 30",
            answer="30"
        ),
        dspy.Example(
            problem="A shirt costs $60. If there's a 15% discount, what's the sale price?",
            reasoning="Discount = 15% of $60 = 0.15 × $60 = $9\nSale price = $60 - $9 = $51",
            answer="$51"
        ),
        
        # Multi-step problems
        dspy.Example(
            problem="A recipe calls for 2 cups of flour for 12 cookies. How much flour is needed for 30 cookies?",
            reasoning="Set up proportion: 2 cups / 12 cookies = x cups / 30 cookies\nCross multiply: 2 × 30 = 12 × x\n60 = 12x\nx = 5 cups",
            answer="5 cups"
        )
    ]
    
    # Create test examples (different from training)
    test_examples = [
        dspy.Example(
            problem="A bicycle wheel has a radius of 14 inches. What is its circumference?",
            reasoning="Circumference = 2πr = 2 × π × 14 = 28π ≈ 87.96 inches",
            answer="28π or approximately 87.96 inches"
        ),
        dspy.Example(
            problem="Solve for x: 5x + 8 = 33",
            reasoning="5x + 8 = 33\nSubtract 8: 5x = 25\nDivide by 5: x = 5",
            answer="x = 5"
        ),
        dspy.Example(
            problem="If 4 pencils cost $1.20, how much do 10 pencils cost?",
            reasoning="Cost per pencil = $1.20 ÷ 4 = $0.30\nCost of 10 pencils = 10 × $0.30 = $3.00",
            answer="$3.00"
        )
    ]
    
    return training_examples, test_examples

# Create training and test data
train_data, test_data = create_math_training_data()

print_result(f"Created {len(train_data)} training examples and {len(test_data)} test examples")

# Display a few training examples
print("\nSample Training Examples:")
for i, example in enumerate(train_data[:3], 1):
    print(f"\n{i}. Problem: {example.problem}")
    print(f"   Expected Reasoning: {example.reasoning}")
    print(f"   Expected Answer: {example.answer}")

## Create Evaluation Metric

Let's create a metric to evaluate mathematical reasoning accuracy.

In [None]:
print_step("Creating Evaluation Metric", "Defining accuracy measurement for math problems")

def extract_number_from_answer(answer_text):
    """Extract numerical value from answer text."""
    # Look for numbers in the text
    import re
    
    # Handle common formats
    answer_text = answer_text.lower().strip()
    
    # Extract number patterns
    number_patterns = [
        r'x\s*=\s*([\d\.]+)',  # x = 5
        r'y\s*=\s*([\d\.]+)',  # y = 11
        r'\$([\d\.]+)',        # $51
        r'([\d\.]+)\s*miles',  # 135 miles
        r'([\d\.]+)\s*mph',    # 7.5 mph
        r'([\d\.]+)\s*units', # 40 units
        r'([\d\.]+)\s*cups',   # 5 cups
        r'([\d\.]+)\s*inches', # 87.96 inches
        r'([\d\.]+)π',         # 28π
        r'([\d\.]+)',          # any number
    ]
    
    for pattern in number_patterns:
        match = re.search(pattern, answer_text)
        if match:
            try:
                return float(match.group(1))
            except (ValueError, IndexError):
                continue
    
    return None

def math_accuracy_metric(example, pred, trace=None):
    """Evaluate if the predicted answer matches the expected answer."""
    
    # Extract expected and predicted numerical values
    expected_num = extract_number_from_answer(example.answer)
    predicted_num = extract_number_from_answer(pred.answer)
    
    # If we can't extract numbers, fall back to string comparison
    if expected_num is None or predicted_num is None:
        expected_clean = example.answer.lower().strip()
        predicted_clean = pred.answer.lower().strip()
        return expected_clean == predicted_clean
    
    # Allow small numerical differences (for floating point precision)
    tolerance = 0.01
    return abs(expected_num - predicted_num) <= tolerance

def evaluate_math_solver(solver, test_examples):
    """Evaluate a math solver on test examples."""
    correct = 0
    total = len(test_examples)
    results = []
    
    for example in test_examples:
        try:
            prediction = solver(problem=example.problem)
            is_correct = math_accuracy_metric(example, prediction)
            
            if is_correct:
                correct += 1
            
            results.append({
                'problem': example.problem,
                'expected': example.answer,
                'predicted': prediction.answer,
                'correct': is_correct,
                'reasoning': prediction.reasoning
            })
            
        except Exception as e:
            print(f"Error evaluating problem '{example.problem}': {e}")
            results.append({
                'problem': example.problem,
                'expected': example.answer,
                'predicted': 'ERROR',
                'correct': False,
                'reasoning': f'Error: {e}'
            })
    
    accuracy = correct / total if total > 0 else 0
    return accuracy, results

# Test our evaluation metric
print("Testing evaluation metric:")
test_cases = [
    ("x = 5", "x = 5", True),
    ("$51", "$51.00", True),
    ("135 miles", "135 miles", True),
    ("42", "43", False),
    ("7.5 mph", "7.50 mph", True)
]

for expected, predicted, should_match in test_cases:
    # Create mock objects
    mock_example = type('obj', (object,), {'answer': expected})
    mock_pred = type('obj', (object,), {'answer': predicted})
    
    result = math_accuracy_metric(mock_example, mock_pred)
    status = "✓" if result == should_match else "✗"
    print(f"{status} Expected: '{expected}', Predicted: '{predicted}', Match: {result}")

print_result("Evaluation metric created and tested!")

## Baseline Evaluation

Let's evaluate our basic solver before optimization.

In [None]:
print_step("Baseline Evaluation", "Testing unoptimized solver performance")

# Evaluate basic solver
baseline_accuracy, baseline_results = evaluate_math_solver(basic_solver, test_data)

print(f"Baseline Accuracy: {baseline_accuracy:.3f} ({sum(r['correct'] for r in baseline_results)}/{len(baseline_results)})")

print("\nDetailed Results:")
print("=" * 80)

for i, result in enumerate(baseline_results, 1):
    status = "✓" if result['correct'] else "✗"
    print(f"\n{i}. {status} Problem: {result['problem']}")
    print(f"   Expected: {result['expected']}")
    print(f"   Predicted: {result['predicted']}")
    if not result['correct']:
        print(f"   Reasoning: {result['reasoning'][:100]}...")
    print("-" * 60)

print("\n" + "=" * 80)

## Optimize with DSPy

Now let's use DSPy optimizers to improve our math solver.

In [None]:
print_step("DSPy Optimization", "Using BootstrapFewShot to optimize the math solver")

# Create an optimized version of our math solver
class OptimizedMathSolver(dspy.Module):
    """Optimized mathematical problem solver."""
    
    def __init__(self):
        super().__init__()
        self.solve = dspy.ChainOfThought(MathReasoning)
    
    def forward(self, problem):
        result = self.solve(problem=problem)
        return dspy.Prediction(
            problem=problem,
            reasoning=result.reasoning,
            answer=result.answer
        )

# Create optimizer
from dspy.teleprompt import BootstrapFewShot

# Configure the optimizer
config = dict(max_bootstrapped_demos=8, max_labeled_demos=4)

# Create and run optimizer
print("Setting up optimizer...")
teleprompter = BootstrapFewShot(metric=math_accuracy_metric, **config)

print("Optimizing math solver (this may take a few minutes)...")
optimized_solver = teleprompter.compile(OptimizedMathSolver(), trainset=train_data)

print_result("Optimization completed!")

# Show what the optimizer learned
print("\nOptimizer learned the following examples:")
if hasattr(optimized_solver.solve, 'demos') and optimized_solver.solve.demos:
    for i, demo in enumerate(optimized_solver.solve.demos[:3], 1):
        print(f"\nDemo {i}:")
        print(f"Problem: {demo.problem}")
        print(f"Reasoning: {demo.reasoning[:100]}...")
        print(f"Answer: {demo.answer}")
else:
    print("No demos found in optimized solver.")

## Evaluate Optimized Solver

Let's compare the performance of our optimized solver.

In [None]:
print_step("Optimized Evaluation", "Testing optimized solver performance")

# Evaluate optimized solver
optimized_accuracy, optimized_results = evaluate_math_solver(optimized_solver, test_data)

print(f"Optimized Accuracy: {optimized_accuracy:.3f} ({sum(r['correct'] for r in optimized_results)}/{len(optimized_results)})")
print(f"Baseline Accuracy:  {baseline_accuracy:.3f} ({sum(r['correct'] for r in baseline_results)}/{len(baseline_results)})")

improvement = optimized_accuracy - baseline_accuracy
print(f"Improvement: {improvement:+.3f} ({improvement/baseline_accuracy*100:+.1f}%)" if baseline_accuracy > 0 else "Improvement: N/A")

print("\nComparison Results:")
print("=" * 90)
print(f"{'Problem':<50} {'Baseline':<15} {'Optimized':<15} {'Improved'}")
print("-" * 90)

for i, (baseline, optimized) in enumerate(zip(baseline_results, optimized_results)):
    baseline_status = "✓" if baseline['correct'] else "✗"
    optimized_status = "✓" if optimized['correct'] else "✗"
    improved = "Yes" if optimized['correct'] and not baseline['correct'] else "No" if not optimized['correct'] and baseline['correct'] else "-"
    
    problem_short = baseline['problem'][:47] + "..." if len(baseline['problem']) > 50 else baseline['problem']
    print(f"{problem_short:<50} {baseline_status:<15} {optimized_status:<15} {improved}")

print("\n" + "=" * 90)

# Show detailed comparison for any improved cases
improved_cases = [(b, o) for b, o in zip(baseline_results, optimized_results) 
                  if o['correct'] and not b['correct']]

if improved_cases:
    print(f"\nDetailed Analysis of {len(improved_cases)} Improved Cases:")
    print("=" * 80)
    
    for i, (baseline, optimized) in enumerate(improved_cases, 1):
        print(f"\n{i}. Problem: {baseline['problem']}")
        print(f"   Expected: {baseline['expected']}")
        print(f"   Baseline Answer: {baseline['predicted']}")
        print(f"   Optimized Answer: {optimized['predicted']}")
        print(f"   Optimized Reasoning: {optimized['reasoning'][:150]}...")
        print("-" * 60)
else:
    print("\nNo improved cases found.")

## Test on More Complex Problems

Let's test both solvers on more challenging mathematical problems.

In [None]:
print_step("Advanced Problem Testing", "Testing on complex mathematical problems")

# Create more challenging problems
advanced_problems = [
    "A car's value depreciates by 15% each year. If it's worth $20,000 today, what will it be worth in 3 years?",
    "The sum of three consecutive integers is 72. What are the three integers?",
    "A rectangular garden is 4 meters longer than it is wide. If the area is 60 square meters, what are the dimensions?",
    "If compound interest is calculated annually at 5% rate, how much will $1000 grow to in 4 years?",
    "A triangle has sides of length 3, 4, and 5. What is its area using Heron's formula?"
]

print("Solving Advanced Problems:")
print("=" * 80)

for i, problem in enumerate(advanced_problems, 1):
    print(f"\n{i}. Problem: {problem}")
    print("-" * 60)
    
    # Solve with baseline
    try:
        baseline_result = basic_solver(problem=problem)
        print(f"Baseline Answer: {baseline_result.answer}")
        print(f"Baseline Reasoning: {baseline_result.reasoning[:100]}...")
    except Exception as e:
        print(f"Baseline Error: {e}")
    
    print()
    
    # Solve with optimized
    try:
        optimized_result = optimized_solver(problem=problem)
        print(f"Optimized Answer: {optimized_result.answer}")
        print(f"Optimized Reasoning: {optimized_result.reasoning[:100]}...")
    except Exception as e:
        print(f"Optimized Error: {e}")
    
    print("\n" + "=" * 60)

## Try Different Optimizers

Let's experiment with different DSPy optimizers.

In [None]:
print_step("Different Optimizers", "Comparing various DSPy optimization strategies")

# Try LabeledFewShot optimizer
from dspy.teleprompt import LabeledFewShot

print("Testing LabeledFewShot optimizer...")
labeled_teleprompter = LabeledFewShot(k=4)  # Use 4 examples
labeled_solver = labeled_teleprompter.compile(OptimizedMathSolver(), trainset=train_data)

# Evaluate labeled few-shot solver
labeled_accuracy, labeled_results = evaluate_math_solver(labeled_solver, test_data)

print("\nOptimizer Comparison:")
print("=" * 50)
print(f"Baseline (no optimization): {baseline_accuracy:.3f}")
print(f"BootstrapFewShot:          {optimized_accuracy:.3f}")
print(f"LabeledFewShot:            {labeled_accuracy:.3f}")

# Find the best optimizer
optimizers = [
    ("Baseline", baseline_accuracy),
    ("BootstrapFewShot", optimized_accuracy),
    ("LabeledFewShot", labeled_accuracy)
]

best_optimizer, best_accuracy = max(optimizers, key=lambda x: x[1])
print(f"\nBest Optimizer: {best_optimizer} (Accuracy: {best_accuracy:.3f})")

# Show reasoning differences
print("\nReasoning Quality Comparison:")
print("=" * 70)

sample_problem = test_data[0].problem
print(f"Problem: {sample_problem}")
print(f"Expected: {test_data[0].answer}")
print()

# Compare reasoning from different approaches
baseline_demo = basic_solver(problem=sample_problem)
print(f"Baseline Reasoning:\n{baseline_demo.reasoning}\nAnswer: {baseline_demo.answer}")
print()

optimized_demo = optimized_solver(problem=sample_problem)
print(f"BootstrapFewShot Reasoning:\n{optimized_demo.reasoning}\nAnswer: {optimized_demo.answer}")
print()

labeled_demo = labeled_solver(problem=sample_problem)
print(f"LabeledFewShot Reasoning:\n{labeled_demo.reasoning}\nAnswer: {labeled_demo.answer}")

## Error Analysis

Let's analyze common error patterns and potential improvements.

In [None]:
print_step("Error Analysis", "Analyzing failure modes and potential improvements")

def analyze_errors(results, solver_name):
    """Analyze error patterns in solver results."""
    errors = [r for r in results if not r['correct']]
    
    if not errors:
        print(f"\n{solver_name}: No errors to analyze! 🎉")
        return
    
    print(f"\n{solver_name} Error Analysis:")
    print(f"Total Errors: {len(errors)}/{len(results)} ({len(errors)/len(results)*100:.1f}%)")
    print("-" * 60)
    
    for i, error in enumerate(errors, 1):
        print(f"\nError {i}:")
        print(f"Problem: {error['problem']}")
        print(f"Expected: {error['expected']}")
        print(f"Predicted: {error['predicted']}")
        
        # Try to categorize the error
        expected_num = extract_number_from_answer(error['expected'])
        predicted_num = extract_number_from_answer(error['predicted'])
        
        if expected_num is not None and predicted_num is not None:
            if abs(expected_num - predicted_num) < 1:
                error_type = "Minor numerical error"
            elif predicted_num == 0:
                error_type = "Calculation failure"
            else:
                error_type = "Significant numerical error"
        else:
            error_type = "Format or parsing error"
        
        print(f"Error Type: {error_type}")
        print(f"Reasoning: {error['reasoning'][:100]}...")

# Analyze errors for each solver
analyze_errors(baseline_results, "Baseline Solver")
analyze_errors(optimized_results, "Optimized Solver")
analyze_errors(labeled_results, "LabeledFewShot Solver")

# Suggestions for improvement
print("\n" + "=" * 70)
print("Suggestions for Further Improvement:")
print("=" * 70)

suggestions = [
    "1. Add more diverse training examples covering edge cases",
    "2. Implement explicit format validation for numerical answers",
    "3. Use multiple solution approaches and select the most consistent",
    "4. Add verification steps to check answer reasonableness",
    "5. Fine-tune the language model on mathematical reasoning tasks",
    "6. Implement symbolic math computation for exact calculations",
    "7. Use ensemble methods combining multiple optimized solvers",
    "8. Add domain-specific optimizers for different math areas"
]

for suggestion in suggestions:
    print(suggestion)

print("\n" + "=" * 70)

## Summary

In this notebook, we demonstrated how to optimize mathematical reasoning using DSPy:

### Key Components:

1. **Basic Math Solver** - Unoptimized baseline using Chain of Thought reasoning
2. **Training Data Creation** - Comprehensive examples across different math domains
3. **Evaluation Metrics** - Robust accuracy measurement with numerical tolerance
4. **DSPy Optimization** - BootstrapFewShot and LabeledFewShot optimizers
5. **Performance Comparison** - Detailed analysis of optimization benefits
6. **Error Analysis** - Understanding failure modes and improvement opportunities

### Optimization Strategies Tested:

- **BootstrapFewShot** - Automatically generates effective few-shot examples
- **LabeledFewShot** - Uses provided examples as demonstrations
- **Baseline Comparison** - Quantifies optimization benefits

### Math Problem Types Covered:

- **Basic Arithmetic** - Addition, subtraction, multiplication, division
- **Word Problems** - Distance/speed/time, ratios, proportions
- **Geometry** - Area, perimeter, circumference calculations
- **Algebra** - Solving equations, substitution
- **Percentages** - Discounts, tax calculations
- **Multi-step Problems** - Complex reasoning chains

### Key Findings:

- **Optimization Impact** - DSPy optimizers can significantly improve mathematical reasoning accuracy
- **Reasoning Quality** - Optimized solvers show more systematic and reliable solution approaches
- **Error Patterns** - Common issues include numerical precision and format consistency
- **Training Data Quality** - High-quality examples with clear reasoning steps are crucial

### Practical Applications:

- **Educational Tools** - Automated homework help and tutoring systems
- **Problem Solving** - Engineering and scientific calculation assistance
- **Financial Analysis** - Automated calculation and verification systems
- **Quality Assurance** - Verification of mathematical computations

### Next Steps:

- **Advanced Optimizers** - Experiment with MIPRO, ensemble methods
- **Domain Specialization** - Create optimizers for specific math areas
- **Symbolic Integration** - Combine with symbolic math libraries
- **Continuous Learning** - Implement feedback loops for ongoing improvement
- **Production Deployment** - Scale optimized solvers for real-world applications

This demonstration shows how DSPy's optimization capabilities can transform a basic mathematical reasoning system into a more accurate and reliable tool through systematic training and few-shot learning approaches.