<a href="https://colab.research.google.com/github/quantexolution/aimo/blob/main/AI_Mathematical_Olympiad_Progress_Prize_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

ai_mathematical_olympiad_progress_prize_3_path = kagglehub.competition_download('ai-mathematical-olympiad-progress-prize-3')

print('Data source import complete.')


> # `AI Mathematical Olympiad - Progress Prize 3`

## Competition Overview

Welcome to AIMO3! This competition challenges us to build AI systems capable of solving **International Mathematical Olympiad (IMO) level problems**. With a prize pool of **$2.2M+**, this is one of the most exciting AI reasoning competitions on Kaggle.

### Key Competition Facts:
- **110 problems total**: 50 public, 50 private, 10 reference
- **Answer format**: Integer between 0 and 99,999 (inclusive)
- **Hardware**: CPU ≤9h or GPU ≤5h runtime, no internet
- **Evaluation**: Penalized accuracy (both runs must be correct for full score)
- **Prize**: 1st place gets $262,144 + potential $1.5M+ for solving ≥47/50

### What Makes This Competition Special:
1. **Zero train-test contamination** - All problems are original
2. **IMO-level difficulty** - Algebra, Combinatorics, Geometry, Number Theory
3. **H100 GPUs available** - Industry-leading compute resources
4. **Scoring**: Both private runs must match for full points


#  Table of Contents

1. **[Setup & Configuration](#setup)** - Import libraries, set paths
2. **[Understanding the Data](#data)** - Explore problem structure and LaTeX format
3. **[Reference Problems Analysis](#reference)** - Deep dive into 10 sample problems
4. **[Solution Strategy](#strategy)** - Grandmaster approach to mathematical reasoning
5. **[Model Architecture](#model)** - LLM setup with reasoning chains
6. **[Inference Pipeline](#inference)** - Production-ready prediction system
7. **[Submission](#submission)** - Final submission code
8. **[Tips & Tricks](#tips)** - Competition insights from top competitors

<a id="setup"></a>
# 1️ Setup & Configuration

First, let's import all necessary libraries and configure our environment. We'll use a modular approach that works both locally and on Kaggle.

In [None]:
# ============================================================================
# IMPORTS & CONFIGURATION
# ============================================================================
import os
import re
import sys
import time
import warnings
from pathlib import Path
from typing import Optional, Union

import numpy as np
import pandas as pd
import polars as pl

warnings.filterwarnings('ignore')

# ============================================================================
# PATH CONFIGURATION - Works on both Kaggle and Local
# ============================================================================
IS_KAGGLE = os.path.exists('/kaggle/input')

if IS_KAGGLE:
    INPUT_PATH = Path('/kaggle/input/ai-mathematical-olympiad-progress-prize-3')
    OUTPUT_PATH = Path('/kaggle/working')
else:
    # Local development paths - adjust as needed
    INPUT_PATH = Path('./input')
    OUTPUT_PATH = Path('./output')
    OUTPUT_PATH.mkdir(exist_ok=True)

# List available files
print("ENVIRONMENT CHECK")
print("=" * 60)
print(f"Running on: {'Kaggle' if IS_KAGGLE else 'Local'}")
print(f"Input path: {INPUT_PATH}")
print(f"Output path: {OUTPUT_PATH}")
print()

if IS_KAGGLE:
    for dirname, _, filenames in os.walk('/kaggle/input'):
        for filename in filenames:
            print(f"{os.path.join(dirname, filename)}")

<a id="data"></a>
# 2️ Understanding the Data

Let's explore the competition data structure. The problems are formatted in **LaTeX** with specific mathematical notation conventions.

### Data Files:
| File | Description |
|------|-------------|
| `reference.csv` | 10 practice problems with solutions |
| `test.csv` | 50 problems (placeholder in public, real in scoring) |
| `sample_submission.csv` | Submission format template |

### Answer Format:
- All answers are integers in range **[0, 99999]**
- Modular arithmetic is **explicit** (e.g., "remainder when divided by 1000")
- No implicit modulo like in AIMO1/AIMO2!

In [None]:
# ============================================================================
# LOAD AND EXPLORE DATA
# ============================================================================

# Load reference problems (10 practice problems with known answers)
if IS_KAGGLE:
    reference_df = pd.read_csv(INPUT_PATH / 'reference.csv')
    test_df = pd.read_csv(INPUT_PATH / 'test.csv')
    sample_submission = pd.read_csv(INPUT_PATH / 'sample_submission.csv')
else:
    # Create sample data for local testing
    reference_df = pd.DataFrame({
        'id': ['ref_1', 'ref_2', 'ref_3'],
        'problem': [
            r'Find the smallest positive integer $n$ such that $n^2 + 1$ is divisible by $5$.',
            r'Let $a, b, c$ be positive real numbers with $abc = 1$. Find the minimum value of $a + b + c$.',
            r'How many 4-digit palindromes are divisible by 11?'
        ],
        'answer': [2, 3, 36]
    })
    test_df = reference_df.copy()
    sample_submission = pd.DataFrame({'id': reference_df['id'], 'answer': 0})

print("DATA OVERVIEW")
print("=" * 60)
print(f"\n Reference Problems: {len(reference_df)}")
print(f" Test Problems: {len(test_df)}")
print(f" Sample Submission: {len(sample_submission)}")
print("\n" + "=" * 60)
print("REFERENCE DATA STRUCTURE")
print("=" * 60)
print(reference_df.info())
print("\n" + "=" * 60)
print("FIRST 3 REFERENCE PROBLEMS")
print("=" * 60)
for idx, row in reference_df.head(3).iterrows():
    print(f"\n Problem {idx + 1} (ID: {row['id']})")
    print("-" * 50)
    print(f"Problem: {row['problem'][:200]}...")
    if 'answer' in reference_df.columns:
        print(f"Answer: {row['answer']}")

<a id="reference"></a>
# 3️ Reference Problems Analysis

The competition provides **10 reference problems** with full solutions. These are crucial for understanding:
1. **Problem difficulty range** (National Olympiad → IMO level)
2. **LaTeX notation conventions**
3. **Answer format expectations**
4. **Mathematical domains**: Algebra, Combinatorics, Geometry, Number Theory

### Key Insights from Reference Problems:

| Domain | Typical Techniques |
|--------|-------------------|
| **Algebra** | Polynomial manipulation, inequalities, functional equations |
| **Combinatorics** | Counting, pigeonhole, graph theory, recursion |
| **Geometry** | Coordinate geometry, trigonometry, circle theorems |
| **Number Theory** | Modular arithmetic, divisibility, Diophantine equations |

In [None]:
# ============================================================================
# LATEX PARSING UTILITIES
# ============================================================================

def parse_latex_problem(problem_text: str) -> dict:
    """
    Parse a LaTeX math problem and extract key components.

    Returns:
        dict with parsed information about the problem
    """
    info = {
        'raw_text': problem_text,
        'has_modulo': False,
        'modulo_value': None,
        'problem_type': 'unknown',
        'math_expressions': [],
        'length': len(problem_text)
    }

    # Check for modulo/remainder questions
    modulo_patterns = [
        r'remainder when.*divided by\s*\$?(\d+)\$?',
        r'modulo\s*\$?(\d+)\$?',
        r'mod\s*\$?(\d+)\$?'
    ]

    for pattern in modulo_patterns:
        match = re.search(pattern, problem_text, re.IGNORECASE)
        if match:
            info['has_modulo'] = True
            info['modulo_value'] = int(match.group(1))
            break

    # Detect problem type based on keywords
    type_keywords = {
        'geometry': ['triangle', 'circle', 'angle', 'polygon', 'perpendicular', 'parallel'],
        'number_theory': ['divisible', 'prime', 'gcd', 'lcm', 'integer', 'modulo', 'remainder'],
        'algebra': ['equation', 'polynomial', 'root', 'sum', 'product', 'function'],
        'combinatorics': ['count', 'ways', 'permutation', 'combination', 'arrange', 'subset']
    }

    problem_lower = problem_text.lower()
    for ptype, keywords in type_keywords.items():
        if any(kw in problem_lower for kw in keywords):
            info['problem_type'] = ptype
            break

    # Extract math expressions
    math_pattern = r'\$([^$]+)\$'
    info['math_expressions'] = re.findall(math_pattern, problem_text)

    return info

# Analyze reference problems
print("REFERENCE PROBLEMS ANALYSIS")
print("=" * 60)

for idx, row in reference_df.iterrows():
    parsed = parse_latex_problem(row['problem'])
    print(f"\n Problem {idx + 1}")
    print(f"   Type: {parsed['problem_type'].upper()}")
    print(f"   Has Modulo: {parsed['has_modulo']}", end="")
    if parsed['has_modulo']:
        print(f" (mod {parsed['modulo_value']})")
    else:
        print()
    print(f"   Length: {parsed['length']} chars")
    print(f"   Math expressions: {len(parsed['math_expressions'])}")

<a id="strategy"></a>
# 4️ Solution Strategy

## Grandmaster Approach to Mathematical Olympiad Problems

### The Challenge:
- IMO-level problems require **deep mathematical reasoning**
- Simple pattern matching won't work - problems are **original**
- Need **reliable** answers (both runs must match for full score)

### Winning Strategies from AIMO1 & AIMO2:

| Winner | Key Technique |
|--------|---------------|
| **AIMO1 (Numina)** | Fine-tuned Mistral + Chain-of-Thought |
| **AIMO2 (NemoSkills)** | DeepSeek-Math + Reward modeling |

### Our Multi-Stage Approach:

```
┌─────────────────────────────────────────────────────────────┐
│                    PROBLEM INPUT                            │
│                  (LaTeX formatted)                          │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 1: PROBLEM ANALYSIS                      │
│  • Parse LaTeX notation                                     │
│  • Identify problem type (algebra/geometry/etc)             │
│  • Extract key constraints and values                       │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│            STAGE 2: REASONING ENGINE                        │
│  • Chain-of-Thought prompting                               │
│  • Multiple solution attempts                               │
│  • Self-verification steps                                  │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│            STAGE 3: ANSWER EXTRACTION                       │
│  • Parse numerical answer                                   │
│  • Apply modular arithmetic if needed                       │
│  • Validate answer range [0, 99999]                         │
└──────────────────────┬──────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              STAGE 4: CONSISTENCY CHECK                     │
│  • Multiple runs for verification                           │
│  • Majority voting if resources allow                       │
│  • Final answer selection                                   │
└─────────────────────────────────────────────────────────────┘
```

<a id="model"></a>
# 5️ Model Architecture

## LLM-Based Mathematical Reasoning

For this competition, we'll implement a **flexible model class** that supports multiple approaches:

### Approach Options:

1. **DeepSeek-Math** (Recommended for high scores)
   - Specifically trained for mathematical reasoning
   - Strong performance on competition math
   - Available as open-weight model

2. **Qwen2-Math**
   - Excellent math capabilities
   - Good LaTeX understanding
   - Multiple sizes available

3. **Baseline (Symbolic)**
   - Pattern matching + heuristics
   - Fast but limited accuracy
   - Good for testing pipeline

### Key Implementation Features:
- **Lazy loading**: Model loads on first inference
- **Timeout handling**: Graceful degradation
- **Answer validation**: Ensures output in valid range
- **Chain-of-Thought**: Step-by-step reasoning

In [None]:
# ============================================================================
# MATHEMATICAL REASONING PROMPTS
# ============================================================================

# System prompt for mathematical reasoning
MATH_SYSTEM_PROMPT = """You are an expert mathematician specializing in International Mathematical Olympiad (IMO) problems.

Your task is to solve mathematical problems step by step with rigorous reasoning.

IMPORTANT RULES:
1. Show your complete reasoning process
2. Check your work before giving the final answer
3. The final answer must be a non-negative integer between 0 and 99999
4. If the problem asks for a remainder, compute the modular arithmetic correctly
5. Always end your response with: FINAL ANSWER: [your integer answer]

For geometry problems: Use coordinate geometry or trigonometry when helpful.
For number theory: Check small cases, look for patterns, use modular arithmetic.
For combinatorics: Use counting principles carefully, verify with small cases.
For algebra: Simplify systematically, check boundary conditions."""

# Chain-of-thought prompt template
COT_PROMPT_TEMPLATE = """Solve the following mathematical olympiad problem. Show your complete reasoning.

PROBLEM:
{problem}

SOLUTION:
Let me solve this step by step.

Step 1: Understand the problem
"""

# Answer extraction prompt
ANSWER_EXTRACTION_PROMPT = """Based on your solution, extract ONLY the final numerical answer.
The answer must be a non-negative integer between 0 and 99999 (inclusive).

Your solution concluded with this reasoning:
{reasoning}

The FINAL ANSWER is: """

print("Prompts configured successfully!")
print(f"   System prompt length: {len(MATH_SYSTEM_PROMPT)} chars")
print(f"   CoT template length: {len(COT_PROMPT_TEMPLATE)} chars")

In [None]:
# ============================================================================
# ANSWER EXTRACTION UTILITIES
# ============================================================================

def extract_numerical_answer(text: str) -> Optional[int]:
    """
    Extract the final numerical answer from model output.

    Handles various formats:
    - "FINAL ANSWER: 42"
    - "The answer is 42"
    - "Therefore, the answer is 42."
    - Just "42" at the end

    Returns:
        Integer answer or None if not found
    """
    if not text:
        return None

    # Clean the text
    text = text.strip()

    # Priority 1: Look for explicit "FINAL ANSWER:" pattern
    final_answer_patterns = [
        r'FINAL\s*ANSWER\s*[:=]\s*(\d+)',
        r'final\s*answer\s*[:=]\s*(\d+)',
        r'Final\s*Answer\s*[:=]\s*(\d+)',
    ]

    for pattern in final_answer_patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            answer = int(match.group(1))
            if 0 <= answer <= 99999:
                return answer

    # Priority 2: Look for "answer is X" patterns
    answer_patterns = [
        r'(?:the\s+)?answer\s+is\s+(\d+)',
        r'(?:the\s+)?answer\s*[:=]\s*(\d+)',
        r'therefore[,\s]+(\d+)',
        r'thus[,\s]+(\d+)',
        r'hence[,\s]+(\d+)',
        r'=\s*(\d+)\s*$',  # Ends with = number
    ]

    for pattern in answer_patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            answer = int(match.group(1))
            if 0 <= answer <= 99999:
                return answer

    # Priority 3: Look for boxed answers (common in LaTeX)
    boxed_patterns = [
        r'\\boxed\{(\d+)\}',
        r'\\fbox\{(\d+)\}',
    ]

    for pattern in boxed_patterns:
        match = re.search(pattern, text)
        if match:
            answer = int(match.group(1))
            if 0 <= answer <= 99999:
                return answer

    # Priority 4: Last number in the text (fallback)
    all_numbers = re.findall(r'\b(\d+)\b', text)
    if all_numbers:
        # Take the last number
        answer = int(all_numbers[-1])
        if 0 <= answer <= 99999:
            return answer

    return None


def validate_answer(answer: Optional[int], default: int = 0) -> int:
    """
    Validate and sanitize the answer.

    Args:
        answer: The extracted answer
        default: Default value if answer is invalid

    Returns:
        Valid integer in range [0, 99999]
    """
    if answer is None:
        return default

    # Ensure integer
    answer = int(answer)

    # Clamp to valid range
    if answer < 0:
        return 0
    if answer > 99999:
        return answer % 100000  # Take last 5 digits as fallback

    return answer


# Test the extraction
test_cases = [
    "After careful calculation, FINAL ANSWER: 42",
    "Therefore, the answer is 123.",
    "\\boxed{999}",
    "The result equals 54321 after simplification.",
    "No answer here",
]

print("ANSWER EXTRACTION TESTS")
print("=" * 60)
for test in test_cases:
    result = extract_numerical_answer(test)
    validated = validate_answer(result)
    print(f"Input: '{test[:50]}...' → {result} (validated: {validated})")

In [None]:
# ============================================================================
# MODEL CLASS - PRODUCTION READY
# ============================================================================

class MathOlympiadModel:
    """
    A production-ready model for solving Mathematical Olympiad problems.

    Features:
    - Lazy loading (model loads on first predict call)
    - Multiple backend support (transformers, vLLM, API)
    - Robust error handling
    - Answer validation
    """

    def __init__(self,
                 model_name: str = "baseline",
                 device: str = "auto",
                 max_new_tokens: int = 4096,
                 temperature: float = 0.1,
                 num_attempts: int = 1):
        """
        Initialize the model.

        Args:
            model_name: Name of the model to use
            device: Device to run on ('auto', 'cuda', 'cpu')
            max_new_tokens: Maximum tokens to generate
            temperature: Sampling temperature (lower = more deterministic)
            num_attempts: Number of solution attempts for voting
        """
        self.model_name = model_name
        self.device = device
        self.max_new_tokens = max_new_tokens
        self.temperature = temperature
        self.num_attempts = num_attempts

        self._model = None
        self._tokenizer = None
        self._is_loaded = False

    def load(self):
        """Load the model and tokenizer."""
        print(f"Loading model: {self.model_name}")
        start_time = time.time()

        if self.model_name == "baseline":
            # Baseline: rule-based solver (no ML model needed)
            self._model = self._create_baseline_solver()
            self._is_loaded = True

        elif "deepseek" in self.model_name.lower():
            # DeepSeek-Math model
            try:
                from transformers import AutoModelForCausalLM, AutoTokenizer
                import torch

                self._tokenizer = AutoTokenizer.from_pretrained(
                    self.model_name,
                    trust_remote_code=True
                )
                self._model = AutoModelForCausalLM.from_pretrained(
                    self.model_name,
                    torch_dtype=torch.bfloat16,
                    device_map=self.device,
                    trust_remote_code=True
                )
                self._is_loaded = True
            except Exception as e:
                print(f"Failed to load {self.model_name}: {e}")
                print("   Falling back to baseline solver")
                self._model = self._create_baseline_solver()
                self._is_loaded = True

        elif "qwen" in self.model_name.lower():
            # Qwen2-Math model
            try:
                from transformers import AutoModelForCausalLM, AutoTokenizer
                import torch

                self._tokenizer = AutoTokenizer.from_pretrained(self.model_name)
                self._model = AutoModelForCausalLM.from_pretrained(
                    self.model_name,
                    torch_dtype=torch.bfloat16,
                    device_map=self.device
                )
                self._is_loaded = True
            except Exception as e:
                print(f"Failed to load {self.model_name}: {e}")
                self._model = self._create_baseline_solver()
                self._is_loaded = True
        else:
            # Default to baseline
            self._model = self._create_baseline_solver()
            self._is_loaded = True

        elapsed = time.time() - start_time
        print(f"Model loaded in {elapsed:.2f}s")

    def _create_baseline_solver(self):
        """Create a baseline rule-based solver."""
        def baseline_solve(problem: str) -> int:
            """
            Baseline solver using pattern matching and heuristics.

            This is a fallback that provides reasonable guesses based on
            problem analysis. Not expected to achieve high accuracy.
            """
            parsed = parse_latex_problem(problem)

            # If there's a modulo question, return a small number
            if parsed['has_modulo'] and parsed['modulo_value']:
                # Common answers for modulo problems
                return parsed['modulo_value'] // 2

            # Look for specific number patterns in the problem
            numbers = re.findall(r'\b(\d+)\b', problem)
            if numbers:
                # Use the most common number or first significant one
                significant = [int(n) for n in numbers if int(n) > 1 and int(n) <= 99999]
                if significant:
                    return significant[0]

            # Default fallback based on problem type
            type_defaults = {
                'geometry': 180,  # Common angle-related
                'number_theory': 1,
                'algebra': 0,
                'combinatorics': 1,
            }

            return type_defaults.get(parsed['problem_type'], 0)

        return baseline_solve

    def predict(self, problem: str) -> int:
        """
        Predict the answer for a mathematical problem.

        Args:
            problem: The LaTeX-formatted problem text

        Returns:
            Integer answer in range [0, 99999]
        """
        # Lazy loading
        if not self._is_loaded:
            self.load()

        try:
            if callable(self._model):
                # Baseline solver
                answer = self._model(problem)
            else:
                # LLM-based solver
                answer = self._solve_with_llm(problem)

            return validate_answer(answer)

        except Exception as e:
            print(f"Prediction error: {e}")
            return 0

    def _solve_with_llm(self, problem: str) -> int:
        """Solve using the loaded LLM."""
        import torch

        # Format the prompt
        prompt = COT_PROMPT_TEMPLATE.format(problem=problem)

        # Generate response
        inputs = self._tokenizer(prompt, return_tensors="pt").to(self._model.device)

        with torch.no_grad():
            outputs = self._model.generate(
                **inputs,
                max_new_tokens=self.max_new_tokens,
                temperature=self.temperature,
                do_sample=self.temperature > 0,
                pad_token_id=self._tokenizer.eos_token_id
            )

        response = self._tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extract answer
        answer = extract_numerical_answer(response)
        return answer if answer is not None else 0


# Test the model class
print("MODEL CLASS TEST")
print("=" * 60)
model = MathOlympiadModel(model_name="baseline")

test_problem = r"Find the smallest positive integer $n$ such that $n^2 + 1$ is divisible by $5$."
result = model.predict(test_problem)
print(f"Test problem: {test_problem}")
print(f"Predicted answer: {result}")

<a id="inference"></a>
# 6️ Inference Pipeline

## Production-Ready Prediction System

Now let's build the complete inference pipeline that:
1. Integrates with Kaggle's evaluation API
2. Handles edge cases gracefully
3. Provides logging and monitoring
4. Ensures consistent predictions across both private set runs

### Key Design Decisions:

| Aspect | Choice | Reason |
|--------|--------|--------|
| **Loading** | Lazy | Reduce startup time |
| **Timeout** | Graceful | Return default answer vs crash |
| **Logging** | Verbose | Debug production issues |
| **Validation** | Strict | Ensure valid range [0, 99999] |

In [None]:
# ============================================================================
# ADVANCED MODEL CONFIGURATION
# ============================================================================
#
# Uncomment and modify the appropriate section based on your hardware:
#
# ----------------------
# OPTION 1: CPU ONLY (9 hours max)
# ----------------------
# MODEL_CONFIG = {
#     'model_name': 'baseline',  # Use baseline for CPU
#     'device': 'cpu',
#     'max_new_tokens': 2048,
#     'temperature': 0.0,
# }
#
# ----------------------
# OPTION 2: GPU - DeepSeek-Math (Recommended)
# ----------------------
# MODEL_CONFIG = {
#     'model_name': 'deepseek-ai/deepseek-math-7b-instruct',
#     'device': 'auto',
#     'max_new_tokens': 4096,
#     'temperature': 0.1,
# }
#
# ----------------------
# OPTION 3: GPU - Qwen2-Math
# ----------------------
# MODEL_CONFIG = {
#     'model_name': 'Qwen/Qwen2-Math-7B-Instruct',
#     'device': 'auto',
#     'max_new_tokens': 4096,
#     'temperature': 0.1,
# }

# Default configuration (baseline for testing)
MODEL_CONFIG = {
    'model_name': 'baseline',
    'device': 'auto',
    'max_new_tokens': 4096,
    'temperature': 0.1,
    'num_attempts': 1,
}

print(" MODEL CONFIGURATION")
print("=" * 60)
for key, value in MODEL_CONFIG.items():
    print(f"   {key}: {value}")

In [None]:
# ============================================================================
# INFERENCE PIPELINE
# ============================================================================

class AIMO3InferencePipeline:
    """
    Complete inference pipeline for AIMO3 competition.

    Features:
    - Lazy model loading
    - Robust error handling
    - Detailed logging
    - Answer validation
    - Statistics tracking
    """

    def __init__(self, config: dict):
        """Initialize the pipeline with configuration."""
        self.config = config
        self.model = None
        self.stats = {
            'total_problems': 0,
            'successful_predictions': 0,
            'failed_predictions': 0,
            'avg_time_per_problem': 0,
            'predictions': []
        }
        self.start_time = None

    def initialize(self):
        """Initialize the model (lazy loading wrapper)."""
        if self.model is None:
            print("=" * 60)
            print(" INITIALIZING INFERENCE PIPELINE")
            print("=" * 60)
            self.model = MathOlympiadModel(**self.config)
            self.start_time = time.time()

    def predict(self, problem_id: str, problem_text: str) -> int:
        """
        Make a prediction for a single problem.

        Args:
            problem_id: Unique problem identifier
            problem_text: LaTeX-formatted problem

        Returns:
            Integer answer in [0, 99999]
        """
        # Initialize on first call
        self.initialize()

        problem_start = time.time()
        self.stats['total_problems'] += 1

        try:
            # Make prediction
            answer = self.model.predict(problem_text)

            # Validate
            answer = validate_answer(answer)

            # Log
            elapsed = time.time() - problem_start
            print(f"   Problem {problem_id}: Answer = {answer} ({elapsed:.2f}s)")

            self.stats['successful_predictions'] += 1
            self.stats['predictions'].append({
                'id': problem_id,
                'answer': answer,
                'time': elapsed
            })

            return answer

        except Exception as e:
            print(f"    Problem {problem_id}: Error - {e}")
            self.stats['failed_predictions'] += 1
            return 0

    def get_stats(self) -> dict:
        """Get pipeline statistics."""
        if self.stats['total_problems'] > 0:
            total_time = sum(p['time'] for p in self.stats['predictions'])
            self.stats['avg_time_per_problem'] = total_time / self.stats['total_problems']
        return self.stats

    def print_summary(self):
        """Print a summary of the inference run."""
        stats = self.get_stats()
        print("\n" + "=" * 60)
        print(" INFERENCE SUMMARY")
        print("=" * 60)
        print(f"   Total problems: {stats['total_problems']}")
        print(f"   Successful: {stats['successful_predictions']}")
        print(f"   Failed: {stats['failed_predictions']}")
        print(f"   Avg time/problem: {stats['avg_time_per_problem']:.2f}s")
        if self.start_time:
            total_elapsed = time.time() - self.start_time
            print(f"   Total elapsed: {total_elapsed:.2f}s")


# Create pipeline instance
pipeline = AIMO3InferencePipeline(MODEL_CONFIG)
print(" Pipeline created successfully!")

In [None]:
# ============================================================================
# TEST ON REFERENCE PROBLEMS
# ============================================================================

print(" TESTING ON REFERENCE PROBLEMS")
print("=" * 60)

# Test pipeline on reference data
test_results = []

for idx, row in reference_df.iterrows():
    problem_id = row['id']
    problem_text = row['problem']
    true_answer = row.get('answer', None)

    # Get prediction
    predicted = pipeline.predict(problem_id, problem_text)

    # Check correctness
    is_correct = (predicted == true_answer) if true_answer is not None else None

    test_results.append({
        'id': problem_id,
        'predicted': predicted,
        'true_answer': true_answer,
        'correct': is_correct
    })

# Summary
pipeline.print_summary()

# Accuracy on reference
if any(r['correct'] is not None for r in test_results):
    correct_count = sum(1 for r in test_results if r['correct'])
    total_count = sum(1 for r in test_results if r['correct'] is not None)
    accuracy = correct_count / total_count if total_count > 0 else 0
    print(f"\n Reference Accuracy: {correct_count}/{total_count} = {accuracy:.1%}")

<a id="submission"></a>
# 7️ Final Submission

## Production Submission Code

This is the **final submission code** that integrates with Kaggle's evaluation API.

### Important Notes:
1. **Must call `inference_server.serve()`** within 15 minutes of script start
2. **No internet access** during evaluation
3. **Both GPU (5h) and CPU (9h)** time limits apply
4. Submission runs **twice** on private set - both must succeed!

### Submission Checklist:
- ✅ Model loads successfully
- ✅ Predictions are integers in [0, 99999]
- ✅ Graceful error handling
- ✅ Consistent across runs (deterministic if possible)

In [None]:
# ============================================================================
# FINAL SUBMISSION CODE
# ============================================================================
# This cell contains the complete submission code for the competition.
# It's designed to work with Kaggle's evaluation API.
# ============================================================================

import os
import re
import time
from typing import Optional

import pandas as pd
import polars as pl

# Try to import the competition API (only available on Kaggle)
try:
    import kaggle_evaluation.aimo_3_inference_server
    KAGGLE_API_AVAILABLE = True
except ImportError:
    KAGGLE_API_AVAILABLE = False
    print("Kaggle API not available (running locally)")


# ============================================================================
# PRODUCTION MODEL CLASS
# ============================================================================

class ProductionModel:
    """
    Production-ready model for AIMO3 submission.

    This class is optimized for the competition environment:
    - Lazy loading to meet 15-minute startup requirement
    - Robust error handling
    - Deterministic predictions for consistency across runs
    """

    def __init__(self):
        self._model = None
        self._is_loaded = False

    def load(self):
        """
        Load the model. Called lazily on first prediction.

        CUSTOMIZE THIS METHOD:
        - For baseline: Uses pattern matching (fast, low accuracy)
        - For LLM: Load your fine-tuned model here
        """
        print("Loading production model...")
        start_time = time.time()

        # ============================================================
        # OPTION 1: BASELINE (Fast, works everywhere)
        # ============================================================
        self._model = self._create_baseline_solver()

        # ============================================================
        # OPTION 2: DEEPSEEK-MATH (Uncomment for GPU)
        # ============================================================
        # from transformers import AutoModelForCausalLM, AutoTokenizer
        # import torch
        #
        # model_name = "deepseek-ai/deepseek-math-7b-instruct"
        # self._tokenizer = AutoTokenizer.from_pretrained(
        #     model_name, trust_remote_code=True
        # )
        # self._model = AutoModelForCausalLM.from_pretrained(
        #     model_name,
        #     torch_dtype=torch.bfloat16,
        #     device_map="auto",
        #     trust_remote_code=True
        # )

        self._is_loaded = True
        elapsed = time.time() - start_time
        print(f"Model loaded in {elapsed:.2f}s")

    def _create_baseline_solver(self):
        """Create baseline rule-based solver."""
        def solve(problem: str) -> int:
            # Parse problem
            problem_lower = problem.lower()

            # Look for modulo patterns
            mod_match = re.search(
                r'remainder when.*divided by\s*\$?(\d+)\$?',
                problem, re.IGNORECASE
            )
            if mod_match:
                mod_val = int(mod_match.group(1))
                return mod_val // 2

            # Extract numbers from problem
            numbers = re.findall(r'\b(\d+)\b', problem)
            significant = [int(n) for n in numbers if 1 < int(n) <= 99999]

            if significant:
                # Return first significant number as heuristic
                return significant[0] % 100000

            # Default based on problem type
            if 'triangle' in problem_lower or 'angle' in problem_lower:
                return 180
            elif 'prime' in problem_lower:
                return 2
            elif 'sum' in problem_lower:
                return 0

            return 0

        return solve

    def predict(self, problem: str) -> int:
        """
        Generate prediction for a problem.

        Args:
            problem: LaTeX-formatted problem text

        Returns:
            Integer in [0, 99999]
        """
        # Lazy load
        if not self._is_loaded:
            self.load()

        try:
            if callable(self._model):
                answer = self._model(problem)
            else:
                answer = self._llm_predict(problem)

            # Validate answer
            answer = int(answer)
            answer = max(0, min(99999, answer))
            return answer

        except Exception as e:
            print(f"Prediction error: {e}")
            return 0

    def _llm_predict(self, problem: str) -> int:
        """LLM-based prediction (when using transformers model)."""
        import torch

        prompt = f"""Solve this mathematical olympiad problem step by step.
At the end, provide your answer as: FINAL ANSWER: [integer]

Problem:
{problem}

Solution:
"""

        inputs = self._tokenizer(prompt, return_tensors="pt").to(self._model.device)

        with torch.no_grad():
            outputs = self._model.generate(
                **inputs,
                max_new_tokens=4096,
                temperature=0.1,
                do_sample=True,
                pad_token_id=self._tokenizer.eos_token_id
            )

        response = self._tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extract answer
        match = re.search(r'FINAL\s*ANSWER\s*[:=]\s*(\d+)', response, re.IGNORECASE)
        if match:
            return int(match.group(1))

        # Fallback: last number
        numbers = re.findall(r'\b(\d+)\b', response)
        if numbers:
            return int(numbers[-1])

        return 0


# ============================================================================
# PREDICTION FUNCTION (Required by Kaggle API)
# ============================================================================

# Global model instance
model = ProductionModel()


def predict(id_: pl.Series, problem: pl.Series) -> pl.DataFrame:
    """
    Prediction function for Kaggle API.

    Args:
        id_: Problem ID (Polars Series with single element)
        problem: Problem text (Polars Series with single element)

    Returns:
        DataFrame with 'id' and 'answer' columns
    """
    # Extract values from Series
    problem_id = id_.item(0)
    problem_text: str = problem.item(0)

    # Make prediction
    prediction = model.predict(problem_text)

    # Return as DataFrame
    return pl.DataFrame({
        'id': problem_id,
        'answer': prediction
    })


print("Submission code ready!")
print(f"   Kaggle API available: {KAGGLE_API_AVAILABLE}")

In [None]:
# ============================================================================
# RUN INFERENCE SERVER
# ============================================================================
# This cell starts the inference server for Kaggle evaluation.
# Locally, it runs against the test.csv file.
# ============================================================================

if KAGGLE_API_AVAILABLE:
    # Create inference server
    inference_server = kaggle_evaluation.aimo_3_inference_server.AIMO3InferenceServer(
        predict
    )

    if os.getenv('KAGGLE_IS_COMPETITION_RERUN'):
        # PRODUCTION MODE: Serve predictions to the evaluation system
        # IMPORTANT: Must be called within 15 minutes of script start!
        print(" Starting inference server (PRODUCTION MODE)...")
        inference_server.serve()
    else:
        # DEVELOPMENT MODE: Test locally with test.csv
        print(" Running local gateway (DEVELOPMENT MODE)...")
        inference_server.run_local_gateway(
            ('/kaggle/input/ai-mathematical-olympiad-progress-prize-3/test.csv',)
        )
else:
    # Local testing without Kaggle API
    print(" LOCAL TESTING MODE")
    print("=" * 60)

    # Test on reference or test data
    for idx, row in reference_df.head(5).iterrows():
        problem_id = row['id']
        problem_text = row['problem']

        # Create Polars Series for API compatibility
        id_series = pl.Series([problem_id])
        problem_series = pl.Series([problem_text])

        # Get prediction
        result = predict(id_series, problem_series)
        print(f"Problem {problem_id}: Answer = {result['answer'].item()}")

<a id="tips"></a>
# 8️ Grandmaster Tips & Tricks

## Insights from Top Competitors

### 1. Model Selection

| Model | Pros | Cons | Best For |
|-------|------|------|----------|
| **DeepSeek-Math-7B** | Strong math reasoning | 7B params, GPU needed | GPU submissions |
| **Qwen2-Math-7B** | Fast, accurate | Similar requirements | Alternative GPU |
| **Numina-Math** | AIMO1 winner base | May be overfit | Fine-tuning |
| **Baseline** | Fast, no GPU | Low accuracy | Testing pipeline |

### 2. Prompt Engineering Tips

```
✅ DO:
- Use step-by-step reasoning (Chain-of-Thought)
- Ask model to verify its answer
- Include domain-specific instructions
- Request explicit "FINAL ANSWER:" format

❌ DON'T:
- Expect single-shot accuracy
- Trust answers without validation
- Ignore edge cases (modular arithmetic!)
```

### 3. Runtime Optimization

```python
# Time budget (GPU mode: 5 hours for 50 problems)
TIME_PER_PROBLEM = 5 * 60 * 60 / 50  # = 360 seconds

# With safety margin (handle unexpected delays)
SAFE_TIME_PER_PROBLEM = 300  # 5 minutes per problem
```

### 4. Consistency for Double-Run Scoring

Since both private runs must match for full score:
- **Use deterministic decoding** (`temperature=0` or set seeds)
- **Cache intermediate results** if possible
- **Handle edge cases identically** across runs

### 5. Answer Extraction Best Practices

```python
# Priority order for answer extraction:
1. Look for "FINAL ANSWER: XXX" pattern
2. Look for \boxed{XXX} LaTeX notation
3. Look for "the answer is XXX"
4. Fall back to last number in response
5. Return 0 as ultimate fallback
```

### 6. Common Mistakes to Avoid

| Mistake | Impact | Solution |
|---------|--------|----------|
| Answer > 99999 | Invalid | Take modulo or clamp |
| Negative answers | Invalid | Take absolute value |
| Float answers | Invalid | Round to integer |
| Missing modulo | Wrong | Parse problem for mod requirements |
| Non-deterministic | 50% score | Set random seeds |

## Advanced Strategies for Higher Scores

### Strategy 1: Ensemble Methods

Combine multiple approaches for more robust predictions:

```
Model A (DeepSeek) ─┐
                    ├──→ Majority Vote ──→ Final Answer
Model B (Qwen)    ─┤
                    │
Symbolic Solver   ─┘
```

### Strategy 2: Self-Consistency (SC)

Generate multiple solutions and pick the most common answer:

```python
# Generate 5 solutions with temperature > 0
solutions = [model.predict(problem, temp=0.7) for _ in range(5)]
# Take majority vote
answer = Counter(solutions).most_common(1)[0][0]
```

### Strategy 3: Problem-Type Routing

Use different strategies for different problem types:

```
┌─────────────────┐
│ Detect Problem  │
│     Type        │
└────────┬────────┘
         │
    ┌────┴────┬────────────┬────────────┐
    ▼         ▼            ▼            ▼
┌───────┐ ┌───────┐ ┌──────────┐ ┌───────────┐
│Geometry│ │Number │ │Combinat- │ │  Algebra  │
│Solver  │ │Theory │ │orics     │ │  Solver   │
└────────┘ └───────┘ └──────────┘ └───────────┘
```

### Strategy 4: Two-Phase Approach

1. **Phase 1**: Quick solve with short timeout
2. **Phase 2**: Deeper reasoning if time permits

```python
# Phase 1: Quick attempt (30 seconds)
quick_answer = model.predict(problem, max_tokens=512)

# Phase 2: Deep reasoning if time available
if time_remaining > threshold:
    deep_answer = model.predict(problem, max_tokens=4096)
    if deep_answer != quick_answer:
        # Verify with additional attempt
        answer = verify_and_select(quick_answer, deep_answer)
```

## Expected Performance by Approach

| Approach | Expected Public Score | Time per Problem | GPU Required |
|----------|----------------------|------------------|--------------|
| **Random Baseline** | ~1/100,000 | <1s | ❌ |
| **Pattern Matching** | 5-10% | <1s | ❌ |
| **Small LLM (7B)** | 20-35% | 30-60s | ✅ |
| **Large LLM (70B)** | 35-50% | 2-5min | ✅✅ |
| **Fine-tuned + Ensemble** | 50-70%+ | 3-5min | ✅✅ |
| **AIMO2 Winner Approach** | 68% | - | ✅✅ |

### Resources for Improvement

1. **Official Resources**:
   - [Reference Problems PDF](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-3/data) - Full solutions
   - [Fields Model Initiative](https://fieldsmodelinitiative.org/) - Free compute (128 H100s!)
   - [Tinker API](https://thinkingmachines.ai/) - Up to $400 credits

2. **Past Winner Solutions**:
   - [AIMO1 Winner (Numina)](https://www.kaggle.com/competitions/ai-mathematical-olympiad-progress-prize-1/discussion/495373)
   - [AIMO2 Winner (NemoSkills)](https://arxiv.org/abs/2504.16891)

3. **Useful Datasets**:
   - MATH dataset
   - GSM8K
   - Numina-Math-CoT
   - PRM800K (process reward model data)

## Quick Start Guide

### For Beginners:
1. **Run this notebook as-is** - You'll get a working baseline submission
2. **Understand the pipeline** - Study how predict() works
3. **Test on reference problems** - Verify your understanding

### For Intermediate:
1. **Add a real LLM** - Uncomment the DeepSeek/Qwen sections
2. **Tune prompts** - Improve the Chain-of-Thought template
3. **Add validation** - Verify answers make mathematical sense

### For Advanced:
1. **Fine-tune a model** - Use Numina-Math or similar datasets
2. **Implement ensembling** - Combine multiple models
3. **Add symbolic solvers** - For specific problem types
4. **Optimize for runtime** - Use vLLM or TensorRT

---

### Submission Checklist

Before submitting, verify:

- [ ] Notebook runs completely without errors
- [ ] `predict()` function returns valid integers [0, 99999]
- [ ] No internet access required during runtime
- [ ] Runtime is under limit (GPU: 5h, CPU: 9h)
- [ ] Model loads within 15 minutes
- [ ] Predictions are deterministic (same input → same output)

---

### Conclusion

This notebook provides a **complete framework** for the AIMO3 competition:

1.  **Data exploration** - Understanding problem format
2.  **Model architecture** - Flexible LLM-based solver
3.  **Inference pipeline** - Production-ready code
4.  **Submission code** - Kaggle API integration
5.  **Tips & tricks** - Grandmaster insights

**Next steps:**
- Submit this baseline to get on the leaderboard
- Iterate on model selection and prompting
- Fine-tune for better performance
- Apply for compute resources from Fields Model Initiative!

### Connect with Me  

Feel free to follow me on these platforms:

[![GitHub](https://img.shields.io/badge/GitHub-181717?style=for-the-badge&logo=github&logoColor=white)](https://github.com/AdilShamim8)  
[![LinkedIn](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/adilshamim8)  
[![Twitter](https://img.shields.io/badge/Twitter-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white)](https://x.com/adil_shamim8)  