# Tutorial: GEPA for AIME (Math)
In this tutorial, we optimize GPT-4.1 Mini's Chain of Thought (`dspy.ChainOfThought`) for solving math problems (AIME) using the `dspy.GEPA` optimizer!

In [1]:
api_key = input("Enter your OpenAI API key: ")
import dspy
lm = dspy.LM("openai/gpt-4.1-mini", temperature=1, api_key=api_key, max_tokens=32000)
dspy.configure(lm=lm)

### Loading the AIME dataset

The AIME exam consists of 2 problem sets of size 15 for each year. For this tutorial, we will use AIME problem sets from previous years (2022-2024) for optimization (amounting to total 3 years x 2 sets x 15 problems = 90 problems, split equally between train and validation sets), and test the performance on AIME 2025 (2 sets x 15 problems = 30 problems). Since AIME 2025 is a small set, we repeat it 5 times for statistical stability in evaluation.

In [2]:
import dspy
from datasets import load_dataset

def init_dataset():
    train_split = load_dataset("AI-MO/aimo-validation-aime")['train']
    train_split = [
        dspy.Example({
            "problem": x['problem'],
            'solution': x['solution'],
            'answer': x['answer'],
        }).with_inputs("problem")
        for x in train_split
    ]
    import random
    random.Random(0).shuffle(train_split)
    tot_num = len(train_split)

    test_split = load_dataset("MathArena/aime_2025")['train']
    test_split = [
        dspy.Example({
            "problem": x['problem'],
            'answer': x['answer'],
        }).with_inputs("problem")
        for x in test_split
    ]

    train_set = train_split[:int(0.5 * tot_num)]
    val_set = train_split[int(0.5 * tot_num):]
    test_set = test_split * 5

    return train_set, val_set, test_set

In [3]:
train_set, val_set, test_set = init_dataset()

len(train_set), len(val_set), len(test_set)

(45, 45, 150)

Let's view an example task input

In [4]:
print("Problem:")
print(train_set[0]['problem'])
print("\n\nSolution:")
print(train_set[0]['solution'])
print("\n\nAnswer:")
print(train_set[0]['answer'])

Problem:
In isosceles trapezoid $ABCD$, parallel bases $\overline{AB}$ and $\overline{CD}$ have lengths $500$ and $650$, respectively, and $AD=BC=333$. The angle bisectors of $\angle{A}$ and $\angle{D}$ meet at $P$, and the angle bisectors of $\angle{B}$ and $\angle{C}$ meet at $Q$. Find $PQ$.


Solution:
We have the following diagram:

Let $X$ and $W$ be the points where $AP$ and $BQ$ extend to meet $CD$, and $YZ$ be the height of $\triangle AZB$. As proven in Solution 2, triangles $APD$ and $DPW$ are congruent right triangles. Therefore, $AD = DW = 333$. We can apply this logic to triangles $BCQ$ and $XCQ$ as well, giving us $BC = CX = 333$. Since $CD = 650$, $XW = DW + CX - CD = 16$.
Additionally, we can see that $\triangle XZW$ is similar to $\triangle PQZ$ and $\triangle AZB$. We know that $\frac{XW}{AB} = \frac{16}{500}$. So, we can say that the height of the triangle $AZB$ is $500u$ while the height of the triangle $XZW$ is $16u$. After that, we can figure out the distance from 

### Let's define the program: A simple `dspy.ChainOfThought`

In [5]:
class GenerateResponse(dspy.Signature):
    """Solve the problem and provide the answer in the correct format."""
    problem = dspy.InputField()
    answer = dspy.OutputField()

program = dspy.ChainOfThought(GenerateResponse)

### Defining the evaluation metric
We simply check exact match between the predicted answer and the correct answer.

In [6]:
def metric(example, prediction, trace=None, pred_name=None, pred_trace=None):
    correct_answer = int(example['answer'])
    try:
        llm_answer = int(prediction.answer)
    except ValueError as e:
        return 0
    return int(correct_answer == llm_answer)

### Evaluating unoptimized Chain Of Thought

In [7]:
import dspy
evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metric,
    num_threads=32,
    display_table=True,
    display_progress=True
)

evaluate(program)

Average Metric: 70.00 / 150 (46.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [00:01<00:00, 119.75it/s]
2025/08/12 21:49:36 INFO dspy.evaluate.evaluate: Average Metric: 70 / 150 (46.7%)



2025/08/12 21:49:36 INFO dspy.evaluate.evaluate: Average Metric: 70 / 150 (46.7%)





Unnamed: 0,problem,example_answer,reasoning,pred_answer,metric
0,Find the sum of all integer bases $b>9$ for which $17_b$ is a divi...,70,We are looking for integer bases \( b > 9 \) such that \( 17_b \) ...,70,✔️ [1]
1,"On $\triangle ABC$ points $A, D, E$, and $B$ lie in that order on ...",588,Let's analyze the problem step-by-step. We have triangle \( ABC \)...,588,✔️ [1]
2,The 9 members of a baseball team went to an ice-cream parlor after...,16,"We have 9 players, each choosing one of three flavors: chocolate (...",16,✔️ [1]
3,"Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ a...",117,We start with the given equation: \[12x^2 - xy - 6y^2 = 0.\] Our g...,117,✔️ [1]
4,There are $8!= 40320$ eight-digit positive integers that use each ...,279,We are given that there are \(8! = 40320\) eight-digit numbers tha...,279,✔️ [1]
...,...,...,...,...,...
145,Let $S$ be the set of vertices of a regular $24$-gon. Find the num...,113,We have a regular 24-gon with vertex set \( S \) of size 24. We wa...,1666,
146,Let $A_1 A_2 A_3 \ldots A_{11}$ be an $11$-sided non-convex simple...,19,Let's analyze the problem step-by-step. We are given an 11-sided p...,19,✔️ [1]
147,"Let $x_1, x_2, x_3, \ldots$ be a sequence of rational numbers defi...",248,"We have the sequence: \[ x_1 = \frac{25}{11}, \quad x_{k+1} = \fra...",589,
148,Let $\triangle ABC$ be a right triangle with $\angle A = 90^\circ$...,104,Given a right triangle \(\triangle ABC\) with \(\angle A = 90^\cir...,98,


EvaluationResult(score=46.67, results=<list of 150 results>)

### Optimize the program with `dspy.GEPA`

GEPA is a _reflective_ prompt optimizer, and it's strength lies in being able to leverage additional sources of information, like the DSPy program's execution and evaluation pipelines, which provides GEPA more visibility into why the system got the score that it did, and then GEPA can introspect to identify how to improve the score. GEPA can also leverage additional supervision provided in this manner. For example, during optimization, we can return the correct solution's to the problems the program failed to solve.

We note that while such explicit supervision is not available in all scenarios, GEPA can work very flexibly with different forms of feedback (for example, using LLM-as-a-judge feedback shown in the PAPILLON tutorial, or just using answer labels, as shown in the facility-support tutorial).

Let's quickly modify the evaluation metric to become an optimization metric for GEPA, that can provide this additional supervision!

In [8]:
def metric_with_feedback(example, prediction, trace=None, pred_name=None, pred_trace=None):
    correct_answer = int(example['answer'])
    written_solution = example.get('solution', '')
    try:
        llm_answer = int(prediction.answer)
    except ValueError as e:
        feedback_text = f"The final answer must be a valid integer and nothing else. You responded with '{prediction.answer}', which couldn't be parsed as a python integer. Please ensure your answer is a valid integer without any additional text or formatting."
        feedback_text += f" The correct answer is '{correct_answer}'."
        if written_solution:
            feedback_text += f" Here's the full step-by-step solution:\n{written_solution}\n\nThink about what takeaways you can learn from this solution to improve your future answers and approach to similar problems and ensure your final answer is a valid integer."
        return dspy.Prediction(score=0, feedback=feedback_text)

    score = int(correct_answer == llm_answer)

    feedback_text = ""
    if score == 1:
        feedback_text = f"Your answer is correct. The correct answer is '{correct_answer}'."
    else:
        feedback_text = f"Your answer is incorrect. The correct answer is '{correct_answer}'."
    
    if written_solution:
        feedback_text += f" Here's the full step-by-step solution:\n{written_solution}\n\nThink about what takeaways you can learn from this solution to improve your future answers and approach to similar problems."

    return dspy.Prediction(score=score, feedback=feedback_text)

In [9]:
from dspy import GEPA

optimizer = GEPA(
    metric=metric_with_feedback,
    auto="light",
    num_threads=32,
    track_stats=True,
    reflection_minibatch_size=3,
    reflection_lm=dspy.LM(model="gpt-5", temperature=1.0, max_tokens=32000, api_key=api_key)
)

optimized_program = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,
)

2025/08/12 21:49:36 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 560 metric calls of the program. This amounts to 6.22 full evals on the train+val set.
3
2025/08/12 21:49:36 INFO dspy.teleprompt.gepa.gepa: Using 45 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget.
2025/08/12 21:52:15 INFO dspy.evaluate.evaluate: Average Metric: 17.0 / 45 (37.8%)
2025/08/12 21:52:15 INFO dspy.teleprompt.gepa.gepa: Iteration 0: Base program full valset score: 0.37777777777777777
2025/08/12 21:52:15 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Selected program 0 score: 0.37777777777777777
Average Metric: 2.00 / 3 (66.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:10<00:00, 63.51s/it]
2025/08/12 21:55:26 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)

2025/08/12 21:56:50 INFO dspy.teleprompt.g

### Let's see the prompt generated

In [10]:
print(optimized_program.predict.signature.instructions)

You will be given one math problem as plain text under a key like “problem.” Your job is to solve it correctly and return:

- reasoning: a concise, logically ordered solution that uses identities/structure to avoid brute force, ends with a quick verification.
- answer: the final requested number/expression only (no extra words).

Formatting:
- Use exactly two top-level fields named “reasoning” and “answer.”
- Keep reasoning succinct but complete. Bullet points are fine.
- The answer field must contain only the final value requested (e.g., 227, 585, 601).

General problem-solving guidance:
- Parse the problem type (e.g., base representation, intersecting families of subsets, avoiding arithmetic progressions, symmetric sums with constraints, ordered tuples counting).
- Always enforce domain constraints (e.g., base-b digits in 0..b−1; no leading zero for base-10 “three-digit”; ordered vs unordered families; strict increase conditions in sequences).
- Use algebraic identities and modular a

It can be seen that what GEPA is doing here, is precomputing some reasoning to come up with a good plan for future task instances. Due to the improved performance in unseen validation set, we expect this prompt to generalize!

### Evaluating the Chain Of Thought optimized with GEPA

In [11]:
evaluate(optimized_program)

Average Metric: 85.00 / 150 (56.7%): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [00:00<00:00, 476.89it/s]

2025/08/12 23:53:14 INFO dspy.evaluate.evaluate: Average Metric: 85 / 150 (56.7%)





Unnamed: 0,problem,example_answer,reasoning,pred_answer,metric
0,Find the sum of all integer bases $b>9$ for which $17_b$ is a divi...,70,- Interpret the numbers in base \( b \): \[ 17_b = 1 \cdot b + 7 =...,70,✔️ [1]
1,"On $\triangle ABC$ points $A, D, E$, and $B$ lie in that order on ...",588,"1. Set up coordinate system: - Place \( A \) at the origin \((0,0)...",588,✔️ [1]
2,The 9 members of a baseball team went to an ice-cream parlor after...,16,"- We have 9 players and 3 flavors: Chocolate (C), Vanilla (V), Str...",16,✔️ [1]
3,"Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ a...",117,"We need to find all integer pairs \((x,y)\) with \(x,y \in [-100, ...",117,✔️ [1]
4,There are $8!= 40320$ eight-digit positive integers that use each ...,279,- We consider all 8-digit numbers using each of the digits 1 throu...,567,
...,...,...,...,...,...
145,Let $S$ be the set of vertices of a regular $24$-gon. Find the num...,113,- The problem is to find the number of ways to pair up the 24 vert...,11,
146,Let $A_1 A_2 A_3 \ldots A_{11}$ be an $11$-sided non-convex simple...,19,We are given a simple polygon \(A_1 A_2 \dots A_{11}\) with vertic...,19,✔️ [1]
147,"Let $x_1, x_2, x_3, \ldots$ be a sequence of rational numbers defi...",248,"Given the recurrence: \[ x_1 = \frac{25}{11}, \quad x_{k+1} = \fra...",728,
148,Let $\triangle ABC$ be a right triangle with $\angle A = 90^\circ$...,104,- Given the right triangle \( \triangle ABC \) with right angle at...,104,✔️ [1]


EvaluationResult(score=56.67, results=<list of 150 results>)

GEPA was able to optimize the GPT-4.1 Mini's performance on AIME 2025 **from 46.6% score to 56.6%**, a 10% improvement, with just a budget of `auto="light"`!