In this example, we will see GEPA evolve the whole DSPy program (not just the instruction), including modifying the structure/dataflow of the program. We will use GEPA to tune a simple dspy.ChainOfThought module for MATH questions into a full DSPy program.

In [1]:
import os

os.environ["OPENAI_API_KEY"] = input("OPENAI_API_KEY: ")

In [2]:
import dspy

In [None]:
import random

from dspy.datasets import MATH

dataset = MATH(subset="algebra")

# Shuffle the train and dev sets
random.Random(0).shuffle(dataset.train)
random.Random(0).shuffle(dataset.dev)

print(len(dataset.train), len(dataset.dev), len(dataset.test))

350 350 487


Let's inspect an example from the training set.

In [4]:
example = dataset.train[0]
print("Question:", example.question)
print("Answer:", example.answer)

Question: The doctor has told Cal O'Ree that during his ten weeks of working out at the gym, he can expect each week's weight loss to be $1\%$ of his weight at the end of the previous week. His weight at the beginning of the workouts is $244$ pounds. How many pounds does he expect to weigh at the end of the ten weeks? Express your answer to the nearest whole number.
Answer: 221


Let's define a simple DSPy program to solve this task.

Unlike dspy.GEPA that can take an instantiated DSPy module as input, here, we want to evolve the full DSPy program. Hence, a candidate here is the source code as string. The seed program does not need to be sophisticated, it just needs to demonstrate what the expected input/output interface is, and possibly the available tools. You can also include any additional information about the environment as a comment.

In [5]:
program_src = """import dspy
program = dspy.ChainOfThought("question -> answer")"""

GEPA interfaces with external frameworks through an adapter. In this case, we integrate GEPA with a DspyAdapter.

In [6]:
from gepa.adapters.dspy_full_program_adapter.full_program_adapter import DspyAdapter

In [7]:
def metric_fn(example, pred, trace=None):
    score = dataset.metric(example, pred)
    if score:
        feedback_text = f"The provided answer '{pred.answer}' is correct."
    else:
        feedback_text = f"The provided answer '{pred.answer}' is incorrect. The correct answer is '{example.answer}'. Here's the step by step solution:\n{example.reasoning}"
    return dspy.Prediction(score=score, feedback=feedback_text)

In [16]:
reflection_lm = dspy.LM(model="openai/gpt-4.1", max_tokens=32000)  # temperature=1
adapter = DspyAdapter(
    task_lm=dspy.LM(model="openai/gpt-4.1-nano", max_tokens=32000),
    metric_fn=metric_fn,
    num_threads=80,
    reflection_lm=lambda x: reflection_lm(x)[0],
)

Let's evaluate the base program

In [17]:
o = adapter.evaluate(dataset.test, {"program": program_src})

2025/08/27 19:21:30 INFO dspy.evaluate.evaluate: Average Metric: 327.0 / 487 (67.1%)


The base program obtains a score of 67.1%

Let's launch the GEPA optimization.

In [19]:
from gepa import optimize

o = optimize(
    seed_candidate={"program": program_src},
    trainset=dataset.train,
    valset=dataset.dev[:200],
    adapter=adapter,
    reflection_lm=lambda x: reflection_lm(x)[0],
    max_metric_calls=2000,
    display_progress_bar=True,
)

Let's see the DSPy program found by GEPA

In [20]:
print(o.best_candidate["program"])

import dspy
from typing import Optional

class MathQAReasoningSignature(dspy.Signature):
    """
    Solve the given math word problem step by step, showing all necessary reasoning and calculations.
    - First, provide a clear, detailed, and logically ordered reasoning chain, using equations and algebraic steps as needed.
    - Then, extract the final answer in the required format, strictly following these rules:
        * If the answer should be a number, output only the number (no units, unless explicitly requested).
        * If the answer should be an algebraic expression, output it in LaTeX math mode (e.g., \frac{h^2}{m}).
        * Do not include explanatory text, units, or extra formatting in the answer field unless the question explicitly requests it.
    Common pitfalls:
        - Including units when not required.
        - Restating the answer with extra words or formatting.
        - Failing to simplify expressions or extract the final answer.
    Edge cases:
        - If 

Evaluating the optimized program

In [21]:
_ = adapter.evaluate(dataset.test, o.best_candidate)

2025/08/27 20:00:35 INFO dspy.evaluate.evaluate: Average Metric: 454.0 / 487 (93.2%)


We see it going from **67% to 93%** in just a few rounds of optimization!