# Tutorial: Math Reasoning

Let's walk through a quick example of setting up a `dspy.ChainOfThought` module and optimizing it for answering algebra questions.

Install the latest DSPy via `pip install -U dspy` and follow along.

Let's tell DSPy that we will use OpenAI's `gpt-4o-mini` in our modules. To authenticate, DSPy will look into your `OPENAI_API_KEY`. You can easily swap this out for [other providers or local models](https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb).

In [1]:
%load_ext autoreload
%autoreload 2


In [2]:
import dspy


# dspy.configure(lm=gpt4o_mini)  # we'll use gpt-4o-mini as the default LM, unless otherwise specified

* 'fields' has been removed
  from .autonotebook import tqdm as notebook_tqdm


Next, let's load some data examples from the [MATH](https://arxiv.org/abs/2103.03874) benchmark. We'll use a training split for optimization and evaluate it on a held-out dev set.

Please note that the following step will require:
```bash
%pip install git+https://github.com/hendrycks/math.git
```

In [3]:
import dotenv
dotenv.load_dotenv()

True

In [4]:
from dspy.datasets import MATH

dataset = MATH(subset='algebra')
print(len(dataset.train), len(dataset.dev))

350 350


Let's inspect one example from the training set.

In [5]:
example = dataset.train[0]
print("Question:", example.question)
print("Answer:", example.answer)

Question: The doctor has told Cal O'Ree that during his ten weeks of working out at the gym, he can expect each week's weight loss to be $1\%$ of his weight at the end of the previous week. His weight at the beginning of the workouts is $244$ pounds. How many pounds does he expect to weigh at the end of the ten weeks? Express your answer to the nearest whole number.
Answer: 221


Now let's define our module. It's extremely simple: just a chain-of-thought step that takes a `question` and produces an `answer`.

In [9]:
# with dspy.settings.configure(lm=gpt4o_mini):
module = dspy.ChainOfThought("question -> answer")
#     module(question=example.question)

Next, let's set up an evaluator for the zero-shot module above, before prompt optimization.

In [10]:
THREADS = 35
kwargs = dict(num_threads=THREADS, display_progress=True)
evaluate = dspy.Evaluate(devset=dataset.dev, metric=dataset.metric, **kwargs)

gpt4o_mini = dspy.LM('openai/gpt-4o-mini', max_tokens=2000)
gpt4o = dspy.LM('openai/gpt-4o', max_tokens=2000)
r1 = dspy.LM('fireworks_ai/accounts/fireworks/models/deepseek-r1', max_tokens=4000)

with dspy.settings.context(lm=gpt4o):
    gpt4o_uncompiled = evaluate(module)

with dspy.settings.context(lm=r1):
    r1_uncompiled = evaluate(module)


NameError: name 'module' is not defined

In [21]:
r1.history[-1]

{'prompt': None,
 'messages': [{'role': 'system',
   'content': 'Your input fields are:\n1. `question` (str)\n\nYour output fields are:\n1. `reasoning` (str)\n2. `answer` (str)\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "reasoning": "{reasoning}",\n  "answer": "{answer}"\n}\n\nIn adhering to this structure, your objective is: \n        Given the fields `question`, produce the fields `answer`.'},
  {'role': 'user',
   'content': '[[ ## question ## ]]\nTwo numbers are independently selected from the set of positive integers less than or equal to 5. What is the probability that the sum of the two numbers is greater than their product? Express your answer as a common fraction.\n\nRespond with a JSON object in the following order of fields: `reasoning`, then `answer`.'}],
 'kwargs': {'tem

And lastly let's optimize our module. Since we want strong reasoning, we'll use the large GPT-4o as the teacher model (used to bootstrap reasoning for the small LM at optimization time) but not as the prompt model (used to craft instructions) or the task model (trained).

GPT-4o will be invoked only a small number of times. The model involved directly in optimization and in the resulting (optimized) program will be GPT-4o-mini.

We will also specify `max_bootstrapped_demos=4` which means we want at most four bootstrapped examples in the prompt and `max_labeled_demos=4` which means that, in total between bootstrapped and pre-labeled examples, we want at most four.

In [29]:
kwargs_4o = dict(num_threads=THREADS, task_model=gpt4o, teacher_settings=dict(lm=gpt4o), prompt_model=gpt4o)
kwargs_r1 = dict(num_threads=THREADS, task_model=r1, teacher_settings=dict(lm=r1), prompt_model=r1)
optimizer_4o = dspy.MIPROv2(metric=dataset.metric, auto="medium", **kwargs_4o)
optimizer_r1 = dspy.MIPROv2(metric=dataset.metric, auto="medium", **kwargs_r1)


kwargs = dict(requires_permission_to_run=False, max_bootstrapped_demos=4, max_labeled_demos=4)
optimized_module_4o = optimizer_4o.compile(module, trainset=dataset.train, **kwargs)
optimized_module_r1 = optimizer_r1.compile(module, trainset=dataset.train, **kwargs)

with dspy.settings.context(lm=gpt4o):
    gpt4o_compiled = evaluate(optimized_module_4o)
with dspy.settings.context(lm=r1):
    r1_compiled = evaluate(optimized_module_r1)

gpt4o_compiled
r1_compiled

2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: True
num_candidates: 19
valset size: 280

2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of demonstrations...


Bootstrapping set 1/19
Bootstrapping set 2/19
Bootstrapping set 3/19


  6%|▌         | 4/70 [00:00<00:00, 723.03it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/19


  6%|▌         | 4/70 [00:00<00:00, 731.26it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 5/19


  6%|▌         | 4/70 [00:00<00:00, 934.09it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 6/19


  6%|▌         | 4/70 [00:00<00:00, 974.68it/s]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 7/19


  7%|▋         | 5/70 [00:00<00:00, 1005.39it/s]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 8/19


  1%|▏         | 1/70 [00:00<00:00, 819.68it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/19


  3%|▎         | 2/70 [00:00<00:00, 1050.55it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 10/19


  6%|▌         | 4/70 [00:00<00:00, 1108.36it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 11/19


  1%|▏         | 1/70 [00:00<00:00, 946.15it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/19


  3%|▎         | 2/70 [00:00<00:00, 1037.42it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 13/19


  1%|▏         | 1/70 [00:00<00:00, 190.59it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 14/19


  7%|▋         | 5/70 [00:00<00:00, 500.25it/s]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 15/19


  1%|▏         | 1/70 [00:00<00:00, 497.49it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 16/19


  3%|▎         | 2/70 [00:00<00:00, 990.51it/s]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 17/19


  1%|▏         | 1/70 [00:00<00:00, 874.00it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 18/19


  1%|▏         | 1/70 [00:00<00:00, 447.87it/s]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 19/19


  3%|▎         | 2/70 [00:00<00:00, 870.28it/s]
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are a skilled mathematician tasked with solving complex mathematical problems. For each given "question", produce a detailed "reasoning" that logically breaks down the problem step by step, and then provide the final "an

Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Average Metric: 209.00 / 280 (74.6%): 100%|██████████| 280/280 [00:00<00:00, 4967.79it/s]

2025/02/02 01:23:56 INFO dspy.evaluate.evaluate: Average Metric: 209 / 280 (74.6%)
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 74.64






2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 28 - Minibatch ==


Average Metric: 20.00 / 25 (80.0%): 100%|██████████| 25/25 [00:00<00:00, 4411.34it/s]

2025/02/02 01:23:56 INFO dspy.evaluate.evaluate: Average Metric: 20 / 25 (80.0%)
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0]
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 28 - Minibatch ==



Average Metric: 23.00 / 25 (92.0%): 100%|██████████| 25/25 [00:00<00:00, 4855.19it/s]

2025/02/02 01:23:56 INFO dspy.evaluate.evaluate: Average Metric: 23 / 25 (92.0%)
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0]
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 28 - Minibatch ==



Average Metric: 17.00 / 25 (68.0%): 100%|██████████| 25/25 [00:00<00:00, 4352.38it/s]

2025/02/02 01:23:56 INFO dspy.evaluate.evaluate: Average Metric: 17 / 25 (68.0%)
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0]
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:56 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 28 - Minibatch ==



Average Metric: 17.00 / 25 (68.0%): 100%|██████████| 25/25 [00:00<00:00, 4755.66it/s]

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 17 / 25 (68.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 2'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:00<00:00, 4850.48it/s]


2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 7 / 28 - Minibatch ==


Average Metric: 20.00 / 25 (80.0%): 100%|██████████| 25/25 [00:00<00:00, 4843.98it/s]


2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 20 / 25 (80.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 28 - Minibatch ==


Average Metric: 19.00 / 25 (76.0%): 100%|██████████| 25/25 [00:00<00:00, 644.30it/s] 

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 19 / 25 (76.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 12'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 28 - Minibatch ==



Average Metric: 17.00 / 25 (68.0%): 100%|██████████| 25/25 [00:00<00:00, 4200.18it/s]

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 17 / 25 (68.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 13'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:00<00:00, 4271.36it/s]

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 /28 - Full Evaluation =====
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 92.0) from minibatch trials...



Average Metric: 251.00 / 280 (89.6%): 100%|██████████| 280/280 [00:00<00:00, 3028.83it/s]

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 251 / 280 (89.6%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 89.64
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 28 - Minibatch ==



Average Metric: 20.00 / 25 (80.0%): 100%|██████████| 25/25 [00:00<00:00, 3730.26it/s]

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 20 / 25 (80.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 8'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 13 / 28 - Minibatch ==



Average Metric: 23.00 / 25 (92.0%): 100%|██████████| 25/25 [00:00<00:00, 3726.42it/s]

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 23 / 25 (92.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 10'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 28 - Minibatch ==



Average Metric: 21.00 / 25 (84.0%): 100%|██████████| 25/25 [00:00<00:00, 3479.83it/s]


2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 21 / 25 (84.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 84.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 2'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 28 - Minibatch ==


Average Metric: 24.00 / 25 (96.0%): 100%|██████████| 25/25 [00:00<00:00, 3444.39it/s]


2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 24 / 25 (96.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 96.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 28 - Minibatch ==


Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:00<00:00, 3069.78it/s] 


2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 28 - Minibatch ==


Average Metric: 24.00 / 25 (96.0%): 100%|██████████| 25/25 [00:00<00:00, 2977.64it/s] 


2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 24 / 25 (96.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 96.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 28 - Minibatch ==


Average Metric: 23.00 / 25 (92.0%): 100%|██████████| 25/25 [00:00<00:00, 2736.51it/s] 

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 23 / 25 (92.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 9', 'Predictor 0: Few-Shot Set 16'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 19 / 28 - Minibatch ==



Average Metric: 19.00 / 25 (76.0%): 100%|██████████| 25/25 [00:00<00:00, 3829.58it/s]

2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 19 / 25 (76.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 6'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 28 - Minibatch ==



Average Metric: 24.00 / 25 (96.0%): 100%|██████████| 25/25 [00:00<00:00, 3877.15it/s] 


2025/02/02 01:23:57 INFO dspy.evaluate.evaluate: Average Metric: 24 / 25 (96.0%)
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 96.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0, 96.0]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64]
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 21 /28 - Full Evaluation =====
2025/02/02 01:23:57 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 96.0) from minibatch trials...


Average Metric: 250.00 / 280 (89.3%): 100%|██████████| 280/280 [00:00<00:00, 3876.82it/s]

2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 250 / 280 (89.3%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:00<00:00, 3661.87it/s]

2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0, 96.0, 88.0]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 28 - Minibatch ==



Average Metric: 21.00 / 25 (84.0%): 100%|██████████| 25/25 [00:00<00:00, 3813.97it/s]

2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 21 / 25 (84.0%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 84.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0, 96.0, 88.0, 84.0]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 28 - Minibatch ==



Average Metric: 25.00 / 25 (100.0%): 100%|██████████| 25/25 [00:00<00:00, 3199.32it/s]

2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 25 / 25 (100.0%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 16', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0, 96.0, 88.0, 84.0, 100.0]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 25 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:00<00:00, 4037.33it/s] 


2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 9'].
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0, 96.0, 88.0, 84.0, 100.0, 88.0]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 28 - Minibatch ==


Average Metric: 23.00 / 25 (92.0%): 100%|██████████| 25/25 [00:00<00:00, 3603.48it/s]


2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 23 / 25 (92.0%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 16', 'Predictor 0: Few-Shot Set 1'].
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0, 96.0, 88.0, 84.0, 100.0, 88.0, 92.0]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 28 - Minibatch ==


Average Metric: 19.00 / 25 (76.0%): 100%|██████████| 25/25 [00:00<00:00, 3463.05it/s]

2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 19 / 25 (76.0%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 11'].
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [80.0, 92.0, 68.0, 68.0, 88.0, 80.0, 76.0, 68.0, 88.0, 80.0, 92.0, 84.0, 96.0, 88.0, 96.0, 92.0, 76.0, 96.0, 88.0, 84.0, 100.0, 88.0, 92.0, 76.0]





2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64


2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 28 /28 - Full Evaluation =====
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 100.0) from minibatch trials...


Average Metric: 249.00 / 280 (88.9%): 100%|██████████| 280/280 [00:00<00:00, 4097.99it/s]

2025/02/02 01:23:58 INFO dspy.evaluate.evaluate: Average Metric: 249 / 280 (88.9%)
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.64, 89.29, 88.93]
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.64
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 89.64!
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: True
num_candidates: 19
valset size: 280

2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/02/02 01:23:58 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=19 sets of 


Bootstrapping set 1/19
Bootstrapping set 2/19
Bootstrapping set 3/19


  6%|▌         | 4/70 [02:18<38:05, 34.63s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/19


  6%|▌         | 4/70 [01:10<19:20, 17.58s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 5/19


  6%|▌         | 4/70 [01:15<20:47, 18.91s/it]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 6/19


  6%|▌         | 4/70 [01:42<28:17, 25.73s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 7/19


  7%|▋         | 5/70 [02:01<26:16, 24.25s/it]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 8/19


  1%|▏         | 1/70 [00:11<12:56, 11.26s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/19


  6%|▌         | 4/70 [01:49<30:07, 27.38s/it]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 10/19


  6%|▌         | 4/70 [01:37<26:49, 24.39s/it]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 11/19


  1%|▏         | 1/70 [00:12<14:33, 12.65s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 12/19


  4%|▍         | 3/70 [00:46<17:14, 15.45s/it]


Bootstrapped 2 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 13/19


  1%|▏         | 1/70 [00:28<32:40, 28.41s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 14/19


  7%|▋         | 5/70 [01:38<21:17, 19.65s/it]


Bootstrapped 4 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.
Bootstrapping set 15/19


  1%|▏         | 1/70 [00:12<14:44, 12.82s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 16/19


  3%|▎         | 2/70 [00:33<18:48, 16.59s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 17/19


  1%|▏         | 1/70 [00:23<27:34, 23.98s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 18/19


  3%|▎         | 2/70 [01:12<40:50, 36.03s/it]  


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 19/19


  1%|▏         | 1/70 [00:50<57:57, 50.40s/it]
2025/02/02 01:42:23 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/02/02 01:42:23 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/02/02 01:42:23 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...



Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.


2025/02/02 01:54:33 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/02/02 01:54:33 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

2025/02/02 01:54:33 INFO dspy.teleprompt.mipro_optimizer_v2: 1: You are a mathematics tutor specializing in structured problem solving. For each question: 1) Analyze the problem type and required mathematical domains (algebra, number theory, etc.) 2) Break down the solution using step-by-step logic with proper formula application 3) Explicitly show intermediate calculations while maintaining dimensional consistency 4) Verify solutions meet any stated constraints (coprime requirements, integer results etc.) 5) Present your final answer in boxed notation following all problem-specific formatting rules. Ensure reasoning demonstrates conceptual understanding rather than just procedural execution.

2025/02/02 01:54:33 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Solve the

Average Metric: 209.00 / 280 (74.6%): 100%|██████████| 280/280 [00:00<00:00, 4584.08it/s]

2025/02/02 01:54:34 INFO dspy.evaluate.evaluate: Average Metric: 209 / 280 (74.6%)
2025/02/02 01:54:34 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 74.64

2025/02/02 01:54:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 28 - Minibatch ==



Average Metric: 23.00 / 25 (92.0%): 100%|██████████| 25/25 [00:39<00:00,  1.58s/it] 

2025/02/02 01:55:13 INFO dspy.evaluate.evaluate: Average Metric: 23 / 25 (92.0%)
2025/02/02 01:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 92.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0]
2025/02/02 01:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:55:13 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:10<00:00,  2.34it/s]

2025/02/02 01:55:24 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:55:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 10', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:55:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0]
2025/02/02 01:55:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:55:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:55:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:10<00:00,  2.45it/s] 

2025/02/02 01:55:34 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:55:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 18'].
2025/02/02 01:55:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0]
2025/02/02 01:55:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:55:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:55:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:07<00:00,  3.47it/s]

2025/02/02 01:55:41 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:55:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 15', 'Predictor 0: Few-Shot Set 2'].
2025/02/02 01:55:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0]
2025/02/02 01:55:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:55:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:55:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 28 - Minibatch ==



Average Metric: 25.00 / 25 (100.0%): 100%|██████████| 25/25 [00:11<00:00,  2.16it/s]

2025/02/02 01:55:53 INFO dspy.evaluate.evaluate: Average Metric: 25 / 25 (100.0%)
2025/02/02 01:55:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 100.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18'].
2025/02/02 01:55:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0]
2025/02/02 01:55:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:55:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:55:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 7 / 28 - Minibatch ==



Average Metric: 19.00 / 25 (76.0%): 100%|██████████| 25/25 [00:11<00:00,  2.12it/s]

2025/02/02 01:56:05 INFO dspy.evaluate.evaluate: Average Metric: 19 / 25 (76.0%)
2025/02/02 01:56:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1'].
2025/02/02 01:56:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0]
2025/02/02 01:56:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:56:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:56:05 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 28 - Minibatch ==



Average Metric: 21.00 / 25 (84.0%): 100%|██████████| 25/25 [00:28<00:00,  1.16s/it]

2025/02/02 01:56:34 INFO dspy.evaluate.evaluate: Average Metric: 21 / 25 (84.0%)
2025/02/02 01:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 84.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 12'].
2025/02/02 01:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0]
2025/02/02 01:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:56:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 28 - Minibatch ==



Average Metric: 17.00 / 25 (68.0%): 100%|██████████| 25/25 [00:12<00:00,  2.04it/s]

2025/02/02 01:56:46 INFO dspy.evaluate.evaluate: Average Metric: 17 / 25 (68.0%)
2025/02/02 01:56:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 11', 'Predictor 0: Few-Shot Set 13'].
2025/02/02 01:56:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0]
2025/02/02 01:56:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:56:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:56:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:10<00:00,  2.31it/s] 

2025/02/02 01:56:57 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:56:57 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4'].
2025/02/02 01:56:57 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0]
2025/02/02 01:56:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64]
2025/02/02 01:56:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 74.64


2025/02/02 01:56:57 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 /28 - Full Evaluation =====
2025/02/02 01:56:57 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 100.0) from minibatch trials...



Average Metric: 250.00 / 279 (89.6%): 100%|█████████▉| 279/280 [00:45<00:01,  1.42s/it]

2025/02/02 01:58:10 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'If $x$, $y$, and $z$ are positive integers such that $6xyz+30xy+21xz+2yz+105x+10y+7z=812$, find $x+y+z$.', 'reasoning': "Usually when we apply Simon's Favorite Factoring Trick, we have two variables. Maybe we can find an adaptation for three variables. We notice that four of the terms on the left hand side have a factor of $z$ in them, so we can factor it out as: $$z(6xy+21x+2y+7)+30xy+105x+10y=812.$$This looks promising! Add $35$ to each side and continue factoring: \\begin{align*}\nz(6xy+21x+2y+7)+30xy+105x+10y+35&=812+35 \\quad \\Rightarrow \\\\\nz(6xy+21x+2y+7)+5(6xy+21x+2y+7)&=812+35 \\quad \\Rightarrow \\\\\n(z+5)(6xy+21x+2y+7)&=847.\n\\end{align*}Now we can proceed with the two-variable version of Simon's Favorite Factoring Trick on the remaining four-term factor: \\begin{align*}\n(z+5)(3x(2y+7)+2y+7)&=847 \\quad \\Rightarrow \\\\\n(z+5)(3x+1)(2y+7)&=847.\n\\end{align*}The prime factor

Average Metric: 250.00 / 279 (89.6%): 100%|██████████| 280/280 [01:13<00:00,  3.83it/s]

2025/02/02 01:58:10 INFO dspy.evaluate.evaluate: Average Metric: 250.0 / 280 (89.3%)
2025/02/02 01:58:10 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 89.29
2025/02/02 01:58:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:58:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29
2025/02/02 01:58:10 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/02/02 01:58:10 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:07<00:00,  3.29it/s]

2025/02/02 01:58:18 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:58:18 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 3'].
2025/02/02 01:58:18 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0]
2025/02/02 01:58:18 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:58:18 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:58:18 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 13 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:08<00:00,  2.86it/s]

2025/02/02 01:58:27 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:58:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 12', 'Predictor 0: Few-Shot Set 7'].
2025/02/02 01:58:27 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0]
2025/02/02 01:58:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:58:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:58:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 28 - Minibatch ==



Average Metric: 24.00 / 25 (96.0%): 100%|██████████| 25/25 [00:09<00:00,  2.53it/s] 

2025/02/02 01:58:37 INFO dspy.evaluate.evaluate: Average Metric: 24 / 25 (96.0%)
2025/02/02 01:58:37 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 96.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 16'].
2025/02/02 01:58:37 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0]
2025/02/02 01:58:37 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:58:37 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:58:37 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 28 - Minibatch ==



Average Metric: 24.00 / 25 (96.0%): 100%|██████████| 25/25 [00:40<00:00,  1.63s/it] 

2025/02/02 01:59:17 INFO dspy.evaluate.evaluate: Average Metric: 24 / 25 (96.0%)
2025/02/02 01:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 96.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 16'].
2025/02/02 01:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0]
2025/02/02 01:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 28 - Minibatch ==



Average Metric: 20.00 / 25 (80.0%): 100%|██████████| 25/25 [00:12<00:00,  1.98it/s]

2025/02/02 01:59:30 INFO dspy.evaluate.evaluate: Average Metric: 20 / 25 (80.0%)
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 80.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 6'].
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0]
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 28 - Minibatch ==



Average Metric: 24.00 / 25 (96.0%): 100%|██████████| 25/25 [00:00<00:00, 4504.58it/s] 

2025/02/02 01:59:30 INFO dspy.evaluate.evaluate: Average Metric: 24 / 25 (96.0%)
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 96.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 18'].
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0]
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:59:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:10<00:00,  2.44it/s] 

2025/02/02 01:59:40 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:59:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 18', 'Predictor 0: Few-Shot Set 16'].
2025/02/02 01:59:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0]
2025/02/02 01:59:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:59:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:59:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 19 / 28 - Minibatch ==



Average Metric: 19.00 / 25 (76.0%): 100%|██████████| 25/25 [00:09<00:00,  2.64it/s]

2025/02/02 01:59:50 INFO dspy.evaluate.evaluate: Average Metric: 19 / 25 (76.0%)
2025/02/02 01:59:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 76.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 14'].
2025/02/02 01:59:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0]
2025/02/02 01:59:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:59:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:59:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:09<00:00,  2.70it/s] 

2025/02/02 01:59:59 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 01:59:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 15'].
2025/02/02 01:59:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0, 88.0]
2025/02/02 01:59:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29]
2025/02/02 01:59:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 01:59:59 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 21 /28 - Full Evaluation =====
2025/02/02 01:59:59 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 96.0) from minibatch trials...



Average Metric: 250.00 / 279 (89.6%): 100%|█████████▉| 279/280 [00:51<00:02,  2.71s/it]

2025/02/02 02:00:59 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'If $x$, $y$, and $z$ are positive integers such that $6xyz+30xy+21xz+2yz+105x+10y+7z=812$, find $x+y+z$.', 'reasoning': "Usually when we apply Simon's Favorite Factoring Trick, we have two variables. Maybe we can find an adaptation for three variables. We notice that four of the terms on the left hand side have a factor of $z$ in them, so we can factor it out as: $$z(6xy+21x+2y+7)+30xy+105x+10y=812.$$This looks promising! Add $35$ to each side and continue factoring: \\begin{align*}\nz(6xy+21x+2y+7)+30xy+105x+10y+35&=812+35 \\quad \\Rightarrow \\\\\nz(6xy+21x+2y+7)+5(6xy+21x+2y+7)&=812+35 \\quad \\Rightarrow \\\\\n(z+5)(6xy+21x+2y+7)&=847.\n\\end{align*}Now we can proceed with the two-variable version of Simon's Favorite Factoring Trick on the remaining four-term factor: \\begin{align*}\n(z+5)(3x(2y+7)+2y+7)&=847 \\quad \\Rightarrow \\\\\n(z+5)(3x+1)(2y+7)&=847.\n\\end{align*}The prime factor

Average Metric: 250.00 / 279 (89.6%): 100%|██████████| 280/280 [00:59<00:00,  4.73it/s]

2025/02/02 02:00:59 INFO dspy.evaluate.evaluate: Average Metric: 250.0 / 280 (89.3%)
2025/02/02 02:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29]
2025/02/02 02:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29
2025/02/02 02:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/02/02 02:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:15<00:00,  1.59it/s] 

2025/02/02 02:01:14 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 02:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 14', 'Predictor 0: Few-Shot Set 16'].
2025/02/02 02:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0, 88.0, 88.0]
2025/02/02 02:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29]
2025/02/02 02:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 02:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:10<00:00,  2.32it/s] 

2025/02/02 02:01:25 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 02:01:25 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5'].
2025/02/02 02:01:25 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0, 88.0, 88.0, 88.0]
2025/02/02 02:01:25 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29]
2025/02/02 02:01:25 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 02:01:25 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 28 - Minibatch ==



Average Metric: 21.00 / 25 (84.0%): 100%|██████████| 25/25 [00:09<00:00,  2.56it/s]

2025/02/02 02:01:35 INFO dspy.evaluate.evaluate: Average Metric: 21 / 25 (84.0%)
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 84.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 9'].
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0, 88.0, 88.0, 88.0, 84.0]
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29]
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 25 / 28 - Minibatch ==



Average Metric: 24.00 / 25 (96.0%): 100%|██████████| 25/25 [00:00<00:00, 2906.25it/s]

2025/02/02 02:01:35 INFO dspy.evaluate.evaluate: Average Metric: 24 / 25 (96.0%)
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 96.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 16'].
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0, 88.0, 88.0, 88.0, 84.0, 96.0]
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29]
2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 02:01:35 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:10<00:00,  2.33it/s] 

2025/02/02 02:01:46 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 02:01:46 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 13', 'Predictor 0: Few-Shot Set 16'].
2025/02/02 02:01:46 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0, 88.0, 88.0, 88.0, 84.0, 96.0, 88.0]
2025/02/02 02:01:46 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29]
2025/02/02 02:01:46 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 02:01:46 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 28 - Minibatch ==



Average Metric: 22.00 / 25 (88.0%): 100%|██████████| 25/25 [00:16<00:00,  1.47it/s] 

2025/02/02 02:02:03 INFO dspy.evaluate.evaluate: Average Metric: 22 / 25 (88.0%)
2025/02/02 02:02:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 88.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 17', 'Predictor 0: Few-Shot Set 17'].
2025/02/02 02:02:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [92.0, 88.0, 88.0, 88.0, 100.0, 76.0, 84.0, 68.0, 88.0, 88.0, 88.0, 96.0, 96.0, 80.0, 96.0, 88.0, 76.0, 88.0, 88.0, 88.0, 84.0, 96.0, 88.0, 88.0]
2025/02/02 02:02:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29]
2025/02/02 02:02:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 89.29


2025/02/02 02:02:03 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 28 /28 - Full Evaluation =====
2025/02/02 02:02:03 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 90.0) from minibatch trials...



Average Metric: 254.00 / 280 (90.7%): 100%|██████████| 280/280 [00:53<00:00,  5.20it/s]

2025/02/02 02:02:57 INFO dspy.evaluate.evaluate: Average Metric: 254 / 280 (90.7%)
2025/02/02 02:02:57 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 90.71
2025/02/02 02:02:57 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [74.64, 89.29, 89.29, 90.71]
2025/02/02 02:02:57 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 90.71
2025/02/02 02:02:57 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/02/02 02:02:57 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 90.71!



Average Metric: 317.00 / 350 (90.6%): 100%|██████████| 350/350 [00:57<00:00,  6.04it/s]

2025/02/02 02:03:55 INFO dspy.evaluate.evaluate: Average Metric: 317 / 350 (90.6%)



Average Metric: 224.00 / 349 (64.2%): 100%|█████████▉| 349/350 [05:50<00:05,  5.09s/it]

2025/02/02 02:12:13 ERROR dspy.utils.parallelizer: Error processing item Example({'question': 'Find the smallest possible value of $$\\frac{(y-x)^2}{(y-z)(z-x)} + \\frac{(z-y)^2}{(z-x)(x-y)} + \\frac{(x-z)^2}{(x-y)(y-z)},$$ where $x,y,$ and $z$ are distinct real numbers.', 'reasoning': 'Combining all three fractions under a single denominator, the given expression is equal to $$\\frac{(x-y)^3 + (y-z)^3 + (z-x)^3}{(x-y)(y-z)(z-x)}.$$ Consider the numerator as a polynomial in $x$, so that $P(x) = (x-y)^3 + (y-z)^3 + (z-x)^3$ (where we treat $y$ and $z$ as fixed values). It follows that $P(y) = (y-y)^3 + (y-z)^3 + (z-y)^3 = 0$, so $y$ is a root of $P(x) = 0$ and $x-y$ divides into $P(x)$. By symmetry, it follows that $y-z$ and $z-x$ divide into $P(x)$. Since $P$ is a cubic in its variables, it follows that $P = k(x-y)(y-z)(z-x)$, where $k$ is a constant. By either expanding the definition of $P$, or by trying test values (if we take $x = 0, y = -1, z = 1$, we obtain $P = -6 = k \\cdot (-2

Average Metric: 224.00 / 349 (64.2%): 100%|██████████| 350/350 [08:18<00:00,  1.42s/it]

2025/02/02 02:12:13 INFO dspy.evaluate.evaluate: Average Metric: 224.0 / 350 (64.0%)





64.0

In [7]:
# optimized_module_4o.save('gpt4o_compiled.json')
# optimized_module_r1.save('r1_compiled.json')


In [8]:
import matplotlib.pyplot as plt
import numpy as np

plt.clf()

# Data for the bars
models = ['GPT-4o', 'R1']
uncompiled_scores = [gpt4o_uncompiled, r1_uncompiled]
compiled_scores = [gpt4o_compiled, r1_compiled]

x = np.arange(len(models))  # Label locations
width = 0.35  # Width of the bars

# Create bars
plt.bar(x - width/2, uncompiled_scores, width, label='Uncompiled')
plt.bar(x + width/2, compiled_scores, width, label='Compiled')

# Customize the plot
plt.xlabel('Models')
plt.ylabel('Score')
plt.title('R1 vs GPT-4o on MATH with MIPROv2')
plt.xticks(x, models)
plt.legend()

plt.show()

NameError: name 'gpt4o_uncompiled' is not defined

<Figure size 640x480 with 0 Axes>

In [51]:
import dspy

chat_adapter = dspy.ChatAdapter()
with dspy.settings.context(lm=r1, adapter=chat_adapter):
    module = dspy.ChainOfThought("question -> answer")
    module(question=example.question)

[autoreload of dspy.adapters.chat_adapter failed: Traceback (most recent call last):
  File "/Users/isaac/projects/dspy/.venv/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 276, in check
    superreload(m, reload, self.old_objects)
  File "/Users/isaac/projects/dspy/.venv/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 500, in superreload
    update_generic(old_obj, new_obj)
  File "/Users/isaac/projects/dspy/.venv/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 397, in update_generic
    update(a, b)
  File "/Users/isaac/projects/dspy/.venv/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 349, in update_class
    if update_generic(old_obj, new_obj):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/isaac/projects/dspy/.venv/lib/python3.12/site-packages/IPython/extensions/autoreload.py", line 397, in update_generic
    update(a, b)
  File "/Users/isaac/projects/dspy/.venv/lib/python3.12/site-packag

TypeError: super(type, obj): obj must be an instance or subtype of type

In [7]:
evaluate(optimized_module)

Average Metric: 310.00 / 350 (88.6%): 100%|██████████| 350/350 [01:31<00:00,  3.84it/s]

2024/11/28 18:59:19 INFO dspy.evaluate.evaluate: Average Metric: 310 / 350 (88.6%)





Unnamed: 0,question,example_reasoning,example_answer,pred_reasoning,pred_answer,method
0,What is the smallest integer value of $c$ such that the function $...,The given function has a domain of all real numbers if and only if...,1,The function \( f(x) = \frac{x^2 + 1}{x^2 - x + c} \) will have a ...,1,✔️ [True]
1,What is the least value of $x$ that is a solution of $|{-x+3}|=7$?,"In order to have $|{-x+3}| = 7$, we must have $-x + 3 = 7$ or $-x ...",-4,The equation \( |{-x+3}|=7 \) implies two possible cases: 1. \(-x ...,-4,✔️ [True]
2,Evaluate $\left\lceil -\frac{7}{4}\right\rceil$.,"$-\frac{7}{4}$ is between $-1$ and $-2$, so $\left\lceil -\frac{7}...",-1,"To evaluate \(\left\lceil -\frac{7}{4}\right\rceil\), we first nee...",-1,✔️ [True]
3,"A triangle has vertices at coordinates $(11,1)$, $(2,3)$ and $(3,7...",We must find the distance between each pair of points by using the...,10,To find the length of the sides of the triangle formed by the vert...,10,✔️ [True]
4,Let $f(x) = x + 2$ and $g(x) = 1/f(x)$. What is $g(f(-3))$?,"First, we find that $f(-3) = (-3) + 2 = -1$. Then, $$g(f(-3)) = g(...",1,"To find \( g(f(-3)) \), we first need to evaluate \( f(-3) \). Usi...",1,✔️ [True]


88.57

Neat. It was pretty straightforward to improve quality from 74% to over 88% on a held-out set here.

That said, for reasoning tasks like this, you will often want to consider more advanced strategies, like:

- A `dspy.ReAct` module with access to a calculator function or `dspy.PythonInterpreter`
- Ensembling multiple optimized prompts with a majority vote (or an Aggregator module) on top


Just to understand what changed, let's view the prompt after optimization.

In [8]:
dspy.inspect_history()





[34m[2024-11-28T18:59:19.176586][0m

[31mSystem message:[0m

Your input fields are:
1. `question` (str)

Your output fields are:
1. `reasoning` (str)
2. `answer` (str)

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## answer ## ]]
{answer}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Analyze the `question` provided, and systematically apply mathematical reasoning to derive the `answer`. Ensure to articulate each step of your thought process in the `reasoning` field, detailing how you identify relationships and formulate equations to arrive at the solution.


[31mUser message:[0m

[[ ## question ## ]]
If $|4x+2|=10$ and $x<0$, what is the value of $x$?

Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## com