# L4: Optimize DSPy Agent with DSPy Optimizer

<p style="background-color:#fff6e4; padding:15px; border-width:3px; border-color:#f5ecda; border-style:solid; border-radius:6px"> ⏳ <b>Note <code>(Kernel Starting)</code>:</b> This notebook takes about 30 seconds to be ready to use. You may start and watch the video while you wait.</p>

In [1]:
from helper import get_openai_api_key
openai_api_key = get_openai_api_key()

import os

os.environ["OPENAI_API_KEY"] = get_openai_api_key()

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>Access <code>requirements.txt</code> and <code>helper.py</code> files:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.</p>

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>
</div>

In [2]:
import mlflow

In [3]:
from helper import get_mlflow_tracking_uri

mlflow_tracking_uri = get_mlflow_tracking_uri()
mlflow.set_tracking_uri(mlflow_tracking_uri)

In [4]:
mlflow.set_experiment("dspy_course_4")

2025/10/29 10:26:38 INFO mlflow.tracking.fluent: Experiment with name 'dspy_course_4' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/714844631808117306', creation_time=1761733598258, experiment_id='714844631808117306', last_update_time=1761733598258, lifecycle_stage='active', name='dspy_course_4', tags={}>

In [5]:
mlflow.dspy.autolog(log_evals=True, log_compiles=True, log_traces_from_compile=True)

In [6]:
import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

## Build a RAG Agent

In [7]:
def search_wikipedia(query: str) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=3)
    return [x["text"] for x in results]

react = dspy.ReAct("question -> answer", tools=[search_wikipedia])

In [8]:
import json

# Load trainset
trainset = []
with open("trainset.jsonl", "r") as f:
    for line in f:
        trainset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

# Load valset
valset = []
with open("valset.jsonl", "r") as f:
    for line in f:
        valset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

In [9]:
# Overview of the dataset.
print(trainset[0])

Example({'question': 'Are Smyrnium and Nymania both types of plant?', 'answer': 'yes'}) (input_keys={'question'})


In [10]:
tp = dspy.MIPROv2(
    metric=dspy.evaluate.answer_exact_match,
    auto="light",
    num_threads=16
)

In [11]:
dspy.cache.load_memory_cache("./memory_cache.pkl")

In [12]:
optimized_react = tp.compile(
    react,
    trainset=trainset,
    valset=valset,
    requires_permission_to_run=False,
)

2025/10/29 10:26:41 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID 'abda392434b841e1ba73606aeb7c5de9', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current dspy workflow
2025/10/29 10:26:41 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 20
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 100

2025/10/29 10:26:41 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/10/29 10:26:41 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/10/29 10:26:41 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

 18%|█▊        | 18/100 [00:01<00:05, 14.00it/s]

Bootstrapped 4 full traces after 18 examples for up to 1 rounds, amounting to 18 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Bootstrapping set 4/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

  1%|          | 1/100 [00:00<00:04, 23.43it/s]

Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Bootstrapping set 5/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

 10%|█         | 10/100 [00:00<00:03, 25.59it/s]

Bootstrapped 4 full traces after 10 examples for up to 1 rounds, amounting to 10 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Bootstrapping set 6/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

  2%|▏         | 2/100 [00:00<00:03, 24.98it/s]

Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

2025/10/29 10:26:43 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/10/29 10:26:43 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/10/29 10:26:43 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2025/10/29 10:26:44 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/10/29 10:26:44 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

You are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far.
Your goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`.

To do this, you will interleave next_thought, next_tool_name, and next_tool_args in ea

Average Metric: 31.00 / 100 (31.0%): 100%|██████████| 100/100 [00:03<00:00, 30.64it/s]

2025/10/29 10:26:47 INFO dspy.evaluate.evaluate: Average Metric: 31 / 100 (31.0%)





2025/10/29 10:26:47 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 31.0

2025/10/29 10:26:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 25 - Minibatch ==


🏃 View run eval_full_0 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/d3a84ae93a7f434b91cf58abf19cf366
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 3.00 / 35 (8.6%): 100%|██████████| 35/35 [00:01<00:00, 23.78it/s] 

2025/10/29 10:26:48 INFO dspy.evaluate.evaluate: Average Metric: 3 / 35 (8.6%)





2025/10/29 10:26:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 8.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/10/29 10:26:48 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57]
2025/10/29 10:26:48 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0]
2025/10/29 10:26:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 31.0


2025/10/29 10:26:48 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 25 - Minibatch ==


🏃 View run eval_minibatch_0 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/82b39e9f6d95493783b9d3d0345117f4
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:01<00:00, 25.43it/s]

2025/10/29 10:26:50 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/10/29 10:26:50 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/10/29 10:26:50 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43]
2025/10/29 10:26:50 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0]
2025/10/29 10:26:50 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 31.0


2025/10/29 10:26:50 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 25 - Minibatch ==



🏃 View run eval_minibatch_1 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/f7fa36a0cebb43c79e1a60a3daf6387b
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 5.00 / 35 (14.3%): 100%|██████████| 35/35 [00:01<00:00, 26.39it/s]

2025/10/29 10:26:51 INFO dspy.evaluate.evaluate: Average Metric: 5 / 35 (14.3%)





2025/10/29 10:26:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 14.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/10/29 10:26:51 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29]
2025/10/29 10:26:51 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0]
2025/10/29 10:26:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 31.0


2025/10/29 10:26:51 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 25 - Minibatch ==


🏃 View run eval_minibatch_2 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/126ea98329a74daaba08b0e06e9bcb90
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:01<00:00, 26.74it/s]

2025/10/29 10:26:53 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)
2025/10/29 10:26:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/10/29 10:26:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29]
2025/10/29 10:26:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0]
2025/10/29 10:26:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 31.0


2025/10/29 10:26:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 25 - Minibatch ==



🏃 View run eval_minibatch_3 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/63b428a486ec4fae8d37878a59cdb72e
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:01<00:00, 24.38it/s]

2025/10/29 10:26:54 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)
2025/10/29 10:26:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/10/29 10:26:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57]
2025/10/29 10:26:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0]
2025/10/29 10:26:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 31.0


2025/10/29 10:26:54 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 25 - Full Evaluation =====
2025/10/29 10:26:54 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...



🏃 View run eval_minibatch_4 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/cf10d8a473264fb2871490b0ad188777
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 50.00 / 100 (50.0%): 100%|██████████| 100/100 [00:04<00:00, 24.69it/s]

2025/10/29 10:26:58 INFO dspy.evaluate.evaluate: Average Metric: 50 / 100 (50.0%)
2025/10/29 10:26:58 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 50.0
2025/10/29 10:26:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0]
2025/10/29 10:26:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0
2025/10/29 10:26:58 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/10/29 10:26:58 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 25 - Minibatch ==



🏃 View run eval_full_1 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/093015ae369a4ec99d5529f708fb3820
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:01<00:00, 28.71it/s]

2025/10/29 10:26:59 INFO dspy.evaluate.evaluate: Average Metric: 15 / 35 (42.9%)
2025/10/29 10:26:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/10/29 10:26:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86]
2025/10/29 10:26:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0]
2025/10/29 10:26:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:26:59 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 25 - Minibatch ==



🏃 View run eval_minibatch_5 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/ae7cf6de1a074b1bb44bc3b188054ef0
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:01<00:00, 24.53it/s]

2025/10/29 10:27:01 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)
2025/10/29 10:27:01 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/10/29 10:27:01 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29]
2025/10/29 10:27:01 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0]
2025/10/29 10:27:01 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:01 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 25 - Minibatch ==



🏃 View run eval_minibatch_6 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/c0bdbcdece6547358908decd0ac6d04d
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 6.00 / 35 (17.1%): 100%|██████████| 35/35 [00:01<00:00, 30.79it/s]

2025/10/29 10:27:02 INFO dspy.evaluate.evaluate: Average Metric: 6 / 35 (17.1%)
2025/10/29 10:27:02 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 17.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/10/29 10:27:02 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14]
2025/10/29 10:27:02 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0]
2025/10/29 10:27:02 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:02 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 25 - Minibatch ==



🏃 View run eval_minibatch_7 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/fd8aa489abc94392b7c0ea08665d8b8f
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 13.00 / 35 (37.1%): 100%|██████████| 35/35 [00:01<00:00, 20.83it/s]

2025/10/29 10:27:04 INFO dspy.evaluate.evaluate: Average Metric: 13 / 35 (37.1%)
2025/10/29 10:27:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 37.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/10/29 10:27:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14]
2025/10/29 10:27:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0]
2025/10/29 10:27:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 25 - Minibatch ==



🏃 View run eval_minibatch_8 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/e5eca202428741739638a160f9a70952
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:01<00:00, 26.79it/s]

2025/10/29 10:27:05 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)



🏃 View run eval_minibatch_9 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/a3429ced9ddc41a7b77bdfc248def2b1
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306


2025/10/29 10:27:05 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/10/29 10:27:05 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29]
2025/10/29 10:27:05 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0]
2025/10/29 10:27:05 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:05 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 25 - Full Evaluation =====
2025/10/29 10:27:05 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...


Average Metric: 49.00 / 100 (49.0%): 100%|██████████| 100/100 [00:03<00:00, 25.14it/s]

2025/10/29 10:27:09 INFO dspy.evaluate.evaluate: Average Metric: 49 / 100 (49.0%)
2025/10/29 10:27:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0]
2025/10/29 10:27:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0
2025/10/29 10:27:09 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/10/29 10:27:09 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 25 - Minibatch ==



🏃 View run eval_full_2 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/a9d04ffef9064d9d9bbcee0703786d20
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:01<00:00, 25.37it/s]

2025/10/29 10:27:11 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/10/29 10:27:11 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].



🏃 View run eval_minibatch_10 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/673030843e7a46a6ae9c181ce58c29ed
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306


2025/10/29 10:27:11 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43]
2025/10/29 10:27:11 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0]
2025/10/29 10:27:11 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:11 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 25 - Minibatch ==


Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:01<00:00, 24.43it/s]

2025/10/29 10:27:12 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)



🏃 View run eval_minibatch_11 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/9c941b5dece9443dbee1b1be4ee3c252
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306


2025/10/29 10:27:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 1'].
2025/10/29 10:27:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43]
2025/10/29 10:27:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0]
2025/10/29 10:27:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 25 - Minibatch ==


Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:01<00:00, 19.58it/s]

2025/10/29 10:27:14 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)





2025/10/29 10:27:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 4'].
2025/10/29 10:27:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29]
2025/10/29 10:27:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0]
2025/10/29 10:27:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 25 - Minibatch ==


🏃 View run eval_minibatch_12 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/6b90e3a2f93743b09eddb80ed2e000ae
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:01<00:00, 26.05it/s]

2025/10/29 10:27:15 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)





2025/10/29 10:27:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].
2025/10/29 10:27:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29, 48.57]
2025/10/29 10:27:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0]
2025/10/29 10:27:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 25 - Minibatch ==


🏃 View run eval_minibatch_13 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/a5e80a907e954724a38bd0695655a6e2
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 20.00 / 35 (57.1%): 100%|██████████| 35/35 [00:01<00:00, 26.75it/s]

2025/10/29 10:27:17 INFO dspy.evaluate.evaluate: Average Metric: 20 / 35 (57.1%)





2025/10/29 10:27:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/10/29 10:27:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29, 48.57, 57.14]
2025/10/29 10:27:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0]
2025/10/29 10:27:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:17 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 25 - Full Evaluation =====
2025/10/29 10:27:17 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 57.14) from minibatch trials...


🏃 View run eval_minibatch_14 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/4a8c44d38d1e4b3fadcfb7f3610ca67a
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 49.00 / 100 (49.0%): 100%|██████████| 100/100 [00:04<00:00, 24.75it/s]

2025/10/29 10:27:21 INFO dspy.evaluate.evaluate: Average Metric: 49 / 100 (49.0%)





2025/10/29 10:27:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0, 49.0]
2025/10/29 10:27:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0
2025/10/29 10:27:21 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/10/29 10:27:21 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 25 - Minibatch ==


🏃 View run eval_full_3 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/ec14b4b01c394b27841db418838cdbf0
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:01<00:00, 28.70it/s]

2025/10/29 10:27:22 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)
2025/10/29 10:27:22 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/10/29 10:27:22 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29, 48.57, 57.14, 48.57]
2025/10/29 10:27:22 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0, 49.0]
2025/10/29 10:27:22 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:22 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 25 - Minibatch ==



🏃 View run eval_minibatch_15 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/3339eb64fd96432e98e0152682a7b827
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 21.00 / 35 (60.0%): 100%|██████████| 35/35 [00:01<00:00, 24.07it/s]

2025/10/29 10:27:24 INFO dspy.evaluate.evaluate: Average Metric: 21 / 35 (60.0%)
2025/10/29 10:27:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 60.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].
2025/10/29 10:27:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29, 48.57, 57.14, 48.57, 60.0]
2025/10/29 10:27:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0, 49.0]
2025/10/29 10:27:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 25 - Minibatch ==



🏃 View run eval_minibatch_16 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/d167d158815448cdbb0cd7bd80d65431
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:01<00:00, 19.06it/s]

2025/10/29 10:27:26 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)
2025/10/29 10:27:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 5'].
2025/10/29 10:27:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29, 48.57, 57.14, 48.57, 60.0, 51.43]
2025/10/29 10:27:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0, 49.0]
2025/10/29 10:27:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 25 - Minibatch ==



🏃 View run eval_minibatch_17 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/d6560073f1ba41e89e0f5caebbd73017
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:01<00:00, 24.34it/s]

2025/10/29 10:27:27 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)



🏃 View run eval_minibatch_18 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/b8e39d63f6dd4e53bdb14304fa32643a
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306


2025/10/29 10:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].
2025/10/29 10:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29, 48.57, 57.14, 48.57, 60.0, 51.43, 51.43]
2025/10/29 10:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0, 49.0]
2025/10/29 10:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:27 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 25 - Minibatch ==


Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:01<00:00, 25.28it/s]

2025/10/29 10:27:28 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)
2025/10/29 10:27:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].
2025/10/29 10:27:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [8.57, 51.43, 14.29, 54.29, 48.57, 42.86, 54.29, 17.14, 37.14, 54.29, 51.43, 51.43, 54.29, 48.57, 57.14, 48.57, 60.0, 51.43, 51.43, 48.57]
2025/10/29 10:27:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0, 49.0]
2025/10/29 10:27:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 50.0


2025/10/29 10:27:28 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 25 - Full Evaluation =====
2025/10/29 10:27:28 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 60.0) from minibatch trials..


🏃 View run eval_minibatch_19 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/8e6883a6ca6a422fb24b737370786d87
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Average Metric: 54.00 / 100 (54.0%): 100%|██████████| 100/100 [00:04<00:00, 24.88it/s]

2025/10/29 10:27:33 INFO dspy.evaluate.evaluate: Average Metric: 54 / 100 (54.0%)





2025/10/29 10:27:33 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 54.0
2025/10/29 10:27:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [31.0, 50.0, 49.0, 49.0, 54.0]
2025/10/29 10:27:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 54.0
2025/10/29 10:27:33 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/10/29 10:27:33 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 54.0!


🏃 View run eval_full_4 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/360a759d35914c37b473aebff49f3d14
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

🏃 View run selective-goose-236 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/abda392434b841e1ba73606aeb7c5de9
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306


In [13]:
optimized_react.react.signature

StringSignature(question, trajectory -> next_thought, next_tool_name, next_tool_args
    instructions="Given the fields `question`, produce the fields `answer`.\n\nYou are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far.\nYour goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`.\n\nTo do this, you will interleave next_thought, next_tool_name, and next_tool_args in each turn, and also when finishing the task.\nAfter each tool call, you receive a resulting observation, which gets appended to your trajectory.\n\nWhen writing next_thought, you may reason about the current situation and plan for future steps.\nWhen selecting the next_tool_name and its next_tool_args, the tool must be one of:\n\n(1) search_wikipedia. It takes arguments {'query': {'type': 'string'}} in JSON format.\n(2) finish, whose description is <desc>Marks the task as complete. That is, signals

In [14]:
optimized_react.react.demos

[Example({'augmented': True, 'question': 'That Darn Cat! and Never a Dull Moment were both produced by what studio?', 'trajectory': '[[ ## thought_0 ## ]]\nI need to find out which studio produced both "That Darn Cat!" and "Never a Dull Moment." This information is likely available on Wikipedia, so I will search for it there.\n\n[[ ## tool_name_0 ## ]]\nsearch_wikipedia\n\n[[ ## tool_args_0 ## ]]\n{"query": "That Darn Cat! and Never a Dull Moment studio production"}\n\n[[ ## observation_0 ## ]]\n[1] «That Darn Cat! | That Darn Cat! is a 1965 American Walt Disney Productions thriller comedy film starring Hayley Mills (in her last of the six films she made for the Walt Disney Studios) and Dean Jones (starring in his first film for Disney) in a story about bank robbers, a kidnapping and a mischievous cat. The film was based on the 1963 novel "Undercover Cat" by Gordon and Mildred Gordon and was directed by Robert Stevenson. The title song was written by the Sherman Brothers and sung by Bo

In [15]:
evaluator = dspy.Evaluate(
    metric=dspy.evaluate.answer_exact_match,
    devset=valset,
    display_table=True,
    display_progress=True,
    num_threads=24,
)

In [16]:
original_score = evaluator(react)
print(f"Original score: {original_score}")

Average Metric: 31.00 / 100 (31.0%): 100%|██████████| 100/100 [00:03<00:00, 30.17it/s]

2025/10/29 10:27:36 INFO dspy.evaluate.evaluate: Average Metric: 31 / 100 (31.0%)





Unnamed: 0,question,example_answer,trajectory,reasoning,pred_answer,answer_exact_match
0,"What movie did ""the king of cool"" play in with Bud Ekins as his st...","""The Great Escape""","{'thought_0': 'I need to find out which movie ""the king of cool"" s...","Steve McQueen, known as ""the king of cool,"" starred in the movie ""...","The movie is ""The Great Escape.""",
1,whos family had their own reality tv show. Robert Kardashian or Ma...,their family reality television series,"{'thought_0': 'I need to determine which individual, Robert Kardas...",Robert Kardashian's family is well-known for their reality TV show...,Robert Kardashian's family had their own reality TV show.,
2,Which star in Shadows in Paradise is a Russian ballerina?,Sofya Skya,"{'thought_0': 'I need to find out which star in the film ""Shadows ...","I searched for information about the cast of the 1986 film ""Shadow...",There is no information available about a Russian ballerina in the...,
3,What was the meaning of the name of the man who appointed Amashsai?,comforter,"{'thought_0': ""I need to find out who appointed Amashsai and the m...",Nehemiah appointed Amashsai to work at the temple in Jerusalem. Th...,"The meaning of the name of the man who appointed Amashsai, Nehemia...",
4,"In addition to the Austrian passport, what is needed to gain acces...",national identity card,{'thought_0': 'I need to find out what additional requirements or ...,To gain access to 173 countries and territories with an Austrian p...,"In addition to the Austrian passport, travelers may need to obtain...",
...,...,...,...,...,...,...
95,"What date did the American actress and singer-songwriter, known fo...","April 19, 1994",{'thought_0': 'I need to find out the name of the American actress...,The American actress and singer-songwriter known for her role as P...,2007,
96,What animated creatures were the title characters of the film whic...,seals,{'thought_0': 'I need to identify the animated creatures that were...,The animated creatures that are the title characters of the film b...,The animated creatures that are the title characters of the film b...,
97,The 1925 Saint Mary's Gaels football team represented what private...,Saint Mary's College of California,"{'thought_0': ""I need to find out which private, coeducational col...",The 1925 Saint Mary's Gaels football team represented Saint Mary's...,Saint Mary's College of California,✔️ [True]
98,Were Dorothy Arzner and Richard Wallace both French film directors?,no,{'thought_0': 'I need to verify the nationalities and contribution...,Both Dorothy Arzner and Richard Wallace were confirmed to be Ameri...,"No, neither Dorothy Arzner nor Richard Wallace were French film di...",


🏃 View run sneaky-skink-824 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/10bef02e70f945d8ae901bbc832769ef
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Original score: 31.0


In [17]:
optimized_score = evaluator(optimized_react)
print(f"Optimized score: {optimized_score}")

Average Metric: 54.00 / 100 (54.0%): 100%|██████████| 100/100 [00:04<00:00, 21.86it/s]

2025/10/29 10:27:41 INFO dspy.evaluate.evaluate: Average Metric: 54 / 100 (54.0%)





Unnamed: 0,question,example_answer,trajectory,reasoning,pred_answer,answer_exact_match
0,"What movie did ""the king of cool"" play in with Bud Ekins as his st...","""The Great Escape""","{'thought_0': 'I need to find out which movie ""the king of cool"" s...",I found that Bud Ekins was Steve McQueen's stunt double in the fil...,The Great Escape,✔️ [True]
1,whos family had their own reality tv show. Robert Kardashian or Ma...,their family reality television series,{'thought_0': 'I need to find out which family had their own reali...,"The Kardashian family, associated with Robert Kardashian, has thei...",Robert Kardashian,
2,Which star in Shadows in Paradise is a Russian ballerina?,Sofya Skya,"{'thought_0': 'I need to find out which star in ""Shadows in Paradi...","In my search for the cast of ""Shadows in Paradise,"" I found that t...",Sofya Skya,✔️ [True]
3,What was the meaning of the name of the man who appointed Amashsai?,comforter,"{'thought_0': ""I need to find out who appointed Amashsai and the m...","Amashsai was appointed by Nehemiah, and the name Amasai, which is ...","""Burdensome""",
4,"In addition to the Austrian passport, what is needed to gain acces...",national identity card,{'thought_0': 'I need to find out what additional requirements are...,The search results indicate that Austrian citizens have visa-free ...,"A valid Austrian passport, and potentially a visa or health docume...",
...,...,...,...,...,...,...
95,"What date did the American actress and singer-songwriter, known fo...","April 19, 1994",{'thought_0': 'I need to find out the release date of the first al...,I found that the American actress and singer-songwriter Katey Saga...,"April 19, 1994",✔️ [True]
96,What animated creatures were the title characters of the film whic...,seals,{'thought_0': 'I need to identify the animated creatures that were...,The question pertains to animated creatures that are the title cha...,"Fairies (specifically Puck, Titania, and Oberon)",
97,The 1925 Saint Mary's Gaels football team represented what private...,Saint Mary's College of California,"{'thought_0': ""I need to find out which private, coeducational col...",The 1925 Saint Mary's Gaels football team represented Saint Mary's...,Saint Mary's College of California,✔️ [True]
98,Were Dorothy Arzner and Richard Wallace both French film directors?,no,"{'thought_0': ""I need to determine if both Dorothy Arzner and Rich...","I found that Dorothy Arzner was an American film director, and Ric...","No, neither Dorothy Arzner nor Richard Wallace were French film di...",


🏃 View run victorious-ox-919 at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306/runs/7817091b4e9e49c0981f05d46d33dc04
🧪 View experiment at: https://s172-29-112-60p8080.lab-aws-production.deeplearning.ai//#/experiments/714844631808117306
Optimized score: 54.0
