# L4: Optimize DSPy Agent with DSPy Optimizer

In [1]:
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()

# Access your API keys
google_api_key = os.getenv('GOOGLE_API_KEY')
openai_api_key = os.getenv('OPENAI_API_KEY')
# Use the keys
print(f"Google API Key loaded: {bool(google_api_key)}")
print(f"OpenAI API Key loaded: {bool(openai_api_key)}")

os.environ["OPENAI_API_KEY"]  = openai_api_key

Google API Key loaded: True
OpenAI API Key loaded: True


In [2]:
import mlflow

In [14]:
from pathlib import Path
notebook_path = Path.cwd()
root_path = Path.cwd().parent
data_path = root_path / "data"
mlflow_tracking_uri = notebook_path / "mlruns"
print(notebook_path)

e:\GIT_ROOT\Learning\DLAI-shortcourse_notebooks\courses\dspy\notebooks


In [3]:
def get_mlflow_tracking_uri():
    return "http://localhost:8080"
    #return os.environ.get('DLAI_LOCAL_URL').format(port=8080)


mlflow_tracking_uri = get_mlflow_tracking_uri()
mlflow.set_tracking_uri(mlflow_tracking_uri)

In [None]:
# !mlflow server --host 127.0.0.1 --port 8080

In [4]:
mlflow.set_experiment("dspy_course_4")

2025/07/06 18:47:22 INFO mlflow.tracking.fluent: Experiment with name 'dspy_course_4' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/840480777417013393', creation_time=1751807842407, experiment_id='840480777417013393', last_update_time=1751807842407, lifecycle_stage='active', name='dspy_course_4', tags={}>

In [5]:
mlflow.dspy.autolog(log_evals=True, log_compiles=True, log_traces_from_compile=True)

In [6]:
import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))

## Build a RAG Agent

In [7]:
def search_wikipedia(query: str) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=3)
    return [x["text"] for x in results]

react = dspy.ReAct("question -> answer", tools=[search_wikipedia])

'pwd' is not recognized as an internal or external command,
operable program or batch file.


In [15]:
import json

# Load trainset
trainset = []
with open(data_path/"trainset.jsonl", "r") as f:
    for line in f:
        trainset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

# Load valset
valset = []
with open(data_path/"valset.jsonl", "r") as f:
    for line in f:
        valset.append(dspy.Example(**json.loads(line)).with_inputs("question"))

In [16]:
# Overview of the dataset.
print(trainset[0])

Example({'question': 'Are Smyrnium and Nymania both types of plant?', 'answer': 'yes'}) (input_keys={'question'})


In [19]:
tp = dspy.MIPROv2(
    metric=dspy.evaluate.answer_exact_match,
    auto="light",
    num_threads=16
)

In [17]:
dspy.cache.load_memory_cache(data_path/"memory_cache.pkl")

In [20]:
optimized_react = tp.compile(
    react,
    trainset=trainset,
    valset=valset,
    requires_permission_to_run=False,
)

2025/07/06 18:51:57 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '4fb3da48f6a84886bb2d96948876c41e', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current dspy workflow
2025/07/06 18:51:57 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 20
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 100

2025/07/06 18:51:57 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/07/06 18:51:57 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/07/06 18:51:57 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

 15%|█▌        | 15/100 [02:03<11:37,  8.21s/it]

Bootstrapped 4 full traces after 15 examples for up to 1 rounds, amounting to 15 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Bootstrapping set 4/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

  1%|          | 1/100 [00:11<19:09, 11.61s/it]

Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Bootstrapping set 5/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

 10%|█         | 10/100 [01:14<11:14,  7.50s/it]

Bootstrapped 4 full traces after 10 examples for up to 1 rounds, amounting to 10 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Bootstrapping set 6/6


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

  2%|▏         | 2/100 [00:10<08:17,  5.07s/it]

Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.





Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

2025/07/06 18:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/07/06 18:55:55 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/07/06 18:56:31 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...

2025/07/06 18:57:28 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/07/06 18:57:28 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `question`, produce the fields `answer`.

You are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far.
Your goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`.

To do this, you will interleave next_thought, next_tool_name, and next_tool_args in ea

Average Metric: 29.00 / 100 (29.0%): 100%|██████████| 100/100 [00:57<00:00,  1.74it/s]

2025/07/06 18:58:27 INFO dspy.evaluate.evaluate: Average Metric: 29 / 100 (29.0%)



🏃 View run eval_full_0 at: http://localhost:8080/#/experiments/840480777417013393/runs/a57e9dd5e91c447c851bb101534e8384
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 18:58:27 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 29.0

2025/07/06 18:58:28 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 25 - Minibatch ==


Average Metric: 8.00 / 35 (22.9%): 100%|██████████| 35/35 [00:26<00:00,  1.31it/s]

2025/07/06 18:58:55 INFO dspy.evaluate.evaluate: Average Metric: 8 / 35 (22.9%)





2025/07/06 18:58:55 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 22.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/07/06 18:58:55 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86]
2025/07/06 18:58:55 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/07/06 18:58:55 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/07/06 18:58:55 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 25 - Minibatch ==


🏃 View run eval_minibatch_0 at: http://localhost:8080/#/experiments/840480777417013393/runs/d4c30b0f9b874d6a81377fe9d39579c4
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:21<00:00,  1.63it/s]

2025/07/06 18:59:17 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)





2025/07/06 18:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/07/06 18:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57]
2025/07/06 18:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/07/06 18:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/07/06 18:59:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 25 - Minibatch ==


🏃 View run eval_minibatch_1 at: http://localhost:8080/#/experiments/840480777417013393/runs/f22336a914af4da0a5e2112d03f49934
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Average Metric: 8.00 / 35 (22.9%): 100%|██████████| 35/35 [00:25<00:00,  1.37it/s]

2025/07/06 18:59:43 INFO dspy.evaluate.evaluate: Average Metric: 8 / 35 (22.9%)



🏃 View run eval_minibatch_2 at: http://localhost:8080/#/experiments/840480777417013393/runs/2dba3c5ca8d5421c8b23f4c822f9f9fc
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 18:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 22.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 0'].
2025/07/06 18:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86]
2025/07/06 18:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/07/06 18:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/07/06 18:59:43 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 25 - Minibatch ==


Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:20<00:00,  1.72it/s]

2025/07/06 19:00:04 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)





2025/07/06 19:00:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/07/06 19:00:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29]
2025/07/06 19:00:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/07/06 19:00:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/07/06 19:00:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 25 - Minibatch ==


🏃 View run eval_minibatch_3 at: http://localhost:8080/#/experiments/840480777417013393/runs/fa482e5db3024ad8a3978b3dc070c9fc
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:27<00:00,  1.29it/s]

2025/07/06 19:00:32 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)



🏃 View run eval_minibatch_4 at: http://localhost:8080/#/experiments/840480777417013393/runs/f8fc846c9c42431eb1fd293d942ee74f
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:00:32 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 2'].
2025/07/06 19:00:32 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43]
2025/07/06 19:00:32 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0]
2025/07/06 19:00:32 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 29.0


2025/07/06 19:00:32 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 25 - Full Evaluation =====
2025/07/06 19:00:32 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...


Average Metric: 51.00 / 100 (51.0%): 100%|██████████| 100/100 [00:30<00:00,  3.24it/s]

2025/07/06 19:01:03 INFO dspy.evaluate.evaluate: Average Metric: 51 / 100 (51.0%)





2025/07/06 19:01:03 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 51.0


🏃 View run eval_full_1 at: http://localhost:8080/#/experiments/840480777417013393/runs/8cef0585ab93424aae8d8ba7d1223e8e
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:01:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0]
2025/07/06 19:01:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/07/06 19:01:03 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/06 19:01:03 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 25 - Minibatch ==


Average Metric: 14.00 / 35 (40.0%): 100%|██████████| 35/35 [00:23<00:00,  1.48it/s]

2025/07/06 19:01:27 INFO dspy.evaluate.evaluate: Average Metric: 14 / 35 (40.0%)



🏃 View run eval_minibatch_5 at: http://localhost:8080/#/experiments/840480777417013393/runs/9b45efcbd0d044b38865a271f2855736
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:01:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/07/06 19:01:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0]
2025/07/06 19:01:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0]
2025/07/06 19:01:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:01:28 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 25 - Minibatch ==


Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:25<00:00,  1.37it/s]

2025/07/06 19:01:54 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)





2025/07/06 19:01:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
2025/07/06 19:01:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43]
2025/07/06 19:01:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0]
2025/07/06 19:01:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:01:54 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 25 - Minibatch ==


🏃 View run eval_minibatch_6 at: http://localhost:8080/#/experiments/840480777417013393/runs/a1dd611aa10d446fb617b6b31e35697f
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Average Metric: 8.00 / 35 (22.9%): 100%|██████████| 35/35 [00:29<00:00,  1.18it/s]

2025/07/06 19:02:24 INFO dspy.evaluate.evaluate: Average Metric: 8 / 35 (22.9%)



🏃 View run eval_minibatch_7 at: http://localhost:8080/#/experiments/840480777417013393/runs/941843b2c6244dccbb34fbef17320708
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:02:24 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 22.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 0'].
2025/07/06 19:02:24 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86]
2025/07/06 19:02:24 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0]
2025/07/06 19:02:24 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:02:24 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 25 - Minibatch ==


Average Metric: 8.00 / 35 (22.9%): 100%|██████████| 35/35 [00:29<00:00,  1.20it/s]

2025/07/06 19:02:54 INFO dspy.evaluate.evaluate: Average Metric: 8 / 35 (22.9%)



🏃 View run eval_minibatch_8 at: http://localhost:8080/#/experiments/840480777417013393/runs/f46fd197979d420da2a9719546902a98
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:02:54 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 22.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/07/06 19:02:54 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86]
2025/07/06 19:02:54 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0]
2025/07/06 19:02:54 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:02:54 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 25 - Minibatch ==


Average Metric: 19.00 / 35 (54.3%): 100%|██████████| 35/35 [00:08<00:00,  4.19it/s]

2025/07/06 19:03:03 INFO dspy.evaluate.evaluate: Average Metric: 19 / 35 (54.3%)



🏃 View run eval_minibatch_9 at: http://localhost:8080/#/experiments/840480777417013393/runs/769b5449e3974ef39d79d45a3b9f6c58
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:03:03 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.29 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 2'].
2025/07/06 19:03:03 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29]
2025/07/06 19:03:03 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0]
2025/07/06 19:03:03 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:03:03 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 25 - Full Evaluation =====
2025/07/06 19:03:03 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 54.29) from minibatch trials...


Average Metric: 48.00 / 100 (48.0%): 100%|██████████| 100/100 [00:16<00:00,  6.03it/s]

2025/07/06 19:03:20 INFO dspy.evaluate.evaluate: Average Metric: 48 / 100 (48.0%)



🏃 View run eval_full_2 at: http://localhost:8080/#/experiments/840480777417013393/runs/d45f2c98db164a8388166fc7cbba6bd3
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:03:20 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0]
2025/07/06 19:03:20 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/07/06 19:03:20 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/06 19:03:20 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 25 - Minibatch ==


Average Metric: 20.00 / 35 (57.1%): 100%|██████████| 35/35 [00:09<00:00,  3.61it/s]

2025/07/06 19:03:30 INFO dspy.evaluate.evaluate: Average Metric: 20 / 35 (57.1%)





2025/07/06 19:03:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/07/06 19:03:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14]
2025/07/06 19:03:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0]
2025/07/06 19:03:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:03:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 25 - Minibatch ==


🏃 View run eval_minibatch_10 at: http://localhost:8080/#/experiments/840480777417013393/runs/672481b6dca74ac980bd9c6cb7d935c7
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:19<00:00,  1.77it/s]

2025/07/06 19:03:51 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)



🏃 View run eval_minibatch_11 at: http://localhost:8080/#/experiments/840480777417013393/runs/651c25dfe4e4427195ff18178c45a09a
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:03:51 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/07/06 19:03:51 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71]
2025/07/06 19:03:51 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0]
2025/07/06 19:03:51 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:03:51 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 25 - Minibatch ==


Average Metric: 16.00 / 35 (45.7%): 100%|██████████| 35/35 [00:20<00:00,  1.73it/s]

2025/07/06 19:04:12 INFO dspy.evaluate.evaluate: Average Metric: 16 / 35 (45.7%)



🏃 View run eval_minibatch_12 at: http://localhost:8080/#/experiments/840480777417013393/runs/cc356dd22a1c49edb409ef340ad946ad
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:04:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.71 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/07/06 19:04:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71]
2025/07/06 19:04:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0]
2025/07/06 19:04:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:04:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 25 - Minibatch ==


Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:21<00:00,  1.63it/s]

2025/07/06 19:04:34 INFO dspy.evaluate.evaluate: Average Metric: 15 / 35 (42.9%)





2025/07/06 19:04:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/07/06 19:04:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71, 42.86]
2025/07/06 19:04:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0]
2025/07/06 19:04:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:04:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 25 - Minibatch ==


🏃 View run eval_minibatch_13 at: http://localhost:8080/#/experiments/840480777417013393/runs/c085aac4ac7545788eb7df1018bf9227
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:24<00:00,  1.45it/s]

2025/07/06 19:04:58 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)



🏃 View run eval_minibatch_14 at: http://localhost:8080/#/experiments/840480777417013393/runs/4fa7bfc7f9434ed3a8dfb5cee1272020
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:04:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].
2025/07/06 19:04:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71, 42.86, 48.57]
2025/07/06 19:04:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0]
2025/07/06 19:04:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:04:59 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 19 / 25 - Full Evaluation =====
2025/07/06 19:04:59 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 57.14) from minibatch trials...


Average Metric: 49.00 / 100 (49.0%): 100%|██████████| 100/100 [00:19<00:00,  5.09it/s]

2025/07/06 19:05:19 INFO dspy.evaluate.evaluate: Average Metric: 49 / 100 (49.0%)



🏃 View run eval_full_3 at: http://localhost:8080/#/experiments/840480777417013393/runs/e3da6038c70a49e2a663c09da71de29d
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:05:19 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0, 49.0]
2025/07/06 19:05:19 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/07/06 19:05:19 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/06 19:05:19 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 25 - Minibatch ==


Average Metric: 18.00 / 35 (51.4%): 100%|██████████| 35/35 [00:08<00:00,  4.13it/s]

2025/07/06 19:05:28 INFO dspy.evaluate.evaluate: Average Metric: 18 / 35 (51.4%)



🏃 View run eval_minibatch_15 at: http://localhost:8080/#/experiments/840480777417013393/runs/afa410074677472a935d6552b7fe8895
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:05:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.43 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 4'].
2025/07/06 19:05:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71, 42.86, 48.57, 51.43]
2025/07/06 19:05:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0, 49.0]
2025/07/06 19:05:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:05:28 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 21 / 25 - Minibatch ==


Average Metric: 20.00 / 35 (57.1%): 100%|██████████| 35/35 [00:05<00:00,  6.88it/s]

2025/07/06 19:05:33 INFO dspy.evaluate.evaluate: Average Metric: 20 / 35 (57.1%)



🏃 View run eval_minibatch_16 at: http://localhost:8080/#/experiments/840480777417013393/runs/ffb9edd3eb5a440b9d46f8ebeb2433c5
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:05:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.14 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/07/06 19:05:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71, 42.86, 48.57, 51.43, 57.14]
2025/07/06 19:05:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0, 49.0]
2025/07/06 19:05:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:05:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 25 - Minibatch ==


Average Metric: 17.00 / 35 (48.6%): 100%|██████████| 35/35 [00:05<00:00,  6.89it/s]

2025/07/06 19:05:39 INFO dspy.evaluate.evaluate: Average Metric: 17 / 35 (48.6%)



🏃 View run eval_minibatch_17 at: http://localhost:8080/#/experiments/840480777417013393/runs/8e94dbbb6efd42f99806d8633b360686
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:05:39 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 48.57 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/07/06 19:05:39 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71, 42.86, 48.57, 51.43, 57.14, 48.57]
2025/07/06 19:05:39 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0, 49.0]
2025/07/06 19:05:39 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:05:39 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 25 - Minibatch ==


Average Metric: 15.00 / 35 (42.9%): 100%|██████████| 35/35 [00:26<00:00,  1.33it/s]

2025/07/06 19:06:06 INFO dspy.evaluate.evaluate: Average Metric: 15 / 35 (42.9%)



🏃 View run eval_minibatch_18 at: http://localhost:8080/#/experiments/840480777417013393/runs/875e06934ade4989b348911f237573a8
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:06:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.86 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].
2025/07/06 19:06:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71, 42.86, 48.57, 51.43, 57.14, 48.57, 42.86]
2025/07/06 19:06:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0, 49.0]
2025/07/06 19:06:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:06:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 25 - Minibatch ==


Average Metric: 14.00 / 35 (40.0%): 100%|██████████| 35/35 [00:05<00:00,  6.70it/s]

2025/07/06 19:06:12 INFO dspy.evaluate.evaluate: Average Metric: 14 / 35 (40.0%)



🏃 View run eval_minibatch_19 at: http://localhost:8080/#/experiments/840480777417013393/runs/4dc955f5f819419bba32af651e5808ef
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:06:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 40.0 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
2025/07/06 19:06:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [22.86, 48.57, 22.86, 54.29, 51.43, 40.0, 51.43, 22.86, 22.86, 54.29, 57.14, 45.71, 45.71, 42.86, 48.57, 51.43, 57.14, 48.57, 42.86, 40.0]
2025/07/06 19:06:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0, 49.0]
2025/07/06 19:06:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0


2025/07/06 19:06:12 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 25 / 25 - Full Evaluation =====
2025/07/06 19:06:12 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 51.43) from minibatch trials...


Average Metric: 47.00 / 100 (47.0%): 100%|██████████| 100/100 [00:18<00:00,  5.51it/s]

2025/07/06 19:06:31 INFO dspy.evaluate.evaluate: Average Metric: 47 / 100 (47.0%)



🏃 View run eval_full_4 at: http://localhost:8080/#/experiments/840480777417013393/runs/184e5b87215b48489d82e6e2b0f03df3
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


2025/07/06 19:06:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [29.0, 51.0, 48.0, 49.0, 47.0]
2025/07/06 19:06:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 51.0
2025/07/06 19:06:31 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/07/06 19:06:31 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 51.0!


Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

🏃 View run rogue-calf-148 at: http://localhost:8080/#/experiments/840480777417013393/runs/4fb3da48f6a84886bb2d96948876c41e
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393


In [None]:
optimized_react.react.signature

In [21]:
optimized_react.react.demos

[Example({'augmented': True, 'question': 'Konnichiwa was launched in a broadcast that was streamed by the platform that was founded in what city in 2010?', 'trajectory': '[[ ## thought_0 ## ]]\nI need to find out which platform was founded in 2010 and in what city. This information will help me answer the question about the broadcast of Konnichiwa. I will search for platforms that were established in 2010.\n\n[[ ## tool_name_0 ## ]]\nsearch_wikipedia\n\n[[ ## tool_args_0 ## ]]\n{"query": "platform founded in 2010"}\n\n[[ ## observation_0 ## ]]\n[1] «Genius Crowds | Genius Crowds was a crowdsourcing platform founded in 2010 on the principles of collaboration and sharing. It served as a platform for everyday people – students, women, men, moms, dads, DIY\'ers, cooks, and tinkerers - who have a product idea that they don\'t know how to bring to market.»\n[2] «Seedups | SeedUps is an equity crowdfunding platform for seed investment to launch and focuses on startup companies. It was founded

In [22]:
evaluator = dspy.Evaluate(
    metric=dspy.evaluate.answer_exact_match,
    devset=valset,
    display_table=True,
    display_progress=True,
    num_threads=24,
)

In [23]:
original_score = evaluator(react)
print(f"Original score: {original_score}")

Average Metric: 29.00 / 100 (29.0%): 100%|██████████| 100/100 [00:12<00:00,  7.95it/s]

2025/07/06 19:06:47 INFO dspy.evaluate.evaluate: Average Metric: 29 / 100 (29.0%)





Unnamed: 0,question,example_answer,trajectory,reasoning,pred_answer,answer_exact_match
0,"What movie did ""the king of cool"" play in with Bud Ekins as his st...","""The Great Escape""","{'thought_0': 'I need to find out which movie ""the king of cool"" s...",Bud Ekins served as Steve McQueen's stunt double in the classic fi...,The Great Escape,✔️ [True]
1,whos family had their own reality tv show. Robert Kardashian or Ma...,their family reality television series,"{'thought_0': 'I need to determine which individual, Robert Kardas...",Robert Kardashian's family is well-known for their reality TV show...,Robert Kardashian's family had their own reality TV show.,
2,Which star in Shadows in Paradise is a Russian ballerina?,Sofya Skya,"{'thought_0': 'I need to find out which star in the film ""Shadows ...","I searched for information about the cast of the film ""Shadows in ...","There is no Russian ballerina in ""Shadows in Paradise.""",
3,What was the meaning of the name of the man who appointed Amashsai?,comforter,{'thought_0': 'I need to gather information about the man who appo...,"The man who appointed Amashsai was Nehemiah, a biblical figure kno...","The meaning of the name of the man who appointed Amashsai, Nehemia...",
4,"In addition to the Austrian passport, what is needed to gain acces...",national identity card,{'thought_0': 'To answer the question about what is needed in addi...,To gain access to 173 countries and territories with an Austrian p...,"In addition to the Austrian passport, travelers may need to obtain...",
...,...,...,...,...,...,...
95,"What date did the American actress and singer-songwriter, known fo...","April 19, 1994",{'thought_0': 'I need to find out the name of the American actress...,The American actress and singer-songwriter known for her role as P...,2007,
96,What animated creatures were the title characters of the film whic...,seals,{'thought_0': 'I need to identify the animated creatures that were...,The question asks for animated creatures that were the title chara...,The animated creatures that were the title characters of the film ...,
97,The 1925 Saint Mary's Gaels football team represented what private...,Saint Mary's College of California,"{'thought_0': ""I need to find out which private, coeducational col...",The 1925 Saint Mary's Gaels football team represented Saint Mary's...,Saint Mary's College of California,✔️ [True]
98,Were Dorothy Arzner and Richard Wallace both French film directors?,no,{'thought_0': 'I need to verify the nationalities and contribution...,Both Dorothy Arzner and Richard Wallace were confirmed to be Ameri...,"No, neither Dorothy Arzner nor Richard Wallace were French film di...",


🏃 View run eval at: http://localhost:8080/#/experiments/840480777417013393/runs/c0848e006ff946a9b8f5425fc95d2f0e
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Original score: 29.0


In [24]:
optimized_score = evaluator(optimized_react)
print(f"Optimized score: {optimized_score}")

Average Metric: 51.00 / 100 (51.0%): 100%|██████████| 100/100 [00:15<00:00,  6.59it/s]

2025/07/06 19:07:03 INFO dspy.evaluate.evaluate: Average Metric: 51 / 100 (51.0%)





Unnamed: 0,question,example_answer,trajectory,reasoning,pred_answer,answer_exact_match
0,"What movie did ""the king of cool"" play in with Bud Ekins as his st...","""The Great Escape""","{'thought_0': 'I need to find out which movie ""the king of cool"" s...","Steve McQueen, referred to as ""the king of cool,"" starred in the f...",The Great Escape,✔️ [True]
1,whos family had their own reality tv show. Robert Kardashian or Ma...,their family reality television series,"{'thought_0': 'I need to determine which of the two individuals, R...","Robert Kardashian is part of the Kardashian family, who are famous...",Robert Kardashian,
2,Which star in Shadows in Paradise is a Russian ballerina?,Sofya Skya,"{'thought_0': 'I need to find out which star in the film ""Shadows ...","In the 2010 film ""Shadows in Paradise,"" Sofya Skya is a prominent ...",Sofya Skya,✔️ [True]
3,What was the meaning of the name of the man who appointed Amashsai?,comforter,"{'thought_0': ""I need to find out who appointed Amashsai and the m...","Amashsai was appointed by Nehemiah, and the name ""Nehemiah"" means ...",Yahweh comforts,
4,"In addition to the Austrian passport, what is needed to gain acces...",national identity card,{'thought_0': 'To answer the question about what is needed in addi...,The information gathered indicates that Austrian citizens have vis...,Austrian passport (and possibly a visa for specific countries),
...,...,...,...,...,...,...
95,"What date did the American actress and singer-songwriter, known fo...","April 19, 1994",{'thought_0': 'I need to find out the name of the American actress...,The American actress and singer-songwriter known for her role as P...,2007,
96,What animated creatures were the title characters of the film whic...,seals,{'thought_0': 'I need to identify the animated creatures that were...,The animated creatures that are the title characters of the film b...,"Fairies (Puck, Titania, Oberon)",
97,The 1925 Saint Mary's Gaels football team represented what private...,Saint Mary's College of California,"{'thought_0': ""I need to find out which private, coeducational col...",The 1925 Saint Mary's Gaels football team represented Saint Mary's...,Saint Mary's College of California,✔️ [True]
98,Were Dorothy Arzner and Richard Wallace both French film directors?,no,{'thought_0': 'I need to determine the nationalities of both Dorot...,"Dorothy Arzner was an American film director, and Richard Wallace ...","No, they were not both French film directors.",


🏃 View run eval at: http://localhost:8080/#/experiments/840480777417013393/runs/9abf7ebe6ed748f4b1cbb6772eafa5f9
🧪 View experiment at: http://localhost:8080/#/experiments/840480777417013393
Optimized score: 51.0
