# DSPy Agents with Together AI

Let's create a DSPy agent that can search through multiple steps to find information. We will see how well the agent performs when powered by a 8b LLM and then improve it by using a 70b LLM.

Adapted from [source](https://dspy.ai/tutorials/agents/).

In this tutorial, we'll use Meta's Llama-3.1-8B-Instruct-Turbo and Llama-3.3-70B-Instruct-Turbo models from Together AI.

The 8B model is relatively small and fast, making it a good choice for initial development and testing.
While it may not be as capable as larger models for complex tasks, it provides a good balance of speed and capability.

The 70B model is much more powerful and will serve as our "teacher" model to help improve the performance
of the smaller 8B model through DSPy's optimization techniques.

In the code below, we'll configure the 8B model as our main LM and use the 70B model sparingly as a teacher
to help optimize our agent's performance.

We'll see that we can improve the agent's performance by over 20% using DSPy's MIPRO prompt optimizer.

In [34]:
!pip install -qU dspy

In [18]:
import dspy
#ignore warnings
import warnings, os
warnings.filterwarnings('ignore')
warnings.filterwarnings(action="ignore", category=UserWarning, module="litellm")  # Ignore litellm warnings

TOGETHER_API_KEY = os.getenv('TOGETHER_API_KEY')

llama8b = dspy.LM('together_ai/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo', api_key=TOGETHER_API_KEY, api_base='https://api.together.xyz/v1', temperature=0.7)

llama3_70b = dspy.LM('together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo', api_key=TOGETHER_API_KEY, api_base='https://api.together.xyz/v1', temperature=0.7)

dspy.configure(lm=llama8b)

In [19]:
llama8b("Say this is a test!", temperature=0.7)  # => ['This is a test!']
#llama3b(messages=[{"role": "user", "content": "Say this is a test!"}])  # => ['This is a test!']

["It looks like you're ready for a test. What kind of test is it? A language proficiency test, a quiz, or something else?"]

We'll load examples from the HoVer (Hover over Verifiable Claims) dataset, which contains complex multi-hop claims. For each claim, we need to identify the set of Wikipedia pages required to verify or refute it. The claims are intentionally complex, often requiring information from multiple pages to fact-check.

In [20]:
import random
from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
hover = [
    dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
    for x in hover
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids and not hpqa_ids.add(x["hpqa_id"])
]

random.Random(0).shuffle(hover)
trainset, devset, testset = hover[:100], hover[100:200], hover[650:]

In [21]:
# View an example
example = trainset[0]

print("Claim:", example.claim)
print("Correct pages that must be retrieved:", example.titles)

Claim: This director is known for his work on Miss Potter. The Academy of Motion Picture Arts and Sciences presents the award in which he was nominated for his work in "Babe".
Pages that must be retrieved: ['Chris Noonan', 'Academy Award for Best Director', 'Miss Potter']


Now we'll set up Wikipedia search functionality using ColBERTv2. This server indexes article abstracts (first paragraphs) from Wikipedia's 2017 dataset, which matches the data used in HoVer. The search will help us find relevant articles to verify claims.

In [22]:
DOCS = {}

def search(query: str, k: int) -> list[str]:
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=k)
    results = [x['text'] for x in results]

    for result in results:
        title, text = result.split(" | ", 1)
        DOCS[title] = text

    return results

In [23]:
search(query='toronto restaurants', k = 3)

["Cuisine in Toronto | The cuisine of Toronto reflects Toronto's size and multicultural diversity. Different ethnic neighbourhoods throughout the city focus on specific cuisines, such as authentic Chinese and Vietnamese found in the city's six Chinatowns, Korean in Koreatown, Greek on The Danforth, Italian cuisine in Little Italy and Corso Italia, and Indian in Little India. Numerous other world cuisines are available throughout the city, including Portuguese, Hungarian, Japanese, and Caribbean. Toronto's large Jewish population has also ensured a variety of Jewish restaurants and delis, with varying adherence to kosher rules. In addition to ethnic cuisines, Toronto is also home to many fine dining establishments and chain restaurants ranging from fast food to casual or upscale dining.",
 "Fran's Restaurant | Fran's Restaurant is a small chain of restaurants based in Toronto, Ontario, Canada. Its first restaurant was a haunt of pianist Glenn Gould.",
 'Bistro 990 | Bistro 990 was a res

In [24]:
# Now, let's use the search function to define two tools for our ReAct agent:

def search_wikipedia(query: str) -> list[str]:
    """Returns top-5 results and then the titles of the top-5 to top-30 results."""

    topK = search(query, 30)
    titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
    return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]

def lookup_wikipedia(title: str) -> str:
    """Returns the text of the Wikipedia page, if it exists."""

    if title in DOCS:
        return DOCS[title]

    results = [x for x in search(title, 10) if x.startswith(title + " | ")]
    if not results:
        return f"No Wikipedia page found for title: {title}"
    return results[0]

Setting up a Fact-Checking Agent

In this section, we'll create a ReAct agent using DSPy to help with fact verification.
The agent will have a simple but powerful task: given a claim, it will return a list
of relevant Wikipedia article titles that can be used to verify or refute that claim.

The agent will:
1. Take a claim as input
2. Search through Wikipedia articles
3. Return a list[str] of article titles needed for fact-checking

In [25]:
instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)
react = dspy.ReAct(signature, tools=[search_wikipedia, lookup_wikipedia], max_iters=20)

In [26]:
# try our agent out
react(claim="David Gregory was born in 1625.").titles[:3]

['David Gregory (physician)',
 'David Gregory (journalist)',
 'David Gregory (historian)']

Let's create an evaluation metric called top5_recall.

This metric compares the agent's output against a gold standard set of 3 relevant Wikipedia titles.
It calculates what fraction of those gold titles appear in the agent's top 5 predicted titles.
For example, if 2 out of 3 gold titles are found in the top 5, the recall would be 0.67.

In [27]:
def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0
    
    # If we're just doing inference, just measure the recall.
    return recall

evaluate = dspy.Evaluate(devset=devset, metric=top5_recall, num_threads=16, display_progress=True, display_table=5)

Let's evaluate our basic ReAct agent using Llama-3.2-8B as the language model.
This will give us a baseline performance to improve upon.

Since this is a relatively small model (8B parameters), it can sometimes fail or give errors.
We'll add error handling to gracefully handle any failures.

In [28]:
warnings.filterwarnings('ignore')
warnings.filterwarnings(action="ignore", category=UserWarning, module="litellm")  # Ignore litellm warnings

def safe_react(claim: str):
    try:
        return react(claim=claim)
    except Exception as e:
        return dspy.Prediction(titles=[])

evaluate(safe_react)

Average Metric: 49.33 / 100 (49.3%): 100%|██████████| 100/100 [02:28<00:00,  1.48s/it]

2025/03/09 00:40:21 INFO dspy.evaluate.evaluate: Average Metric: 49.33333333333333 / 100 (49.3%)





Unnamed: 0,claim,example_titles,trajectory,reasoning,pred_titles,top5_recall
0,The Church of England's movement that inspired the Trinity Episcop...,"[Samuel Rickards, Trinity Episcopal Church (Houghton, Michigan), O...","{'thought_0': ""We need to find the Wikipedia title related to the ...",To find all Wikipedia titles relevant to verifying (or refuting) t...,"['John Keble', 'John Keble Church, Mill Hill', 'National Apostasy'...",
1,"Red, White & Crüe and this athlete both fight. The french fighter ...","[Bobby Stewart, Red, White &amp; Crüe, Mike Tyson]","{'thought_0': 'We need to find Wikipedia titles related to Red, Wh...","To verify or refute the claim, we need to find Wikipedia titles re...","['Mötley Crüe', 'Mike Tyson', 'Boxing', 'Red, White & Crüe', 'Bobb...",✔️ [0.667]
2,The writer/director/actor from Glen or Glenda and Fernand Rivers s...,"[Glen or Glenda, Fernand Rivers, Ed Wood]",{'thought_0': 'We need to find a Wikipedia title that shares the c...,The claim is verified by finding all Wikipedia titles that share t...,"['Joel Hershman', 'Sid Bennett (director)', 'David H. Steinberg', ...",
3,The film by Sandi Sissel was released before The End of Suburbia.,"[Chicken Ranch (film), The End of Suburbia, Sandi Sissel]","{'thought_0': 'We need to find the release dates of the films ""The...",To verify the claim that the film by Sandi Sissel was released bef...,"['The End of Suburbia', 'Kailashey Kelenkari', 'The End of Suburbi...",✔️ [0.333]
4,The actor who played captain hook in the live production with Tayl...,"[Taylor Louderman, Christopher Walken, Peter Pan Live!]","{'thought_0': 'We need to verify the claim, so the first step is t...",The claim made is that the actor who played Captain Hook in the li...,[Bruce Dern],


49.33

We get a baseline performance of 0.5 top-5 recall.

Let's try to improve this performance using DSPy's MIPRO optimizer.
MIPRO will automatically optimize the prompts and few-shot examples used by our ReAct agent.
We'll use a more powerful model (Llama-3-70B) as the teacher model to help guide the optimization.

This may take around ~20 minutes and make some $5 worth of calls to Together AI to optimize Llama3 8B.

This will give us new prompts and few-shot examples that will improve the agent's performance.

In [29]:
llama3_70b = dspy.LM('together_ai/meta-llama/Llama-3.3-70B-Instruct-Turbo', api_key=TOGETHER_API_KEY, api_base='https://api.together.xyz/v1', temperature=0.7)

kwargs = dict(teacher_settings=dict(lm=llama3_70b), prompt_model=llama3_70b, max_errors=999)

tp = dspy.MIPROv2(metric=top5_recall, auto="medium", num_threads=16, **kwargs)

optimized_react = tp.compile(react, trainset=trainset, max_bootstrapped_demos=3, max_labeled_demos=0)

2025/03/09 00:40:54 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING MEDIUM AUTO RUN SETTINGS:
num_trials: 25
minibatch: True
num_candidates: 9
valset size: 80

2025/03/09 00:44:54 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/03/09 00:44:54 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/03/09 00:44:54 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=9 sets of demonstrations...


Bootstrapping set 1/9
Bootstrapping set 2/9


 45%|████▌     | 9/20 [01:34<01:55, 10.49s/it]


Bootstrapped 3 full traces after 9 examples for up to 1 rounds, amounting to 9 attempts.
Bootstrapping set 3/9


 55%|█████▌    | 11/20 [00:57<00:47,  5.26s/it]


Bootstrapped 3 full traces after 11 examples for up to 1 rounds, amounting to 11 attempts.
Bootstrapping set 4/9


 10%|█         | 2/20 [00:07<01:09,  3.85s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 5/9


 55%|█████▌    | 11/20 [00:20<00:16,  1.88s/it]


Bootstrapped 3 full traces after 11 examples for up to 1 rounds, amounting to 11 attempts.
Bootstrapping set 6/9


 45%|████▌     | 9/20 [00:07<00:09,  1.16it/s]


Bootstrapped 3 full traces after 9 examples for up to 1 rounds, amounting to 9 attempts.
Bootstrapping set 7/9


 20%|██        | 4/20 [00:00<00:00, 98.97it/s]


Bootstrapped 3 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 8/9


 10%|█         | 2/20 [00:00<00:00, 90.18it/s]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 9/9


 45%|████▌     | 9/20 [00:26<00:32,  2.92s/it]
2025/03/09 00:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/03/09 00:48:29 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.


Bootstrapped 3 full traces after 9 examples for up to 1 rounds, amounting to 9 attempts.


2025/03/09 00:48:36 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing instructions...

2025/03/09 00:51:50 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/03/09 00:51:50 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Find all Wikipedia titles relevant to verifying (or refuting) the claim.

You will be given `claim` and your goal is to finish with `titles`.

To do this, you will interleave Thought, Tool Name, and Tool Args, and receive a resulting Observation.

Thought can reason about the current situation, and Tool Name can be the following types:

(1) search_wikipedia, whose description is <desc>Returns top-5 results and then the titles of the top-5 to top-30 results.</desc>. It takes arguments {'query': {'type': 'string'}} in JSON format.
(2) lookup_wikipedia, whose description is <desc>Returns the text of the Wikipedia page, if it exists.</desc>. It takes arguments {'title': {'type': 'string'}} in JSON format.
(3) finish, whose description is <d

Average Metric: 35.33 / 80 (44.2%): 100%|██████████| 80/80 [02:01<00:00,  1.52s/it]

2025/03/09 00:53:52 INFO dspy.evaluate.evaluate: Average Metric: 35.33333333333333 / 80 (44.2%)
2025/03/09 00:53:52 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 44.17

2025/03/09 00:53:52 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 28 - Minibatch ==



Average Metric: 14.67 / 25 (58.7%): 100%|██████████| 25/25 [00:23<00:00,  1.06it/s]

2025/03/09 00:54:15 INFO dspy.evaluate.evaluate: Average Metric: 14.666666666666666 / 25 (58.7%)
2025/03/09 00:54:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 5'].
2025/03/09 00:54:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67]
2025/03/09 00:54:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:54:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:54:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 28 - Minibatch ==



Average Metric: 12.67 / 25 (50.7%): 100%|██████████| 25/25 [00:25<00:00,  1.01s/it]

2025/03/09 00:54:41 INFO dspy.evaluate.evaluate: Average Metric: 12.666666666666666 / 25 (50.7%)
2025/03/09 00:54:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 50.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 1'].
2025/03/09 00:54:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67]
2025/03/09 00:54:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:54:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:54:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 28 - Minibatch ==



Average Metric: 10.67 / 25 (42.7%): 100%|██████████| 25/25 [00:29<00:00,  1.16s/it]

2025/03/09 00:55:10 INFO dspy.evaluate.evaluate: Average Metric: 10.666666666666666 / 25 (42.7%)
2025/03/09 00:55:10 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 42.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 2', 'Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 3'].
2025/03/09 00:55:10 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67]
2025/03/09 00:55:10 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:55:10 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:55:10 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 28 - Minibatch ==



Average Metric: 15.33 / 25 (61.3%): 100%|██████████| 25/25 [00:16<00:00,  1.52it/s]

2025/03/09 00:55:26 INFO dspy.evaluate.evaluate: Average Metric: 15.333333333333332 / 25 (61.3%)
2025/03/09 00:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 61.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 5', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 0'].
2025/03/09 00:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33]
2025/03/09 00:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:55:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 28 - Minibatch ==



Average Metric: 15.67 / 25 (62.7%): 100%|██████████| 25/25 [00:15<00:00,  1.60it/s]

2025/03/09 00:55:42 INFO dspy.evaluate.evaluate: Average Metric: 15.666666666666666 / 25 (62.7%)
2025/03/09 00:55:42 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 7', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 6'].
2025/03/09 00:55:42 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67]
2025/03/09 00:55:42 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:55:42 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:55:42 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 7 / 28 - Minibatch ==



Average Metric: 13.67 / 25 (54.7%): 100%|██████████| 25/25 [00:40<00:00,  1.63s/it]

2025/03/09 00:56:23 INFO dspy.evaluate.evaluate: Average Metric: 13.666666666666666 / 25 (54.7%)
2025/03/09 00:56:23 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 4', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 3'].
2025/03/09 00:56:23 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67]
2025/03/09 00:56:23 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:56:23 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:56:23 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 28 - Minibatch ==



Average Metric: 14.00 / 25 (56.0%): 100%|██████████| 25/25 [00:24<00:00,  1.01it/s]

2025/03/09 00:56:48 INFO dspy.evaluate.evaluate: Average Metric: 14.0 / 25 (56.0%)
2025/03/09 00:56:48 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 56.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 8', 'Predictor 0: Few-Shot Set 7', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
2025/03/09 00:56:48 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0]
2025/03/09 00:56:48 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:56:48 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:56:48 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 28 - Minibatch ==



Average Metric: 14.67 / 25 (58.7%): 100%|██████████| 25/25 [00:29<00:00,  1.19s/it]

2025/03/09 00:57:17 INFO dspy.evaluate.evaluate: Average Metric: 14.666666666666666 / 25 (58.7%)
2025/03/09 00:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 3', 'Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 0'].
2025/03/09 00:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67]
2025/03/09 00:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:57:17 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 28 - Minibatch ==



Average Metric: 16.33 / 25 (65.3%): 100%|██████████| 25/25 [00:17<00:00,  1.45it/s]

2025/03/09 00:57:35 INFO dspy.evaluate.evaluate: Average Metric: 16.333333333333332 / 25 (65.3%)
2025/03/09 00:57:35 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 00:57:35 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33]
2025/03/09 00:57:35 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17]
2025/03/09 00:57:35 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 44.17


2025/03/09 00:57:35 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 11 / 28 - Full Evaluation =====
2025/03/09 00:57:35 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 65.33) from minibatch trials...



Average Metric: 49.67 / 80 (62.1%): 100%|██████████| 80/80 [00:39<00:00,  2.00it/s]

2025/03/09 00:58:15 INFO dspy.evaluate.evaluate: Average Metric: 49.666666666666664 / 80 (62.1%)
2025/03/09 00:58:15 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 62.08
2025/03/09 00:58:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 00:58:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08
2025/03/09 00:58:15 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/09 00:58:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 28 - Minibatch ==



Average Metric: 13.67 / 25 (54.7%): 100%|██████████| 25/25 [00:24<00:00,  1.00it/s]

2025/03/09 00:58:40 INFO dspy.evaluate.evaluate: Average Metric: 13.666666666666666 / 25 (54.7%)
2025/03/09 00:58:40 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 54.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 6', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 4'].
2025/03/09 00:58:40 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67]
2025/03/09 00:58:40 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 00:58:40 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 00:58:40 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 13 / 28 - Minibatch ==



Average Metric: 13.33 / 25 (53.3%): 100%|██████████| 25/25 [00:09<00:00,  2.67it/s]

2025/03/09 00:58:49 INFO dspy.evaluate.evaluate: Average Metric: 13.333333333333332 / 25 (53.3%)
2025/03/09 00:58:49 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 53.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 00:58:49 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33]
2025/03/09 00:58:49 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 00:58:49 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 00:58:49 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 14 / 28 - Minibatch ==



Average Metric: 11.67 / 25 (46.7%): 100%|██████████| 25/25 [00:19<00:00,  1.27it/s]

2025/03/09 00:59:09 INFO dspy.evaluate.evaluate: Average Metric: 11.666666666666666 / 25 (46.7%)
2025/03/09 00:59:09 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 46.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 3', 'Predictor 0: Few-Shot Set 5', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 6'].
2025/03/09 00:59:09 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67]
2025/03/09 00:59:09 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 00:59:09 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 00:59:09 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 15 / 28 - Minibatch ==



Average Metric: 15.67 / 25 (62.7%): 100%|██████████| 25/25 [00:19<00:00,  1.29it/s]

2025/03/09 00:59:28 INFO dspy.evaluate.evaluate: Average Metric: 15.666666666666666 / 25 (62.7%)
2025/03/09 00:59:28 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 62.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 00:59:28 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67]
2025/03/09 00:59:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 00:59:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 00:59:28 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 16 / 28 - Minibatch ==



Average Metric: 12.33 / 25 (49.3%): 100%|██████████| 25/25 [00:37<00:00,  1.49s/it]

2025/03/09 01:00:06 INFO dspy.evaluate.evaluate: Average Metric: 12.333333333333332 / 25 (49.3%)
2025/03/09 01:00:06 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 49.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 7', 'Predictor 1: Instruction 5', 'Predictor 1: Few-Shot Set 6'].
2025/03/09 01:00:06 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33]
2025/03/09 01:00:06 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 01:00:06 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:00:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 17 / 28 - Minibatch ==



Average Metric: 14.67 / 25 (58.7%): 100%|██████████| 25/25 [00:24<00:00,  1.03it/s]

2025/03/09 01:00:30 INFO dspy.evaluate.evaluate: Average Metric: 14.666666666666666 / 25 (58.7%)
2025/03/09 01:00:30 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 4', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 01:00:30 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67]
2025/03/09 01:00:30 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 01:00:30 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:00:30 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 18 / 28 - Minibatch ==



Average Metric: 16.33 / 25 (65.3%): 100%|██████████| 25/25 [00:03<00:00,  6.78it/s]

2025/03/09 01:00:34 INFO dspy.evaluate.evaluate: Average Metric: 16.333333333333332 / 25 (65.3%)
2025/03/09 01:00:34 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 5'].
2025/03/09 01:00:34 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33]
2025/03/09 01:00:34 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 01:00:34 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:00:34 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 19 / 28 - Minibatch ==



Average Metric: 11.33 / 25 (45.3%): 100%|██████████| 25/25 [00:25<00:00,  1.00s/it]

2025/03/09 01:00:59 INFO dspy.evaluate.evaluate: Average Metric: 11.333333333333332 / 25 (45.3%)
2025/03/09 01:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 45.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 6', 'Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 5'].
2025/03/09 01:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33]
2025/03/09 01:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 01:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:00:59 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 20 / 28 - Minibatch ==



Average Metric: 8.67 / 16 (54.2%):  64%|██████▍   | 16/25 [00:10<00:03,  2.82it/s]

2025/03/09 01:01:11 ERROR dspy.utils.parallelizer: Error processing item Example({'claim': 'Grand Forks International Airport is closer to the town it is near, than the airport close by to The Texas Air & Space Museum.', 'titles': ['Rick Husband Amarillo International Airport', 'Grand Forks International Airport', 'Texas Air &amp; Space Museum']}) (input_keys={'claim'}): 1 validation error for literal['search_wikipedia','lookup_wikipedia','finish']
  Input should be 'search_wikipedia', 'lookup_wikipedia' or 'finish' [type=literal_error, input_value='modify_tool', input_type=str]
    For further information visit https://errors.pydantic.dev/2.10/v/literal_error. Set `provide_traceback=True` to see the stack trace.


Average Metric: 13.33 / 24 (55.6%): 100%|██████████| 25/25 [00:14<00:00,  1.74it/s]

2025/03/09 01:01:14 INFO dspy.evaluate.evaluate: Average Metric: 13.333333333333332 / 25 (53.3%)
2025/03/09 01:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 53.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 6', 'Predictor 0: Few-Shot Set 8', 'Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 5'].
2025/03/09 01:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33, 53.33]
2025/03/09 01:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08]
2025/03/09 01:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 21 / 28 - Full Evaluation =====
2025/03/09 01:01:14 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 65.33) from minibatch trials...



Average Metric: 46.67 / 80 (58.3%): 100%|██████████| 80/80 [00:06<00:00, 12.04it/s]

2025/03/09 01:01:21 INFO dspy.evaluate.evaluate: Average Metric: 46.666666666666664 / 80 (58.3%)
2025/03/09 01:01:21 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33]
2025/03/09 01:01:21 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08
2025/03/09 01:01:21 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/09 01:01:21 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 22 / 28 - Minibatch ==



Average Metric: 16.00 / 25 (64.0%): 100%|██████████| 25/25 [00:20<00:00,  1.19it/s]

2025/03/09 01:01:42 INFO dspy.evaluate.evaluate: Average Metric: 16.0 / 25 (64.0%)
2025/03/09 01:01:42 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 64.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 7', 'Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 1'].
2025/03/09 01:01:42 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33, 53.33, 64.0]
2025/03/09 01:01:42 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33]
2025/03/09 01:01:42 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:01:42 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 23 / 28 - Minibatch ==



Average Metric: 16.67 / 25 (66.7%): 100%|██████████| 25/25 [00:17<00:00,  1.41it/s]

2025/03/09 01:02:00 INFO dspy.evaluate.evaluate: Average Metric: 16.666666666666668 / 25 (66.7%)
2025/03/09 01:02:00 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 66.67 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 7', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 01:02:00 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33, 53.33, 64.0, 66.67]
2025/03/09 01:02:00 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33]
2025/03/09 01:02:00 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:02:00 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 24 / 28 - Minibatch ==



Average Metric: 16.33 / 25 (65.3%): 100%|██████████| 25/25 [00:14<00:00,  1.70it/s]

2025/03/09 01:02:15 INFO dspy.evaluate.evaluate: Average Metric: 16.333333333333332 / 25 (65.3%)
2025/03/09 01:02:15 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 65.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 7', 'Predictor 0: Few-Shot Set 1', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 01:02:15 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33, 53.33, 64.0, 66.67, 65.33]
2025/03/09 01:02:15 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33]
2025/03/09 01:02:15 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:02:15 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 25 / 28 - Minibatch ==



Average Metric: 17.00 / 25 (68.0%): 100%|██████████| 25/25 [00:15<00:00,  1.56it/s]

2025/03/09 01:02:31 INFO dspy.evaluate.evaluate: Average Metric: 17.0 / 25 (68.0%)
2025/03/09 01:02:31 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 68.0 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 7', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 01:02:31 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33, 53.33, 64.0, 66.67, 65.33, 68.0]
2025/03/09 01:02:31 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33]
2025/03/09 01:02:31 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:02:31 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 26 / 28 - Minibatch ==



Average Metric: 18.33 / 25 (73.3%): 100%|██████████| 25/25 [00:15<00:00,  1.58it/s]

2025/03/09 01:02:47 INFO dspy.evaluate.evaluate: Average Metric: 18.333333333333332 / 25 (73.3%)
2025/03/09 01:02:47 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 73.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 7', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 01:02:47 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33, 53.33, 64.0, 66.67, 65.33, 68.0, 73.33]
2025/03/09 01:02:47 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33]
2025/03/09 01:02:47 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:02:47 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 27 / 28 - Minibatch ==



Average Metric: 17.33 / 25 (69.3%): 100%|██████████| 25/25 [00:26<00:00,  1.07s/it]

2025/03/09 01:03:14 INFO dspy.evaluate.evaluate: Average Metric: 17.333333333333332 / 25 (69.3%)
2025/03/09 01:03:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 69.33 on minibatch of size 25 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 0', 'Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 7'].
2025/03/09 01:03:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [58.67, 50.67, 42.67, 61.33, 62.67, 54.67, 56.0, 58.67, 65.33, 54.67, 53.33, 46.67, 62.67, 49.33, 58.67, 65.33, 45.33, 53.33, 64.0, 66.67, 65.33, 68.0, 73.33, 69.33]
2025/03/09 01:03:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33]
2025/03/09 01:03:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 62.08


2025/03/09 01:03:14 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 28 / 28 - Full Evaluation =====
2025/03/09 01:03:14 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program 


Average Metric: 51.00 / 80 (63.8%): 100%|██████████| 80/80 [00:18<00:00,  4.37it/s]

2025/03/09 01:03:33 INFO dspy.evaluate.evaluate: Average Metric: 51.0 / 80 (63.8%)
2025/03/09 01:03:33 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 63.75
2025/03/09 01:03:33 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [44.17, 62.08, 58.33, 63.75]
2025/03/09 01:03:33 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 63.75
2025/03/09 01:03:33 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/03/09 01:03:33 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 63.75!





In [30]:
evaluate(optimized_react)

Average Metric: 68.33 / 100 (68.3%): 100%|██████████| 100/100 [00:44<00:00,  2.23it/s]

2025/03/09 01:05:56 INFO dspy.evaluate.evaluate: Average Metric: 68.33333333333333 / 100 (68.3%)





Unnamed: 0,claim,example_titles,trajectory,reasoning,pred_titles,top5_recall
0,The Church of England's movement that inspired the Trinity Episcop...,"[Samuel Rickards, Trinity Episcopal Church (Houghton, Michigan), O...","{'thought_0': 'To verify the claim, we need to find the movement t...",The claim states that the Church of England's movement that inspir...,"[Oxford Movement, Samuel Rickards, Trinity Episcopal Church (Hough...",✔️ [1.000]
1,"Red, White & Crüe and this athlete both fight. The french fighter ...","[Bobby Stewart, Red, White &amp; Crüe, Mike Tyson]","{'thought_0': 'The claim mentions two entities that fight: Red, Wh...",The claim states that a director known for his work on Miss Potter...,"[Chris Noonan, Babe (film), Miss Potter, Academy Award for Best Di...",
2,The writer/director/actor from Glen or Glenda and Fernand Rivers s...,"[Glen or Glenda, Fernand Rivers, Ed Wood]","{'thought_0': 'To verify the claim, we need to find the writer/dir...",The claim states that the writer/director/actor from Glen or Glend...,"[Ed Wood, Fernand Rivers, Glen or Glenda, Fernand Rivers (film dir...",✔️ [1.000]
3,The film by Sandi Sissel was released before The End of Suburbia.,"[Chicken Ranch (film), The End of Suburbia, Sandi Sissel]",{'thought_0': 'We need to find the film by Sandi Sissel and compar...,The claim states that the film by Sandi Sissel was released before...,"[Chicken Ranch, The End of Suburbia, Sandi Sissel]",✔️ [0.667]
4,The actor who played captain hook in the live production with Tayl...,"[Taylor Louderman, Christopher Walken, Peter Pan Live!]","{'thought_0': 'To verify the claim, we need to find the actor who ...",The claim states that the actor who played captain hook in the liv...,"[Christopher Walken, The Deer Hunter, Taylor Louderman, Peter Pan ...",✔️ [1.000]


68.33

We can see that MIPRO optimization significantly improved performance from around 49% recall to 68% recall.

To understand how MIPRO achieved this improvement, let's examine the optimized prompts it generated. We'll run an example query and then use inspect_history() to see the prompts for both components of our ReAct agent - the reasoning loop that breaks down the task, and the final prediction module that produces the output titles.

In [31]:
optimized_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles

['Joe Orton',
 'Bernard-Marie Koltès',
 'Up Against It',
 'The Beatles',
 'Playwrights']

In [32]:
# view the optimized prompts
dspy.inspect_history(n=1)





[34m[2025-03-09T01:06:24.362989][0m

[31mSystem message:[0m

Your input fields are:
1. `claim` (str)
2. `trajectory` (str)

Your output fields are:
1. `reasoning` (str)
2. `titles` (list[str])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## claim ## ]]
{claim}

[[ ## trajectory ## ]]
{trajectory}

[[ ## reasoning ## ]]
{reasoning}

[[ ## titles ## ]]
{titles}        # note: the value you produce must adhere to the JSON schema: {"type": "array", "items": {"type": "string"}}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        You are a fact-checker tasked with verifying the accuracy of claims. To do this, you will need to find all relevant Wikipedia titles that can help verify or refute the claim. Your goal is to provide a list of titles that are directly related to the claim, which will aid in determining its validity. Please use your knowledge of Wikipedia and its search functionality to ide

Save the optimized prompts!

In [33]:
optimized_react.save("optimized_react.json")

loaded_react = dspy.ReAct("claim -> titles: list[str]", tools=[search_wikipedia, lookup_wikipedia], max_iters=20)
loaded_react.load("optimized_react.json")

loaded_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles

['Joe Orton',
 'Bernard-Marie Koltès',
 'Up Against It',
 'The Beatles',
 'Playwrights']