# Tutorial: Agents

Let's walk through a quick example of setting up a `dspy.ReAct` agent with a couple of tools and optimizing it to conduct advanced browsing for multi-hop search.

Install the latest DSPy via `pip install -U dspy` and follow along.

In this tutorial, we'll use an extremely small LM, Meta's `Llama-3.2-3B-Instruct` which has 3 billion parameters.

A model like this is not very reliable out of the box for long or complex agent loops. However, it's extremely fast and cheap to host, as it needs very little RAM.

You might be able to host the 3B model on your laptop with Ollama, on your GPU server with SGLang, or via a provider that hosts it for you like Databricks or Together.

In the snippet below, we'll configure our main LM as `Llama-3.2-3B`. We'll also set up a larger LM, i.e. `GPT-4o`, as a teacher that we'll invoke a very small number of times to help teach the small LM.

In [1]:
import dspy

llama3b = dspy.LM('<provider>/Llama-3.2-3B-Instruct', temperature=0.7)
gpt4o = dspy.LM('openai/gpt-4o', temperature=0.7)

dspy.configure(lm=llama3b)

Let's load a dataset for our task. We'll load examples from the HoVer multi-hop task, where the input is a (really!) complex claim and the output we're seeking is the set of Wikipedia pages that are required to fact-check that claim.

In [2]:
import random
from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
hover = [
    dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
    for x in hover
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids and not hpqa_ids.add(x["hpqa_id"])
]

random.Random(0).shuffle(hover)
trainset, devset, testset = hover[:100], hover[100:200], hover[650:]

Let's view an example of this task:

In [3]:
example = trainset[0]

print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)

Claim: This director is known for his work on Miss Potter. The Academy of Motion Picture Arts and Sciences presents the award in which he was nominated for his work in "Babe".
Pages that must be retrieved: ['Miss Potter', 'Chris Noonan', 'Academy Award for Best Director']


Now, let's define a function to do the search in Wikipedia. We'll rely on a ColBERTv2 server that can search the "abstracts" (i.e., first paragraphs) of every article that existed in Wikipedia in 2017, which is the data used in HoVer.

In [4]:
DOCS = {}

def search(query: str, k: int) -> list[str]:
    results = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')(query, k=k)
    results = [x['text'] for x in results]

    for result in results:
        title, text = result.split(" | ", 1)
        DOCS[title] = text

    return results

Now, let's use the `search` function to define two tools for our ReAct agent:

In [5]:
def search_wikipedia(query: str) -> list[str]:
    """Returns top-5 results and then the titles of the top-5 to top-30 results."""

    topK = search(query, 30)
    titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
    return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]

def lookup_wikipedia(title: str) -> str:
    """Returns the text of the Wikipedia page, if it exists."""

    if title in DOCS:
        return DOCS[title]

    results = [x for x in search(title, 10) if x.startswith(title + " | ")]
    if not results:
        return f"No Wikipedia page found for title: {title}"
    return results[0]

Now, let's define the ReAct agent in DSPy. It's going to be super simple: it'll take a `claim` and produce a list `titles: list[str]`.

We'll instruct it to find all Wikipedia titles that are needed to fact-check the claim.

In [6]:
instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)
react = dspy.ReAct(signature, tools=[search_wikipedia, lookup_wikipedia], max_iters=20)

Let's try it with a really simple claim to see if our tiny 3B model can do it!

In [7]:
react(claim="David Gregory was born in 1625.").titles[:3]

['David Gregory (physician)', 'David A. Gregory', 'David Harry Gregory']

Great. Now let's set up an evaluation metric, `top5_recall`.

It will return the fraction of the gold pages (which are always 3) that are retrieved in the top-5 titles returned by the agent.

In [9]:
def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0
    
    # If we're just doing inference, just measure the recall.
    return recall

evaluate = dspy.Evaluate(devset=devset, metric=top5_recall, num_threads=16, display_progress=True, display_table=5)

Let's evaluate our off-the-shelf agent, with `Llama-3.2-8B`, to see how far we can go already.

This model is tiny, so it can complain fairly often. Let's wrap it in a try/except block to hide those.

In [10]:
def safe_react(claim: str):
    try:
        return react(claim=claim)
    except Exception as e:
        return dspy.Prediction(titles=[])

evaluate(safe_react)

  0%|          | 0/100 [00:00<?, ?it/s]

Average Metric: 8.00 / 100 (8.0%): 100%|██████████| 100/100 [05:22<00:00,  3.22s/it]

2024/12/17 14:09:47 INFO dspy.evaluate.evaluate: Average Metric: 7.999999999999997 / 100 (8.0%)





Unnamed: 0,claim,example_titles,trajectory,reasoning,pred_titles,top5_success
0,The Church of England's movement that inspired the Trinity Episcop...,"[Oxford Movement, Trinity Episcopal Church (Houghton, Michigan), S...",{'thought_0': 'The claim suggests that there is a specific movemen...,The search results seem to be a mix of different churches with sim...,"['Trinity Episcopal Church (Houghton, Michigan)', 'Trinity Episcop...",✔️ [0.333]
1,"Red, White & Crüe and this athlete both fight. The french fighter ...","[Red, White &amp; Crüe, Mike Tyson, Bobby Stewart]",,,[],
2,The writer/director/actor from Glen or Glenda and Fernand Rivers s...,"[Ed Wood, Glen or Glenda, Fernand Rivers]",,,[],
3,The film by Sandi Sissel was released before The End of Suburbia.,"[Chicken Ranch (film), Sandi Sissel, The End of Suburbia]",,,[],
4,The actor who played captain hook in the live production with Tayl...,"[Christopher Walken, Taylor Louderman, Peter Pan Live!]",,,[],


8.0

Wow. It only scores 8% in terms of recall. Not that good!

Let's now optimize the two prompts inside `dspy.ReAct` jointly to maximize the recall of our agent. This may take around 30 minutes and make some $5 worth of calls to GPT-4o to optimize Llama-3.2-3B.

In [12]:
kwargs = dict(teacher_settings=dict(lm=gpt4o), prompt_model=gpt4o, max_errors=999)

tp = dspy.MIPROv2(metric=top5_recall, auto="medium", num_threads=16, **kwargs)
optimized_react = tp.compile(react, trainset=trainset, max_bootstrapped_demos=3, max_labeled_demos=0)

Let's now evaluate again, after optimization.

In [13]:
evaluate(optimized_react)

Average Metric: 41.67 / 100 (41.7%): 100%|██████████| 100/100 [03:00<00:00,  1.81s/it]

2024/12/17 15:12:06 INFO dspy.evaluate.evaluate: Average Metric: 41.66666666666667 / 100 (41.7%)





Unnamed: 0,claim,example_titles,trajectory,reasoning,pred_titles,top5_success
0,The Church of England's movement that inspired the Trinity Episcop...,"[Oxford Movement, Trinity Episcopal Church (Houghton, Michigan), S...","{'thought_0': 'To verify the claim, I need to identify the Church ...",The claim states that the Church of England's movement that inspir...,"['Trinity Episcopal Church (Houghton, Michigan)', 'Church of All S...",✔️ [0.667]
1,"Red, White & Crüe and this athlete both fight. The french fighter ...","[Red, White &amp; Crüe, Mike Tyson, Bobby Stewart]","{'thought_0': 'To verify the claim, I need to identify the French ...","The claim states that Red, White & Crüe is a term applied to sport...","[Bobby Stewart, Bernardin Ledoux Kingue Matam, Mötley Crüe, Milan ...",✔️ [0.333]
2,The writer/director/actor from Glen or Glenda and Fernand Rivers s...,"[Ed Wood, Glen or Glenda, Fernand Rivers]","{'thought_0': 'To verify the claim, I need to identify the writer/...",The claim states that Glen or Glenda and Fernand Rivers share the ...,"[Ed Wood, Bela Lugosi, Dolores Fuller]",✔️ [0.333]
3,The film by Sandi Sissel was released before The End of Suburbia.,"[Chicken Ranch (film), Sandi Sissel, The End of Suburbia]","{'thought_0': 'To verify the claim, I need to find the release dat...",The claim states that the film by Sandi Sissel was released before...,"[Sandi Sissel, The End of Suburbia (film)]",✔️ [0.333]
4,The actor who played captain hook in the live production with Tayl...,"[Christopher Walken, Taylor Louderman, Peter Pan Live!]","{'thought_0': 'To verify the claim, I need to find the actor who p...",The claim suggests that the actor who played Captain Hook in the l...,"[Cyril Ritchard, Ruth Connell]",


41.67

Awesome. It looks like the system improved drastically from 8% recall to around 40% recall. That was a pretty straightforward approach, but DSPy gives you many tools to continue iterating on this from here.

Next, let's inspect the optimized prompts to understand what it has learned. We'll run one query and then inspect the last two prompts, which will show us the prompts used for both ReAct sub-modules, the one that does the agentic loop and the other than prepares the final results.

In [15]:
optimized_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles

['Bernard-Marie Koltès', 'Joe Orton']

In [17]:
dspy.inspect_history(n=2)





[34m[2024-12-17T15:13:25.420335][0m

[31mSystem message:[0m

Your input fields are:
1. `claim` (str)
2. `trajectory` (str)

Your output fields are:
1. `next_thought` (str)
2. `next_tool_name` (Literal[search_wikipedia, lookup_wikipedia, finish])
3. `next_tool_args` (dict[str, Any])

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## claim ## ]]
{claim}

[[ ## trajectory ## ]]
{trajectory}

[[ ## next_thought ## ]]
{next_thought}

[[ ## next_tool_name ## ]]
{next_tool_name}        # note: the value you produce must be one of: search_wikipedia; lookup_wikipedia; finish

[[ ## next_tool_args ## ]]
{next_tool_args}        # note: the value you produce must be pareseable according to the following JSON schema: {"type": "object"}

[[ ## completed ## ]]

In adhering to this structure, your objective is: 
        Find all Wikipedia titles relevant to verifying (or refuting) the claim.
        
        You will be given `claim` and yo

Finally, let's save our optimized program so we can use it again later.

In [18]:
optimized_react.save("optimized_react.json")

loaded_react = dspy.ReAct("claim -> titles: list[str]", tools=[search_wikipedia, lookup_wikipedia], max_iters=20)
loaded_react.load("optimized_react.json")

loaded_react(claim="The author of the 1960s unproduced script written for The Beatles, Up Against It, and Bernard-Marie Koltès are both playwrights.").titles

['Bernard-Marie Koltès', 'Joe Orton']