<img src="../docs/docs/static/img/dspy_logo.png" alt="DSPy7 Image" height="150"/>

# DSPy: Tutorial @ SkyCamp

This notebook contains the **DSPy tutorial** for **SkyCamp 2023**.

Let's begin by setting things up. The snippet below will also install **DSPy** if it's not there already.

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os

try: # When on google Colab, let's clone the notebook so we download the cache.
    import google.colab  # noqa: F401
    repo_path = 'dspy'
    !git -C $repo_path pull origin || git clone https://github.com/stanfordnlp/dspy $repo_path
except:
    repo_path = '.'

if repo_path not in sys.path:
    sys.path.append(repo_path)

# Set up the cache for this notebook
os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(repo_path, 'cache')

# import pkg_resources # Install the package if it's not installed
# if not "dspy-ai" in {pkg.key for pkg in pkg_resources.working_set}:
#     !pip install -U pip
#     # !pip install dspy-ai
#     !pip install -e $repo_path

import dspy

In [2]:
%load_ext autoreload
%autoreload 2
import sys; sys.path.append('/future/u/okhattab/repos/public/stanfordnlp/dspy')

from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShot, BootstrapFewShotWithRandomSearch

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
./cache/compiler


### 1) Configure the default LM and retriever

We'll start by setting up the language model (LM) and retrieval model (RM). **DSPy** supports multiple API and local models.

In this notebook, we will use `Llama2-13b-chat` using the HuggingFace TGI serving software infrastructure. In principle you can run this on your own local GPUs, but for this tutorial all examples are pre-cached so you don't need to worry about cost.

We will use the retriever `ColBERTv2`. To make things easy, we've set up a ColBERTv2 server hosting a Wikipedia 2017 "abstracts" search index (i.e., containing first paragraph of each article from this [2017 dump](https://hotpotqa.github.io/wiki-readme.html)), so you don't need to worry about setting one up! It's free.

**Note:** _If you run this notebook as instructed, you don't need an API key. All examples are already cached internally so you can inspect them!_

In [3]:
llama = dspy.HFClientTGI(model="meta-llama/Llama-2-13b-chat-hf", port=[7140, 7141, 7142, 7143], max_tokens=150)
colbertv2 = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

# # NOTE: After you finish this notebook, you can use GPT-3.5 like this if you like.
# turbo = dspy.OpenAI(model='gpt-3.5-turbo-instruct')
# # In that case, make sure to configure lm=turbo below if you choose to do that.

dspy.settings.configure(rm=colbertv2, lm=llama)

### 2) Create a few question–answer pairs for our task

In [4]:
train = [('Who was the director of the 2009 movie featuring Peter Outerbridge as William Easton?', 'Kevin Greutert'),
         ('The heir to the Du Pont family fortune sponsored what wrestling team?', 'Foxcatcher'),
         ('In what year was the star of To Hell and Back born?', '1925'),
         ('Which award did the first book of Gary Zukav receive?', 'U.S. National Book Award'),
         ('What documentary about the Gilgo Beach Killer debuted on A&E?', 'The Killing Season'),
         ('Which author is English: John Braine or Studs Terkel?', 'John Braine'),
         ('Who produced the album that included a re-recording of "Lithium"?', 'Butch Vig')]

train = [dspy.Example(question=question, answer=answer).with_inputs('question') for question, answer in train]

In [5]:
dev = [('Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?', 'E. L. Doctorow'),
       ('Right Back At It Again contains lyrics co-written by the singer born in what city?', 'Gainesville, Florida'),
       ('What year was the party of the winner of the 1971 San Francisco mayoral election founded?', '1828'),
       ('Anthony Dirrell is the brother of which super middleweight title holder?', 'Andre Dirrell'),
       ('The sports nutrition business established by Oliver Cookson is based in which county in the UK?', 'Cheshire'),
       ('Find the birth date of the actor who played roles in First Wives Club and Searching for the Elephant.', 'February 13, 1980'),
       ('Kyle Moran was born in the town on what river?', 'Castletown River'),
       ("The actress who played the niece in the Priest film was born in what city, country?", 'Surrey, England'),
       ('Name the movie in which the daughter of Noel Harrison plays Violet Trefusis.', 'Portrait of a Marriage'),
       ('What year was the father of the Princes in the Tower born?', '1442'),
       ('What river is near the Crichton Collegiate Church?', 'the River Tyne'),
       ('Who purchased the team Michael Schumacher raced for in the 1995 Monaco Grand Prix in 2000?', 'Renault'),
       ('André Zucca was a French photographer who worked with a German propaganda magazine published by what Nazi organization?', 'the Wehrmacht')]

dev = [dspy.Example(question=question, answer=answer).with_inputs('question') for question, answer in dev]

### 3) Key Concepts: Signatures & Modules

In [6]:
# Define a dspy.Predict module with the signature `question -> answer` (i.e., takes a question and outputs an answer).
predict = dspy.Predict('question -> answer')

# Use the module!
predict(question="What is the capital of Germany?")

Prediction(
    answer='Berlin'
)

In the example above, we used the `dspy.Predict` module **zero-shot**, i.e. without compiling it on any examples.

Let's now build a slightly more advanced program. Our program will use the `dspy.ChainOfThought` module, which asks the LM to think step by step.

We will call this program `CoT`.

In [7]:
class CoT(dspy.Module):  # let's define a new module
    def __init__(self):
        super().__init__()

        # here we declare the chain of thought sub-module, so we can later compile it (e.g., teach it a prompt)
        self.generate_answer = dspy.ChainOfThought('question -> answer')
    
    def forward(self, question):
        return self.generate_answer(question=question)  # here we use the module

Now let's compile this using our six `train` examples. We will us the very simple `BootstrapFewShot` in DSPy.

In [8]:
metric_EM = dspy.evaluate.answer_exact_match

teleprompter = BootstrapFewShot(metric=metric_EM, max_bootstrapped_demos=2)
cot_compiled = teleprompter.compile(CoT(), trainset=train)

100%|██████████| 7/7 [00:00<00:00, 29.36it/s]

Bootstrapped 1 full traces after 7 examples in round 0.





Let's ask a question to this new program.

In [9]:
cot_compiled("What is the capital of Germany?")

Prediction(
    rationale='determine the capital of Germany. We know that the capital of Germany is Berlin, so the answer is Berlin.',
    answer='Berlin'
)

You might be curious what's happening under the hood. Let's inspect the last call to our Llama LM to see the prompt and the output.

In [10]:
llama.inspect_history(n=1)





Given the fields `question`, produce the fields `answer`.

---

Question: Who was the director of the 2009 movie featuring Peter Outerbridge as William Easton?
Answer: Kevin Greutert

Question: Which award did the first book of Gary Zukav receive?
Answer: U.S. National Book Award

Question: What documentary about the Gilgo Beach Killer debuted on A&E?
Answer: The Killing Season

Question: In what year was the star of To Hell and Back born?
Answer: 1925

Question: The heir to the Du Pont family fortune sponsored what wrestling team?
Answer: Foxcatcher

Question: Who produced the album that included a re-recording of "Lithium"?
Answer: Butch Vig

---

Follow the following format.

Question: ${question}
Reasoning: Let's think step by step in order to ${produce the answer}. We ...
Answer: ${answer}

---

Question: Which author is English: John Braine or Studs Terkel?
Reasoning: Let's think step by step in order to determine which author is English. We know that John Braine is English, 

Notice how the prompt ends with the question we asked ("What is the capital of Germany?"), but before that it includes few-shot examples.

The final example in the prompt contains a rationale (step-by-step reasoning) self-generated from the LM for use as a demonstration, for the training question "Which author is English: John Braine or Studs Terkel?".

Now, let's evaluate on our development set.

In [11]:
NUM_THREADS = 32
evaluate_hotpot = Evaluate(devset=dev, metric=metric_EM, num_threads=NUM_THREADS, display_progress=True, display_table=15)

First, let's evaluate the compiled `CoT` program with Llama. Feel free to replace `cot_compiled` below with `CoT()` (notice the paranthesis) to test the zero-shot version of CoT.

In [12]:
evaluate_hotpot(cot_compiled)

Average Metric: 3 / 13  (23.1): 100%|██████████| 13/13 [00:00<00:00, 117.05it/s]


Average Metric: 3 / 13  (23.1%)


Unnamed: 0,question,example_answer,rationale,pred_answer,answer_exact_match
0,Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?,E. L. Doctorow,"determine who has a broader scope of profession. We know that E. L. Doctorow was a novelist, but Julia Peterkin was a journalist and a...",Julia Peterkin,❌ [False]
1,Right Back At It Again contains lyrics co-written by the singer born in what city?,"Gainesville, Florida","determine the answer. We know that the singer was born in Minneapolis, so the answer is Minneapolis.",Minneapolis,❌ [False]
2,What year was the party of the winner of the 1971 San Francisco mayoral election founded?,1828,determine the year the party of the winner of the 1971 San Francisco mayoral election was founded. We know that the party was founded in...,1971,❌ [False]
3,Anthony Dirrell is the brother of which super middleweight title holder?,Andre Dirrell,"determine which super middleweight title holder Anthony Dirrell is the brother of. We know that Anthony Dirrell is a professional boxer, and after researching, we...",Andre Dirrell,✔️ [True]
4,The sports nutrition business established by Oliver Cookson is based in which county in the UK?,Cheshire,"determine the county in the UK where Oliver Cookson's sports nutrition business is based. We know that Oliver Cookson is a British entrepreneur, so we...",Surrey,❌ [False]
5,Find the birth date of the actor who played roles in First Wives Club and Searching for the Elephant.,"February 13, 1980","determine the birth date of the actor. We know that the actor was born in the 1950s, so we need to narrow down the possible...","August 12, 1955",❌ [False]
6,Kyle Moran was born in the town on what river?,Castletown River,determine where Kyle Moran was born. We know that Kyle Moran was born in the town on the Delaware River.,Delaware River,❌ [False]
7,"The actress who played the niece in the Priest film was born in what city, country?","Surrey, England","determine the answer. We know that the actress was born in a city, so we need to determine the country. After researching, we found that...","Los Angeles, California",❌ [False]
8,Name the movie in which the daughter of Noel Harrison plays Violet Trefusis.,Portrait of a Marriage,"determine the name of the movie. We know that the daughter of Noel Harrison plays Violet Trefusis, so we need to determine the name of...",The Remains of the Day,❌ [False]
9,What year was the father of the Princes in the Tower born?,1442,"determine the answer. We know that the father of the Princes in the Tower was born before 1483, so we need to find the correct...",1452,❌ [False]


23.08

### 4) Bonus 1: RAG with query generation

As a bonus, let's define a more sophisticated program called `RAG`. This program will:

- Use the LM to generate a search query based on the input question
- Retrieve three passages using our retriever
- Use the LM to generate a final answer using these passages

In [13]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        # declare three modules: the retriever, a query generator, and an answer generator
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_query = dspy.ChainOfThought("question -> search_query")
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        # generate a search query from the question, and use it to retrieve passages
        search_query = self.generate_query(question=question).search_query
        passages = self.retrieve(search_query).passages

        # generate an answer from the passages and the question
        return self.generate_answer(context=passages, question=question)

Out of curiosity, we can evaluate the **uncompiled** (or **zero-shot**) version of this program.

In [14]:
evaluate_hotpot(RAG(), display_table=0)

Average Metric: 3 / 13  (23.1): 100%|██████████| 13/13 [00:00<00:00, 45.09it/s]

Average Metric: 3 / 13  (23.1%)





23.08

Let's now compile this RAG program. We'll use a slightly more advnaced teleprompter (automatic prompt optimizer) this time, which relies on random search.

In [15]:
teleprompter2 = BootstrapFewShotWithRandomSearch(metric=metric_EM, max_bootstrapped_demos=2, num_candidate_programs=8, num_threads=NUM_THREADS)
rag_compiled = teleprompter2.compile(RAG(), trainset=train, valset=dev)

Going to sample between 1 and 2 traces per predictor.
Will attempt to train 8 candidate sets.


Average Metric: 3 / 13  (23.1): 100%|██████████| 13/13 [00:00<00:00, 155.65it/s]


Average Metric: 3 / 13  (23.1%)
Score: 23.08 for set: [0, 0]
New best score: 23.08 for seed -3
Scores so far: [23.08]
Best score: 23.08


Average Metric: 3 / 13  (23.1): 100%|██████████| 13/13 [00:00<00:00, 72.77it/s]


Average Metric: 3 / 13  (23.1%)
Score: 23.08 for set: [7, 7]
Scores so far: [23.08, 23.08]
Best score: 23.08


 86%|████████▌ | 6/7 [00:00<00:00, 13.07it/s]


Bootstrapped 2 full traces after 7 examples in round 0.


Average Metric: 5 / 13  (38.5): 100%|██████████| 13/13 [00:00<00:00, 45.43it/s]


Average Metric: 5 / 13  (38.5%)
Score: 38.46 for set: [7, 7]
New best score: 38.46 for seed -1
Scores so far: [23.08, 23.08, 38.46]
Best score: 38.46
Average of max per entry across top 1 scores: 0.38461538461538464
Average of max per entry across top 2 scores: 0.46153846153846156
Average of max per entry across top 3 scores: 0.46153846153846156
Average of max per entry across top 5 scores: 0.46153846153846156
Average of max per entry across top 8 scores: 0.46153846153846156
Average of max per entry across top 9999 scores: 0.46153846153846156


100%|██████████| 7/7 [00:00<00:00, 19.01it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


Average Metric: 6 / 13  (46.2): 100%|██████████| 13/13 [00:00<00:00, 42.01it/s]


Average Metric: 6 / 13  (46.2%)
Score: 46.15 for set: [7, 7]
New best score: 46.15 for seed 0
Scores so far: [23.08, 23.08, 38.46, 46.15]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384


 14%|█▍        | 1/7 [00:00<00:00, 21.10it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 4 / 13  (30.8): 100%|██████████| 13/13 [00:00<00:00, 68.72it/s]


Average Metric: 4 / 13  (30.8%)
Score: 30.77 for set: [7, 7]
Scores so far: [23.08, 23.08, 38.46, 46.15, 30.77]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384


 29%|██▊       | 2/7 [00:00<00:00, 21.89it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 4 / 13  (30.8): 100%|██████████| 13/13 [00:00<00:00, 67.99it/s]


Average Metric: 4 / 13  (30.8%)
Score: 30.77 for set: [7, 7]
Scores so far: [23.08, 23.08, 38.46, 46.15, 30.77, 30.77]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384


 43%|████▎     | 3/7 [00:00<00:00, 21.61it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


Average Metric: 3 / 13  (23.1): 100%|██████████| 13/13 [00:00<00:00, 61.97it/s]


Average Metric: 3 / 13  (23.1%)
Score: 23.08 for set: [7, 7]
Scores so far: [23.08, 23.08, 38.46, 46.15, 30.77, 30.77, 23.08]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384


 14%|█▍        | 1/7 [00:00<00:00, 21.15it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 5 / 13  (38.5): 100%|██████████| 13/13 [00:00<00:00, 44.95it/s]


Average Metric: 5 / 13  (38.5%)
Score: 38.46 for set: [7, 7]
Scores so far: [23.08, 23.08, 38.46, 46.15, 30.77, 30.77, 23.08, 38.46]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384


100%|██████████| 7/7 [00:00<00:00, 22.62it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


Average Metric: 4 / 13  (30.8): 100%|██████████| 13/13 [00:00<00:00, 66.59it/s]


Average Metric: 4 / 13  (30.8%)
Score: 30.77 for set: [7, 7]
Scores so far: [23.08, 23.08, 38.46, 46.15, 30.77, 30.77, 23.08, 38.46, 30.77]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384


 57%|█████▋    | 4/7 [00:00<00:00, 23.46it/s]


Bootstrapped 1 full traces after 5 examples in round 0.


Average Metric: 4 / 13  (30.8): 100%|██████████| 13/13 [00:00<00:00, 68.29it/s]


Average Metric: 4 / 13  (30.8%)
Score: 30.77 for set: [7, 7]
Scores so far: [23.08, 23.08, 38.46, 46.15, 30.77, 30.77, 23.08, 38.46, 30.77, 30.77]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384


100%|██████████| 7/7 [00:00<00:00, 20.87it/s]


Bootstrapped 1 full traces after 7 examples in round 0.


Average Metric: 4 / 13  (30.8): 100%|██████████| 13/13 [00:00<00:00, 70.76it/s]

Average Metric: 4 / 13  (30.8%)
Score: 30.77 for set: [7, 7]
Scores so far: [23.08, 23.08, 38.46, 46.15, 30.77, 30.77, 23.08, 38.46, 30.77, 30.77, 30.77]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.5384615384615384
Average of max per entry across top 3 scores: 0.5384615384615384
Average of max per entry across top 5 scores: 0.5384615384615384
Average of max per entry across top 8 scores: 0.5384615384615384
Average of max per entry across top 9999 scores: 0.5384615384615384
11 candidate programs found.





Let's now evaluate this compiled version of RAG.

In [16]:
evaluate_hotpot(rag_compiled)

Average Metric: 6 / 13  (46.2): 100%|██████████| 13/13 [00:00<00:00, 137.18it/s]

Average Metric: 6 / 13  (46.2%)





Unnamed: 0,question,example_answer,rationale,pred_answer,answer_exact_match
0,Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?,E. L. Doctorow,"answer this question. We know that E. L. Doctorow and Julia Peterkin are both authors, but we also know that E. L. Doctorow is known...",E. L. Doctorow.,✔️ [True]
1,Right Back At It Again contains lyrics co-written by the singer born in what city?,"Gainesville, Florida","answer this question. We know that Beyoncé is the singer who co-wrote the lyrics of ""Right Back At It Again"". We also know that Beyoncé...",Houston.,❌ [False]
2,What year was the party of the winner of the 1971 San Francisco mayoral election founded?,1828,answer this question. We know that the winner of the 1971 San Francisco mayoral election was a member of the Democratic Party. We also know...,1828.,✔️ [True]
3,Anthony Dirrell is the brother of which super middleweight title holder?,Andre Dirrell,answer this question. We know that Anthony Dirrell is a professional boxer. We also know that he held the WBC super middleweight title from 2014...,Andre Dirrell.,✔️ [True]
4,The sports nutrition business established by Oliver Cookson is based in which county in the UK?,Cheshire,"answer this question. We know that Oliver Cookson established Myprotein, a sports nutrition business. We also know that Myprotein was sold for a reported £58...",Cheshire.,✔️ [True]
5,Find the birth date of the actor who played roles in First Wives Club and Searching for the Elephant.,"February 13, 1980","answer this question. We know that the actor played roles in ""First Wives Club"" and ""Searching for the Elephant"". We also know that the actor's...","October 17, 1976.",❌ [False]
6,Kyle Moran was born in the town on what river?,Castletown River,answer this question. We know that Kyle Moran is an actor who was born in Livingston. We also know that Livingston is a town in...,River Forth.,❌ [False]
7,"The actress who played the niece in the Priest film was born in what city, country?","Surrey, England",answer this question. We know that Lily Collins is an actress and the daughter of Phil Collins. We also know that she was born in...,"Surrey, England.",✔️ [True]
8,Name the movie in which the daughter of Noel Harrison plays Violet Trefusis.,Portrait of a Marriage,answer this question. We know that Noel Harrison is the father of Dhani Harrison. We also know that Dhani Harrison is a member of the...,"The daughter of Noel Harrison plays Violet Trefusis in the movie ""The Killing Season"".",❌ [False]
9,What year was the father of the Princes in the Tower born?,1442,answer this question. We know that the father of the Princes in the Tower was King Richard III of England. We also know that he...,1452.,❌ [False]


46.15

Let's inspect one of the LM calls for this. Focus in particular on the structure of the last few input/output examples in the prompt.

In [17]:
rag_compiled("What year was the party of the winner of the 1971 San Francisco mayoral election founded?")
llama.inspect_history(n=1)





Given the fields `context`, `question`, produce the fields `answer`.

---

Question: Which author is English: John Braine or Studs Terkel?
Answer: John Braine

Question: The heir to the Du Pont family fortune sponsored what wrestling team?
Answer: Foxcatcher

Question: Who produced the album that included a re-recording of "Lithium"?
Answer: Butch Vig

Question: In what year was the star of To Hell and Back born?
Answer: 1925

Question: What documentary about the Gilgo Beach Killer debuted on A&E?
Answer: The Killing Season

Question: Who was the director of the 2009 movie featuring Peter Outerbridge as William Easton?
Answer: Kevin Greutert

---

Follow the following format.

Context: ${context}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: ${answer}

---

Context:
[1] «The Dancing Wu Li Masters | The Dancing Wu Li Masters is a 1979 book by Gary Zukav, a popular science work exploring modern physics, and quantum phen

### 4) Bonus 2: Multi-Hop Retrieval and Reasoning

Let's now build a simple multi-hop program, which will interleave multiple calls to the LM and the retriever.

Please follow the **TODO** instructions below to implement this.

In [18]:
from dsp.utils.utils import deduplicate

class MultiHop(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_query = dspy.ChainOfThought("question -> search_query")

        # TODO: Define a dspy.ChainOfThought module with the signature 'context, question -> search_query'.
        self.generate_query_from_context = dspy.ChainOfThought("context, question -> search_query")

        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        passages = []
        
        search_query = self.generate_query(question=question).search_query
        passages += self.retrieve(search_query).passages

        # TODO: Use self.generate_query_from_context to generate a search query.
        # Note: Modules require named keyword arguments (e.g., context=..., question=...).
        search_query = self.generate_query_from_context(context=passages, question=question).search_query

        # TODO: Use self.retrieve to retrieve passages. Append them to the list `passages`.
        passages += self.retrieve(search_query).passages

        return self.generate_answer(context=deduplicate(passages), question=question)

In [19]:
multihop_compiled = teleprompter2.compile(MultiHop(), trainset=train, valset=dev)

Average Metric: 3 / 13  (23.1): 100%|██████████| 13/13 [00:00<00:00, 40.91it/s]


Average Metric: 3 / 13  (23.1%)
Score: 23.08 for set: [0, 0, 0]
New best score: 23.08 for seed -3
Scores so far: [23.08]
Best score: 23.08


Average Metric: 3 / 13  (23.1): 100%|██████████| 13/13 [00:00<00:00, 53.59it/s]


Average Metric: 3 / 13  (23.1%)
Score: 23.08 for set: [7, 7, 7]
Scores so far: [23.08, 23.08]
Best score: 23.08


 57%|█████▋    | 4/7 [00:00<00:00, 11.10it/s]


Bootstrapped 2 full traces after 5 examples in round 0.


Average Metric: 6 / 13  (46.2): 100%|██████████| 13/13 [00:00<00:00, 27.22it/s]


Average Metric: 6 / 13  (46.2%)
Score: 46.15 for set: [7, 7, 7]
New best score: 46.15 for seed -1
Scores so far: [23.08, 23.08, 46.15]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.46153846153846156
Average of max per entry across top 3 scores: 0.46153846153846156
Average of max per entry across top 5 scores: 0.46153846153846156
Average of max per entry across top 8 scores: 0.46153846153846156
Average of max per entry across top 9999 scores: 0.46153846153846156


 43%|████▎     | 3/7 [00:00<00:00, 15.41it/s]

Bootstrapped 2 full traces after 4 examples in round 0.



Average Metric: 5 / 13  (38.5): 100%|██████████| 13/13 [00:00<00:00, 27.45it/s]


Average Metric: 5 / 13  (38.5%)
Score: 38.46 for set: [7, 7, 7]
Scores so far: [23.08, 23.08, 46.15, 38.46]
Best score: 46.15
Average of max per entry across top 1 scores: 0.46153846153846156
Average of max per entry across top 2 scores: 0.46153846153846156
Average of max per entry across top 3 scores: 0.46153846153846156
Average of max per entry across top 5 scores: 0.46153846153846156
Average of max per entry across top 8 scores: 0.46153846153846156
Average of max per entry across top 9999 scores: 0.46153846153846156


 14%|█▍        | 1/7 [00:00<00:00, 16.50it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 7 / 13  (53.8): 100%|██████████| 13/13 [00:00<00:00, 37.79it/s]


Average Metric: 7 / 13  (53.8%)
Score: 53.85 for set: [7, 7, 7]
New best score: 53.85 for seed 1
Scores so far: [23.08, 23.08, 46.15, 38.46, 53.85]
Best score: 53.85
Average of max per entry across top 1 scores: 0.5384615384615384
Average of max per entry across top 2 scores: 0.6153846153846154
Average of max per entry across top 3 scores: 0.6153846153846154
Average of max per entry across top 5 scores: 0.6153846153846154
Average of max per entry across top 8 scores: 0.6153846153846154
Average of max per entry across top 9999 scores: 0.6153846153846154


 29%|██▊       | 2/7 [00:00<00:00, 17.20it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 8 / 13  (61.5): 100%|██████████| 13/13 [00:00<00:00, 55.08it/s]


Average Metric: 8 / 13  (61.5%)
Score: 61.54 for set: [7, 7, 7]
New best score: 61.54 for seed 2
Scores so far: [23.08, 23.08, 46.15, 38.46, 53.85, 61.54]
Best score: 61.54
Average of max per entry across top 1 scores: 0.6153846153846154
Average of max per entry across top 2 scores: 0.6153846153846154
Average of max per entry across top 3 scores: 0.6923076923076923
Average of max per entry across top 5 scores: 0.6923076923076923
Average of max per entry across top 8 scores: 0.6923076923076923
Average of max per entry across top 9999 scores: 0.6923076923076923


 43%|████▎     | 3/7 [00:00<00:00, 17.15it/s]


Bootstrapped 1 full traces after 4 examples in round 0.


Average Metric: 8 / 13  (61.5): 100%|██████████| 13/13 [00:00<00:00, 50.97it/s]


Average Metric: 8 / 13  (61.5%)
Score: 61.54 for set: [7, 7, 7]
Scores so far: [23.08, 23.08, 46.15, 38.46, 53.85, 61.54, 61.54]
Best score: 61.54
Average of max per entry across top 1 scores: 0.6153846153846154
Average of max per entry across top 2 scores: 0.6153846153846154
Average of max per entry across top 3 scores: 0.6153846153846154
Average of max per entry across top 5 scores: 0.6923076923076923
Average of max per entry across top 8 scores: 0.6923076923076923
Average of max per entry across top 9999 scores: 0.6923076923076923


 14%|█▍        | 1/7 [00:00<00:00, 11.73it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 5 / 13  (38.5): 100%|██████████| 13/13 [00:00<00:00, 38.16it/s]


Average Metric: 5 / 13  (38.5%)
Score: 38.46 for set: [7, 7, 7]
Scores so far: [23.08, 23.08, 46.15, 38.46, 53.85, 61.54, 61.54, 38.46]
Best score: 61.54
Average of max per entry across top 1 scores: 0.6153846153846154
Average of max per entry across top 2 scores: 0.6153846153846154
Average of max per entry across top 3 scores: 0.6153846153846154
Average of max per entry across top 5 scores: 0.6923076923076923
Average of max per entry across top 8 scores: 0.6923076923076923
Average of max per entry across top 9999 scores: 0.6923076923076923


 71%|███████▏  | 5/7 [00:00<00:00, 17.44it/s]


Bootstrapped 2 full traces after 6 examples in round 0.


Average Metric: 5 / 13  (38.5): 100%|██████████| 13/13 [00:00<00:00, 34.45it/s]


Average Metric: 5 / 13  (38.5%)
Score: 38.46 for set: [7, 7, 7]
Scores so far: [23.08, 23.08, 46.15, 38.46, 53.85, 61.54, 61.54, 38.46, 38.46]
Best score: 61.54
Average of max per entry across top 1 scores: 0.6153846153846154
Average of max per entry across top 2 scores: 0.6153846153846154
Average of max per entry across top 3 scores: 0.6153846153846154
Average of max per entry across top 5 scores: 0.6923076923076923
Average of max per entry across top 8 scores: 0.6923076923076923
Average of max per entry across top 9999 scores: 0.6923076923076923


 29%|██▊       | 2/7 [00:00<00:00, 20.30it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 5 / 13  (38.5): 100%|██████████| 13/13 [00:00<00:00, 56.06it/s]


Average Metric: 5 / 13  (38.5%)
Score: 38.46 for set: [7, 7, 7]
Scores so far: [23.08, 23.08, 46.15, 38.46, 53.85, 61.54, 61.54, 38.46, 38.46, 38.46]
Best score: 61.54
Average of max per entry across top 1 scores: 0.6153846153846154
Average of max per entry across top 2 scores: 0.6153846153846154
Average of max per entry across top 3 scores: 0.6153846153846154
Average of max per entry across top 5 scores: 0.6923076923076923
Average of max per entry across top 8 scores: 0.6923076923076923
Average of max per entry across top 9999 scores: 0.6923076923076923


 43%|████▎     | 3/7 [00:00<00:00, 20.01it/s]


Bootstrapped 2 full traces after 4 examples in round 0.


Average Metric: 7 / 13  (53.8): 100%|██████████| 13/13 [00:00<00:00, 26.05it/s]

Average Metric: 7 / 13  (53.8%)
Score: 53.85 for set: [7, 7, 7]
Scores so far: [23.08, 23.08, 46.15, 38.46, 53.85, 61.54, 61.54, 38.46, 38.46, 38.46, 53.85]
Best score: 61.54
Average of max per entry across top 1 scores: 0.6153846153846154
Average of max per entry across top 2 scores: 0.6153846153846154
Average of max per entry across top 3 scores: 0.6153846153846154
Average of max per entry across top 5 scores: 0.8461538461538461
Average of max per entry across top 8 scores: 0.8461538461538461
Average of max per entry across top 9999 scores: 0.8461538461538461
11 candidate programs found.





In [20]:
evaluate_hotpot(multihop_compiled, devset=dev)

Average Metric: 8 / 13  (61.5): 100%|██████████| 13/13 [00:00<00:00, 92.27it/s]

Average Metric: 8 / 13  (61.5%)





Unnamed: 0,question,example_answer,rationale,pred_answer,answer_exact_match
0,Who has a broader scope of profession: E. L. Doctorow or Julia Peterkin?,E. L. Doctorow,"answer this question. We know that E. L. Doctorow is an American novelist, editor, and professor, and he has been described as one of the...",E. L. Doctorow.,✔️ [True]
1,Right Back At It Again contains lyrics co-written by the singer born in what city?,"Gainesville, Florida","answer this question. We know that Beyoncé is an American singer, songwriter, dancer, and actress, and she was born in Houston, Texas. Her album ""Beyoncé""...",Houston.,❌ [False]
2,What year was the party of the winner of the 1971 San Francisco mayoral election founded?,1828,"answer this question. We know that the Democratic Party is one of the two major contemporary political parties in the United States, and it was...",1828.,✔️ [True]
3,Anthony Dirrell is the brother of which super middleweight title holder?,Andre Dirrell,answer this question. We know that Anthony Dirrell is a professional boxer who held the WBC super middleweight title from 2014 to 2015. We also...,Andre Dirrell.,✔️ [True]
4,The sports nutrition business established by Oliver Cookson is based in which county in the UK?,Cheshire,answer this question. We know that Oliver Cookson is a UK entrepreneur who established the sports nutrition business Myprotein. We also know that Myprotein was...,Cheshire.,✔️ [True]
5,Find the birth date of the actor who played roles in First Wives Club and Searching for the Elephant.,"February 13, 1980","answer this question. We know that the actor's name is Jo Dong-hyuk, and he was born on December 11, 1977. Therefore, the answer is December...","December 11, 1977.",❌ [False]
6,Kyle Moran was born in the town on what river?,Castletown River,answer this question. We know that Kyle Moran is an Irish footballer who plays as a forward for Perth SC in the NPL Western Australia....,River Dundalk.,❌ [False]
7,"The actress who played the niece in the Priest film was born in what city, country?","Surrey, England","answer this question. We know that Lily Collins is an actress, and she was born in Surrey, England. Therefore, the answer is Surrey, England.","Surrey, England.",✔️ [True]
8,Name the movie in which the daughter of Noel Harrison plays Violet Trefusis.,Portrait of a Marriage,"answer this question. We know that Cathryn Harrison is the daughter of Noel Harrison, and she is an English actress. One of her roles was...",First Daughter.,❌ [False]
9,What year was the father of the Princes in the Tower born?,1442,"answer this question. We know that the father of the Princes in the Tower was King Richard III of England, and he was born on...",1452.,❌ [False]


61.54

Let's now inspect the prompt for the second-hop search query for one of the questions.

In [21]:
multihop_compiled(question="Who purchased the team Michael Schumacher raced for in the 1995 Monaco Grand Prix in 2000?")
llama.inspect_history(n=1, skip=2)





Given the fields `context`, `question`, produce the fields `search_query`.

---

Follow the following format.

Context: ${context}

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the search_query}. We ...

Search Query: ${search_query}

---

Context:
[1] «The Dancing Wu Li Masters | The Dancing Wu Li Masters is a 1979 book by Gary Zukav, a popular science work exploring modern physics, and quantum phenomena in particular. It was awarded a 1980 U.S. National Book Award in category of Science. Although it explores empirical topics in modern physics research, "The Dancing Wu Li Masters" gained attention for leveraging metaphors taken from eastern spiritual movements, in particular the Huayen school of Buddhism with the monk Fazang's treatise on The Golden Lion, to explain quantum phenomena and has been regarded by some reviewers as a New Age work, although the book is mostly concerned with the work of pioneers in western physics down through the ages.