# Preparation

We make use of OpenRouter in our tutorial, through which we can access GPT 3.5 in blocked regions

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()
OPENROUTER_API_KEY = os.environ.get('OPENROUTER_API_KEY')

For the below cell, you can use `dspy.OpenAI` class directly if it works for you

In [3]:
import dspy

# Configure LLM, for this example we are using OpenAI's GPT-3.5-turbo
lm = dspy.Databricks(api_key=OPENROUTER_API_KEY,
		api_base="https://openrouter.ai/api/v1",
		model="openai/gpt-3.5-turbo")

# Configure DSPy to use the following language model by default
dspy.settings.configure(lm = lm)

  from .autonotebook import tqdm as notebook_tqdm


# Basic Concept of DSPy: Signatures and Modules

They are the building blocks of prompt programming in DSPy. Let's dive in to see what they are about!

## Signatures: Specification of input/output

A signature is the most fundamental building block in DSPy's prompt programming, which is a declarative specification of input/output behavior of a DSPy module. Signatures allow you to tell the LM **what** it needs to do, rather than specify how we should ask the LM to do it.

At its most basic form, a signature is as simple as a single string separating the inputs and output with a `->`

In [4]:
# Define signature
signature = 'sentence -> sentiment'
classify = dspy.Predict(signature)

# Run
sentence = "it's a charming and often affecting journey."
classify(sentence=sentence).sentiment


"I'm sorry, but I am unable to determine the sentiment of the sentence without additional context or information. If you provide me with more details or specific criteria for determining sentiment, I would be happy to assist you further."

The prediction is not a good one, but for instructional purpose let's inspect what was the issued prompt.

In [5]:
lm.inspect_history(n=1)





Given the fields `sentence`, produce the fields `sentiment`.

---

Follow the following format.

Sentence: ${sentence}
Sentiment: ${sentiment}

---

Sentence: it's a charming and often affecting journey.
Sentiment:[32m I'm sorry, but I am unable to determine the sentiment of the sentence without additional context or information. If you provide me with more details or specific criteria for determining sentiment, I would be happy to assist you further.[0m





We can see this prompt is assembled from the `sentence -> sentiment` signature.

As seen from the code below, when we feed the signature into `dspy.Predict()`, the signature will be parsed into the `signature` attribute of the `classify` object, and subsequently assembled as a prompt. The `instructions` is the default one in DSPy.

In [6]:
vars(classify)

{'stage': 'f64b0ae92bc37733',
 'signature': StringSignature(sentence -> sentiment
     instructions='Given the fields `sentence`, produce the fields `sentiment`.'
     sentence = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Sentence:', 'desc': '${sentence}'})
     sentiment = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'output', 'prefix': 'Sentiment:', 'desc': '${sentiment}'})
 ),
 'config': {},
 'lm': None,
 'traces': [],
 'train': [],
 'demos': []}

What if we want to provide a more detailed description of our objective to the LLM, beyond the basic `sentence -> sentiment` signature? To do so we need to provide a more verbose signatures in form of **Class-based DSPy Signatures**.

Notice we provide no explicit instruction as to how the LLM should obtain the sentiment. We are just describing the task at hand, and also the expected output.

In [7]:
# Define signature in Class-based form
class Emotion(dspy.Signature):
    # Describe the task
    """Classify emotions in a sentence."""
    
    sentence = dspy.InputField()
    # Adding description to the output field
    sentiment = dspy.OutputField(desc="Possible choices: sadness, joy, love, anger, fear, surprise.")

classify_class_based = dspy.Predict(Emotion)

# Issue prediction
classify_class_based(sentence=sentence).sentiment

"Sentence: It's a charming and often affecting journey.\nSentiment: joy"

It is now outputting a much better prediction! Again we see the descriptions we made when defining the class-based DSPy signatures are assembled into a prompt

In [8]:
lm.inspect_history(n=1)





Classify emotions in a sentence.

---

Follow the following format.

Sentence: ${sentence}
Sentiment: Possible choices: sadness, joy, love, anger, fear, surprise.

---

Sentence: it's a charming and often affecting journey.
Sentiment:[32m Sentence: It's a charming and often affecting journey.
Sentiment: joy[0m





This might do for simple tasks, but advanced applications might require sophisticated prompting techniques like Chain of Thought or ReAct. In DSPy these are implemented as `Modules`

## Modules: Abstracting prompting techniques

We are used to hardcoding phrases like `let's think step by step` in our prompt. In DSPy these prompting techniques are abstracted as **Modules**. Let's see below for an example of applying our class-based signature to the `dspy.ChainOfThought` module


In [9]:
# Apply the basic `sentence->sentiment` signature to Chain of Thought
classify_cot = dspy.ChainOfThought(Emotion)

# Run
classify_cot(sentence=sentence).sentiment

# Inspect prompt
lm.inspect_history(n=1)






Classify emotions in a sentence.

---

Follow the following format.

Sentence: ${sentence}
Reasoning: Let's think step by step in order to ${produce the sentiment}. We ...
Sentiment: Possible choices: sadness, joy, love, anger, fear, surprise.

---

Sentence: it's a charming and often affecting journey.
Reasoning: Let's think step by step in order to[32m Sentence: It's a charming and often affecting journey.
Reasoning: Let's think step by step in order to determine the sentiment. The use of the words "charming" and "affecting" suggests positive emotions associated with enjoyment and emotional impact. We can infer that the overall tone is positive and heartwarming, evoking feelings of joy and possibly love.
Sentiment: Joy, love[0m





Notice how the "Reasoning: Let's think step by step..." phrase is added to our prompt, and the quality of our prediction is even better now.

As of time of writing DSPy provides the following prompting techniques in form of Modules. Notice the `dspy.Predict` we used in the initial example is also a Module, representing no prompting technique!

1. `dspy.Predict`: Basic predictor. Does not modify the signature. Handles the key forms of learning (i.e., storing the instructions and demonstrations and updates to the LM).
2. `dspy.ChainOfThought`: Teaches the LM to think step-by-step before committing to the signature's response.
3. `dspy.ProgramOfThought`: Teaches the LM to output code, whose execution results will dictate the response.
4. `dspy.ReAct`: An agent that can use tools to implement the given signature.
5. `dspy.MultiChainComparison`: Can compare multiple outputs from ChainOfThought to produce a final prediction.

It also have some function-style modules:

6. `dspy.majority`: Can do basic voting to return the most popular response from a set of predictions.

You can check out further examples in [each module's respective guide](https://dspy-docs.vercel.app/api/category/modules).

### Chaining the modules
On the other hand, what about RAG? We can chain the modules together to deal with bigger problems!

First we define a retriever, for our example we use a ColBERT retriever getting information from Wikipedia Abstracts 2017

In [12]:
# Configure retriever
rm = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(rm = rm)

Then we define the `RAG` class inherited from `dspy.Module`. It needs two methods:
- The `__init__` method will simply declare the sub-modules it needs: `dspy.Retrieve` and `dspy.ChainOfThought`. The latter is defined to implement our `context, question -> answer` signature.
- The `forward` method will describe the control flow of answering the question using the modules we have.

Note: Code and description borrowed from [DSPy's introduction notebook](https://github.com/stanfordnlp/dspy/blob/main/intro.ipynb)

In [13]:
# Define a class-based signature
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

# Chain different modules together to retrieve information from Wikipedia Abstracts 2017, then pass it as context for Chain of Thought to generate an answer
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        answer = self.generate_answer(context=context, question=question)
        return answer

Then we make use of the class to perform a RAG

In [14]:
# Initilize our RAG class
rag = RAG()

# Define a question and pass it into the RAG class
my_question = "When was the first FIFA World Cup held?"
rag(question=my_question).answer

'1930'

Inspecting the prompt, we see that 3 passages retrieved from Wikipedia Abstracts 2017 is interpersed as context for Chain of Thought generation

In [15]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the answer}. We ...

Answer: often between 1 and 5 words

---

Context:
[1] «History of the FIFA World Cup | The FIFA World Cup was first held in 1930, when FIFA president Jules Rimet decided to stage an international football tournament. The inaugural edition, held in 1930, was contested as a final tournament of only thirteen teams invited by the organization. Since then, the World Cup has experienced successive expansions and format remodeling to its current 32-team final tournament preceded by a two-year qualifying process, involving over 200 teams from around the world.»
[2] «1950 FIFA World Cup | The 1950 FIFA World Cup, held in Brazil from 24 June to 16 July 1950, was the fourth FIFA World Cup. It was the first World Cup since 1938, the planned 1942 and 1946 competitions having be

The above examples might not seem much. At its most basic application the DSPy seemed only doing nothing that can't be done with f-string, but it actually present a paradigm shift for prompt writing, as this brings **modularity** to prompt composition!

First we describe our objective with `Signature`, then we apply different prompting techniques with `Modules`. To test different prompt techniques for a given problem, we can simply switch the modules used and compare their results, rather than hardcoding the "let's think step by step..." (for Chain of Thought) or "you will interleave Thought, Action, and Observation steps" (for ReAct) phrases.

The power of DSPy is not only limited to modularity, it can also optimize our prompt based on training samples, and test it systematically. We will be exploring this in the next section!

# Optimizer: Train our prompt as with machine learning
In this section we attempt to optimize our prompt for a RAG application.

Taking Chain of Thought as an example, beyond just adding the "let's think step by step" phrase, we can boost its performance with a few tweaks:
1. Adding suitable examples (aka **few-shot learning**).
2. Furthermore, we can **bootstrap demonstrations of reasoning** to teach the LMs to apply proper reasoning to deal with the task at hand. 

Doing this manually would be highly time-consuming and can't generalize to different problems, but with DSPy this can be done automatically. Let's dive in!

## Preparation
Like machine learning, to train our prompt we need to prepare our training and test datasets. Initially this cell will take around 20 minutes to run.

In [10]:
from dspy.datasets.hotpotqa import HotPotQA

# For demonstration purpose we will use a small subset of the HotPotQA dataset, 20 for training and testing each
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=20, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train]
testset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(testset)

  table = cls._concat_blocks(blocks, axis=0)


(20, 20)

Inspecting our dataset, which is basically a set of question-and-answer pairs

In [17]:
trainset[0]

Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'})

To facilitate understanding of the optimization process, we launch **Phoenix** to observe our DSPy application, which is a great tool for LLM observability in general!

Note: If you are on Windows, please also install Windows C++ Build Tools [here](https://visualstudio.microsoft.com/visual-cpp-build-tools/), which is necessary for Phoenix

In [18]:
# Phoenix by default uses the 6006 port for the UI. 
# If you have a port conflict, you can close the port by uncommenting the following code

import phoenix as px
# import psutil
# 
# def close_port(port):
#     for conn in psutil.net_connections(kind='inet'):
#         if conn.laddr.port == port:
#             print(f"Closing port {port} by terminating PID {conn.pid}")
#             process = psutil.Process(conn.pid)
#             process.terminate()

# close_port(6006)

phoenix_session = px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


Configure our OpenTelemetry exporter, which will export spans and traces to Phoenix, and run the DSPy instrumentor to wrap calls to the relevant DSPy components.

In [19]:
from openinference.instrumentation.dspy import DSPyInstrumentor
from opentelemetry import trace as trace_api
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk import trace as trace_sdk
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace.export import SimpleSpanProcessor

endpoint = "http://127.0.0.1:6006/v1/traces"
resource = Resource(attributes={})
tracer_provider = trace_sdk.TracerProvider(resource=resource)
span_otlp_exporter = OTLPSpanExporter(endpoint=endpoint)
tracer_provider.add_span_processor(SimpleSpanProcessor(span_exporter=span_otlp_exporter))

trace_api.set_tracer_provider(tracer_provider=tracer_provider)
DSPyInstrumentor().instrument()

## Prompt Optimization
Then we are ready to see what this opimitzation is about! To "train" our prompt, we need 3 things:

1. A training set. We'll just use our 20 question–answer examples from `trainset`.
2. A metric for validation. Here we use the native `dspy.evaluate.answer_exact_match` which checks if the predicted answer exactly matches the right answer (questionable but suffice for demonstration). For real-life applications you can define your own evaluation criteria
3. A specific **Optimizer** (formerly teleprompter). The DSPy library includes a number of optimization strategies and you can check them out [here](https://dspy-docs.vercel.app/docs/building-blocks/optimizers). For our example we use `BootstrapFewShot`

Now we train our prompt. Successful execution of the below cell should show "Bootstrapped 4 full traces after n examples in round 0"

In [20]:
from dspy.teleprompt import BootstrapFewShot

# Simple optimizer example. I am explicitly stating the default values for max_bootstrapped_demos and max_labeled_demos for demonstration purposes
optimizer = BootstrapFewShot(metric=dspy.evaluate.answer_exact_match, max_bootstrapped_demos=4)

# Compile!
compiled_rag = optimizer.compile(RAG(), trainset=trainset)

 70%|███████   | 14/20 [00:00<00:00, 41.20it/s]

Bootstrapped 4 full traces after 15 examples in round 0.





Before using the `compiled_rag` to answer a question, let's see what went behind the scene during the training process (aka compile). We launch the Phoenix console by visiting `http://localhost:6006/` in browser

In my run I have made 14 calls using the `RAG` class, in each of those calls we post a question to LM to obtain a prediction. 

Below is the summary of my run, see that 4 correct answers are made from 14 samples, thus reaching our `max_bootstrapped_demos` parameter and stopping the calls.

|#|Question|Predicted Answer|Real Answer|Correct|
|---|---|---|---|---|
|1|At My Window was released by which American singer-songwriter?|Townes Van Zandt|John Townes Van Zandt|❌|
|2|which  American actor was Candace Kita  guest starred with|Context: Candace Kita has appeared in various television shows and films throughout her career, including guest-starring roles.\n\nQuestion: Which American actor was Candace Kita guest-starred with?\n\nReasoning: Let's think step by step in order to identify the American actor who appeared alongside Candace Kita in one of her guest-starring roles.\n\nAnswer: Bill Murray|Bill Murray|❌|
|3|Which of these publications was most recently published, Who Put the Bomp or Self?|Context: The publication \"Who Put the Bomp\" was a rock music fanzine edited and published by Greg Shaw from 1970 to 1979.\n\nQuestion: Which of these publications was most recently published, Who Put the Bomp or Self?\n\nReasoning: \"Who Put the Bomp\" was last published in 1979.\n\nAnswer: Who Put the|Self|❌|
|4|The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?|Answer: 1950|1950|❌|
|5|Which magazine has published articles by Scott Shaw, Tae Kwon Do Times or Southwest Art?|Tae Kwon Do Times|Tae Kwon Do Times|✔️|
|6|In what year was the club founded that played Manchester City in the 1972 FA Charity Shield|Context: The 1972 FA Charity Shield was contested between Manchester City and Aston Villa.\n\nQuestion: In what year was the club founded that played Manchester City in the 1972 FA Charity Shield?\n\nReasoning: Let's think step by step in order to find the answer. The club that played Manchester City in the 1972 FA Charity Shield was Aston Villa, and|1874|❌|
|7|Which is taller, the Empire State Building or the Bank of America Tower?|Answer: the Empire State Building|The Empire State Building|❌|
|8|Which American actress who made their film debut in the 1995 teen drama \"Kids\" was the co-founder of Voto Latino?|Context: Rosario Dawson is an American actress, producer, singer, comic book writer, and political activist who made her film debut in the 1995 teen drama \"Kids.\"\n\nQuestion: Which American actress who made their film debut in the 1995 teen drama \"Kids\" was the co-founder of Voto Latino?\n\nReasoning: Let's think step by step in|Rosario Dawson|❌|
|9|Tombstone stared an actor born May 17, 1955 known as who?|Bill Paxton|Bill Paxton|✔️|
|10|What is the code name for the German offensive that started this Second World War engagement on the Eastern Front (a few hundred kilometers from Moscow) between Soviet and German forces, which included 102nd Infantry Division?|Operation Citadel|Operation Citadel|✔️|
|11|Who acted in the shot film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet."?|Question: Who acted in the short film The Shore and is also the youngest actress ever to play Ophelia in a Royal Shakespeare Company production of "Hamlet." ?\n\nReasoning: Let's think step by step in order to find the answer. We know that this actress played Octavia of the Julii in the HBO/BBC series "Rome" and voiced F|Kerry Condon|❌|
|12|Which company distributed this 1977 American animated film produced by Walt Disney Productions for which Sherman Brothers wrote songs?|Buena Vista Distribution Company|Buena Vista Distribution|❌|
|13|Samantha Cristoforetti and Mark Shuttleworth are both best known for being first in their field to go where?|Context:\n[1] Samantha Cristoforetti and Mark Shuttleworth are both known for their achievements in space exploration.\n\nQuestion: Samantha Cristoforetti and Mark Shuttleworth are both best known for being first in their field to go where?\n\nReasoning: Let's think step by step in order to Answer: Space\n\nAnswer: Space|space|❌|
|14|Having the combination of excellent foot speed and bat speed helped Eric Davis, create what kind of outfield for the Los Angeles Dodgers?|Outfield of Dreams|Outfield of Dreams|✔️|

But what are the prompts DSPy issued to obtain the bootstrapped demos? Here's the prompt for question #14. We can see as DSPy tries to generate one bootstrapped demo, it would randomly add samples from our `trainset` for few-short learning

In [21]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Question: Samantha Cristoforetti and Mark Shuttleworth are both best known for being first in their field to go where?
Answer: space

Question: which American actor was Candace Kita guest starred with
Answer: Bill Murray

Question: Tombstone stared an actor born May 17, 1955 known as who?
Answer: Bill Paxton

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas founded in what year?
Answer: 2010

Question: Which is taller, the Empire State Building or the Bank of America Tower?
Answer: The Empire State Building

Question: This American guitarist best known for her work with the Iron Maidens is an ancestor of a composer who was known as what?
Answer: The Waltz King

Question: Which magazine has published articles by Scott Shaw, Tae Kwon Do Times or Southwest Art?
Answer: Tae Kwon Do Times

Question: Which American actress who made their film debut in the 1995 tee

Time to put the `compiled_rag` to test! Here we raise a question which was answered wrongly in our summary table, and see if we can get the right answer this time.

In [22]:
compiled_rag(question="Which of these publications was most recently published, Who Put the Bomp or Self?")

Prediction(
    rationale='Answer: Self',
    answer='Self'
)

We now get the right answer!

Again let's inspect the prompt issued. Notice how the compiled prompt is different from the ones that were used during bootstrapping. Apart from the few-shot examples, **bootstrapped Context-Question-Reasoning-Answer demonstrations are added to the prompt**, improving the LM's capability.

In [23]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

Question: "Everything Has Changed" is a song from an album released under which record label ?
Answer: Big Machine Records

Question: The Victorians - Their Story In Pictures is a documentary series written by an author born in what year?
Answer: 1950

Question: Which Pakistani cricket umpire who won 3 consecutive ICC umpire of the year awards in 2009, 2010, and 2011 will be in the ICC World Twenty20?
Answer: Aleem Sarwar Dar

Question: Samantha Cristoforetti and Mark Shuttleworth are both best known for being first in their field to go where?
Answer: space

Question: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?
Answer: Aleksandr Danilovich Aleksandrov

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas founded in what year?
Answer: 201

The above example still falls short of what we typically do with machine learning: Typically we define a couple of candidate models, see how they perform against the test set, and select the one achieving the highest performance score. This is what we will do next!

# Full fledged example: "Models" comparison with LLM

## The aim of this example

Typically for LM comparison we raise underspecified questions like “how do different LMs compare on a certain problem”. With DSPy's modular, composable programs and optimizers, we are now equipped to answer toward “how they compare on a certain problem with module X when compiled with Optimizer Y”, which is a well-defined and reproducible run, thus reducing the role of artful prompt construction in modern AI.

In this section, we want to address the question of "Given the LM we use (GPT 3.5 Turbo), what is the best module and optimizer that could best perform a RAG to get the right answer".

The modules under evaluation are:
- **Vanilla**: Single-hop RAG to answer a question based on the retrieved context, without key phrases like "let's think step by step"
- **COT**: Single-hop RAG with Chain of Thought
- **ReAct**: Single-hop RAG with ReAct prompting
- **BasicMultiHop**: 2-hop RAG with Chain of Thought

And the optimizer candidates are:
- **None**: No additional instructions apart from the signature
- **Labeled few-shot**: Simply constructs few-shot examples from provided labeled Q/A pairs
- **Bootstrap few-shot**: As we demonstrated, self-generate complete demonstrations for every stage of our module. Will simply use the generated demonstrations (if they pass the metric) without any further optimization. For `Vanilla` it is just equal to "Labeled few-shot"

As for evaluation metric, we use exact match as criteria (`dspy.evaluate.metrics.answer_exact_match`) against the test set.

*Note: exact match is a very questionable evaluation criteria, but it would suffice to ilustrate the idea, feel free to explore using other criteria*

Let's begin! First, we define our modules

In [11]:
# Vanilla
class Vanilla(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.Predict("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        answer = self.generate_answer(context=context, question=question)
        return answer
    
vanilla = Vanilla()

# COT
class COT(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")
    
    def forward(self, question):
        context = self.retrieve(question).passages
        answer = self.generate_answer(context=context, question=question)
        return answer
    
cot = COT()

# ReAct
react = dspy.ReAct("question-> answer", tools=[dspy.Retrieve(k=3)], max_iters=5)

# BasicMultiHop
class BasicMultiHop(dspy.Module):
    def __init__(self, passages_per_hop=3):
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_query = dspy.ChainOfThought("context, question-> search_query")
        self.generate_answer = dspy.ChainOfThought("context, question-> answer")

    def forward(self, question):
        context = []

        for hop in range(2):
            query = self.generate_query(context=context, question=question).search_query
            context += self.retrieve(query).passages

        return self.generate_answer(context=context, question=question)
    
multihop = BasicMultiHop(passages_per_hop=3)

Then define permutations for our model candidates

In [12]:
from dspy.teleprompt import LabeledFewShot, BootstrapFewShot

metric = dspy.evaluate.metrics.answer_exact_match

modules = {
    'vanilla': vanilla,
    'cot': cot,
    'react': react,
    'multihop': multihop,
}

optimizers = {
    'none': None,
    'labeled_few_shot': LabeledFewShot(),
    'bootstrap_few_shot': BootstrapFewShot(metric=metric, max_errors=20),
}

Here we define a helper class to facilitate the evaluation

In [29]:
from dspy.evaluate.evaluate import Evaluate
import pandas as pd

class ModelSelection():

    # Compile our models
    def __init__(self, modules, optimizers, metric, trainset):
        self.models = []
        self.metric = metric
        
        for module_name, module in modules.items():
            print(f'Compiling models for {module_name}...')
            models_for_a_program = {'module_name': module_name, 'optimizers': []}

            for optimizer_name, optimizer in optimizers.items():
                print(f'...{optimizer_name}')
                if optimizer is None:
                    compiled_model = module
                else:
                    compiled_model = optimizer.compile(student=module, trainset=trainset)

                optimizer = {
                        'name': optimizer_name,
                        'compiled_model': compiled_model
                }

                models_for_a_program['optimizers'].append(optimizer)

            self.models.append(models_for_a_program)

    # Evaluate our models against the testset. After evaluation, we will have a matrix of models and their scores under the evaluation_matrix attribute
    def evaluate(self, testset):
        evaluator = Evaluate(devset=testset, metric=self.metric, num_threads=3, return_outputs=True)
        for module in self.models:
            print(f"""Evaluating models for {module['module_name']}...""")
            for optimizer in module['optimizers']:
                compiled_model = optimizer['compiled_model']
                evaluation_score, outputs = evaluator(compiled_model)
                optimizer['score'] = evaluation_score

        # read dict into a dataframe
        df = pd.DataFrame(self.models)

        # unnest optimizers column
        df = df.explode('optimizers')

        # extract name/score column from optimizers
        df['optimizer'] = df['optimizers'].apply(lambda x: x['name'])
        df['score'] = df['optimizers'].apply(lambda x: x['score'])

        df.drop(columns=['optimizers'], inplace=True)
        self.evaluation_matrix = df

    # Raise a question against the compiled model
    def question_for_model(self, module_name, optimizer_name, question):
        for model in self.models:
            if model['module_name'] == module_name:
                for s in model['optimizers']:
                    if s['name'] == optimizer_name:
                        return s['compiled_model'](question=question)

We are now ready to start the evaluation, it would take around 20 minutes to complete

In [30]:
# Compile the models
ms = ModelSelection(modules=modules, optimizers=optimizers, metric=metric, trainset=trainset)

# Evaluate them
ms.evaluate(testset=testset)

Compiling models for vanilla...
...none
...labeled_few_shot
...bootstrap_few_shot


100%|██████████| 20/20 [01:26<00:00,  4.32s/it]


Bootstrapped 0 full traces after 20 examples in round 0.
Compiling models for cot...
...none
...labeled_few_shot
...bootstrap_few_shot


 65%|██████▌   | 13/20 [00:59<00:31,  4.55s/it]


Bootstrapped 4 full traces after 14 examples in round 0.
Compiling models for react...
...none
...labeled_few_shot
...bootstrap_few_shot


 40%|████      | 8/20 [01:38<02:24, 12.07s/it]

Failed to run or to evaluate example Example({'question': 'Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?', 'answer': 'Rosario Dawson'}) (input_keys={'question'}) with <function answer_exact_match at 0x0000015CBFCA4A60> due to 'NoneType' object is not iterable.


100%|██████████| 20/20 [04:17<00:00, 12.86s/it]


Bootstrapped 3 full traces after 20 examples in round 0.
Compiling models for multihop...
...none
...labeled_few_shot
...bootstrap_few_shot


 40%|████      | 8/20 [01:57<02:56, 14.74s/it]


Bootstrapped 4 full traces after 9 examples in round 0.
Evaluating models for vanilla...
Average Metric: 0 / 20  (0.0%)
Average Metric: 0 / 20  (0.0%)
Average Metric: 0 / 20  (0.0%)
Evaluating models for cot...
Average Metric: 0 / 20  (0.0%)
Average Metric: 7 / 20  (35.0%)
Average Metric: 10 / 20  (50.0%)
Evaluating models for react...
Average Metric: 5 / 20  (25.0%)
Average Metric: 5 / 20  (25.0%)
Average Metric: 7 / 20  (35.0%)
Evaluating models for multihop...
Error for example in dev set: 		 'NoneType' object is not iterable
Average Metric: 0.0 / 20  (0.0%)
Error for example in dev set: 		 'NoneType' object is not iterable
Average Metric: 6.0 / 20  (30.0%)
Average Metric: 7 / 20  (35.0%)


Here's the evaluation result. We can see the `COT` module with `BootstrapFewShot` optimizer yields the best performance

In [31]:
ms.evaluation_matrix

Unnamed: 0,module_name,optimizer,score
0,vanilla,none,0.0
0,vanilla,labeled_few_shot,0.0
0,vanilla,bootstrap_few_shot,0.0
1,cot,none,0.0
1,cot,labeled_few_shot,35.0
1,cot,bootstrap_few_shot,50.0
2,react,none,25.0
2,react,labeled_few_shot,25.0
2,react,bootstrap_few_shot,35.0
3,multihop,none,0.0


But before we conclude the exercise, it might be useful to inspect the result more deeply: `Multihop with BootstrapFewShot`, which supposedly equips with more relevant context than `COT with Bootstrap`, has a worse performance. It is strange!

## Debug and fine-tune our prompt

Now heads to the Phoenix Console to see what's going on. We pick a random question `William Hughes Miller was born in a city with how many inhabitants ?`, and inspect how did COT, ReAct, BasicMultiHop with BoostrapFewShot optimizer came up with their answer. You can type this in the search bar for filter: `"""William Hughes Miller was born in a city with how many inhabitants ?""" in input.value`

The below table shows the answer provided by the 3 models:
|Model|Predicted answer|
|---|---|
|Multihop with BootstrapFewShot|The answer will vary based on the specific city of William Hughes Miller's birthplace.|
|ReAct with BootstrapFewShot|Kosciusko, Mississippi|
|COT with BootstrapFewShot|The city of Kosciusko, Mississippi, has a population of approximately 7,402 inhabitants.|

The correct answer is `7,402 at the 2010 census`. Both `ReAct with BootstrapFewShot` and `COT with BootstrapFewShot` provided relevant answers, but `Multihop with BootstrapFewShot` simply failed to provide one. Checking the execution trace in Phoenix, it looks like the LM fails to understand what is expected for the `search_query` specified in the signature

Execute the below cells to revise the signatures, and re-run our model comparison again.

In [73]:
# Define class-based signatures
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""
    
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

class FollowupQuery(dspy.Signature):
    """Generate a query which is conducive to answering the question"""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    search_query = dspy.OutputField(desc="Judge if the context is adequate to answer the question, if not adequate or if it is blank, generate a search query that would help you answer the question.")

In [79]:
# Revise the modules with the class-based signatures.
## Vanilla
class VanillaRevised(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.Predict(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        answer = self.generate_answer(context=context, question=question)
        return answer
    
vanilla_revised = VanillaRevised()

## COT
class COTRevised(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        answer = self.generate_answer(context=context, question=question)
        return answer
    
cot_revised = COTRevised()

## ReAct
react_revised = dspy.ReAct(BasicQA, tools=[dspy.Retrieve(k=3)], max_iters=5)

## BasicMultiHop
class BasicMultiHopRevised(dspy.Module):
    def __init__(self, passages_per_hop=3):
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_query = dspy.ChainOfThought(FollowupQuery)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        context = []

        for hop in range(2):
            query = self.generate_query(context=context, question=question).search_query
            context += self.retrieve(query).passages

        return self.generate_answer(context=context, question=question)
    
multihop_revised = BasicMultiHopRevised(passages_per_hop=3)
    
modules_revised = {
    'vanilla': vanilla_revised,
    'cot': cot_revised,
    'react': react_revised,
    'multihop': multihop_revised,
}

# Re-compile and evaluate
ms_revised = ModelSelection(modules=modules_revised, optimizers=optimizers, metric=metric, trainset=trainset)
ms_revised.evaluate(testset=testset)
ms_revised.evaluation_matrix

Compiling models for vanilla...
...none
...labeled_few_shot
...bootstrap_few_shot


  0%|          | 0/20 [00:00<?, ?it/s]

 40%|████      | 8/20 [00:00<00:00, 45.18it/s]


Bootstrapped 4 full traces after 9 examples in round 0.
Compiling models for cot...
...none
...labeled_few_shot
...bootstrap_few_shot


 70%|███████   | 14/20 [00:00<00:00, 42.46it/s]


Bootstrapped 4 full traces after 15 examples in round 0.
Compiling models for react...
...none
...labeled_few_shot
...bootstrap_few_shot


 60%|██████    | 12/20 [00:00<00:00, 23.12it/s]

Failed to run or to evaluate example Example({'question': 'Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?', 'answer': 'Rosario Dawson'}) (input_keys={'question'}) with <function answer_exact_match at 0x0000015CBFCA4A60> due to 'NoneType' object is not iterable.


100%|██████████| 20/20 [00:00<00:00, 22.71it/s]


Bootstrapped 3 full traces after 20 examples in round 0.
Compiling models for multihop...
...none
...labeled_few_shot
...bootstrap_few_shot


 25%|██▌       | 5/20 [00:00<00:00, 33.57it/s]


Bootstrapped 4 full traces after 6 examples in round 0.
Evaluating models for vanilla...
Average Metric: 5 / 20  (25.0%)
Average Metric: 8 / 20  (40.0%)
Average Metric: 8 / 20  (40.0%)
Evaluating models for cot...
Average Metric: 10 / 20  (50.0%)
Average Metric: 9 / 20  (45.0%)
Average Metric: 10 / 20  (50.0%)
Evaluating models for react...
Average Metric: 5 / 20  (25.0%)
Average Metric: 5 / 20  (25.0%)
Average Metric: 7 / 20  (35.0%)
Evaluating models for multihop...
Average Metric: 13 / 20  (65.0%)
Average Metric: 13 / 20  (65.0%)
Average Metric: 11 / 20  (55.0%)


Unnamed: 0,module_name,optimizer,score
0,vanilla,none,25.0
0,vanilla,labeled_few_shot,40.0
0,vanilla,bootstrap_few_shot,40.0
1,cot,none,50.0
1,cot,labeled_few_shot,45.0
1,cot,bootstrap_few_shot,50.0
2,react,none,25.0
2,react,labeled_few_shot,25.0
2,react,bootstrap_few_shot,35.0
3,multihop,none,65.0


We now see the score improved across all models, and Multihop with LabeledFewShot now has the best performance! This indicates despite DSPy tries to optimize the prompt, **there is still some prompt engineering involved by specifying your objective in signature**.

The best model now produce an exact match for our question!

In [80]:
# The correct answer is 7,402
question = """`William Hughes Miller was born in a city with how many inhabitants ?"""
ms_revised.question_for_model('multihop','labeled_few_shot',question)

Prediction(
    rationale='Answer: 7,402',
    answer='7,402'
)

As expected, the best prompt contains only few-shot examples, but not the bootstrapped Context-Question-Reasoning-Answer demonstrations.

In [81]:
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Question: This American guitarist best known for her work with the Iron Maidens is an ancestor of a composer who was known as what?
Answer: The Waltz King

Question: Tombstone stared an actor born May 17, 1955 known as who?
Answer: Bill Paxton

Question: Who is older, Aleksandr Danilovich Aleksandrov or Anatoly Fomenko?
Answer: Aleksandr Danilovich Aleksandrov

Question: Which magazine has published articles by Scott Shaw, Tae Kwon Do Times or Southwest Art?
Answer: Tae Kwon Do Times

Question: What is the code name for the German offensive that started this Second World War engagement on the Eastern Front (a few hundred kilometers from Moscow) between Soviet and German forces, which included 102nd Infantry Division?
Answer: Operation Citadel

Question: which American actor was Candace Kita guest starred with
Answer: Bill Murray

Question: Which company distributed this 1977 American animated film produced by Walt Disney Production

It does not mean Multihop with BootstrapFewShot has a worse performance **in general** however, only that for our task, if we use GPT 3.5 Turbo to bootstrap demonstration (which might be of questionable quality) and output prediction, then we might better do without the bootstrapping altogether and keep only the few-shot examples.

This lead to the question: Is it possible to use a more powerful LM, say GPT 4 Turbo (aka `teacher`) to generate demonstrations, while keeping GPT 3.5 Turbo (aka `student`) for prediction?

## "Teacher" to power-up bootstrapping capability

The answer is **YES** as the following cell demonstrates, we will use GPT 4 Turbo as teacher.

In [82]:
# Define the GPT-4 Turbo model
gpt4_turbo = dspy.Databricks(api_key=OPENROUTER_API_KEY,
		api_base="https://openrouter.ai/api/v1",
		model="openai/gpt-4-turbo")

# Define new Optimizer which uses GPT-4 Turbo as a teacher
optimizers_gpt4_teacher = {
    'bootstrap_few_shot': BootstrapFewShot(metric=metric, max_errors=20, teacher_settings=dict(lm=gpt4_turbo)),
}

# Compile the models and evaluate them as before
ms_gpt4_teacher = ModelSelection(modules=modules_revised, optimizers=optimizers_gpt4_teacher, metric=metric, trainset=trainset)
ms_gpt4_teacher.evaluate(testset=testset)
ms_gpt4_teacher.evaluation_matrix

Compiling models for vanilla...
...bootstrap_few_shot


 40%|████      | 8/20 [00:35<00:53,  4.44s/it]


Bootstrapped 4 full traces after 9 examples in round 0.
Compiling models for cot...
...bootstrap_few_shot


 30%|███       | 6/20 [00:36<01:26,  6.15s/it]


Bootstrapped 4 full traces after 7 examples in round 0.
Compiling models for react...
...bootstrap_few_shot


 40%|████      | 8/20 [02:32<04:06, 20.54s/it]

Failed to run or to evaluate example Example({'question': 'Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?', 'answer': 'Rosario Dawson'}) (input_keys={'question'}) with <function answer_exact_match at 0x0000015CBFCA4A60> due to 'NoneType' object is not iterable.


100%|██████████| 20/20 [08:05<00:00, 24.30s/it]


Bootstrapped 3 full traces after 20 examples in round 0.
Compiling models for multihop...
...bootstrap_few_shot


 30%|███       | 6/20 [02:17<05:21, 22.99s/it]


Bootstrapped 4 full traces after 7 examples in round 0.
Evaluating models for vanilla...
Average Metric: 9 / 20  (45.0%)
Evaluating models for cot...
Average Metric: 10 / 20  (50.0%)
Evaluating models for react...
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Error for example in dev set: 		 not enough values to unpack (expected 2, got 1)
Average Metric: 8.0 / 20  (40.0%)
Evaluating models for multihop...
Average Metric: 11 / 20  (55.0%)


Unnamed: 0,module_name,optimizer,score
0,vanilla,bootstrap_few_shot,45.0
1,cot,bootstrap_few_shot,50.0
2,react,bootstrap_few_shot,40.0
3,multihop,bootstrap_few_shot,55.0


Using GPT-4 Turbo as `teacher` does not significantly boost our models' performance however. But it is still worthwhile to see its effect to our prompt. Below is the prompt generated just using GPT 3.5

In [84]:
ms_revised.question_for_model('multihop','bootstrap_few_shot',question)
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Question: Which Pakistani cricket umpire who won 3 consecutive ICC umpire of the year awards in 2009, 2010, and 2011 will be in the ICC World Twenty20?
Answer: Aleem Sarwar Dar

Question: Tombstone stared an actor born May 17, 1955 known as who?
Answer: Bill Paxton

Question: Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?
Answer: Rosario Dawson

Question: Having the combination of excellent foot speed and bat speed helped Eric Davis, create what kind of outfield for the Los Angeles Dodgers?
Answer: "Outfield of Dreams"

Question: In what year was the club founded that played Manchester City in the 1972 FA Charity Shield
Answer: 1874

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas founded in what year?
Answer: 2010

Question: At My Window was released by which American singer-songwriter?
Ans

And here's the prompt generated using GPT-4 as `teacher`. Notice how the "Reasoning" is much better articulated here!

In [85]:
ms_gpt4_teacher.question_for_model('multihop','bootstrap_few_shot',question)
lm.inspect_history(n=1)





Answer questions with short factoid answers.

---

Question: Which Pakistani cricket umpire who won 3 consecutive ICC umpire of the year awards in 2009, 2010, and 2011 will be in the ICC World Twenty20?
Answer: Aleem Sarwar Dar

Question: Tombstone stared an actor born May 17, 1955 known as who?
Answer: Bill Paxton

Question: Which American actress who made their film debut in the 1995 teen drama "Kids" was the co-founder of Voto Latino?
Answer: Rosario Dawson

Question: Having the combination of excellent foot speed and bat speed helped Eric Davis, create what kind of outfield for the Los Angeles Dodgers?
Answer: "Outfield of Dreams"

Question: which American actor was Candace Kita guest starred with
Answer: Bill Murray

Question: The Organisation that allows a community to influence their operation or use and to enjoy the benefits arisingwas founded in what year?
Answer: 2010

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

