<img src="docs/images/DSPy8.png" alt="DSPy7 Image" height="150"/>

## **DSPy**: Programming with Foundation Models

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/stanfordnlp/dspy/blob/main/intro.ipynb)

This notebook introduces the **DSPy** framework for **Programming with Foundation Models**, i.e., language models (LMs) and retrieval models (RMs).

**DSPy** emphasizes programming over prompting. It unifies techniques for **prompting** and **fine-tuning** LMs as well as improving them with **reasoning** and **tool/retrieval augmentation**, all expressed through a _minimalistic set of Pythonic operations that compose and learn_.

**DSPy** provides **composable and declarative modules** for instructing LMs in a familiar Pythonic syntax. On top of that, **DSPy** introduces an **automatic compiler that teaches LMs** how to conduct the declarative steps in your program. The **DSPy compiler** will internally _trace_ your program and then **craft high-quality prompts for large LMs (or train automatic finetunes for small LMs)** to teach them the steps of your task.

### 0] Setting Up

As we'll start to see below, **DSPy** can routinely teach powerful models like `GPT-3.5` and local models like `T5-base` or `Llama2-13b` to be much more reliable at complex tasks. **DSPy** will compile the _same program_ into different few-shot prompts and/or finetunes for each LM.

Let's begin by setting things up. The snippet below will also install **DSPy** if it's not there already.

In [1]:
import openai
import os

openai_api_key = os.environ.get('OPENAI_API_KEY')

In [2]:
%load_ext autoreload
%autoreload 2

import sys
import os

# try: # When on google Colab, let's clone the notebook so we download the cache.
#     import google.colab
#     repo_path = 'dspy'
#     !git -C $repo_path pull origin || git clone https://github.com/stanfordnlp/dspy $repo_path
# except:
#     repo_path = '.'
# 
# if repo_path not in sys.path:
#     sys.path.append(repo_path)
# 
# # Set up the cache for this notebook
# os.environ["DSP_NOTEBOOK_CACHEDIR"] = os.path.join(repo_path, 'cache')
# 
# import pkg_resources # Install the package if it's not installed
# if not "dspy-ai" in {pkg.key for pkg in pkg_resources.working_set}:
#     !pip install -U pip
#     !pip install dspy-ai
#     !pip install openai~=0.28.1
#     # !pip install -e $repo_path

from dspy import dspy
from dspy.retrieve.pinecone_rm import PineconeRM

### 1] Getting Started

We'll start by setting up the language model (LM) and retrieval model (RM). **DSPy** supports multiple API and local models. In this notebook, we'll work with GPT-3.5 (`gpt-3.5-turbo`) and the retriever `ColBERTv2`.

To make things easy, we've set up a ColBERTv2 server hosting a Wikipedia 2017 "abstracts" search index (i.e., containing first paragraph of each article from this [2017 dump](https://hotpotqa.github.io/wiki-readme.html)), so you don't need to worry about setting one up! It's free.

**Note:** _If you want to run this notebook without changing the examples, you don't need an API key. All examples are already cached internally so you can inspect them!_

In [3]:
from dspy.retrieve.pinecone_rm import OpenAIEmbed

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
# colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

pinecone_index_name = "apple-index"
pinecone_api_key = os.environ.get('PINECONE_API_KEY')
pinecone_env = "us-east-1-aws"
retriever_model = PineconeRM(pinecone_index_name, 
                             pinecone_api_key, 
                             None, 
                             OpenAIEmbed(api_key=openai_api_key),
                             3
                             )
dspy.settings.configure(lm=turbo, rm=retriever_model)

In the last line above, we configure **DSPy** to use the turbo LM and the ColBERTv2 retriever (over Wikipedia 2017 abstracts) by default. This will be easy to overwrite for local parts of our programs if needed.

##### A word on the workflow

You can build your own **DSPy programs** for various tasks, e.g., question answering, information extraction, or text-to-SQL.

Whatever the task, the general workflow is:

1. **Collect a little bit of data.** Define examples of the inputs and outputs of your program (e.g., questions and their answers). This could just be a handful of quick examples you wrote down. If large datasets exist, the more the merrier!
1. **Write your program.** Define the modules (i.e., sub-tasks) of your program and the way they should interact together to solve your task.
1. **Define some validation logic.** What makes for a good run of your program? Maybe the answers need to have a certain length or stick to a particular format? Specify the logic that checks that.
1. **Compile!** Ask **DSPy** to _compile_ your program using your data. The compiler will use your data and validation logic to optimize your program (e.g., prompts and modules) so it's efficient and effective!
1. **Iterate.** Repeat the process by improving your data, program, validation, or by using more advanced features of the **DSPy** compiler.

Let's now see this in action.

### 2] Task Examples

**DSPy** accommodates a wide variety of applications and tasks. **In this intro notebook, we will work on the example task of multi-hop question answering (QA).**

Other notebooks and tutorials will present different tasks. Now, let us load a tiny sample from the HotPotQA multi-hop dataset.

In [4]:
# from dspy.datasets import HotPotQA

# # Load the dataset.
# dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# # Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
# trainset = [x.with_inputs('question') for x in dataset.train]
# devset = [x.with_inputs('question') for x in dataset.dev]

# len(trainset), len(devset)

In [5]:
def load_data(filename: str) -> list[dspy.Example]:
    data_set = []
    with open(filename, 'r') as file:
        for line in file:
          line = line.rstrip('\n')  # Remove trailing newline
    
          # Identify question or answer line
          if line.startswith("Q:"):
            # Start a new question
            current_question = line.replace("Q:", "").strip()  # Extract and clean question
          elif line.startswith("A:"):
            if current_question is not None:
              # Add the completed question-answer pair
              data_set.append(dspy.Example(
                  question=current_question,
                  answer=line.replace("A:", "").strip()
              ).with_inputs('question'))
              current_question = None  # Reset for next question
          else:
            # Handle potential continuation lines within questions or answers (optional)
            if current_question is not None:
              current_question += " " + line  # Append continuation line if current question
    
            
    return data_set

In [6]:
apple_example_set = load_data('apple-test.txt')
print(apple_example_set[:3])

[Example({'question': 'What is the primary business of Apple Inc.?', 'answer': 'Designing, manufacturing, and marketing electronic devices, including smartphones, computers, and tablets.'}) (input_keys={'question'}), Example({'question': 'How does Apple manage its business segments?', 'answer': 'Geographically, with segments including the Americas, Europe, Greater China, Japan, and Rest of Asia Pacific.'}) (input_keys={'question'}), Example({'question': "What are the key components of Apple's competitive strategy?", 'answer': 'Innovation, product quality, strong ecosystem, marketing, distribution, and service and support offerings.'}) (input_keys={'question'})]


We just loaded `trainset` (20 examples) and `devset` (50 examples). Each example in our **training set** contains just a **question** and its (human-annotated) **answer**.

**DSPy** typically requires very minimal labeling. Whereas your pipeline may involve six or seven complex steps, you only need labels for the initial question and the final answer. **DSPy** will bootstrap any intermediate labels needed to support your pipeline. If you change your pipeline in any way, the data bootstrapped will change accordingly!

Now, let's look at some data examples.

Examples in the **dev set** contain a third field, namely, **titles** of relevant Wikipedia articles. This is not essential but, for the sake of this intro, it'll help us get a sense of how well our programs are doing.

In [7]:
dev_example = apple_example_set[1]
print(f"Question: {dev_example.question}")
print(f"Answer: {dev_example.answer}")
# print(f"Relevant Wikipedia Titles: {dev_example.gold_titles}")

Question: How does Apple manage its business segments?
Answer: Geographically, with segments including the Americas, Europe, Greater China, Japan, and Rest of Asia Pacific.


After loading the raw data, we'd applied `x.with_inputs('question')` to each example to tell **DSPy** that our input field in each example will be just `question`. Any other fields are labels or metadata that are not given to the system.

In [8]:
# print(f"For this dataset, training examples have input keys {apple_example_set.inputs().keys()} and label keys {apple_example_set.labels().keys()}")
# # print(f"For this dataset, dev examples have input keys {dev_example.inputs().keys()} and label keys {dev_example.labels().keys()}")

Note that there's nothing special about the HotPotQA dataset: it's just a list of examples.

You can define your own examples as below. A future notebook will guide you through creating your own data in unusual or data-scarce settings, which is a context where **DSPy** excels.

```
dspy.Example(field1=value, field2=value2, ...)
```

### 3] Building Blocks

In **DSPy**, we will maintain a clean separation between **defining your modules in a declarative way** and **calling them in a pipeline to solve the task**.

This allows you to focus on the information flow of your pipeline. **DSPy** will then take your program and automatically optimize **how to prompt** (or finetune) LMs **for your particular pipeline** so it works well.

If you have experience with PyTorch, you can think of DSPy as the PyTorch of the foundation model space. Before we see this in action, let's first understand some key pieces.

##### Using the Language Model: **Signatures** & **Predictors**

Every call to the LM in a **DSPy** program needs to have a **Signature**.

A signature consists of three simple elements:

- A minimal description of the sub-task the LM is supposed to solve.
- A description of one or more input fields (e.g., input question) that we will give to the LM.
- A description of one or more output fields (e.g., the question's answer) that we will expect from the LM.

Let's define a simple signature for basic question answering.

In [9]:
class BasicQA(dspy.Signature):
    """Answer questions with short factoid answers."""

    question = dspy.InputField()
    answer = dspy.OutputField()

In `BasicQA`, the docstring describes the sub-task here (i.e., answering questions). Each `InputField` or `OutputField` can optionally contain a description `desc` too. When it's not given, it's inferred from the field's name (e.g., `question`).

Notice that there isn't anything special about this signature in **DSPy**. We can just as easily define a signature that takes a long snippet from a PDF and outputs structured information, for instance.

Anyway, now that we have a signature, let's define and use a **Predictor**. A predictor is a module that knows how to use the LM to implement a signature. Importantly, predictors can **learn** to fit their behavior to the task!

In [10]:
# Define the predictor.
generate_answer = dspy.Predict(BasicQA)

# Call the predictor on a particular input.
pred = generate_answer(question=dev_example.question)

# Print the input and the prediction.
print(f"Question: {dev_example.question}")
print(f"Predicted Answer: {pred.answer}")

Question: How does Apple manage its business segments?
Predicted Answer: Apple manages its business segments by dividing them into five categories: iPhone, Mac, iPad, Services, and Wearables/Home/Accessories.


In the example above, we asked the predictor about the the chef featured in "Restaurant: Impossible". The model outputs an answer ("American").

For visibility, we can inspect how this extremely basic predictor implemented our signature. Let's inspect the history of our LM (**turbo**).

In [11]:
turbo.inspect_history(n=1)





Answer questions with short factoid answers.

---

Follow the following format.

Question: ${question}
Answer: ${answer}

---

Question: How does Apple manage its business segments?
Answer:[32m Apple manages its business segments by dividing them into five categories: iPhone, Mac, iPad, Services, and Wearables/Home/Accessories.[0m





It happens that this chef is both British and American, but we have no way of knowing if the model just guessed "American" because it's a common answer. In general, adding **retrieval** and **learning** will help the LM be more factual, and we'll explore this in a minute!

But before we do that, how about we _just_ change the predictor? It would be nice to allow the model to elicit a chain of thought along with the prediction.

In [12]:
# Define the predictor. Notice we're just changing the class. The signature BasicQA is unchanged.
generate_answer_with_chain_of_thought = dspy.ChainOfThought(BasicQA)

# Call the predictor on the same input.
pred = generate_answer_with_chain_of_thought(question=dev_example.question)

# Print the input, the chain of thought, and the prediction.
print(f"Question: {dev_example.question}")
print(f"Thought: {pred.rationale.split('.', 1)[1].strip()}")
print(f"Predicted Answer: {pred.answer}")

Question: How does Apple manage its business segments?
Thought: 
Predicted Answer: Apple manages its business segments by dividing them into products such as iPhone, Mac, iPad, Services, and Wearables.


This is indeed a better answer: the model figures out that the chef in question is **Robert Irvine** and correctly identifies that he's British.

These predictors (`dspy.Predict` and `dspy.ChainOfThought`) can be applied to _any_ signature. As we'll see below, they can also be optimized to learn from your data and validation logic.

##### Using the Retrieval Model

Using the retriever is pretty simple. A module `dspy.Retrieve(k)` will search for the top-`k` passages that match a given query.

By default, this will use the retriever we configured at the top of this notebook, namely, ColBERTv2 over a Wikipedia index.

In [13]:
retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(dev_example.question)

pppp = retrieve(dev_example.question)
print(pppp)
print(f"Top {retrieve.k} passages for question: {dev_example.question} \n", '-' * 30, '\n')
for idx, passage in enumerate(pppp.passages):
    print(f'{idx+1}]', passage, '\n')

Prediction(
    passages=['Payment Services\n\nThe Company offers payment services, including Apple Card®, a co-branded credit card, and Apple Pay®, a cashless payment service.\n\nSegments\n\nThe Company manages its business primarily on a geographic basis. The Company’s reportable segments consist of the Americas, Europe, Greater China,\nJapan and Rest of Asia Pacific. Americas includes both North and South America. Europe includes European countries, as well as India, the Middle East and\nAfrica. Greater China includes China mainland, Hong Kong and Taiwan. Rest of Asia Pacific includes Australia and those Asian countries not included in the\nCompany’s other reportable segments. Although the reportable segments provide similar hardware and software products and similar services, each one is\nmanaged separately to better align with the location of the Company’s customers and distribution partners and the unique market dynamics of each geographic\nregion.\n\nMarkets and Distribution', '

Feel free to any other queries you like.

In [None]:
retrieve("When was the first FIFA World Cup held?").passages[0]

### 4] Program 1: Basic Retrieval-Augmented Generation (“RAG”)

Let's define our first complete program for this task. We'll build a retrieval-augmented pipeline for answer generation.

Given a question, we'll search for the top-3 passages in Wikipedia and then feed them as context for answer generation.

Let's start by defining this signature: `context, question --> answer`.

In [16]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

Great. Now let's define the actual program. This is a class that inherits from `dspy.Module`.

It needs two methods:

- The `__init__` method will simply declare the sub-modules it needs: `dspy.Retrieve` and `dspy.ChainOfThought`. The latter is defined to implement our `GenerateAnswer` signature.
- The `forward` method will describe the control flow of answering the question using the modules we have.

In [17]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

##### Compiling the RAG program

Having defined this program, let's now **compile** it. Compiling a program will update the parameters stored in each module. In our setting, this is primarily in the form of collecting and selecting good demonstrations for inclusion in your prompt(s).

Compiling depends on three things:

1. **A training set.** We'll just use our 20 question–answer examples from `trainset` above.
1. **A metric for validation.** We'll define a quick `validate_context_and_answer` that checks that the predicted answer is correct. It'll also check that the retrieved context does actually contain that answer.
1. **A specific teleprompter.** The **DSPy** compiler includes a number of **teleprompters** that can optimize your programs.

**Teleprompters:** Teleprompters are powerful optimizers that can take any program and learn to bootstrap and select effective prompts for its modules. Hence the name, which means "prompting at a distance".

Different teleprompters offer various tradeoffs in terms of how much they optimize cost versus quality, etc. We will use a simple default `BootstrapFewShot` in this notebook.


_If you're into analogies, you could think of this as your training data, your loss function, and your optimizer in a standard DNN supervised learning setup. Whereas SGD is a basic optimizer, there are more sophisticated (and more expensive!) ones like Adam or RMSProp._

In [18]:
from dspy.teleprompt import BootstrapFewShot

# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

# Compile!
compiled_rag = teleprompter.compile(RAG(), trainset=apple_example_set)

 10%|███████████████                                                                                                                                        | 1/10 [00:00<00:02,  3.69it/s]

I'm here[{'long_text': 'Many of the Company’s competitors seek to compete primarily through aggressive pricing and very low cost structures, and by imitating the\nCompany’s products and infringing on its intellectual property.\n\nApple Inc. | 2023 Form 10-K | 2\n\nThe Company’s ability to compete successfully depends heavily on ensuring the continuing and timely introduction of innovative new products, services and\ntechnologies to the marketplace. The Company designs and develops nearly the entire solution for its products, including the hardware, operating system,\nnumerous software applications and related services. Principal competitive factors important to the Company include price, product and service features\n(including security features), relative price and performance, product and service quality and reliability, design innovation, a strong third-party software and\naccessories ecosystem, marketing and distribution capability, service and support, and corporate reputation.'},

 20%|██████████████████████████████▏                                                                                                                        | 2/10 [00:00<00:03,  2.33it/s]

I'm here[{'long_text': 'Payment Services\n\nThe Company offers payment services, including Apple Card®, a co-branded credit card, and Apple Pay®, a cashless payment service.\n\nSegments\n\nThe Company manages its business primarily on a geographic basis. The Company’s reportable segments consist of the Americas, Europe, Greater China,\nJapan and Rest of Asia Pacific. Americas includes both North and South America. Europe includes European countries, as well as India, the Middle East and\nAfrica. Greater China includes China mainland, Hong Kong and Taiwan. Rest of Asia Pacific includes Australia and those Asian countries not included in the\nCompany’s other reportable segments. Although the reportable segments provide similar hardware and software products and similar services, each one is\nmanaged separately to better align with the location of the Company’s customers and distribution partners and the unique market dynamics of each geographic\nregion.\n\nMarkets and Distribution'}, {'l

 30%|█████████████████████████████████████████████▎                                                                                                         | 3/10 [00:01<00:02,  2.78it/s]

I'm here[{'long_text': 'Many of the Company’s competitors seek to compete primarily through aggressive pricing and very low cost structures, and by imitating the\nCompany’s products and infringing on its intellectual property.\n\nApple Inc. | 2023 Form 10-K | 2\n\nThe Company’s ability to compete successfully depends heavily on ensuring the continuing and timely introduction of innovative new products, services and\ntechnologies to the marketplace. The Company designs and develops nearly the entire solution for its products, including the hardware, operating system,\nnumerous software applications and related services. Principal competitive factors important to the Company include price, product and service features\n(including security features), relative price and performance, product and service quality and reliability, design innovation, a strong third-party software and\naccessories ecosystem, marketing and distribution capability, service and support, and corporate reputation.'},

 40%|████████████████████████████████████████████████████████████▍                                                                                          | 4/10 [00:01<00:01,  3.03it/s]

I'm here[{'long_text': 'Apple Inc. | 2023 Form 10-K | 3\n\nBusiness Seasonality and Product Introductions\n\nThe Company has historically experienced higher net sales in its first quarter compared to other quarters in its fiscal year due in part to seasonal holiday\ndemand. Additionally, new product and service introductions can significantly impact net sales, cost of sales and operating expenses. The timing of product\nintroductions can also impact the Company’s net sales to its indirect distribution channels as these channels are filled with new inventory following a product\nlaunch, and channel inventory of an older product often declines as the launch of a newer product approaches. Net sales can also be affected when consumers\nand distributors anticipate a product introduction.'}, {'long_text': 'It may be necessary in the future to seek or renew licenses relating to various aspects of the Company’s products, processes and services. While the\nCompany has generally been able to obt

 50%|███████████████████████████████████████████████████████████████████████████▌                                                                           | 5/10 [00:01<00:01,  3.18it/s]

I'm here[{'long_text': 'Therefore, the Company accounts for all third-party application–related sales on a net basis by recognizing in Services net sales only the commission it retains.\n\nApple Inc. | 2023 Form 10-K | 34\n\nNet sales disaggregated by significant products and services for 2023, 2022 and 2021 were as follows (in millions):\n2023 2022 2021\n\niPhone (1) $ 200,583 $ 205,489 $ 191,973 \nMac (1) 29,357 40,177 35,190 \niPad (1) 28,300 29,292 31,862 \nWearables, Home and Accessories (1) 39,845 41,241 38,367 \nServices (2) 85,200 78,129 68,425\n\nTotal net sales $ 383,285 $ 394,328 $ 365,817'}, {'long_text': 'Commission File Number: 001-36743\n\nApple Inc.\n(Exact name of Registrant as specified in its charter)\n\nCalifornia 94-2404110\n(State or other jurisdiction (I.R.S.\n\nEmployer Identification No.)\n\nof incorporation or organization)\n\nOne Apple Park Way\nCupertino, California 95014\n\n(Address of principal executive offices) (Zip Code)\n\n(408) 996-1010\n(Registrant’s

 60%|██████████████████████████████████████████████████████████████████████████████████████████▌                                                            | 6/10 [00:01<00:01,  3.59it/s]

I'm here[{'long_text': 'Inclusion and Diversity\n\nThe Company is committed to its vision to build and sustain a more inclusive workforce that is representative of the communities it serves.\n\nThe Company\ncontinues to work to increase diverse representation at every level, foster an inclusive culture, and support equitable pay and access to opportunity for all\nemployees.\n\nEngagement\n\nThe Company believes that open and honest communication among team members, managers and leaders helps create an open, collaborative work\nenvironment where everyone can contribute, grow and succeed. Team members are encouraged to come to their managers with questions, feedback or\nconcerns, and the Company conducts surveys that gauge employee sentiment in areas like career development, manager performance and inclusivity.\n\nHealth and Safety'}, {'long_text': 'Human Capital\n\nThe Company believes it has a talented, motivated and dedicated team, and works to create an inclusive, safe and supportive

 70%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋                                             | 7/10 [00:02<00:00,  3.68it/s]

I'm here[{'long_text': 'In contrast, many of the Company’s competitors seek to compete primarily through aggressive pricing and very low cost structures,\nand by imitating the Company’s products and infringing on its intellectual property. Effective intellectual property protection is not consistently available in every\ncountry in which the Company operates. If the Company is unable to continue to develop and sell innovative new products with attractive margins or if\ncompetitors infringe on the Company’s intellectual property, the Company’s ability to maintain a competitive advantage could be materially adversely affected.'}, {'long_text': 'Effective intellectual property protection is not consistently available in every\ncountry in which the Company operates. If the Company is unable to continue to develop and sell innovative new products with attractive margins or if\ncompetitors infringe on the Company’s intellectual property, the Company’s ability to maintain a competitive advant

 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                              | 8/10 [00:02<00:00,  3.68it/s]

I'm here[{'long_text': 'The Company’s business, results of operations and financial condition depend substantially on the Company’s ability to continually improve its products and\nservices to maintain their functional and design advantages. There can be no assurance the Company will be able to continue to provide products and services\nthat compete effectively. Business Risks\n\nTo remain competitive and stimulate customer demand, the Company must successfully manage frequent introductions and transitions of products\nand services. Due to the highly volatile and competitive nature of the markets and industries in which the Company competes, the Company must continually introduce new\nproducts, services and technologies, enhance existing products and services, effectively stimulate customer demand for new and upgraded products and\nservices, and successfully manage the transition to these new and upgraded products and services.'}, {'long_text': 'Additionally, the Company faces signific

 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉               | 9/10 [00:02<00:00,  3.53it/s]

I'm here[{'long_text': 'The Company relies on single-source outsourcing partners in the U.S., Asia and Europe to supply and manufacture many components, and on outsourcing\npartners primarily located in Asia, for final assembly of substantially all of the Company’s hardware products. Any failure of these partners to perform can have a\nnegative impact on the Company’s cost or supply of components or finished goods. In addition, manufacturing or logistics in these locations or transit to final\ndestinations can be disrupted for a variety of reasons, including natural and man-made disasters, information technology system failures, commercial disputes,\narmed conflict, economic, business, labor, environmental, public health or political issues, or international trade disputes.'}, {'long_text': 'The Company has also outsourced much of its transportation and logistics management. While these arrangements can lower operating costs, they also reduce\nthe Company’s direct control over producti

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  3.37it/s]

I'm here[{'long_text': 'The Company is also subject\nto litigation and investigations relating to the App Store, which have resulted in changes to the Company’s business practices, and may in the future result in\nfurther changes. Changes have included how developers communicate with consumers outside the App Store regarding alternative purchasing mechanisms. Future changes could also affect what the Company charges developers for access to its platforms, how it manages distribution of apps outside of the App\nStore, and how and to what extent it allows developers to communicate with consumers inside the App Store regarding alternative purchasing mechanisms. This\ncould reduce the volume of sales, and the commission that the Company earns on those sales, would decrease. If the rate of the commission that the Company\nretains on such sales is reduced, or if it is otherwise narrowed in scope or eliminated, the Company’s business, results of operations and financial condition\ncould be ma




Now that we've compiled our RAG program, let's try it out.

In [19]:
# Ask any question you like to this simple RAG program.
my_question = "Please provide the total value of shares repurchased."

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: Please provide the total value of shares repurchased.
Predicted Answer: $76.6 billion
Retrieved Contexts (truncated): ['(1) As of September 30, 2023, the Company was authorized by the Board of Directors to purchase up to $90 billion of the Company’s common stock under a share\nrepurchase program announced on May 4, 2023...', 'Under the programs, shares may be repurchased in privately negotiated or open market transactions, including under plans\ncomplying with Rule 10b5-1 under the Exchange Act. (2) In August 2023, the Comp...', 'Note 10 – Shareholders’ Equity\n\nShare Repurchase Program\n\nDuring 2023, the Company repurchased 471 million shares of its common stock for $76.6 billion, excluding excise tax due under the Inflation Re...']


Excellent. How about we inspect the last prompt for the LM?

In [20]:
turbo.inspect_history(n=1)





Answer questions with short factoid answers.

---

Question: What are the key components of Apple's competitive strategy?
Answer: Innovation, product quality, strong ecosystem, marketing, distribution, and service and support offerings.

Question: What is a key challenge the Company faces regarding its intellectual property?
Answer: Competitors imitate products and infringe on intellectual property, affecting the Company's competitive advantage.

Question: What risks does the Company face in terms of product development and market competition?
Answer: Intense competition, price pressure, and uncertain market growth impact product innovation and market share.

Question: How does Apple manage its business segments?
Answer: Geographically, with segments including the Americas, Europe, Greater China, Japan, and Rest of Asia Pacific.

Question: How does seasonality affect Apple's net sales?
Answer: Higher in the first quarter due to seasonal holiday demand.

Question: What is Apple's ap

Even though we haven't written any of this detailed demonstrations, we see that **DSPy** was able to bootstrap this 3,000 token prompt for **3-shot retrieval augmented generation with hard negative passages and chain of thought** from our extremely simple program.

This illustrates the power of composition and learning. Of course, this was just generated by a particular teleprompter, which may or may not be perfect in each setting. As you'll see in **DSPy**, there is a large but systematic space of options you have to optimize and validate the quality and cost of your programs.

If you're so inclined, you can easily inspect the learned objects themselves.

In [21]:
for name, parameter in compiled_rag.named_predictors():
    print(name)
    print(parameter.demos[0])
    print()

generate_answer
Example({'question': "What are the key components of Apple's competitive strategy?", 'answer': 'Innovation, product quality, strong ecosystem, marketing, distribution, and service and support offerings.'}) (input_keys={'question'})



##### Evaluating the Answers

We can now evaluate our `compiled_rag` program on the dev set. Of course, this tiny set is _not_ meant to be a reliable benchmark, but it'll be instructive to use it for illustration.

For a start, let's evaluate the accuracy (exact match) of the predicted answer.

In [22]:
from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_apple = Evaluate(devset=apple_example_set, num_threads=1, display_progress=True, display_table=5)

# Evaluate the `compiled_rag` program with the `answer_exact_match` metric.
metric = dspy.evaluate.answer_exact_match
evaluate_on_apple(compiled_rag, metric=metric)

Average Metric: 9 / 10  (90.0): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:19<00:00,  1.94s/it]


Average Metric: 9 / 10  (90.0%)


Unnamed: 0,question,example_answer,context,pred_answer,answer_exact_match
0,What is the primary business of Apple Inc.?,"Designing, manufacturing, and marketing electronic devices, including smartphones, computers, and tablets.","['Many of the Company’s competitors seek to compete primarily through aggressive pricing and very low cost structures, and by imitating the\nCompany’s products and infringing on...","Designing, manufacturing, and marketing electronic devices.",False
1,How does Apple manage its business segments?,"Geographically, with segments including the Americas, Europe, Greater China, Japan, and Rest of Asia Pacific.","['Payment Services\n\nThe Company offers payment services, including Apple Card®, a co-branded credit card, and Apple Pay®, a cashless payment service.\n\nSegments\n\nThe Company manages its business primarily...","Geographically, with segments including the Americas, Europe, Greater China, Japan, and Rest of Asia Pacific.",✔️ [True]
2,What are the key components of Apple's competitive strategy?,"Innovation, product quality, strong ecosystem, marketing, distribution, and service and support offerings.","['Many of the Company’s competitors seek to compete primarily through aggressive pricing and very low cost structures, and by imitating the\nCompany’s products and infringing on...","Innovation, product quality, strong ecosystem, marketing, distribution, and service and support offerings.",✔️ [True]
3,How does seasonality affect Apple's net sales?,Higher in the first quarter due to seasonal holiday demand.,['Apple Inc. | 2023 Form 10-K | 3\n\nBusiness Seasonality and Product Introductions\n\nThe Company has historically experienced higher net sales in its first quarter compared to...,Higher in the first quarter due to seasonal holiday demand.,✔️ [True]
4,How many employees does Apple have?,"Approximately 161,000 full-time equivalent employees as of September 30, 2023.","['Therefore, the Company accounts for all third-party application–related sales on a net basis by recognizing in Services net sales only the commission it retains.\n\nApple Inc....","Approximately 161,000 full-time equivalent employees as of September 30, 2023.",✔️ [True]


90.0

##### Evaluating the Retrieval

It may also be instructive to look at the accuracy of retrieval. There are multiple ways to do this. Often, we can just check whether the retrieved passages contain the answer.

That said, since our dev set includes the gold titles that should be retrieved, we can just use these here.

In [None]:
def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = apple_example_set(map(dspy.evaluate.normalize_text, example['gold_titles']))
    found_titles = apple_example_set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

    return gold_titles.issubset(found_titles)

compiled_rag_retrieval_score = evaluate_on_hotpotqa(compiled_rag, metric=gold_passages_retrieved)

Although this simple `compiled_rag` program is able to answer a decent fraction of the questions correctly (on this tiny set, over 40%), the quality of retrieval is much lower.

This potentially suggests that the LM is often relying on the knowledge it memorized during training to answer questions. To address this weak retrieval, let's explore a second program that involves more advanced search behavior.

### 5] Program 2: Multi-Hop Search (“Baleen”)

From exploring the harder questions in the training/dev sets, it becomes clear that a single search query is often not enough for this task. For instance, this can be seen when a question ask about, say, the birth city of the writer of "Right Back At It Again". A search query identifies the author correctly as "Jeremy McKinnon", but it wouldn't figure out when he was born.

The standard approach for this challenge in the retrieval-augmented NLP literature is to build multi-hop search systems, like GoldEn (Qi et al., 2019) and Baleen (Khattab et al., 2021). These systems read the retrieved results and then generate additional queries to gather additional information if necessary. Using **DSPy**, we can easily simulate such systems in a few lines of code.


We'll still use the `GenerateAnswer` signature from the RAG implementation above. All we need now is a **signature** for the "hop" behavior: taking some partial context and a question, generate a search query to find missing information.

In [23]:
class GenerateSearchQuery(dspy.Signature):
    """Write a simple search query that will help answer a complex question."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    query = dspy.OutputField()

Note: We could have written `context = GenerateAnswer.signature.context` to avoid duplicating the description of the `context` field.

Now, let's define the program itself `SimplifiedBaleen`. There are many possible ways to implement this, but we'll keep this version down to the key elements for simplicity.

In [24]:
from dsp.utils import deduplicate

class SimplifiedBaleen(dspy.Module):
    def __init__(self, passages_per_hop=3, max_hops=2):
        super().__init__()

        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(max_hops)]
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
        self.max_hops = max_hops
    
    def forward(self, question):
        context = []
        
        for hop in range(self.max_hops):
            query = self.generate_query[hop](context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        pred = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=pred.answer)

As we can see, the `__init__` method defines a few key sub-modules:

- **generate_query**: For each hop, we will have one `dspy.ChainOfThought` predictor with the `GenerateSearchQuery` signature.
- **retrieve**: This module will do the actual search, using the generated queries.
- **generate_answer**: This `dspy.Predict` module will be used after all the search steps. It has a `GenerateAnswer`, to actually produce an answer.

The `forward` method uses these sub-modules in simple control flow.

1. First, we'll loop up to `self.max_hops` times.
1. In each iteration, we'll generate a search query using the predictor at `self.generate_query[hop]`.
1. We'll retrieve the top-k passages using that query.
1. We'll add the (deduplicated) passages to our accumulator of `context`.
1. After the loop, we'll use `self.generate_answer` to produce an answer.
1. We'll return a prediction with the retrieved `context` and predicted `answer`.

##### Inspect the zero-shot version of the Baleen program

We will also compile this program shortly. But, before that, we can try it out in a "zero-shot" setting (i.e., without any compilation).

Using a program in zero-shot (uncompiled) setting doesn't mean that quality will be bad. It just means that we're bottlenecked directly by the reliability of the underlying LM to understand our sub-tasks from minimal instructions.

This is often just fine when using the most expensive/powerful models (e.g., GPT-4) on the easiest and most standard tasks (e.g., answering simple questions about popular entities).

However, a zero-shot approach quickly falls short for more specialized tasks, for novel domains/settings, and for more efficient (or open) models. **DSPy** can help you in all of these settings.

In [25]:
# Ask any question you like to this simple RAG program.
my_question = "* What new products did apple release in each of the quarter this year? List the result as bullet points"

# Get the prediction. This contains `pred.context` and `pred.answer`.
uncompiled_baleen = SimplifiedBaleen()  # uncompiled (i.e., zero-shot) program
pred = uncompiled_baleen(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

Question: * What new products did apple release in each of the quarter this year? List the result as bullet points
Predicted Answer: - First Quarter 2023: iPad, iPad Pro, Next-generation Apple TV 4K, MLS Season Pass
- Second Quarter 2023: MacBook Pro 14”, MacBook Pro 16”, Mac mini, Second-generation HomePod
- Third Quarter 2023: MacBook Air 15”, Mac Studio, Mac Pro, Apple Vision Pro™, iOS 17
Retrieved Contexts (truncated): ['Significant announcements during fiscal year 2023\nincluded the following:\n\nFirst Quarter 2023:\n\n• iPad and iPad Pro;\n• Next-generation Apple TV 4K; and\n• MLS Season Pass, a Major League Soccer subscri...', 'Third Quarter 2023:\n\n• MacBook Air 15”, Mac Studio and Mac Pro;\n• Apple Vision Pro™, the Company’s first spatial computer featuring its new visionOS™, expected to be available in early calendar year 2...', 'The Company’s total net sales decreased 3% or $11.0 billion during 2023 compared to 2022.\n\nThe weakness in foreign currencies relative to the U.S

Let's inspect the last **three** calls to the LM (i.e., generating the first hop's query, generating the second hop's query, and generating the answer).

In [26]:
turbo.inspect_history(n=3)





Write a simple search query that will help answer a complex question.

---

Follow the following format.

Context: may contain relevant facts

Question: ${question}

Reasoning: Let's think step by step in order to ${produce the query}. We ...

Query: ${query}

---

Context:
[1] «Significant announcements during fiscal year 2023
included the following:

First Quarter 2023:

• iPad and iPad Pro;
• Next-generation Apple TV 4K; and
• MLS Season Pass, a Major League Soccer subscription streaming service.

Second Quarter 2023:

• MacBook Pro 14”, MacBook Pro 16” and Mac mini; and
• Second-generation HomePod.

Third Quarter 2023:

• MacBook Air 15”, Mac Studio and Mac Pro;
• Apple Vision Pro™, the Company’s first spatial computer featuring its new visionOS™, expected to be available in early calendar year 2024; and
• iOS 17, macOS Sonoma, iPadOS 17, tvOS 17 and watchOS 10, updates to the Company’s operating systems.

Fourth Quarter 2023:»
[2] «Third Quarter 2023:

• MacBook Air 15”, Mac S

##### Compiling the Baleen program

Now is the time to compile our multi-hop (`SimplifiedBaleen`) program.

We will first define our validation logic, which will simply require that:

- The predicted answer matches the gold answer.
- The retrieved context contains the gold answer.
- None of the generated queries is rambling (i.e., none exceeds 100 characters in length).
- None of the generated queries is roughly repeated (i.e., none is within 0.8 or higher F1 score of earlier queries).

In [None]:
def validate_context_and_answer_and_hops(example, pred, trace=None):
    if not dspy.evaluate.answer_exact_match(example, pred): return False
    if not dspy.evaluate.answer_passage_match(example, pred): return False

    hops = [example.question] + [outputs.query for *_, outputs in trace if 'query' in outputs]

    if max([len(h) for h in hops]) > 100: return False
    if any(dspy.evaluate.answer_exact_match_str(hops[idx], hops[:idx], frac=0.8) for idx in range(2, len(hops))): return False

    return True

Like we did for RAG, we'll use one of the most basic teleprompters in **DSPy**, namely, `BootstrapFewShot`.

In [None]:
teleprompter = BootstrapFewShot(metric=validate_context_and_answer_and_hops)
compiled_baleen = teleprompter.compile(SimplifiedBaleen(), teacher=SimplifiedBaleen(passages_per_hop=2), trainset=trainset)

##### Evaluating the Retrieval

Earlier, it appeared like our simple RAG program was not very effective at finding all evidence required for answering each question. Is this resolved by the adding some extra steps in the `forward` function of `SimplifiedBaleen`? What about compiling, does it help for that? 

The answer for these questions is not always going to be obvious. However, **DSPy** makes it extremely easy to try many diverse approaches with minimal effort.

Let's evaluate the quality of retrieval of our compiled and uncompiled Baleen pipelines!

In [None]:
uncompiled_baleen_retrieval_score = evaluate_on_hotpotqa(uncompiled_baleen, metric=gold_passages_retrieved, display=False)

In [None]:
compiled_baleen_retrieval_score = evaluate_on_hotpotqa(compiled_baleen, metric=gold_passages_retrieved)

In [None]:
print(f"## Retrieval Score for RAG: {compiled_rag_retrieval_score}")  # note that for RAG, compilation has no effect on the retrieval step
print(f"## Retrieval Score for uncompiled Baleen: {uncompiled_baleen_retrieval_score}")
print(f"## Retrieval Score for compiled Baleen: {compiled_baleen_retrieval_score}")

Excellent! There might be something to this compiled, multi-hop program then. But this is far from all you can do: **DSPy** gives you a clean space of composable operators to deal with any shortcomings you see.

We can inspect a few concrete examples. If we see failure causes, we can:

1. Expand our pipeline by using additional sub-modules (e.g., maybe summarize after retrieval?)
1. Modify our pipeline by using more complex logic (e.g., maybe we need to break out of the multi-hop loop if we found all information we need?) 
1. Refine our validation logic (e.g., maybe use a metric that use a second **DSPy** program to do the answer evaluation, instead of relying on strict string matching)
1. Use a different teleprompter to optimize your pipeline more aggressively.
1. Add more or better training examples!


Or, if you really want, we can tweak the descriptions in the Signatures we use in your program to make them more precisely suited for their sub-tasks. This is akin to prompt engineering and should be a final resort, given the other powerful options that **DSPy** gives us!

In [None]:
compiled_baleen("How many storeys are in the castle that David Gregory inherited?")
turbo.inspect_history(n=3)