# Tutorial: Retrieval-Augmented Generation (RAG)

Let's walk through a quick example of **basic question answering** with and without **retrieval-augmented generation** (RAG) in DSPy. Specifically, let's build **a system for answering Tech questions**, e.g. about Linux or iPhone apps.

Install the latest DSPy via `pip install -U dspy` and follow along. If you're looking instead for a conceptual overview of DSPy, this [recent lecture](https://www.youtube.com/live/JEMYuzrKLUw) is a good place to start.


## Configuring the DSPy Environment

Let's tell DSPy that we will use OpenAI's `gpt-4o-mini` in our modules. To authenticate, DSPy will look into your `OPENAI_API_KEY`. You can easily swap this out for [other providers or local models](https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb).


<details>
<summary>Recommended: Set up MLflow Tracing to understand what's happening under the hood.</summary>

### MLflow DSPy Integration

<a href="https://mlflow.org/">MLflow</a> is an LLMOps tool that natively integrates with DSPy and offer explainability and experiment tracking. In this tutorial, you can use MLflow to visualize prompts and optimization progress as traces to understand the DSPy's behavior better. You can set up MLflow easily by following the four steps below.

![MLflow Trace](./mlflow-tracing-rag.png)

1. Install MLflow

```bash
%pip install mlflow>=2.20
```

2. Start MLflow UI in a separate terminal
```bash
mlflow ui --port 5000
```

3. Connect the notebook to MLflow
```python
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("DSPy")
```

4. Enabling tracing.
```python
mlflow.dspy.autolog()
```

Once you have completed the steps above, you can see traces for each program execution on the notebook. They provide great visibility into the model's behavior and helps you understand the DSPy's concepts better throughout the tutorial.

To kearn more about the integration, visit [MLflow DSPy Documentation](https://mlflow.org/docs/latest/llms/dspy/index.html) as well.

</details>

In [2]:
import dspy
import os

os.environ["OPENAI_API_KEY"] = "{your_openai_api_key}"

lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

## Basic Retrieval-Augmented Generation (RAG)

First, let's download the corpus data that we will use for RAG search. An older version of this tutorial used the full (650,000 document) corpus. To make this very fast and cheap to run, we've downsampled the corpus to just 28,000 documents.

In [3]:
dspy.utils.download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_corpus.jsonl")

Downloading 'ragqa_arena_tech_corpus.jsonl'...


## Setting Up Your Retriever

As far as DSPy is concerned, you can plug in any Python code for calling tools or retrievers. Here, we'll just use OpenAI Embeddings and do top-K search locally for convenience. We will use a vector index based on FAISS (https://github.com/facebookresearch/faiss), so please install it via:

```bash
pip insall -U faiss-cpu
```

Or if you have GPU in your runtime:

```bash
pip install -U faiss-gpu
```

Now let's create the `retriever` (vector index) for our RAG application. We will use the `dspy.retrievers.Embeddings` API, which indexes data via an embedding model and FAISS, and at query time, it fetches the closest data to the query in the embedding space.

In [None]:
import ujson

max_characters = 6000  # for truncating >99th percentile of documents
topk_docs_to_retrieve = 5  # number of documents to retrieve per search query

with open("ragqa_arena_tech_corpus.jsonl") as f:
    corpus = [ujson.loads(line)["text"][:max_characters] for line in f]
    print(f"Loaded {len(corpus)} documents. Will encode them below.")

embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
retriever = dspy.retrievers.Embeddings(embedder=embedder, corpus=corpus, k=topk_docs_to_retrieve)

Loaded 28436 documents. Will encode them below.
Training a 32-byte FAISS index with 337 partitions, based on 28436 x 512-dim embeddings



## Building Your RAG Application

Now it's time to build the RAG application! As covered in the tutorial [Building AI Applications by Customizing DSPy Modules](https://dspy.ai/tutorials/custom_module/), we will put our RAG's logic into a custom DSPy module.

Our RAG takes a very simple structure - we query the retriever to get relevant context, then generate answer based on the question and fetched context.

In [21]:
class RAG(dspy.Module):
    def __init__(self):
        self.respond = dspy.ChainOfThought("context, question -> response")

    def forward(self, question, **kwargs):
        context = retriever(question).passages
        return self.respond(context=context, question=question)

Let's create a RAG instance and call it with some random technical question.


In [23]:
rag = RAG()
rag(question="what are high memory and low memory on linux?")

Prediction(
    reasoning="High memory and low memory in Linux refer to two distinct segments of the kernel's memory space. Low memory is the portion of memory that the kernel can access directly and is statically mapped at boot time, allowing for efficient access. High memory, on the other hand, is not permanently mapped in the kernel's address space, meaning that the kernel must map it temporarily when it needs to access it. This distinction is crucial for managing memory in a 32-bit architecture, where the kernel needs to handle more memory than it can directly address. High memory is typically used for temporary data buffers, while low memory is used for kernel operations.",
    response="In Linux, high memory refers to the segment of memory that is not permanently mapped in the kernel's address space, requiring the kernel to map it temporarily for access. Low memory, conversely, is the portion that the kernel can access directly and is statically mapped at boot time. This separati

In [12]:
dspy.inspect_history()





[34m[2025-06-17T16:19:13.986128][0m

[31mSystem message:[0m

Your input fields are:
1. `context` (str): 
2. `question` (str):
Your output fields are:
1. `reasoning` (str): 
2. `response` (str):
All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## context ## ]]
{context}

[[ ## question ## ]]
{question}

[[ ## reasoning ## ]]
{reasoning}

[[ ## response ## ]]
{response}

[[ ## completed ## ]]
In adhering to this structure, your objective is: 
        Given the fields `context`, `question`, produce the fields `response`.


[31mUser message:[0m

[[ ## context ## ]]
[1] «As far as I remember, High Memory is used for application space and Low Memory for the kernel. Advantage is that (user-space) applications cant access kernel-space memory.»
[2] «HIGHMEM is a range of kernels memory space, but it is NOT memory you access but its a place where you put what you want to access. A typical 32bit Linux virtual memory map is like: 0x0000

## Using a DSPy Optimizer to Improve Your RAG Prompt

You can stop reading if you just want to build a RAG application with DSPy, in this section we are going to show how to use DSPy optimizer to improve the program's quality.

Let's first evaluate the RAG we built. In order to evaluate a RAG application, we need a dataset along with an evaluation metric. For this purpose, we will use the [RAG-QA dataset](https://arxiv.org/abs/2407.13998) as the evaluation dataset, and [SemanticF1](https://karmake2.github.io/files/Publications/2022/SEM_F1.pdf) as the evaluation metric. Let's download the dataset and import the evaluation metric.

In [13]:
import ujson
from dspy.utils import download

# Download question--answer pairs from the RAG-QA Arena "Tech" dataset.
download("https://huggingface.co/dspy/cache/resolve/main/ragqa_arena_tech_examples.jsonl")

with open("ragqa_arena_tech_examples.jsonl") as f:
    data = [ujson.loads(line) for line in f]

# Convert the data into a list of `dspy.Example` objects so that we can use them for `dspy.Evaluate`.
data = [dspy.Example(**d).with_inputs("question") for d in data]

Downloading 'ragqa_arena_tech_examples.jsonl'...


Let's take a look at a sample data record and some metadata of the dataset.

In [17]:
print(f"Data size: {len(data)}")
print(data[0])

Data size: 2064
Example({'question': 'why igp is used in mpls?', 'response': "An IGP exchanges routing prefixes between gateways/routers.  \nWithout a routing protocol, you'd have to configure each route on every router and you'd have no dynamic updates when routes change because of link failures. \nFuthermore, within an MPLS network, an IGP is vital for advertising the internal topology and ensuring connectivity for MP-BGP inside the network.", 'gold_doc_ids': [2822, 2823]}) (input_keys={'question'})


Each data has 3 fields:

- question
- response
- gold_doc_ids

For our evaluation purpose, we only use the question and response field.

Let's take some slices of the dataset to be the training and validation dataset.

In [18]:
import random

random.Random(0).shuffle(data)
trainset, devset = data[:200], data[200:500]

Now let's evaluate our RAG application to get a sense of how it performs. For a complete guide on how to evaluate a DSPy program, please refer to the [evaluation guide](https://dspy.ai/learn/evaluation/overview/).

In [19]:
from dspy.evaluate import SemanticF1

# Instantiate the metric.
metric = SemanticF1(decompositional=True)

# Define an evaluator that we can re-use.
evaluate = dspy.Evaluate(
    devset=devset,
    metric=metric,
    num_threads=24,
    display_progress=True,
    display_table=10,
)

# Evaluate the rag application we just built.
evaluate(rag)

Average Metric: 163.55 / 300 (54.5%): 100%|██████████| 300/300 [03:17<00:00,  1.52it/s]

2025/06/17 19:43:05 INFO dspy.evaluate.evaluate: Average Metric: 163.5533565284151 / 300 (54.5%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],"C should be used over C++ primarily in scenarios where simplicity,...","Use C over C++ when working on embedded systems, requiring low-lev...",✔️ [0.364]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository can be problematic due to Git's...,While it is technically possible to store images in a Git reposito...,✔️ [0.444]
2,how to chmod without /usr/bin/chmod?,"Run the loader directly, and pass it the command you want to run: ...","[3252, 3254, 3270, 1964, 4222, 8050]",To change file permissions without using the `/usr/bin/chmod` comm...,You can change file permissions without using `/usr/bin/chmod` by ...,✔️ [0.500]
3,how to show the filesystem type via the terminal?,"To determine the type of file system used, you can employ the moun...",[8248],"To show the filesystem type via the terminal, you can use several ...",You can show the filesystem type via the terminal using the follow...,
4,can i get the cpu temperature and fan speed from the command line ...,The iStats ruby gem allows you to monitor the CPU temperature with...,"[4354, 4784]","Yes, you can get the CPU temperature and fan speed from the comman...","Yes, you can get the CPU temperature using the command `sudo power...",✔️ [0.667]
5,how can i see my home folder in the finder?,"To navigate to your home folder on a Mac, you can use the keyboard...","[6416, 6423, 6424, 2591]","To see your home folder in Finder, you can use several methods. On...",You can see your home folder in Finder by looking for it in the si...,✔️ [0.500]
6,can you see when someone is looking at your location?,"When using Find My Friends, your friends will not be aware if you ...","[1951, 1954]","Based on the provided context, when someone is using location-shar...","No, you cannot see when someone is looking at your location. When ...",✔️ [0.797]
7,is a zip code considered pii?,"On one hand, some argue that by itself, a zip code cannot be class...","[4775, 4777, 4779, 5541]",A zip code can be considered personally identifiable information (...,"Yes, a zip code can be considered PII, especially when combined wi...",✔️ [0.600]
8,why is my message not delivered imessage?,"""Not Delivered"" implies an error of some kind, perhaps because the...","[7645, 4712]",Your message may not be delivered in iMessage for several reasons....,"Your message may not be delivered due to several factors, such as ...",✔️ [0.381]
9,how to buy an app for a friend?,Google Play Gift Cards have become widely accessible in numerous c...,[383],"To buy an app for a friend, you can use the gifting feature availa...","To buy an app for a friend, you can gift the app directly through ...",


54.52

Off the shelf, our `RAG` module scores ~54.5%. What are our options to make it stronger? You may think about prompt engineering, like making the instruction more comprehensive or provide a few examples. DSPy provides another option to use DSPy optimizers to automatically optimize our RAG's prompt.

Let's set up and use DSPy's MIPRO (v2) optimizer. The run below has a cost around $0.5 (for the `light` auto setting) and may take some 10-20 minutes depending on your number of threads.

In [25]:
tp = dspy.MIPROv2(metric=metric, auto="light", num_threads=24)  # use fewer threads if your rate limit is small

optimized_rag = tp.compile(
    rag,
    trainset=trainset,
    max_bootstrapped_demos=2,
    max_labeled_demos=2,
    requires_permission_to_run=False,
)

# Evaluate the optimized RAG.
evaluate(optimized_rag)

2025/06/17 20:17:43 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 10
minibatch: True
num_fewshot_candidates: 6
num_instruct_candidates: 3
valset size: 100

2025/06/17 20:17:43 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
2025/06/17 20:17:43 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.

2025/06/17 20:17:43 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=6 sets of demonstrations...


Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6


 10%|█         | 4/40 [00:00<00:05,  6.29it/s]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 4/6


  5%|▌         | 2/40 [00:26<08:19, 13.14s/it]


Bootstrapped 1 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 5/6


 10%|█         | 4/40 [01:11<10:45, 17.93s/it]


Bootstrapped 2 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.
Bootstrapping set 6/6


  5%|▌         | 2/40 [00:36<11:28, 18.13s/it]
2025/06/17 20:19:58 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
2025/06/17 20:19:58 INFO dspy.teleprompt.mipro_optimizer_v2: We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.
2025/06/17 20:19:58 INFO dspy.teleprompt.mipro_optimizer_v2: 
Proposing N=3 instructions...



Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Error getting source code: unhashable type: 'dict'.

Running without program aware proposer.


2025/06/17 20:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: Proposed Instructions for Predictor 0:

2025/06/17 20:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: 0: Given the fields `context`, `question`, produce the fields `response`.

2025/06/17 20:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: 1: Imagine you are a technical support specialist tasked with resolving urgent user issues related to macOS and command line environments. A user has reached out in a panic because they cannot find their account pictures and need them for an important presentation in just a few hours. Your job is to provide a clear, step-by-step response to the following question: "Where does macOS store account pictures?" Ensure your response is thorough, includes all relevant locations, and is accessible to users of varying skill levels.

2025/06/17 20:20:03 INFO dspy.teleprompt.mipro_optimizer_v2: 2: Given the context provided, answer the following question with a detailed and clear response: {question}

20

Average Metric: 52.24 / 100 (52.2%): 100%|██████████| 100/100 [00:03<00:00, 32.08it/s]

2025/06/17 20:20:06 INFO dspy.evaluate.evaluate: Average Metric: 52.240583970223014 / 100 (52.2%)
2025/06/17 20:20:06 INFO dspy.teleprompt.mipro_optimizer_v2: Default program score: 52.24

2025/06/17 20:20:06 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 2 / 13 - Minibatch ==



Average Metric: 20.92 / 35 (59.8%): 100%|██████████| 35/35 [00:38<00:00,  1.11s/it]

2025/06/17 20:20:45 INFO dspy.evaluate.evaluate: Average Metric: 20.924469148217348 / 35 (59.8%)
2025/06/17 20:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.78 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2025/06/17 20:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78]
2025/06/17 20:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24]
2025/06/17 20:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 52.24


2025/06/17 20:20:45 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 3 / 13 - Minibatch ==



Average Metric: 20.07 / 35 (57.4%): 100%|██████████| 35/35 [00:40<00:00,  1.17s/it]

2025/06/17 20:21:26 INFO dspy.evaluate.evaluate: Average Metric: 20.074049254214792 / 35 (57.4%)
2025/06/17 20:21:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.35 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2025/06/17 20:21:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35]
2025/06/17 20:21:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24]
2025/06/17 20:21:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 52.24


2025/06/17 20:21:26 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 4 / 13 - Minibatch ==



Average Metric: 20.45 / 35 (58.4%): 100%|██████████| 35/35 [00:37<00:00,  1.08s/it]

2025/06/17 20:22:04 INFO dspy.evaluate.evaluate: Average Metric: 20.452614424867754 / 35 (58.4%)
2025/06/17 20:22:04 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.44 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 5'].
2025/06/17 20:22:04 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44]
2025/06/17 20:22:04 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24]
2025/06/17 20:22:04 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 52.24


2025/06/17 20:22:04 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 5 / 13 - Minibatch ==



Average Metric: 20.86 / 35 (59.6%): 100%|██████████| 35/35 [00:39<00:00,  1.13s/it]

2025/06/17 20:22:43 INFO dspy.evaluate.evaluate: Average Metric: 20.863533031592333 / 35 (59.6%)
2025/06/17 20:22:43 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 59.61 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 2'].
2025/06/17 20:22:43 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44, 59.61]
2025/06/17 20:22:43 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24]
2025/06/17 20:22:43 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 52.24


2025/06/17 20:22:43 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 6 / 13 - Minibatch ==



Average Metric: 22.50 / 35 (64.3%): 100%|██████████| 35/35 [00:42<00:00,  1.22s/it]

2025/06/17 20:23:26 INFO dspy.evaluate.evaluate: Average Metric: 22.504307654109333 / 35 (64.3%)
2025/06/17 20:23:26 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 64.3 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2025/06/17 20:23:26 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44, 59.61, 64.3]
2025/06/17 20:23:26 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24]
2025/06/17 20:23:26 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 52.24


2025/06/17 20:23:26 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 7 / 13 - Full Evaluation =====
2025/06/17 20:23:26 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 64.3) from minibatch trials...



Average Metric: 58.65 / 100 (58.7%): 100%|██████████| 100/100 [01:02<00:00,  1.61it/s]

2025/06/17 20:24:28 INFO dspy.evaluate.evaluate: Average Metric: 58.65054658356367 / 100 (58.7%)
2025/06/17 20:24:28 INFO dspy.teleprompt.mipro_optimizer_v2: [92mNew best full eval score![0m Score: 58.65
2025/06/17 20:24:28 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24, 58.65]
2025/06/17 20:24:28 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.65
2025/06/17 20:24:28 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/17 20:24:28 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 8 / 13 - Minibatch ==



Average Metric: 19.34 / 35 (55.3%): 100%|██████████| 35/35 [00:24<00:00,  1.44it/s]

2025/06/17 20:24:53 INFO dspy.evaluate.evaluate: Average Metric: 19.338979262327037 / 35 (55.3%)
2025/06/17 20:24:53 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 55.25 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 0'].
2025/06/17 20:24:53 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44, 59.61, 64.3, 55.25]
2025/06/17 20:24:53 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24, 58.65]
2025/06/17 20:24:53 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.65


2025/06/17 20:24:53 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 9 / 13 - Minibatch ==



Average Metric: 17.95 / 35 (51.3%): 100%|██████████| 35/35 [00:47<00:00,  1.37s/it]

2025/06/17 20:25:41 INFO dspy.evaluate.evaluate: Average Metric: 17.95395292302051 / 35 (51.3%)
2025/06/17 20:25:41 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 51.3 on minibatch of size 35 with parameters ['Predictor 0: Instruction 2', 'Predictor 0: Few-Shot Set 5'].
2025/06/17 20:25:41 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44, 59.61, 64.3, 55.25, 51.3]
2025/06/17 20:25:41 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24, 58.65]
2025/06/17 20:25:41 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.65


2025/06/17 20:25:41 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 10 / 13 - Minibatch ==



Average Metric: 20.54 / 35 (58.7%): 100%|██████████| 35/35 [00:31<00:00,  1.12it/s]

2025/06/17 20:26:12 INFO dspy.evaluate.evaluate: Average Metric: 20.543559645193525 / 35 (58.7%)
2025/06/17 20:26:12 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 58.7 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 4'].
2025/06/17 20:26:12 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44, 59.61, 64.3, 55.25, 51.3, 58.7]
2025/06/17 20:26:12 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24, 58.65]
2025/06/17 20:26:12 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.65


2025/06/17 20:26:12 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 11 / 13 - Minibatch ==



Average Metric: 20.26 / 35 (57.9%): 100%|██████████| 35/35 [00:01<00:00, 24.41it/s]

2025/06/17 20:26:14 INFO dspy.evaluate.evaluate: Average Metric: 20.255674014148813 / 35 (57.9%)
2025/06/17 20:26:14 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 57.87 on minibatch of size 35 with parameters ['Predictor 0: Instruction 0', 'Predictor 0: Few-Shot Set 5'].
2025/06/17 20:26:14 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44, 59.61, 64.3, 55.25, 51.3, 58.7, 57.87]
2025/06/17 20:26:14 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24, 58.65]
2025/06/17 20:26:14 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.65


2025/06/17 20:26:14 INFO dspy.teleprompt.mipro_optimizer_v2: == Trial 12 / 13 - Minibatch ==



Average Metric: 18.27 / 35 (52.2%): 100%|██████████| 35/35 [00:42<00:00,  1.20s/it]

2025/06/17 20:26:56 INFO dspy.evaluate.evaluate: Average Metric: 18.266113519144685 / 35 (52.2%)
2025/06/17 20:26:56 INFO dspy.teleprompt.mipro_optimizer_v2: Score: 52.19 on minibatch of size 35 with parameters ['Predictor 0: Instruction 1', 'Predictor 0: Few-Shot Set 3'].
2025/06/17 20:26:56 INFO dspy.teleprompt.mipro_optimizer_v2: Minibatch scores so far: [59.78, 57.35, 58.44, 59.61, 64.3, 55.25, 51.3, 58.7, 57.87, 52.19]
2025/06/17 20:26:56 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24, 58.65]
2025/06/17 20:26:56 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.65


2025/06/17 20:26:56 INFO dspy.teleprompt.mipro_optimizer_v2: ===== Trial 13 / 13 - Full Evaluation =====
2025/06/17 20:26:56 INFO dspy.teleprompt.mipro_optimizer_v2: Doing full eval on next top averaging program (Avg Score: 59.61) from minibatch trials...



Average Metric: 57.01 / 100 (57.0%): 100%|██████████| 100/100 [01:02<00:00,  1.61it/s]

2025/06/17 20:27:58 INFO dspy.evaluate.evaluate: Average Metric: 57.006762023295245 / 100 (57.0%)
2025/06/17 20:27:58 INFO dspy.teleprompt.mipro_optimizer_v2: Full eval scores so far: [52.24, 58.65, 57.01]
2025/06/17 20:27:58 INFO dspy.teleprompt.mipro_optimizer_v2: Best full score so far: 58.65
2025/06/17 20:27:58 INFO dspy.teleprompt.mipro_optimizer_v2: 

2025/06/17 20:27:58 INFO dspy.teleprompt.mipro_optimizer_v2: Returning best identified program with score 58.65!



Average Metric: 180.78 / 300 (60.3%): 100%|██████████| 300/300 [03:13<00:00,  1.55it/s]

2025/06/17 20:31:12 INFO dspy.evaluate.evaluate: Average Metric: 180.78224801513014 / 300 (60.3%)





Unnamed: 0,question,example_response,gold_doc_ids,reasoning,pred_response,SemanticF1
0,"when to use c over c++, and c++ over c?","If you are equally familiar with both C++ and C, it's advisable to...",[733],The context outlines several scenarios where one might choose C ov...,You should consider using C over C++ in the following scenarios: 1...,✔️ [0.400]
1,should images be stored in a git repository?,"One viewpoint expresses that there is no significant downside, esp...","[6253, 6254, 6275, 6278, 8215]",Storing images in a Git repository can be problematic due to Git's...,Storing images in a Git repository is generally not recommended du...,✔️ [0.444]
2,how to chmod without /usr/bin/chmod?,"Run the loader directly, and pass it the command you want to run: ...","[3252, 3254, 3270, 1964, 4222, 8050]",The context provides several alternative methods to change file pe...,"To change file permissions without using `/usr/bin/chmod`, you can...",✔️ [0.500]
3,how to show the filesystem type via the terminal?,"To determine the type of file system used, you can employ the moun...",[8248],"To show the filesystem type via the terminal, there are several co...","To show the filesystem type via the terminal, you can use the foll...",✔️ [0.400]
4,can i get the cpu temperature and fan speed from the command line ...,The iStats ruby gem allows you to monitor the CPU temperature with...,"[4354, 4784]","Yes, you can get the CPU temperature and fan speed from the comman...","Yes, you can get the CPU temperature and fan speed from the comman...",✔️ [0.769]
5,how can i see my home folder in the finder?,"To navigate to your home folder on a Mac, you can use the keyboard...","[6416, 6423, 6424, 2591]","To see your home folder in Finder, there are several methods you c...","To see your home folder in Finder, you can use one of the followin...",✔️ [0.600]
6,can you see when someone is looking at your location?,"When using Find My Friends, your friends will not be aware if you ...","[1951, 1954]","Based on the context provided, when someone is using location-shar...","No, you cannot see when someone is looking at your location throug...",✔️ [0.774]
7,is a zip code considered pii?,"On one hand, some argue that by itself, a zip code cannot be class...","[4775, 4777, 4779, 5541]",The classification of a zip code as Personally Identifiable Inform...,"A zip code is not considered PII by itself, as it does not uniquel...",✔️ [0.750]
8,why is my message not delivered imessage?,"""Not Delivered"" implies an error of some kind, perhaps because the...","[7645, 4712]",The context explains several reasons why an iMessage may not be ma...,Your iMessage may not be delivered for several reasons. The recipi...,✔️ [0.708]
9,how to buy an app for a friend?,Google Play Gift Cards have become widely accessible in numerous c...,[383],"To buy an app for a friend, the most straightforward method is to ...","To buy an app for a friend, you can use the following methods: 1. ...",✔️ [0.284]


60.26

We can see that the performance got boosted to `60.26` with very few data under `light` mode. With more data and more iterations we can achieve a better result, please explore on your own!

Let's check on an example here, asking the same question to the baseline `rag = RAG()` program, which was not optimized, and to the `optimized_rag = MIPROv2(..)(..)` program, after prompt optimization.

In [26]:
baseline = rag(question="cmd+tab does not work on hidden or minimized windows")
print(baseline.response)

You are correct that cmd+tab does not work on hidden or minimized windows. The Command + Tab shortcut is designed to switch between active applications, and minimized windows do not count as active. To access a minimized window, you would need to restore it first, either by clicking on its icon in the Dock or using other methods to unminimize it.


In [27]:
pred = optimized_rag(question="cmd+tab does not work on hidden or minimized windows")
print(pred.response)

The Command + Tab shortcut does not allow you to switch directly to hidden or minimized windows. To regain focus on a minimized application, you first need to switch to another application using Command + Tab and let it take focus. If you want to manage minimized windows more effectively, you can adjust settings in System Preferences under Mission Control or use other keyboard shortcuts like Command + Option + H + M to hide all other windows while minimizing the most recent one.


You can use `dspy.inspect_history(n=2)` to view the RAG prompt before and after the optimization. 

A sample comparison from our run can be found here:

- [before optimization](https://gist.github.com/okhat/5d04648f2226e72e66e26a8cb1456ee4)
- [after optimization](https://gist.github.com/okhat/79405b8889b4b07da577ee19f1a3479a).

Concretely, in one of the runs of this notebook, the optimized prompt does the following (note that it may be different on a later rerun).

1. Constructs the following instruction,
```text
Using the provided `context` and `question`, analyze the information step by step to generate a comprehensive and informative `response`. Ensure that the response clearly explains the concepts involved, highlights key distinctions, and addresses any complexities noted in the context.
```

2. And includes two fully worked out RAG examples with synthetic reasoning and answers, e.g. `how to transfer whatsapp voice message to computer?`.

## Keeping an eye on cost

DSPy allows you to track the cost of your programs, which can be used to monitor the cost of your calls. Here, we'll show you how to track the cost of your programs with DSPy.

In [28]:
cost = sum([x["cost"] for x in lm.history if x["cost"] is not None])  # in USD, as calculated by LiteLLM for certain providers


## What's next?

Improving from around 54% to approximately 61% on this task, in terms of `SemanticF1`, was pretty easy through DSPy optimzier.

But DSPy gives you paths to continue iterating on the quality of your system and we have barely scratched the surface.

In general, you have the following tools:

1. Explore better system architectures for your program, e.g. what if we ask the LM to generate search queries for the retriever? See, e.g., the [STORM pipeline](https://arxiv.org/abs/2402.14207) built in DSPy.
2. Explore different [prompt optimizers](https://arxiv.org/abs/2406.11695) or [weight optimizers](https://arxiv.org/abs/2407.10930). See the Optimizers Docs.
3. Scale inference time compute using DSPy Optimizers, e.g. via ensembling multiple post-optimization programs.
4. Cut cost by distilling to a smaller LM, via prompt or weight optimization.

How do you decide which ones to proceed with first?

The first step is to look at your system outputs, which will allow you to identify the sources of lower performance if any. While doing all of this, make sure you continue to refine your metric, e.g. by optimizing against your judgments, and to collect more (or more realistic) data, e.g. from related domains or from putting a demo of your system in front of users.