<img src="docs/images/DSPy8.png" alt="DSPy7 Image" height="150"/>

## **DSPy**: Programming with Foundation Models

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/stanfordnlp/dspy/blob/main/intro.ipynb)

This notebook introduces the **DSPy** framework for **Programming with Foundation Models**, i.e., language models (LMs) and retrieval models (RMs).

**DSPy** emphasizes programming over prompting. It unifies techniques for **prompting** and **fine-tuning** LMs as well as improving them with **reasoning** and **tool/retrieval augmentation**, all expressed through a _minimalistic set of Pythonic operations that compose and learn_.

**DSPy** provides **composable and declarative modules** for instructing LMs in a familiar Pythonic syntax. On top of that, **DSPy** introduces an **automatic compiler that teaches LMs** how to conduct the declarative steps in your program. The **DSPy compiler** will internally _trace_ your program and then **craft high-quality prompts for large LMs (or train automatic finetunes for small LMs)** to teach them the steps of your task.

### 0] Setting Up

As we'll start to see below, **DSPy** can routinely teach powerful models like `GPT-3.5` and local models like `T5-base` or `Llama2-13b` to be much more reliable at complex tasks. **DSPy** will compile the _same program_ into different few-shot prompts and/or finetunes for each LM.

In [1]:
import sys
import os
import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch
from rouge_score import rouge_scorer

In [2]:
model_name = 'gemma2:27b'
#model_name = 'qwen2.5:72b'

In [4]:
#turbo = dspy.OpenAI(model='gpt-3.5-turbo')
lm = dspy.OllamaLocal(model=model_name)
dspy.settings.configure(lm=lm)

You can build your own **DSPy programs** for various tasks, e.g., question answering, information extraction, or text-to-SQL.

Whatever the task, the general workflow is:

1. **Collect a little bit of data.** Define examples of the inputs and outputs of your program (e.g., questions and their answers). This could just be a handful of quick examples you wrote down. If large datasets exist, the more the merrier!
1. **Write your program.** Define the modules (i.e., sub-tasks) of your program and the way they should interact together to solve your task.
1. **Define some validation logic.** What makes for a good run of your program? Maybe the answers need to have a certain length or stick to a particular format? Specify the logic that checks that.
1. **Compile!** Ask **DSPy** to _compile_ your program using your data. The compiler will use your data and validation logic to optimize your program (e.g., prompts and modules) so it's efficient and effective!
1. **Iterate.** Repeat the process by improving your data, program, validation, or by using more advanced features of the **DSPy** compiler.

### Data

In [5]:
from datasets import load_dataset

multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518")
print(multi_lexsum)

DatasetDict({
    train: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 3177
    })
    validation: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 454
    })
    test: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 908
    })
})


The core data type for data in DSPy is Example. You will use Examples to represent items in your training set and test set. DSPy Examples are similar to Python dicts but have a few useful utilities. Your DSPy modules will return values of the type Prediction, which is a special sub-class of Example. When you use DSPy, you will do a lot of evaluation and optimization runs. Your individual datapoints will be of type Example.

Select 100 train examples and 20 dev examples where all 3 summaries are present

In [6]:
trainset = multi_lexsum['train'].filter(
    lambda x: x['summary/tiny'] is not None and x['summary/short'] is not None and x['summary/long'] is not None
).select(range(100))

devset = multi_lexsum['validation'].filter(
    lambda x: x['summary/tiny'] is not None and x['summary/short'] is not None and x['summary/long'] is not None
).select(range(20))

#### Create Examples

In [7]:
def join_sources(x):
    x['sources'] = ' '.join(x['sources'])
    return x

trainset = trainset.map(join_sources, batched=False)
devset = devset.map(join_sources, batched=False)

In [8]:
# Tell DSPy that the joined 'sources' field is the input. Any other fields are labels and/or metadata
trainset = [dspy.Example(
    doc=x['sources'],
    long=x['summary/long'],
    short=x['summary/short'],
    tiny=x['summary/tiny']
).with_inputs('doc') for x in trainset]

devset = [dspy.Example(
    doc=' '.join(x['sources']),
    long=x['summary/long'],
    short=x['summary/short'],
    tiny=x['summary/tiny']
).with_inputs('doc') for x in devset]

# ['doc', 'long', 'short', 'tiny']
len(trainset), len(devset)

(100, 20)

**DSPy** typically requires very minimal labeling. Whereas your pipeline may involve six or seven complex steps, you only need labels for the initial question and the final answer. **DSPy** will bootstrap any intermediate labels needed to support your pipeline. If you change your pipeline in any way, the data bootstrapped will change accordingly!

Now, let's look at some data examples.

In [9]:
train_example = trainset[1]
print(f"short: {train_example.short}")
print(f"tiny: {train_example.tiny}")

short: Two men who were arrested for trespassing on property of businesses open to the public filed a lawsuit in the U.S. District Court for the Western District of Michigan against the city of Grand Rapids, its chief of police, and two individual officers. The plaintiffs claimed that the Grand Rapids Police Department's policy and practice of arresting individuals for trespass -- without probable cause and based on general Letters of Intent to Prosecute signed by Grand Rapids businesses -- results in unreasonable searches and seizures in violation of the Fourth Amendment. The parties came to a private settlement agreement for damages and attorney's fees in late 2019. The Judge dismissed the case in early 2020.


In [10]:
dev_example = devset[2]
print(f"short: {dev_example.short}")
print(f"tiny: {dev_example.tiny}")

short: Pretrial detainees file lawsuit against Middlesex County in November 2015 to ameliorate the unconstitutional conditions of solitary confinement in the Middlesex County Jail. In September 2018, the parties reached a settlement agreement that restricted the maximum amount of time allowed in isolation and provides those in isolation with opportunities to interact with others.
tiny: Pretrial detainees settled this class action against Middlesex County to provide 28 hours per week of out-of-cell time and mental health screenings to people held in solitary confinement.


After loading the raw data, we'd applied `with_inputs(' '.join(x['sources']))` to each example to tell **DSPy** that our input field in each example will be just `doc`. Any other fields are labels or metadata that are not given to the system.

### Basic zero shot prompt

tiny ~ 25 words

short ~ 130 words

long ~ 650 words

Generate a {summary_type} summary of maximum {max_tokens[summary_type]} tokens of the following text:

#### Short Summary

##### Signature

In [11]:
class ShortSummSig(dspy.Signature):
    """Generate short summaries of about 130 words."""
    # input
    doc = dspy.InputField()
    # output
    short = dspy.OutputField()

In `ShortSumm`, the docstring describes the sub-task. Each `InputField` or `OutputField` can optionally contain a description `desc` too. When it's not given, it's inferred from the field's name (e.g., `question`).

Notice that there isn't anything special about this signature in **DSPy**. We can just as easily define a signature that takes a long snippet from a PDF and outputs structured information, for instance.

Anyway, now that we have a signature, let's define and use a **Predictor**. A predictor is a module that knows how to use the LM to implement a signature. Importantly, predictors can **learn** to fit their behavior to the task!

```
dspy.Example(field1=value, field2=value2, ...)
```

In [12]:
# Define the predictor.
generate_short = dspy.Predict(ShortSummSig)

# Call the predictor on a particular input
pred = generate_short(doc=dev_example.doc)

# Print the prediction
#print(f"Doc: {dev_example.doc}")
print(f"Generated short: {pred.short}")
print('')
print(f"Ground truth short: {dev_example.short}")

Generated short: This appears to be a legal document excerpt detailing court activity. Here's a breakdown:

* **Dismissal:** The case (likely a civil lawsuit, given the "cv" designation) was dismissed with prejudice against Middlesex County. This means the plaintiff cannot refile the same claim.
* **Stipulation:**  The dismissal was agreed upon by both parties (the plaintiff and Middlesex County).
* **Dates:** Key dates include:
    * 10/24/2018: Judge Peter G. Sheridan signed the dismissal order.
    * 10/25/2018: The dismissal was officially entered into the court record.
* **PACER Transaction:** This section indicates

Ground truth short: Pretrial detainees file lawsuit against Middlesex County in November 2015 to ameliorate the unconstitutional conditions of solitary confinement in the Middlesex County Jail. In September 2018, the parties reached a settlement agreement that restricted the maximum amount of time allowed in isolation and provides those in isolation with opportunities

#### Evaluation

##### Metric

In [14]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
def eval_summ(example, pred, trace=None):
    return scorer.score(example.short.lower(), pred.short.lower())['rouge1'][2]

r1 = eval_summ

##### Program

In [15]:
class ShortSummProg(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_summ = dspy.Predict(ShortSummSig)

    def forward(self, doc):
        return self.generate_summ(doc=doc)

shortsumm = ShortSummProg()

##### Evaluate

In [16]:
evaluator = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=0)

In [17]:
evaluator(shortsumm, metric=r1)

Average Metric: 4.149412227978533 / 20 


20.75

### Optimized few-shot with bootstrapped demonstrations

In [19]:
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8,
    max_labeled_demos=8,
    #num_candidate_programs=10,
    num_threads=8,
    metric=r1,
    #teacher_settings=dict(lm=gpt4T)
)

Going to sample between 1 and 8 traces per predictor.
Will attempt to bootstrap 16 candidate sets.


In [21]:
shortsumm_fewshot = bootstrap_optimizer.compile(shortsumm, trainset=trainset, valset=devset)

Average Metric: 4.351450206532021 / 20 


Score: 21.76 for set: [0]
New best sscore: 21.76 for seed -3
Scores so far: [21.76]
Best score: 21.76


Average Metric: 4.357405195605286 / 20 


Score: 21.79 for set: [8]
New best sscore: 21.79 for seed -2
Scores so far: [21.76, 21.79]
Best score: 21.79


  8%|▏ | 8/100 [01:18<14:57,  9.75s/it]


Bootstrapped 8 full traces after 9 examples in round 0.


Average Metric: 4.211767554417294 / 20 


Score: 21.06 for set: [8]
Scores so far: [21.76, 21.79, 21.06]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.2276850753525236
Average of max per entry across top 5 scores: 0.2276850753525236
Average of max per entry across top 8 scores: 0.2276850753525236
Average of max per entry across top 9999 scores: 0.2276850753525236


  7%|▏ | 7/100 [02:05<27:43, 17.89s/it]


Bootstrapped 7 full traces after 8 examples in round 0.


Average Metric: 4.1334237528685955 / 20


Score: 20.67 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.2276850753525236
Average of max per entry across top 5 scores: 0.23105284627904102
Average of max per entry across top 8 scores: 0.23105284627904102
Average of max per entry across top 9999 scores: 0.23105284627904102


  3%|  | 3/100 [00:26<14:29,  8.97s/it]


Bootstrapped 3 full traces after 4 examples in round 0.


Average Metric: 4.140660441558267 / 20 


Score: 20.7 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.2276850753525236
Average of max per entry across top 5 scores: 0.23151215027232067
Average of max per entry across top 8 scores: 0.23151215027232067
Average of max per entry across top 9999 scores: 0.23151215027232067


  1%|  | 1/100 [00:08<14:23,  8.72s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 4.087983463840106 / 20 


Score: 20.44 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.2276850753525236
Average of max per entry across top 5 scores: 0.23151215027232067
Average of max per entry across top 8 scores: 0.23381218520546043
Average of max per entry across top 9999 scores: 0.23381218520546043


  4%|  | 4/100 [01:20<32:13, 20.14s/it]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 4.215519643330277 / 20 


Score: 21.08 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.22954593618174238
Average of max per entry across top 5 scores: 0.23056465891200356
Average of max per entry across top 8 scores: 0.23454511405108094
Average of max per entry across top 9999 scores: 0.23454511405108094


  4%|  | 4/100 [00:27<11:00,  6.88s/it]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 4.167550434041505 / 20 


Score: 20.84 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.22954593618174238
Average of max per entry across top 5 scores: 0.23042237382862382
Average of max per entry across top 8 scores: 0.2348621329609808
Average of max per entry across top 9999 scores: 0.2348621329609808


  5%|  | 5/100 [00:31<10:00,  6.32s/it]


Bootstrapped 5 full traces after 6 examples in round 0.


Average Metric: 4.212115835341751 / 20 


Score: 21.06 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.22954593618174238
Average of max per entry across top 5 scores: 0.23529650244822636
Average of max per entry across top 8 scores: 0.23856549397310672
Average of max per entry across top 9999 scores: 0.23896254448361454


  2%|  | 2/100 [00:18<15:18,  9.38s/it]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 4.183474031537362 / 20 


Score: 20.92 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.22954593618174238
Average of max per entry across top 5 scores: 0.23529650244822636
Average of max per entry across top 8 scores: 0.23636757721496565
Average of max per entry across top 9999 scores: 0.23922480771819443


  6%|  | 6/100 [01:53<29:39, 18.93s/it]


Bootstrapped 6 full traces after 7 examples in round 0.


Average Metric: 4.271972054650577 / 20 


Score: 21.36 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22436364365796096
Average of max per entry across top 3 scores: 0.22848029401701817
Average of max per entry across top 5 scores: 0.23366258654079958
Average of max per entry across top 8 scores: 0.23937957573527607
Average of max per entry across top 9999 scores: 0.2422368062385049


  4%|  | 4/100 [00:59<23:40, 14.79s/it]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 4.35365673158565 / 20  


Score: 21.77 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.235972886522418
Average of max per entry across top 8 scores: 0.23960370791644098
Average of max per entry across top 9999 scores: 0.24227274338117538


  8%|▏ | 8/100 [01:15<14:27,  9.43s/it]


Bootstrapped 8 full traces after 9 examples in round 0.


Average Metric: 4.138558549737024 / 20 


Score: 20.69 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77, 20.69]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.235972886522418
Average of max per entry across top 8 scores: 0.23960370791644098
Average of max per entry across top 9999 scores: 0.24227274338117538


  1%|  | 1/100 [00:05<08:22,  5.08s/it]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 4.221797369148664 / 20 


Score: 21.11 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77, 20.69, 21.11]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.23397721499320318
Average of max per entry across top 8 scores: 0.23932193696314497
Average of max per entry across top 9999 scores: 0.2423938877293256


  8%|▏ | 8/100 [00:58<11:14,  7.34s/it]


Bootstrapped 8 full traces after 9 examples in round 0.


Average Metric: 4.266788700363167 / 20 


Score: 21.33 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77, 20.69, 21.11, 21.33]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.23457531776551419
Average of max per entry across top 8 scores: 0.23871939409538467
Average of max per entry across top 9999 scores: 0.24389293216315266


  8%|▏ | 8/100 [02:13<25:36, 16.70s/it]


Bootstrapped 8 full traces after 9 examples in round 0.


Average Metric: 4.1540958333439635 / 20


Score: 20.77 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77, 20.69, 21.11, 21.33, 20.77]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.23457531776551419
Average of max per entry across top 8 scores: 0.23871939409538467
Average of max per entry across top 9999 scores: 0.24408090208796468


  5%|  | 5/100 [00:42<13:22,  8.45s/it]


Bootstrapped 5 full traces after 6 examples in round 0.


Average Metric: 4.174172664920959 / 20 


Score: 20.87 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77, 20.69, 21.11, 21.33, 20.77, 20.87]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.23457531776551419
Average of max per entry across top 8 scores: 0.23871939409538467
Average of max per entry across top 9999 scores: 0.24408090208796468


  2%|  | 2/100 [00:18<15:08,  9.27s/it]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 4.188098941633238 / 20 


Score: 20.94 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77, 20.69, 21.11, 21.33, 20.77, 20.87, 20.94]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.23457531776551419
Average of max per entry across top 8 scores: 0.23871939409538467
Average of max per entry across top 9999 scores: 0.2468081748152374


  4%|  | 4/100 [01:08<27:14, 17.03s/it]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 4.211498499519415 / 20 

Score: 21.06 for set: [8]
Scores so far: [21.76, 21.79, 21.06, 20.67, 20.7, 20.44, 21.08, 20.84, 21.06, 20.92, 21.36, 21.77, 20.69, 21.11, 21.33, 20.77, 20.87, 20.94, 21.06]
Best score: 21.79
Average of max per entry across top 1 scores: 0.21787025978026428
Average of max per entry across top 2 scores: 0.22722673687725586
Average of max per entry across top 3 scores: 0.22958757892814052
Average of max per entry across top 5 scores: 0.23457531776551419
Average of max per entry across top 8 scores: 0.23871939409538467
Average of max per entry across top 9999 scores: 0.2468081748152374
19 candidate programs found.



