<img src="docs/images/DSPy8.png" alt="DSPy7 Image" height="150"/>

## **DSPy**: Programming with Foundation Models

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/stanfordnlp/dspy/blob/main/intro.ipynb)

This notebook introduces the **DSPy** framework for **Programming with Foundation Models**, i.e., language models (LMs) and retrieval models (RMs).

**DSPy** emphasizes programming over prompting. It unifies techniques for **prompting** and **fine-tuning** LMs as well as improving them with **reasoning** and **tool/retrieval augmentation**, all expressed through a _minimalistic set of Pythonic operations that compose and learn_.

**DSPy** provides **composable and declarative modules** for instructing LMs in a familiar Pythonic syntax. On top of that, **DSPy** introduces an **automatic compiler that teaches LMs** how to conduct the declarative steps in your program. The **DSPy compiler** will internally _trace_ your program and then **craft high-quality prompts for large LMs (or train automatic finetunes for small LMs)** to teach them the steps of your task.

### 0] Setting Up

As we'll start to see below, **DSPy** can routinely teach powerful models like `GPT-3.5` and local models like `T5-base` or `Llama2-13b` to be much more reliable at complex tasks. **DSPy** will compile the _same program_ into different few-shot prompts and/or finetunes for each LM.

In [1]:
import sys
import os
import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch, MIPROv2
from rouge_score import rouge_scorer

In [3]:
model_name = 'gemma2:27b'
#model_name = 'qwen2.5:72b'

In [13]:
# does not work right now
#ollama_port = 11434 
#ollama_url = f"http://localhost:{ollama_port}"
#lm = dspy.LM(model=model_name, api_base=ollama_url)
#dspy.settings.configure(lm=lm)

In [3]:
lm = dspy.OllamaLocal(model=model_name)
dspy.settings.configure(lm=lm)

You can build your own **DSPy programs** for various tasks, e.g., question answering, information extraction, or text-to-SQL.

Whatever the task, the general workflow is:

1. **Collect a little bit of data.** Define examples of the inputs and outputs of your program (e.g., questions and their answers). This could just be a handful of quick examples you wrote down. If large datasets exist, the more the merrier!
1. **Write your program.** Define the modules (i.e., sub-tasks) of your program and the way they should interact together to solve your task.
1. **Define some validation logic.** What makes for a good run of your program? Maybe the answers need to have a certain length or stick to a particular format? Specify the logic that checks that.
1. **Compile!** Ask **DSPy** to _compile_ your program using your data. The compiler will use your data and validation logic to optimize your program (e.g., prompts and modules) so it's efficient and effective!
1. **Iterate.** Repeat the process by improving your data, program, validation, or by using more advanced features of the **DSPy** compiler.

### Data

In [5]:
from datasets import load_dataset

multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518")
print(multi_lexsum)

DatasetDict({
    train: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 3177
    })
    validation: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 454
    })
    test: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 908
    })
})


The core data type for data in DSPy is Example. You will use Examples to represent items in your training set and test set. DSPy Examples are similar to Python dicts but have a few useful utilities. Your DSPy modules will return values of the type Prediction, which is a special sub-class of Example. When you use DSPy, you will do a lot of evaluation and optimization runs. Your individual datapoints will be of type Example.

Select 100 train examples and 20 dev examples where all 3 summaries are present

In [6]:
trainset = multi_lexsum['train'].filter(
    lambda x: x['summary/tiny'] is not None and x['summary/short'] is not None and x['summary/long'] is not None
).select(range(100))

devset = multi_lexsum['validation'].filter(
    lambda x: x['summary/tiny'] is not None and x['summary/short'] is not None and x['summary/long'] is not None
).select(range(20))

#### Create Examples

In [7]:
def join_sources(x):
    x['sources'] = ' '.join(x['sources'])
    return x

trainset = trainset.map(join_sources, batched=False)
devset = devset.map(join_sources, batched=False)

In [8]:
# Tell DSPy that the joined 'sources' field is the input. Any other fields are labels and/or metadata
trainset = [dspy.Example(
    doc=x['sources'],
    long=x['summary/long'],
    short=x['summary/short'],
    tiny=x['summary/tiny']
).with_inputs('doc') for x in trainset]

devset = [dspy.Example(
    doc=' '.join(x['sources']),
    long=x['summary/long'],
    short=x['summary/short'],
    tiny=x['summary/tiny']
).with_inputs('doc') for x in devset]

# ['doc', 'long', 'short', 'tiny']
len(trainset), len(devset)

(100, 20)

**DSPy** typically requires very minimal labeling. Whereas your pipeline may involve six or seven complex steps, you only need labels for the initial question and the final answer. **DSPy** will bootstrap any intermediate labels needed to support your pipeline. If you change your pipeline in any way, the data bootstrapped will change accordingly!

Now, let's look at some data examples.

In [9]:
train_example = trainset[1]
print(f"short: {train_example.short}")
print(f"tiny: {train_example.tiny}")

short: Two men who were arrested for trespassing on property of businesses open to the public filed a lawsuit in the U.S. District Court for the Western District of Michigan against the city of Grand Rapids, its chief of police, and two individual officers. The plaintiffs claimed that the Grand Rapids Police Department's policy and practice of arresting individuals for trespass -- without probable cause and based on general Letters of Intent to Prosecute signed by Grand Rapids businesses -- results in unreasonable searches and seizures in violation of the Fourth Amendment. The parties came to a private settlement agreement for damages and attorney's fees in late 2019. The Judge dismissed the case in early 2020.


In [14]:
dev_example = devset[2]
print(f"short: {dev_example.short}")
print(f"tiny: {dev_example.tiny}")

short: Pretrial detainees file lawsuit against Middlesex County in November 2015 to ameliorate the unconstitutional conditions of solitary confinement in the Middlesex County Jail. In September 2018, the parties reached a settlement agreement that restricted the maximum amount of time allowed in isolation and provides those in isolation with opportunities to interact with others.
tiny: Pretrial detainees settled this class action against Middlesex County to provide 28 hours per week of out-of-cell time and mental health screenings to people held in solitary confinement.


After loading the raw data, we'd applied `with_inputs(' '.join(x['sources']))` to each example to tell **DSPy** that our input field in each example will be just `doc`. Any other fields are labels or metadata that are not given to the system.

### Basic zero shot prompt

tiny ~ 25 words

short ~ 130 words

long ~ 650 words

Generate a {summary_type} summary of maximum {max_tokens[summary_type]} tokens of the following text:

#### Short Summary

##### Signature

In [15]:
class ShortSummSig(dspy.Signature):
    """Generate short summaries of about 130 words."""
    # input
    doc = dspy.InputField()
    # output
    short = dspy.OutputField()

In `ShortSumm`, the docstring describes the sub-task. Each `InputField` or `OutputField` can optionally contain a description `desc` too. When it's not given, it's inferred from the field's name (e.g., `question`).

Notice that there isn't anything special about this signature in **DSPy**. We can just as easily define a signature that takes a long snippet from a PDF and outputs structured information, for instance.

Anyway, now that we have a signature, let's define and use a **Predictor**. A predictor is a module that knows how to use the LM to implement a signature. Importantly, predictors can **learn** to fit their behavior to the task!

```
dspy.Example(field1=value, field2=value2, ...)
```

In [16]:
# Define the predictor.
generate_short = dspy.Predict(ShortSummSig)

# Call the predictor on a particular input
pred = generate_short(doc=dev_example.doc)

# Print the prediction
#print(f"Doc: {dev_example.doc}")
print(f"Generated short: {pred.short}")
print('')
print(f"Ground truth short: {dev_example.short}")


[1;31mProvider List: https://docs.litellm.ai/docs/providers[0m


[1;31mProvider List: https://docs.litellm.ai/docs/providers[0m


[1;31mProvider List: https://docs.litellm.ai/docs/providers[0m



BadRequestError: litellm.BadRequestError: LLM Provider NOT provided. Pass in the LLM provider you are trying to call. You passed model=gemma2:27b
 Pass model as E.g. For 'Huggingface' inference endpoints pass in `completion(model='huggingface/starcoder',..)` Learn more: https://docs.litellm.ai/docs/providers

#### Evaluation

##### Metric

In [12]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
def eval_summ(example, pred, trace=None):
    return scorer.score(example.short.lower(), pred.short.lower())['rouge1'][2]

r1 = eval_summ

##### Program

In [13]:
class ShortSummProg(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_summ = dspy.Predict(ShortSummSig)

    def forward(self, doc):
        return self.generate_summ(doc=doc)

shortsumm = ShortSummProg()

##### Evaluate

In [14]:
evaluator = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=0)

In [15]:
evaluator(shortsumm, metric=r1)

Average Metric: 4.149412227978533 / 20  (20.7): 100%|████| 20/20 [01:06<00:00,  3.34s/it]


20.75

### Optimized few-shot with bootstrapped demonstrations

In [48]:
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8, #8
    max_labeled_demos=8, #8
    #num_candidate_programs=10,
    num_threads=8, #8
    metric=r1,
    #teacher_settings=dict(lm=gpt4T)
)

Going to sample between 1 and 8 traces per predictor.
Will attempt to bootstrap 16 candidate sets.


In [None]:
shortsumm_fewshot = bootstrap_optimizer.compile(shortsumm, trainset=trainset, valset=devset)

In [18]:
#max_bootstrapped_demos=4
#max_labeled_demos=4
21.16

#max_bootstrapped_demos=8
#max_labeled_demos=8
23.4

### MIPROv2

In [16]:
# Initialize optimizer
teleprompter = MIPROv2(
    metric=r1,
    #auto="light", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing program with MIPRO...")
optimized_program = teleprompter.compile(
    shortsumm.deepcopy(),
    trainset=trainset,
    valset=devset,
    minibatch_size=5,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    requires_permission_to_run=False,
)

Optimizing program with MIPRO...

==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
These will be used as few-shot example candidates for our program and for creating instructions.

Bootstrapping N=10 sets of demonstrations...
Bootstrapping set 1/10
Bootstrapping set 2/10
Bootstrapping set 3/10


  3%|█▌                                                  | 3/100 [00:15<08:19,  5.15s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 4/10


  3%|█▌                                                  | 3/100 [00:23<12:30,  7.74s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 5/10


  3%|█▌                                                  | 3/100 [00:19<10:41,  6.62s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 6/10


  1%|▌                                                   | 1/100 [00:15<24:59, 15.15s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 7/10


  1%|▌                                                   | 1/100 [00:06<10:42,  6.49s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/10


  1%|▌                                                   | 1/100 [00:08<14:22,  8.71s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/10


  2%|█                                                   | 2/100 [00:13<11:02,  6.76s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 10/10


  1%|▌                                                   | 1/100 [00:11<18:19, 11.11s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.

==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.

Proposing instructions...

Proposed Instructions for Predictor 0:

0: Generate short summaries of about 130 words.

1: ## PROPOSED INSTRUCTION:

You are a legal expert assistant tasked with summarizing complex legal documents for a general audience. Your goal is to create concise and engaging summaries of approximately 130 words.  

**Here's what I need from you:**

* **Clarity and Simplicity:** Use clear, concise language and avoid unnecessary legal jargon.
* **Engaging Hook:** Start with a sentence or two that grabs the reader's attention and briefly explains the document's topic. Consider using an "@" mention format to personalize the summary (e.g., "@individuals seekin

Average Metric: 4.149165274748174 / 20  (20.7): 100%|████| 20/20 [00:30<00:00,  1.50s/it]


Default program score: 20.75

==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.

== Minibatch Trial 1 / 30 ==


Average Metric: 1.401343064094742 / 5  (28.0): 100%|███████| 5/5 [00:18<00:00,  3.65s/it]


Score: 28.03 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 2'].
Minibatch scores so far: [28.03]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 2 / 30 ==


Average Metric: 1.0618074590430882 / 5  (21.2): 100%|██████| 5/5 [00:20<00:00,  4.06s/it]


Score: 21.24 on minibatch of size 5 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 2'].
Minibatch scores so far: [28.03, 21.24]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 3 / 30 ==


Average Metric: 1.2204817001949637 / 5  (24.4): 100%|██████| 5/5 [00:24<00:00,  4.86s/it]


Score: 24.41 on minibatch of size 5 with parameters ['Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 6'].
Minibatch scores so far: [28.03, 21.24, 24.41]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 4 / 30 ==


Average Metric: 0.9273677864618655 / 5  (18.5): 100%|██████| 5/5 [00:55<00:00, 11.20s/it]


Score: 18.55 on minibatch of size 5 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 5'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 5 / 30 ==


Average Metric: 0.8120506925572133 / 5  (16.2): 100%|██████| 5/5 [00:18<00:00,  3.69s/it]


Score: 16.24 on minibatch of size 5 with parameters ['Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 8'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 6 / 30 ==


Average Metric: 1.1770990112609483 / 5  (23.5): 100%|██████| 5/5 [00:13<00:00,  2.60s/it]


Score: 23.54 on minibatch of size 5 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 7 / 30 ==


Average Metric: 1.036908424788845 / 5  (20.7): 100%|███████| 5/5 [00:52<00:00, 10.46s/it]


Score: 20.74 on minibatch of size 5 with parameters ['Predictor 1: Instruction 9', 'Predictor 1: Few-Shot Set 5'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 8 / 30 ==


Average Metric: 1.4303765880779884 / 5  (28.6): 100%|██████| 5/5 [00:14<00:00,  2.88s/it]


Score: 28.61 on minibatch of size 5 with parameters ['Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 4'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 9 / 30 ==


Average Metric: 1.2710077590105233 / 5  (25.4): 100%|██████| 5/5 [00:33<00:00,  6.78s/it]


Score: 25.42 on minibatch of size 5 with parameters ['Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 7'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42]
Full eval scores so far: [20.75]
Best full score so far: 20.75


== Minibatch Trial 10 / 30 ==


Average Metric: 0.9475080447568882 / 5  (19.0): 100%|██████| 5/5 [00:34<00:00,  6.97s/it]


Score: 18.95 on minibatch of size 5 with parameters ['Predictor 1: Instruction 9', 'Predictor 1: Few-Shot Set 7'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95]
Full eval scores so far: [20.75]
Best full score so far: 20.75


===== Full Eval 1 =====
Doing full eval on next top averaging program (Avg Score: 28.61) from minibatch trials...


Average Metric: 4.3351003110630595 / 20  (21.7): 100%|███| 20/20 [00:48<00:00,  2.41s/it]


[92mNew best full eval score![0m Score: 21.68
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 11 / 30 ==


Average Metric: 1.078304408517299 / 5  (21.6): 100%|███████| 5/5 [00:17<00:00,  3.54s/it]


Score: 21.57 on minibatch of size 5 with parameters ['Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 9'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 12 / 30 ==


Average Metric: 0.9603992666113574 / 5  (19.2): 100%|██████| 5/5 [00:13<00:00,  2.66s/it]


Score: 19.21 on minibatch of size 5 with parameters ['Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 4'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 13 / 30 ==


Average Metric: 1.3315784517141984 / 5  (26.6): 100%|██████| 5/5 [00:11<00:00,  2.30s/it]


Score: 26.63 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 1'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 14 / 30 ==


Average Metric: 0.8009034851804644 / 5  (16.0): 100%|██████| 5/5 [00:18<00:00,  3.77s/it]


Score: 16.02 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 2'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 15 / 30 ==


Average Metric: 1.1713741409924427 / 5  (23.4): 100%|██████| 5/5 [00:12<00:00,  2.52s/it]


Score: 23.43 on minibatch of size 5 with parameters ['Predictor 1: Instruction 5', 'Predictor 1: Few-Shot Set 4'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 16 / 30 ==


Average Metric: 1.1144951490546016 / 5  (22.3): 100%|██████| 5/5 [00:14<00:00,  2.87s/it]


Score: 22.29 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 4'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 17 / 30 ==


Average Metric: 1.1445684750032576 / 5  (22.9): 100%|██████| 5/5 [00:06<00:00,  1.25s/it]


Score: 22.89 on minibatch of size 5 with parameters ['Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 0'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 18 / 30 ==


Average Metric: 1.0020107023441176 / 5  (20.0): 100%|██████| 5/5 [00:16<00:00,  3.27s/it]


Score: 20.04 on minibatch of size 5 with parameters ['Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 2'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 19 / 30 ==


Average Metric: 0.9019322857176402 / 5  (18.0): 100%|██████| 5/5 [00:16<00:00,  3.39s/it]


Score: 18.04 on minibatch of size 5 with parameters ['Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 2'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


== Minibatch Trial 20 / 30 ==


Average Metric: 0.8792427930270362 / 5  (17.6): 100%|██████| 5/5 [00:12<00:00,  2.58s/it]


Score: 17.58 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 3'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58]
Full eval scores so far: [20.75, 21.68]
Best full score so far: 21.68


===== Full Eval 2 =====
Doing full eval on next top averaging program (Avg Score: 26.63) from minibatch trials...


Average Metric: 4.171095713200966 / 20  (20.9): 100%|████| 20/20 [00:40<00:00,  2.00s/it]


Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 21 / 30 ==


Average Metric: 0.9461898125904971 / 5  (18.9): 100%|██████| 5/5 [00:16<00:00,  3.24s/it]


Score: 18.92 on minibatch of size 5 with parameters ['Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 4'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 22 / 30 ==


Average Metric: 1.033247283900634 / 5  (20.7): 100%|███████| 5/5 [00:14<00:00,  2.82s/it]


Score: 20.66 on minibatch of size 5 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 1'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 23 / 30 ==


Average Metric: 1.1217776067604932 / 5  (22.4): 100%|██████| 5/5 [00:19<00:00,  3.95s/it]


Score: 22.44 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 8'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 24 / 30 ==


Average Metric: 1.325583187498081 / 5  (26.5): 100%|███████| 5/5 [00:10<00:00,  2.16s/it]


Score: 26.51 on minibatch of size 5 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 1'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44, 26.51]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 25 / 30 ==


Average Metric: 0.8971821927711912 / 5  (17.9): 100%|██████| 5/5 [00:09<00:00,  1.94s/it]


Score: 17.94 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 1'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44, 26.51, 17.94]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 26 / 30 ==


Average Metric: 0.8583115775170115 / 5  (17.2): 100%|██████| 5/5 [00:11<00:00,  2.24s/it]


Score: 17.17 on minibatch of size 5 with parameters ['Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 1'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44, 26.51, 17.94, 17.17]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 27 / 30 ==


Average Metric: 1.18448702135853 / 5  (23.7): 100%|████████| 5/5 [00:50<00:00, 10.09s/it]


Score: 23.69 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 5'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44, 26.51, 17.94, 17.17, 23.69]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 28 / 30 ==


Average Metric: 1.1725157937100747 / 5  (23.5): 100%|██████| 5/5 [00:20<00:00,  4.02s/it]


Score: 23.45 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 6'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44, 26.51, 17.94, 17.17, 23.69, 23.45]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 29 / 30 ==


Average Metric: 0.7327754782566983 / 5  (14.7): 100%|██████| 5/5 [00:14<00:00,  2.88s/it]


Score: 14.66 on minibatch of size 5 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 1'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44, 26.51, 17.94, 17.17, 23.69, 23.45, 14.66]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


== Minibatch Trial 30 / 30 ==


Average Metric: 1.0139890543500827 / 5  (20.3): 100%|██████| 5/5 [00:16<00:00,  3.33s/it]


Score: 20.28 on minibatch of size 5 with parameters ['Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 4'].
Minibatch scores so far: [28.03, 21.24, 24.41, 18.55, 16.24, 23.54, 20.74, 28.61, 25.42, 18.95, 21.57, 19.21, 26.63, 16.02, 23.43, 22.29, 22.89, 20.04, 18.04, 17.58, 18.92, 20.66, 22.44, 26.51, 17.94, 17.17, 23.69, 23.45, 14.66, 20.28]
Full eval scores so far: [20.75, 21.68, 20.86]
Best full score so far: 21.68


===== Full Eval 3 =====
Doing full eval on next top averaging program (Avg Score: 26.51) from minibatch trials...


Average Metric: 4.319258893787406 / 20  (21.6): 100%|████| 20/20 [00:40<00:00,  2.00s/it]

Full eval scores so far: [20.75, 21.68, 20.86, 21.6]
Best full score so far: 21.68


Returning best identified program with score 21.68!





In [None]:
# Save optimize program for future use
optimized_program.save(f"mipro_optimized")