<img src="docs/images/DSPy8.png" alt="DSPy7 Image" height="150"/>

## **DSPy**: Programming with Foundation Models

[<img align="center" src="https://colab.research.google.com/assets/colab-badge.svg" />](https://colab.research.google.com/github/stanfordnlp/dspy/blob/main/intro.ipynb)

This notebook introduces the **DSPy** framework for **Programming with Foundation Models**, i.e., language models (LMs) and retrieval models (RMs).

**DSPy** emphasizes programming over prompting. It unifies techniques for **prompting** and **fine-tuning** LMs as well as improving them with **reasoning** and **tool/retrieval augmentation**, all expressed through a _minimalistic set of Pythonic operations that compose and learn_.

**DSPy** provides **composable and declarative modules** for instructing LMs in a familiar Pythonic syntax. On top of that, **DSPy** introduces an **automatic compiler that teaches LMs** how to conduct the declarative steps in your program. The **DSPy compiler** will internally _trace_ your program and then **craft high-quality prompts for large LMs (or train automatic finetunes for small LMs)** to teach them the steps of your task.

### 0] Setting Up

As we'll start to see below, **DSPy** can routinely teach powerful models like `GPT-3.5` and local models like `T5-base` or `Llama2-13b` to be much more reliable at complex tasks. **DSPy** will compile the _same program_ into different few-shot prompts and/or finetunes for each LM.

In [2]:
import sys
import os
import dspy
from dspy.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch, MIPROv2
from rouge_score import rouge_scorer

In [3]:
model_name = 'gemma2:27b'
#model_name = 'qwen2.5:72b'

In [4]:
# does not work right now
#ollama_port = 11434 
#ollama_url = f"http://localhost:{ollama_port}"
#lm = dspy.LM(model=model_name, api_base=ollama_url)
#dspy.settings.configure(lm=lm)

In [5]:
lm = dspy.OllamaLocal(model=model_name)
dspy.settings.configure(lm=lm)

You can build your own **DSPy programs** for various tasks, e.g., question answering, information extraction, or text-to-SQL.

Whatever the task, the general workflow is:

1. **Collect a little bit of data.** Define examples of the inputs and outputs of your program (e.g., questions and their answers). This could just be a handful of quick examples you wrote down. If large datasets exist, the more the merrier!
1. **Write your program.** Define the modules (i.e., sub-tasks) of your program and the way they should interact together to solve your task.
1. **Define some validation logic.** What makes for a good run of your program? Maybe the answers need to have a certain length or stick to a particular format? Specify the logic that checks that.
1. **Compile!** Ask **DSPy** to _compile_ your program using your data. The compiler will use your data and validation logic to optimize your program (e.g., prompts and modules) so it's efficient and effective!
1. **Iterate.** Repeat the process by improving your data, program, validation, or by using more advanced features of the **DSPy** compiler.

### Data

In [6]:
from datasets import load_dataset

multi_lexsum = load_dataset("allenai/multi_lexsum", name="v20230518")
print(multi_lexsum)

DatasetDict({
    train: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 3177
    })
    validation: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 454
    })
    test: Dataset({
        features: ['id', 'sources', 'sources_metadata', 'summary/long', 'summary/short', 'summary/tiny', 'case_metadata'],
        num_rows: 908
    })
})


The core data type for data in DSPy is Example. You will use Examples to represent items in your training set and test set. DSPy Examples are similar to Python dicts but have a few useful utilities. Your DSPy modules will return values of the type Prediction, which is a special sub-class of Example. When you use DSPy, you will do a lot of evaluation and optimization runs. Your individual datapoints will be of type Example.

Select 100 train examples and 20 dev examples where all 3 summaries are present

In [7]:
trainset = multi_lexsum['train'].filter(
    lambda x: x['summary/tiny'] is not None and x['summary/short'] is not None and x['summary/long'] is not None
).select(range(100))

devset = multi_lexsum['validation'].filter(
    lambda x: x['summary/tiny'] is not None and x['summary/short'] is not None and x['summary/long'] is not None
).select(range(20))

#### Create Examples

In [8]:
def join_sources(x):
    x['sources'] = ' '.join(x['sources'])
    return x

trainset = trainset.map(join_sources, batched=False)
devset = devset.map(join_sources, batched=False)

In [9]:
# Tell DSPy that the joined 'sources' field is the input. Any other fields are labels and/or metadata
trainset = [dspy.Example(
    doc=x['sources'],
    long=x['summary/long'],
    short=x['summary/short'],
    tiny=x['summary/tiny']
).with_inputs('doc') for x in trainset]

devset = [dspy.Example(
    doc=' '.join(x['sources']),
    long=x['summary/long'],
    short=x['summary/short'],
    tiny=x['summary/tiny']
).with_inputs('doc') for x in devset]

# ['doc', 'long', 'short', 'tiny']
len(trainset), len(devset)

(100, 20)

**DSPy** typically requires very minimal labeling. Whereas your pipeline may involve six or seven complex steps, you only need labels for the initial question and the final answer. **DSPy** will bootstrap any intermediate labels needed to support your pipeline. If you change your pipeline in any way, the data bootstrapped will change accordingly!

Now, let's look at some data examples.

In [10]:
train_example = trainset[1]
print(f"short: {train_example.short}")
print(f"tiny: {train_example.tiny}")

short: Two men who were arrested for trespassing on property of businesses open to the public filed a lawsuit in the U.S. District Court for the Western District of Michigan against the city of Grand Rapids, its chief of police, and two individual officers. The plaintiffs claimed that the Grand Rapids Police Department's policy and practice of arresting individuals for trespass -- without probable cause and based on general Letters of Intent to Prosecute signed by Grand Rapids businesses -- results in unreasonable searches and seizures in violation of the Fourth Amendment. The parties came to a private settlement agreement for damages and attorney's fees in late 2019. The Judge dismissed the case in early 2020.


In [11]:
dev_example = devset[2]
print(f"short: {dev_example.short}")
print(f"tiny: {dev_example.tiny}")

short: Pretrial detainees file lawsuit against Middlesex County in November 2015 to ameliorate the unconstitutional conditions of solitary confinement in the Middlesex County Jail. In September 2018, the parties reached a settlement agreement that restricted the maximum amount of time allowed in isolation and provides those in isolation with opportunities to interact with others.
tiny: Pretrial detainees settled this class action against Middlesex County to provide 28 hours per week of out-of-cell time and mental health screenings to people held in solitary confinement.


After loading the raw data, we'd applied `with_inputs(' '.join(x['sources']))` to each example to tell **DSPy** that our input field in each example will be just `doc`. Any other fields are labels or metadata that are not given to the system.

### Basic zero shot prompt

tiny ~ 25 words

short ~ 130 words

long ~ 650 words

Generate a {summary_type} summary of maximum {max_tokens[summary_type]} tokens of the following text:

#### Short Summary

##### Signature

In [12]:
class ShortSummSig(dspy.Signature):
    """Generate short summaries of about 130 words."""
    # input
    doc = dspy.InputField()
    # output
    short = dspy.OutputField()

In `ShortSumm`, the docstring describes the sub-task. Each `InputField` or `OutputField` can optionally contain a description `desc` too. When it's not given, it's inferred from the field's name (e.g., `question`).

Notice that there isn't anything special about this signature in **DSPy**. We can just as easily define a signature that takes a long snippet from a PDF and outputs structured information, for instance.

Anyway, now that we have a signature, let's define and use a **Predictor**. A predictor is a module that knows how to use the LM to implement a signature. Importantly, predictors can **learn** to fit their behavior to the task!

```
dspy.Example(field1=value, field2=value2, ...)
```

In [13]:
# Define the predictor.
generate_short = dspy.Predict(ShortSummSig)

# Call the predictor on a particular input
pred = generate_short(doc=dev_example.doc)

# Print the prediction
#print(f"Doc: {dev_example.doc}")
print(f"Generated short: {pred.short}")
print('')
print(f"Ground truth short: {dev_example.short}")

 		You are using the client OllamaLocal, which will be removed in DSPy 2.6.
 		Changing the client is straightforward and will let you use new features (Adapters) that improve the consistency of LM outputs, especially when using chat LMs. 

 		Learn more about the changes and how to migrate at
 		https://github.com/stanfordnlp/dspy/blob/main/examples/migration.ipynb


Generated short: This appears to be a legal document excerpt detailing court activity. Here's a breakdown:

* **Dismissal:** The case (likely a civil lawsuit, given the "cv" designation) was dismissed with prejudice against Middlesex County. This means the plaintiff cannot refile the same claim.
* **Stipulation:**  The dismissal was agreed upon by both parties (the plaintiff and Middlesex County).
* **Dates:** Key dates include:
    * 10/24/2018: Judge Peter G. Sheridan signed the dismissal order.
    * 10/25/2018: The dismissal was officially entered into the court record.
* **PACER Transaction:** This section indicates

Ground truth short: Pretrial detainees file lawsuit against Middlesex County in November 2015 to ameliorate the unconstitutional conditions of solitary confinement in the Middlesex County Jail. In September 2018, the parties reached a settlement agreement that restricted the maximum amount of time allowed in isolation and provides those in isolation with opportunities

#### Long Summary

In [29]:
class LongSummSig(dspy.Signature):
    """Generate long summaries."""
    # input
    doc = dspy.InputField()
    # output
    long = dspy.OutputField()

In [30]:
# Define the predictor.
generate_long = dspy.Predict(LongSummSig)

# Call the predictor on a particular input
pred = generate_long(doc=dev_example.doc)

# Print the prediction
#print(f"Doc: {dev_example.doc}")
print(f"Generated long: {pred.long}")
print('')
print(f"Ground truth long: {dev_example.long}")

Generated long: This text appears to be a legal document receipt from the PACER (Public Access to Court Electronic Records) system. Here's a breakdown of what it likely means:

* **Case Information:** The document refers to a case with docket number "3:15-cv-07920". This suggests a civil case filed in the United States District Court for the District of New Jersey (the "3" indicates the judicial district).
* **Stipulation of Dismissal:**  The text mentions a "STIPULATION of Dismissal with Prejudice as to Defendant, Middlesex County". This means that the plaintiff(s) and defendant(s), including Middlesex County, agreed to dismiss the case permanently. "With prejudice"

Ground truth long: On November 5, 2015, the plaintiffs, nine pretrial detainees, filed this class action in the United States District Court of New Jersey. The plaintiffs sued Middlesex County under 42 U.S.C. § 1983 for the deprivation of rights secured by the Eighth and Fourteenth Amendments to the United States Constitu

#### Evaluation

##### Metric

In [31]:
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

def eval_summ_short(example, pred, trace=None):
    return scorer.score(example.short.lower(), pred.short.lower())['rouge1'][2]

def eval_summ_long(example, pred, trace=None):
    return scorer.score(example.long.lower(), pred.long.lower())['rouge1'][2]

r1_short = eval_summ_short
r1_long = eval_summ_long

##### Program

In [32]:
class ShortSummProg(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_summ = dspy.Predict(ShortSummSig)

    def forward(self, doc):
        return self.generate_summ(doc=doc)

class LongSummProg(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate_summ = dspy.Predict(LongSummSig)

    def forward(self, doc):
        return self.generate_summ(doc=doc)
        
shortsumm = ShortSummProg()
longsumm = LongSummProg()

##### Evaluate

In [20]:
evaluator = Evaluate(devset=devset, num_threads=1, display_progress=True, display_table=0)

In [22]:
evaluator(shortsumm, metric=r1_short)

Average Metric: 4.149412227978533 / 20  (20.7): 100%|█| 20/20 [01:07<00:00,  3


20.75

In [26]:
evaluator(longsumm, metric=r1_long)

Average Metric: 3.0440304603629977 / 20  (15.2): 100%|█| 20/20 [01:24<00:00,  


15.22

### Optimized few-shot with bootstrapped demonstrations

In [33]:
bootstrap_optimizer = BootstrapFewShotWithRandomSearch(
    max_bootstrapped_demos=8, #8
    max_labeled_demos=8, #8
    num_candidate_programs=10,
    num_threads=8, #8
    metric=r1_long,
    #teacher_settings=dict(lm=gpt4T)
)

Going to sample between 1 and 8 traces per predictor.
Will attempt to bootstrap 10 candidate sets.


In [34]:
longsumm_fewshot = bootstrap_optimizer.compile(longsumm, trainset=trainset, valset=devset)

Average Metric: 2.9462854710706337 / 20  (14.7): 100%|█| 20/20 [00:34<00:00,  


New best score: 14.73 for seed -3
Scores so far: [14.73]
Best score so far: 14.73


Average Metric: 3.0062489253735536 / 20  (15.0): 100%|█| 20/20 [01:42<00:00,  


New best score: 15.03 for seed -2
Scores so far: [14.73, 15.03]
Best score so far: 15.03


  8%|███▎                                     | 8/100 [01:17<14:56,  9.75s/it]


Bootstrapped 8 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.


Average Metric: 2.939228360813264 / 20  (14.7): 100%|█| 20/20 [01:09<00:00,  3


Scores so far: [14.73, 15.03, 14.7]
Best score so far: 15.03


  7%|██▊                                      | 7/100 [02:06<27:59, 18.06s/it]


Bootstrapped 7 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.


Average Metric: 2.9161932791524565 / 20  (14.6): 100%|█| 20/20 [01:30<00:00,  


Scores so far: [14.73, 15.03, 14.7, 14.58]
Best score so far: 15.03


  3%|█▏                                       | 3/100 [00:27<14:51,  9.19s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.


Average Metric: 2.984430772350161 / 20  (14.9): 100%|█| 20/20 [01:14<00:00,  3


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92]
Best score so far: 15.03


  1%|▍                                        | 1/100 [00:08<13:51,  8.40s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.


Average Metric: 3.0069943859958026 / 20  (15.0): 100%|█| 20/20 [01:18<00:00,  


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03]
Best score so far: 15.03


  4%|█▋                                       | 4/100 [01:19<31:56, 19.97s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


Average Metric: 2.9629981400790464 / 20  (14.8): 100%|█| 20/20 [01:48<00:00,  


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03, 14.81]
Best score so far: 15.03


  4%|█▋                                       | 4/100 [00:29<11:44,  7.34s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


Average Metric: 2.9599011735617404 / 20  (14.8): 100%|█| 20/20 [01:19<00:00,  


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03, 14.81, 14.8]
Best score so far: 15.03


  5%|██                                       | 5/100 [00:33<10:34,  6.68s/it]


Bootstrapped 5 full traces after 5 examples for up to 1 rounds, amounting to 5 attempts.


Average Metric: 2.9637034399822224 / 20  (14.8): 100%|█| 20/20 [01:43<00:00,  


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03, 14.81, 14.8, 14.82]
Best score so far: 15.03


  2%|▊                                        | 2/100 [00:19<15:36,  9.55s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.


Average Metric: 2.9963425914776947 / 20  (15.0): 100%|█| 20/20 [01:11<00:00,  


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03, 14.81, 14.8, 14.82, 14.98]
Best score so far: 15.03


  6%|██▍                                      | 6/100 [01:54<29:47, 19.02s/it]


Bootstrapped 6 full traces after 6 examples for up to 1 rounds, amounting to 6 attempts.


Average Metric: 2.955964425958598 / 20  (14.8): 100%|█| 20/20 [01:53<00:00,  5


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03, 14.81, 14.8, 14.82, 14.98, 14.78]
Best score so far: 15.03


  4%|█▋                                       | 4/100 [00:59<23:57, 14.97s/it]


Bootstrapped 4 full traces after 4 examples for up to 1 rounds, amounting to 4 attempts.


Average Metric: 2.9649016961062844 / 20  (14.8): 100%|█| 20/20 [01:29<00:00,  


Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03, 14.81, 14.8, 14.82, 14.98, 14.78, 14.82]
Best score so far: 15.03


  8%|███▎                                     | 8/100 [01:18<14:58,  9.76s/it]


Bootstrapped 8 full traces after 8 examples for up to 1 rounds, amounting to 8 attempts.


Average Metric: 2.979916205066145 / 20  (14.9): 100%|█| 20/20 [01:44<00:00,  5

Scores so far: [14.73, 15.03, 14.7, 14.58, 14.92, 15.03, 14.81, 14.8, 14.82, 14.98, 14.78, 14.82, 14.9]
Best score so far: 15.03
13 candidate programs found.





In [18]:
# short
#max_bootstrapped_demos=4
#max_labeled_demos=4
21.16

#max_bootstrapped_demos=8
#max_labeled_demos=8
23.4

# long


### MIPROv2

In [None]:
# Initialize optimizer
teleprompter = MIPROv2(
    metric=r1_long,
    #auto="light", # Can choose between light, medium, and heavy optimization runs
)

# Optimize program
print(f"Optimizing program with MIPRO...")
optimized_program = teleprompter.compile(
    longsumm.deepcopy(),
    trainset=trainset,
    valset=devset,
    minibatch_size=5,
    max_bootstrapped_demos=3,
    max_labeled_demos=4,
    requires_permission_to_run=False,
)

Optimizing program with MIPRO...

==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==
These will be used as few-shot example candidates for our program and for creating instructions.

Bootstrapping N=10 sets of demonstrations...
Bootstrapping set 1/10
Bootstrapping set 2/10
Bootstrapping set 3/10


  3%|█▏                                       | 3/100 [00:15<08:20,  5.16s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 4/10


  3%|█▏                                       | 3/100 [00:25<13:34,  8.40s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 5/10


  3%|█▏                                       | 3/100 [00:20<11:15,  6.96s/it]


Bootstrapped 3 full traces after 3 examples for up to 1 rounds, amounting to 3 attempts.
Bootstrapping set 6/10


  1%|▍                                        | 1/100 [00:15<25:21, 15.37s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 7/10


  1%|▍                                        | 1/100 [00:06<11:10,  6.78s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 8/10


  1%|▍                                        | 1/100 [00:10<16:43, 10.13s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 9/10


  2%|▊                                        | 2/100 [00:14<11:27,  7.02s/it]


Bootstrapped 2 full traces after 2 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 10/10


  1%|▍                                        | 1/100 [00:11<18:55, 11.47s/it]


Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.

==> STEP 2: PROPOSE INSTRUCTION CANDIDATES <==
We will use the few-shot examples from the previous step, a generated dataset summary, a summary of the program code, and a randomly selected prompting tip to propose instructions.

Proposing instructions...

Proposed Instructions for Predictor 0:

0: Generate long summaries.

1: ## PROPOSED INSTRUCTION:

You are a legal expert AI assistant tasked with summarizing complex legal cases for easy understanding.  Given a text document describing a legal case, provide a comprehensive long summary that includes the following key elements:

* **Parties involved:** Clearly identify the plaintiff(s) and defendant(s), including their roles and affiliations (e.g., incarcerated individual, correctional authorities, legal representation).
* **Central issue:** Concisely state the main legal point of contention in the case. For example, is it about housing rights for

Average Metric: 3.0193271000668607 / 20  (15.1): 100%|█| 20/20 [00:33<00:00,  


Default program score: 15.1

==> STEP 3: FINDING OPTIMAL PROMPT PARAMETERS <==
We will evaluate the program over a series of trials with different combinations of instructions and few-shot examples to find the optimal combination using Bayesian Optimization.

== Minibatch Trial 1 / 30 ==


Average Metric: 1.0443032200959197 / 5  (20.9): 100%|█| 5/5 [00:19<00:00,  3.8


Score: 20.89 on minibatch of size 5 with parameters ['Predictor 1: Instruction 1', 'Predictor 1: Few-Shot Set 2'].
Minibatch scores so far: [20.89]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 2 / 30 ==


Average Metric: 0.612327645044234 / 5  (12.2): 100%|█| 5/5 [00:21<00:00,  4.29


Score: 12.25 on minibatch of size 5 with parameters ['Predictor 1: Instruction 6', 'Predictor 1: Few-Shot Set 2'].
Minibatch scores so far: [20.89, 12.25]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 3 / 30 ==


Average Metric: 0.9294036364596564 / 5  (18.6): 100%|█| 5/5 [00:24<00:00,  4.9


Score: 18.59 on minibatch of size 5 with parameters ['Predictor 1: Instruction 8', 'Predictor 1: Few-Shot Set 6'].
Minibatch scores so far: [20.89, 12.25, 18.59]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 4 / 30 ==


Average Metric: 0.6504209438700116 / 5  (13.0): 100%|█| 5/5 [00:56<00:00, 11.2


Score: 13.01 on minibatch of size 5 with parameters ['Predictor 1: Instruction 4', 'Predictor 1: Few-Shot Set 5'].
Minibatch scores so far: [20.89, 12.25, 18.59, 13.01]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 5 / 30 ==


Average Metric: 0.650393962959553 / 5  (13.0): 100%|█| 5/5 [00:21<00:00,  4.22


Score: 13.01 on minibatch of size 5 with parameters ['Predictor 1: Instruction 3', 'Predictor 1: Few-Shot Set 8'].
Minibatch scores so far: [20.89, 12.25, 18.59, 13.01, 13.01]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 6 / 30 ==


Average Metric: 0.7948498106773054 / 5  (15.9): 100%|█| 5/5 [00:15<00:00,  3.1


Score: 15.9 on minibatch of size 5 with parameters ['Predictor 1: Instruction 2', 'Predictor 1: Few-Shot Set 3'].
Minibatch scores so far: [20.89, 12.25, 18.59, 13.01, 13.01, 15.9]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 7 / 30 ==


Average Metric: 0.8350328634684057 / 5  (16.7): 100%|█| 5/5 [00:54<00:00, 10.9


Score: 16.7 on minibatch of size 5 with parameters ['Predictor 1: Instruction 9', 'Predictor 1: Few-Shot Set 5'].
Minibatch scores so far: [20.89, 12.25, 18.59, 13.01, 13.01, 15.9, 16.7]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 8 / 30 ==


Average Metric: 0.7558863745657208 / 5  (15.1): 100%|█| 5/5 [00:17<00:00,  3.4


Score: 15.12 on minibatch of size 5 with parameters ['Predictor 1: Instruction 7', 'Predictor 1: Few-Shot Set 4'].
Minibatch scores so far: [20.89, 12.25, 18.59, 13.01, 13.01, 15.9, 16.7, 15.12]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 9 / 30 ==


Average Metric: 0.7378720097512407 / 5  (14.8): 100%|█| 5/5 [00:36<00:00,  7.3


Score: 14.76 on minibatch of size 5 with parameters ['Predictor 1: Instruction 0', 'Predictor 1: Few-Shot Set 7'].
Minibatch scores so far: [20.89, 12.25, 18.59, 13.01, 13.01, 15.9, 16.7, 15.12, 14.76]
Full eval scores so far: [15.1]
Best full score so far: 15.1


== Minibatch Trial 10 / 30 ==


Average Metric: 0.6944680231977017 / 5  (13.9): 100%|█| 5/5 [00:37<00:00,  7.5


Score: 13.89 on minibatch of size 5 with parameters ['Predictor 1: Instruction 9', 'Predictor 1: Few-Shot Set 7'].
Minibatch scores so far: [20.89, 12.25, 18.59, 13.01, 13.01, 15.9, 16.7, 15.12, 14.76, 13.89]
Full eval scores so far: [15.1]
Best full score so far: 15.1


===== Full Eval 1 =====
Doing full eval on next top averaging program (Avg Score: 20.89) from minibatch trials...


Average Metric: 2.5339986207297494 / 16  (15.8):  80%|▊| 16/20 [00:50<00:08,  

In [None]:
# Save optimize program for future use
#optimized_program.save(f"mipro_optimized")