<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/DSPy_Prompt_and_Few_Shots_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# DSPy Optimization

This Colab is the companion to the DSPy Optimization Playground.  [DSPy](https://dspy.ai/) is a toolkit for optimizing an application's system prompt.  It does this by evaluating how different prompts measure up against a metric you choose.

This notebook sets you up with a **Pi Scoring System** connected to DSPy, giving you an improved prompt to play with.

This Colab continues with the `Aesop AI` example and a test input set, but any will do.

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [2]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()


# Load Scorer and Dataset

Load the `Aesop AI` example and example set from Pi Labs cookbooks, or edit below to load a different one.

This is using Hugging Face datasets, so any published dataset you can access should work.


In [None]:
# @title Load Scoring Spec
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

aesop_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/aesop_ai.json"
)

display_scoring_spec(aesop_scoring_spec)

In [4]:
# @title Load dataset
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop", split="train")

print(aesop_dataset)

README.md:   0%|          | 0.00/302 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/55.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/23 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output'],
    num_rows: 23
})


## Optimize your prompt

Kick off a prompt optimization run.  This will operate in the background and will take order of **10 minutes**

In [10]:
prompt_optimization_status = client.prompt.optimize.start_job(
    scoring_spec=aesop_scoring_spec,
    initial_system_instruction=aesop_scoring_spec.description,
    examples=[{"llm_input": row["input"], "llm_output": row["output"]} for row in aesop_dataset],
    model_id="gpt-4o-mini",
    tuning_algorithm="DSPY",
    dspy_optimization_type="MIPROv2",
)


## Stream the messages as the inputs are generated

The messages provide detail about what is being done as the prompt is getting optimized.

In [11]:
from withpi_utils import stream

for line in stream(client.prompt.optimize, prompt_optimization_status):
  print(line)

LAUNCHING
RUNNING
2025/03/12 00:03:04 INFO dspy.teleprompt.mipro_optimizer_v2: 
RUNNING WITH THE FOLLOWING LIGHT AUTO RUN SETTINGS:
num_trials: 7
minibatch: False
num_candidates: 5
valset size: 18


2025/03/12 00:03:04 INFO dspy.teleprompt.mipro_optimizer_v2: 
==> STEP 1: BOOTSTRAP FEWSHOT EXAMPLES <==

2025/03/12 00:03:04 INFO dspy.teleprompt.mipro_optimizer_v2: These will be used as few-shot example candidates for our program and for creating instructions.


2025/03/12 00:03:04 INFO dspy.teleprompt.mipro_optimizer_v2: Bootstrapping N=5 sets of demonstrations...

Bootstrapping set 1/5
Bootstrapping set 2/5
Bootstrapping set 3/5

  0%|          | 0/5 [00:00<?, ?it/s]
2025/03/12 00:03:14 ERROR dspy.teleprompt.bootstrap: Failed to run or to evaluate example Example({'input': "Dream up a story involving a hummingbird and a stagnant pond teaching 'Every effort counts, no matter how small.'", 'response': 'The hummingbird, Pip, zipped and zoomed through the sun-drenched meadow.  His tiny win

## Check out the optimized prompt

In [12]:
import json

prompt_optimization_status = client.prompt.optimize.retrieve(prompt_optimization_status.job_id)
optimized_prompt = json.dumps(prompt_optimization_status.optimized_prompt_messages, indent=2)
print(optimized_prompt)


[
  {
    "content": "Your input fields are:\n1. `input` (str): The input to the AI application\n\nYour output fields are:\n1. `response` (str): The response from the AI application\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\n[[ ## input ## ]]\n{input}\n\n[[ ## response ## ]]\n{response}\n\n[[ ## completed ## ]]\n\nIn adhering to this structure, your objective is: \n        Write a children's story in the style of Aesop's Fables teaching a life lesson specified by the user.  Provide just the story with no extra content.",
    "role": "system"
  },
  {
    "content": "[[ ## input ## ]]\nDream up a story involving a hummingbird and a stagnant pond teaching 'Every effort counts, no matter how small.'\n\nRespond with the corresponding output fields, starting with the field `[[ ## response ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.",
    "role": "user"
  },
  {
    "content": "[[ ## response ## ]]\nPip the 

## Save the new system prompt template

It's convenient to stash this template for use later.

In [13]:
from google.colab import files
from pathlib import Path

filename = 'aesop_ai_dspy_prompt.json.jinja'
Path(filename).write_text(optimized_prompt)
files.download(filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## (if resuming) Load system prompt

If you don't want to wait, load the pre-optimized one.

In [14]:
import httpx

optimized_prompt = httpx.get("https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/prompts/aesop_ai_dspy_prompt.json.jinja").text

#Run inference with the test split

DSPy emits a Jinja2-style template, so inference requires some template substitution.  Let's compare results before and after on a few examples.

In [18]:
import os
import re
import litellm
import jinja2
from withpi_utils.colab import pretty_print_responses

os.environ["GEMINI_API_KEY"] = userdata.get('GOOGLE_API_KEY')

def generate(system: str, user: str, model: str) -> str:
    """generate passes the provided system and user prompts into the given model
    via LiteLLM"""
    messages = [
        {"content": system, "role": "system"},
        {"content": user, "role": "user"},
    ]
    return litellm.completion(model=model, messages=messages).choices[0].message.content

prompt_template = jinja2.Template(optimized_prompt)
result_extractor = re.compile(
    r".*\[\[ ## response ## \]\](.*)\[\[ ## completed ## \]\]", re.DOTALL
)

for i in range(5):
  row = aesop_dataset[i]

  original_output = generate(system=aesop_scoring_spec.description,
                             user=row['input'],
                             model="gemini/gemini-2.0-flash")
  raw_dspy_output = litellm.completion(
      model="gemini/gemini-2.0-flash",
      messages=json.loads(prompt_template.render(input=row['input']))).choices[0].message.content
  dspy_output = result_extractor.match(raw_dspy_output).group(1)

  original_score = client.scoring_system.score(
      llm_input=row["input"],
      llm_output=original_output,
      scoring_spec=aesop_scoring_spec,
  )
  dspy_score = client.scoring_system.score(
      llm_input=row["input"],
      llm_output=dspy_output,
      scoring_spec=aesop_scoring_spec,
  )

  pretty_print_responses(
      header="#### Input:\n" + row["input"],
      response1="#### Output:\n" + original_output,
      response2="#### Output:\n" + dspy_output,
      left_label="Original",
      right_label="DSPy",
      scores_left=original_score,
      scores_right=dspy_score,
  )
  print("\n\n")

0,1,2
Story Structure,,0.857
,Plot Structure,1.0
,Conflict Introduction,0.805
,Resolution Clarity,0.766
Character Development,,0.848
,Character Presence,0.992
,Character Development,0.77
,Dialogue Quality,0.781
Narrative Engagement,,0.792
,Engaging Narrative,0.758

0,1,2
Story Structure,,0.833
,Plot Structure,1.0
,Conflict Introduction,0.742
,Resolution Clarity,0.758
Character Development,,0.65
,Character Presence,0.75
,Character Development,0.439
,Dialogue Quality,0.762
Narrative Engagement,,0.677
,Engaging Narrative,0.754







0,1,2
Story Structure,,0.686
,Plot Structure,1.0
,Conflict Introduction,0.316
,Resolution Clarity,0.742
Character Development,,0.749
,Character Presence,0.75
,Character Development,0.742
,Dialogue Quality,0.754
Narrative Engagement,,0.759
,Engaging Narrative,0.758

0,1,2
Story Structure,,0.585
,Plot Structure,1.0
,Conflict Introduction,0.014
,Resolution Clarity,0.742
Character Development,,0.598
,Character Presence,0.754
,Character Development,0.199
,Dialogue Quality,0.84
Narrative Engagement,,0.661
,Engaging Narrative,0.738







0,1,2
Story Structure,,0.928
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.785
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.865
,Engaging Narrative,0.832

0,1,2
Story Structure,,0.921
,Plot Structure,1.0
,Conflict Introduction,0.988
,Resolution Clarity,0.773
Character Development,,0.771
,Character Presence,0.773
,Character Development,0.609
,Dialogue Quality,0.93
Narrative Engagement,,0.76
,Engaging Narrative,0.766







0,1,2
Story Structure,,0.842
,Plot Structure,1.0
,Conflict Introduction,0.762
,Resolution Clarity,0.766
Character Development,,0.844
,Character Presence,0.766
,Character Development,0.773
,Dialogue Quality,0.992
Narrative Engagement,,0.764
,Engaging Narrative,0.789

0,1,2
Story Structure,,0.921
,Plot Structure,1.0
,Conflict Introduction,0.984
,Resolution Clarity,0.777
Character Development,,0.841
,Character Presence,0.758
,Character Development,0.766
,Dialogue Quality,1.0
Narrative Engagement,,0.697
,Engaging Narrative,0.758







0,1,2
Story Structure,,0.835
,Plot Structure,1.0
,Conflict Introduction,0.816
,Resolution Clarity,0.688
Character Development,,0.7
,Character Presence,0.77
,Character Development,0.486
,Dialogue Quality,0.844
Narrative Engagement,,0.785
,Engaging Narrative,0.82

0,1,2
Story Structure,,0.581
,Plot Structure,0.777
,Conflict Introduction,0.477
,Resolution Clarity,0.488
Character Development,,0.517
,Character Presence,0.547
,Character Development,0.249
,Dialogue Quality,0.754
Narrative Engagement,,0.522
,Engaging Narrative,0.543







# Optimizing Few Shots

Details of the few shots optimization are similar. We use DSPy's `BOOTSTRAP_FEW_SHOT` algorithm here.

We kick off a job first, then read the messages as the few shots are getting optimized and finally print the prompt with optimized few shots in it.

In [19]:
shot_optimization_status = client.prompt.optimize.start_job(
    scoring_spec=aesop_scoring_spec,
    initial_system_instruction=aesop_scoring_spec.description,
    examples=[{"llm_input": row["input"], "llm_output": row["output"]} for row in aesop_dataset],
    model_id="gpt-4o-mini",
    tuning_algorithm="DSPY",
    dspy_optimization_type="BOOTSTRAP_FEW_SHOT",
    use_chain_of_thought=True,
)

In [20]:
from withpi_utils import stream

for line in stream(client.prompt.optimize, shot_optimization_status):
  print(line)

LAUNCHING
RUNNING

  0%|          | 0/23 [00:00<?, ?it/s]

  4%|4         | 1/23 [00:05<01:53,  5.16s/it]

  9%|8         | 2/23 [00:12<02:14,  6.39s/it]

  9%|8         | 2/23 [00:12<02:10,  6.20s/it]
Bootstrapped 2 full traces after 2 examples for up to 10 rounds, amounting to 2 attempts.
DONE


In [22]:
shot_optimization_status = client.prompt.optimize.retrieve(shot_optimization_status.job_id)
optimized_shots = json.dumps(shot_optimization_status.optimized_prompt_messages, indent=2)
print(optimized_shots)


[
  {
    "content": "Your input fields are:\n1. `input` (str): The input to the AI application\n\nYour output fields are:\n1. `reasoning` (str)\n2. `response` (str): The response from the AI application\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## input ## ]]\n{input}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  \"reasoning\": \"{reasoning}\",\n  \"response\": \"{response}\"\n}\n\nIn adhering to this structure, your objective is: \n        Write a children's story in the style of Aesop's Fables teaching a life lesson specified by the user.  Provide just the story with no extra content.",
    "role": "system"
  },
  {
    "content": "[[ ## input ## ]]\nWrite a fable involving a tortoise and a hare that emphasizes the value of perseverance and determination.\n\nRespond with a JSON object in the following order of fields: `reasoning`, then `response`.",
    

## Next Steps

Now you have an improved prompt on a small sample set.  You could deploy this now, but improving the training set or the scorer will give you better performance.  Check out the rest of the playgrounds to proceed from here.