<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Synthetic_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Synthetic Data Generation

Many techniques require input and LLM response to drives evaluation and training, but getting high-quality data can be painful and expensive.

Generating this data with AI support can give you a higher quality set with much lower effort.  And it can be done with the same ScoringSpec that drives other techniques in Pi!

We will walk through the same `Aesop AI` example, but you can load any [ScoringSpec](https://build.withpi.ai/techniques/scoring_system/scoring_manual_calibration) here. Let's dig in!

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

# Load a Scoring Spec

Load the `Aesop AI` example from Pi Labs cookbooks, or edit below to load a different one.


In [2]:
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

aesop_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/aesop_ai.json"
)

display_scoring_spec(aesop_scoring_spec)

## Generate an Example Set

Given this structured description, let's build a Dataset containing a bunch of plausible moral lessons that could be used to exercise the ScoringSpec.  This will take about 50 seconds to generate.

In [3]:
synthetic_data_generation_status = client.data.generate_input_response_pairs.start_job(
    system_prompt=aesop_scoring_spec.description,
    num_pairs_to_generate=12,
    seeds=[],
    batch_size=3,
    num_shots=3,
)

## Stream Results

The stream utility will yield data as it is generated, while printing status messages. The below snippet will intersperse the two.

In [4]:
from withpi_utils.jobs import stream

for data in stream(client.data.generate_input_response_pairs, synthetic_data_generation_status):
    print(f"[OUTPUT] - {data}")

LAUNCHING
RUNNING
[INFO] Generating 10 seeds as they are not provided.
[INFO] Yielding generated 10 seeds
[OUTPUT] - {'llm_input': 'Write a story teaching the importance of honesty.', 'llm_output': 'In a small village, there lived a young rabbit named Rosie. Rosie loved to play with her friends and go on adventures in the forest. One day, while playing near the village market, Rosie accidentally broke a beautiful vase belonging to the village merchant, Mr. Squirrel.\n\nAfraid of getting in trouble, Rosie didn\'t tell anyone about the broken vase. Instead, she ran away and hid behind a nearby bush. But as the day went on, Rosie felt terrible about not telling the truth. She couldn\'t stop thinking about the broken vase and how Mr. Squirrel would be sad when he found out.\n\nThe next day, Mr. Squirrel discovered the broken vase and asked all the villagers if they knew who had broken it. Many of the villagers pointed fingers at each other, but no one told the truth. Rosie felt even worse,

## Take a look at the generated examples

Take a look at the returned examples (inputs + outputs)

In [5]:
synthetic_data_generation_status = client.data.generate_input_response_pairs.retrieve(
    job_id=synthetic_data_generation_status.job_id
)

if synthetic_data_generation_status.state not in ["ERROR", "DONE"]:
    print("Please wait for the job to finish and then run this cell again...")
else:
    if synthetic_data_generation_status.state == "DONE":
        print("Printing all the generated examples below...")
        assert synthetic_data_generation_status.data is not None
        for example in synthetic_data_generation_status.data:
            print(example)
    else:
        print("Job ended in error")

Printing all the generated examples below...
SDKExample(llm_input='Write a story teaching the importance of honesty.', llm_output='In a small village, there lived a young rabbit named Rosie. Rosie loved to play with her friends and go on adventures in the forest. One day, while playing near the village market, Rosie accidentally broke a beautiful vase belonging to the village merchant, Mr. Squirrel.\n\nAfraid of getting in trouble, Rosie didn\'t tell anyone about the broken vase. Instead, she ran away and hid behind a nearby bush. But as the day went on, Rosie felt terrible about not telling the truth. She couldn\'t stop thinking about the broken vase and how Mr. Squirrel would be sad when he found out.\n\nThe next day, Mr. Squirrel discovered the broken vase and asked all the villagers if they knew who had broken it. Many of the villagers pointed fingers at each other, but no one told the truth. Rosie felt even worse, knowing that she was the one responsible.\n\nJust then, a wise old 

In [None]:
# @title Let's Score and manually inspect the data
from withpi_utils.colab import pretty_print_responses

for example in synthetic_data_generation_status.data:
    score = client.scoring_system.score(
        llm_input=example.llm_input,
        llm_output=example.llm_output,
        scoring_spec=aesop_scoring_spec,
    )

    pretty_print_responses(
        header="#### Input:\n" + example.llm_input,
        response1="#### Output:\n" + example.llm_output,
        left_label="PI Synthetic Data",
        scores_left=score,
    )
    print("\n\n")

0,1,2
Story Structure,,0.922
,Plot Structure,1.0
,Conflict Introduction,0.766
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.853
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.979
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.938
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.859
,Engaging Narrative,1.0







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.852
,Character Presence,0.801
,Character Development,0.754
,Dialogue Quality,1.0
Narrative Engagement,,0.922
,Engaging Narrative,1.0







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.889
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.941
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.824
Character Development,,0.883
,Character Presence,0.766
,Character Development,0.883
,Dialogue Quality,1.0
Narrative Engagement,,0.822
,Engaging Narrative,0.938







0,1,2
Story Structure,,0.862
,Plot Structure,1.0
,Conflict Introduction,0.816
,Resolution Clarity,0.77
Character Development,,0.84
,Character Presence,0.754
,Character Development,0.766
,Dialogue Quality,1.0
Narrative Engagement,,0.796
,Engaging Narrative,0.859







0,1,2
Story Structure,,0.911
,Plot Structure,1.0
,Conflict Introduction,0.93
,Resolution Clarity,0.805
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.768
,Engaging Narrative,0.77







0,1,2
Story Structure,,0.995
,Plot Structure,1.0
,Conflict Introduction,0.984
,Resolution Clarity,1.0
Character Development,,0.997
,Character Presence,0.992
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.84
,Engaging Narrative,0.914







0,1,2
Story Structure,,0.598
,Plot Structure,1.0
,Conflict Introduction,0.0
,Resolution Clarity,0.793
Character Development,,0.923
,Character Presence,1.0
,Character Development,0.77
,Dialogue Quality,1.0
Narrative Engagement,,0.953
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.971
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.914
Character Development,,0.917
,Character Presence,0.75
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.906
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.931
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.793
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.699
,Engaging Narrative,0.781







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.919
,Character Presence,0.758
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.764
,Engaging Narrative,0.754







## Save the set

We will come back to this in a future colab, so it's useful to capture.  Store it as a Parquet table or other format supported by [Datasets](https://huggingface.co/docs/datasets/en/process#save)

Alternatively, upload to Hugging Face.

In [None]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'input': [example.llm_input for example in synthetic_data_generation_status.data],
    'output': [example.llm_output for example in synthetic_data_generation_status.data]
})

print(dataset)
# dataset.push_to_hub("...")

Dataset({
    features: ['input', 'output'],
    num_rows: 12
})


## Next Steps

This input set can drive many other techniques in Pi.  You can adjust the above methods to add seeds and steer the AI in different ways.