<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Synthetic_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Synthetic Data Generation

Many techniques require input and LLM response to drives evaluation and training, but getting high-quality data can be painful and expensive.

Generating this data with AI support can give you a higher quality set with much lower effort.  And it can be done with the same ScoringSpec that drives other techniques in Pi!

We will walk through the same `Aesop AI` example, but you can load any [ScoringSpec](https://build.withpi.ai/techniques/scoring_system/scoring_manual_calibration) here. Let's dig in!

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

# Load a Scoring Spec

Load the `Aesop AI` example from Pi Labs cookbooks, or edit below to load a different one.


In [2]:
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

aesop_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/aesop_ai.json"
)

display_scoring_spec(aesop_scoring_spec)

## Generate an Example Set

Given this structured description, let's build a Dataset containing a bunch of plausible moral lessons that could be used to exercise the ScoringSpec.  This will take about 50 seconds to generate.

In [3]:
synthetic_data_generation_status = client.data.generate_input_response_pairs.start_job(
    system_prompt=aesop_scoring_spec.description,
    num_pairs_to_generate=12,
    seeds=[],
    batch_size=3,
    num_shots=3,
)

## Stream Results

The stream utility will yield data as it is generated, while printing status messages. The below snippet will intersperse the two.

In [4]:
from withpi_utils.jobs import stream

for data in stream(client.data.generate_input_response_pairs, synthetic_data_generation_status):
    print(f"[OUTPUT] - {data}")

LAUNCHING
RUNNING
[INFO] Generating 10 seeds as they are not provided.
[INFO] Yielding generated 10 seeds
[INFO] Synthetic Data Generation Ongoing => Good Pairs: 10/12. Bad Pairs: 0. Similar Pairs: 0
[INFO] Progress=> Good: 10/12 Bad: 0 Similar: 0
[OUTPUT] - {'llm_input': "Write a story teaching the lesson 'honesty is the best policy'.", 'llm_output': "In a small forest, there lived a rabbit named Rosie. Rosie loved to explore and play with her friends, but she had a habit of telling lies to get out of trouble. One day, while playing near the forest, Rosie accidentally broke a beehive, causing the bees to swarm out in anger.\n\nAfraid of being scolded by the other animals, Rosie told them that a strong wind had blown the hive down. The animals believed her and helped to clean up the mess. However, the bees were still angry and wouldn't leave Rosie alone.\n\nThe next day, Rosie broke a bird's nest, and again, she lied, saying that a sly fox had done it. But the birds didn't believe her 

## Take a look at the generated examples

Take a look at the returned examples (inputs + outputs)

In [5]:
synthetic_data_generation_status = client.data.generate_input_response_pairs.retrieve(
    job_id=synthetic_data_generation_status.job_id
)

if synthetic_data_generation_status.state not in ["ERROR", "DONE"]:
    print("Please wait for the job to finish and then run this cell again...")
else:
    if synthetic_data_generation_status.state == "DONE":
        print("Printing all the generated examples below...")
        assert synthetic_data_generation_status.data is not None
        for example in synthetic_data_generation_status.data:
            print(example)
    else:
        print("Job ended in error")

Printing all the generated examples below...
Example(llm_input="Write a story teaching the lesson 'honesty is the best policy'.", llm_output="In a small forest, there lived a rabbit named Rosie. Rosie loved to explore and play with her friends, but she had a habit of telling lies to get out of trouble. One day, while playing near the forest, Rosie accidentally broke a beehive, causing the bees to swarm out in anger.\n\nAfraid of being scolded by the other animals, Rosie told them that a strong wind had blown the hive down. The animals believed her and helped to clean up the mess. However, the bees were still angry and wouldn't leave Rosie alone.\n\nThe next day, Rosie broke a bird's nest, and again, she lied, saying that a sly fox had done it. But the birds didn't believe her this time, and they scolded her for lying.\n\nRosie realized that her lies were causing more trouble than the truth would have. She decided to confess to the animals about breaking the beehive and the bird's nest.

In [6]:
# @title Let's Score and manually inspect the data
from withpi_utils.colab import pretty_print_responses

for example in synthetic_data_generation_status.data:
    score = client.scoring_system.score(
        llm_input=example.llm_input,
        llm_output=example.llm_output,
        scoring_spec=aesop_scoring_spec,
    )

    pretty_print_responses(
        header="#### Input:\n" + example.llm_input,
        response1="#### Output:\n" + example.llm_output,
        left_label="PI Synthetic Data",
        scores_left=score,
    )
    print("\n\n")

0,1,2
Story Structure,,0.923
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.77
Character Development,,0.844
,Character Presence,0.773
,Character Development,0.758
,Dialogue Quality,1.0
Narrative Engagement,,0.66
,Engaging Narrative,0.754







0,1,2
Story Structure,,0.854
,Plot Structure,0.996
,Conflict Introduction,0.793
,Resolution Clarity,0.773
Character Development,,0.832
,Character Presence,0.75
,Character Development,0.746
,Dialogue Quality,1.0
Narrative Engagement,,0.678
,Engaging Narrative,0.742







0,1,2
Story Structure,,0.91
,Plot Structure,1.0
,Conflict Introduction,0.941
,Resolution Clarity,0.789
Character Development,,0.979
,Character Presence,1.0
,Character Development,0.938
,Dialogue Quality,1.0
Narrative Engagement,,0.842
,Engaging Narrative,0.805







0,1,2
Story Structure,,0.857
,Plot Structure,1.0
,Conflict Introduction,0.789
,Resolution Clarity,0.781
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.789
,Engaging Narrative,0.84







0,1,2
Story Structure,,0.922
,Plot Structure,1.0
,Conflict Introduction,0.996
,Resolution Clarity,0.77
Character Development,,0.97
,Character Presence,1.0
,Character Development,0.91
,Dialogue Quality,1.0
Narrative Engagement,,0.91
,Engaging Narrative,0.984







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.911
,Character Presence,0.992
,Character Development,0.742
,Dialogue Quality,1.0
Narrative Engagement,,0.69
,Engaging Narrative,0.734







0,1,2
Story Structure,,0.948
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.844
Character Development,,0.996
,Character Presence,0.988
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.784
,Engaging Narrative,0.75







0,1,2
Story Structure,,0.927
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.781
Character Development,,0.922
,Character Presence,1.0
,Character Development,0.766
,Dialogue Quality,1.0
Narrative Engagement,,0.832
,Engaging Narrative,0.777







0,1,2
Story Structure,,0.764
,Plot Structure,1.0
,Conflict Introduction,0.531
,Resolution Clarity,0.762
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.915
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.773
,Plot Structure,1.0
,Conflict Introduction,0.543
,Resolution Clarity,0.777
Character Development,,0.822
,Character Presence,0.766
,Character Development,0.758
,Dialogue Quality,0.941
Narrative Engagement,,0.716
,Engaging Narrative,0.754







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.837
,Character Presence,0.75
,Character Development,0.762
,Dialogue Quality,1.0
Narrative Engagement,,0.824
,Engaging Narrative,0.75







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.921
,Character Presence,0.762
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.835
,Engaging Narrative,0.996







0,1,2
Story Structure,,0.927
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.781
Character Development,,0.857
,Character Presence,0.766
,Character Development,0.805
,Dialogue Quality,1.0
Narrative Engagement,,0.85
,Engaging Narrative,0.789







## Save the set

We will come back to this in a future colab, so it's useful to capture.  Store it as a Parquet table or other format supported by [Datasets](https://huggingface.co/docs/datasets/en/process#save)

Alternatively, upload to Hugging Face.

In [7]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'input': [example.llm_input for example in synthetic_data_generation_status.data],
    'output': [example.llm_output for example in synthetic_data_generation_status.data]
})

print(dataset)
# dataset.push_to_hub("...")

Dataset({
    features: ['input', 'output'],
    num_rows: 13
})


## Next Steps

This input set can drive many other techniques in Pi.  You can adjust the above methods to add seeds and steer the AI in different ways.