<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Synthetic_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Synthetic Data Generation

Many techniques require input and LLM response to drives evaluation and training, but getting high-quality data can be painful and expensive.

Generating this data with AI support can give you a higher quality set with much lower effort.  And it can be done with the same ScoringSpec that drives other techniques in Pi!

We will walk through the same `Aesop AI` example, but you can load any [ScoringSpec](https://build.withpi.ai/techniques/scoring_system/scoring_manual_calibration) here. Let's dig in!

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

# Load a Scoring Spec

Load the `Aesop AI` example from Pi Labs cookbooks, or edit below to load a different one.


In [2]:
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

aesop_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/aesop_ai.json"
)

display_scoring_spec(aesop_scoring_spec)

## Generate an Example Set

Given this structured description, let's build a Dataset containing a bunch of plausible moral lessons that could be used to exercise the ScoringSpec.  This will take about 50 seconds to generate.

In [3]:
synthetic_data_generation_status = client.data.generate_input_response_pairs.start_job(
    system_prompt=aesop_scoring_spec.description,
    num_pairs_to_generate=10,
    seeds=[],
    batch_size=3,
    num_shots=3,
)

## Stream the messages as the data is generated

The messages provide detail about what is being done to generate the data.

In [4]:
from withpi_utils.colab import stream_response

synthetic_data_generation_status = stream_response(
    synthetic_data_generation_status.job_id, client.data.generate_input_response_pairs
)

Detailed Status for synthetic_data_jobs:09ac227c912792130a876568ea872593308c0d4b3d7c896ec7991f041cbeedd8:c7bfa8b3-d94c-46b2-a62b-921e7751983a
LAUNCHING
RUNNING
[INFO] Generating 3 seeds as they are not provided.
[INFO] Progress=> Good: 0/10 Bad: 0 Similar: 0
[INFO] Generated themes: ['Kindness and Empathy', 'Honesty and Integrity', 'Perseverance and Hard Work', 'Responsibility and Accountability', 'Courage and Bravery', 'Teamwork and Collaboration', 'Gratitude and Appreciation', 'Respect and Tolerance', 'Generosity and Selflessness', 'Wisdom and Prudence']
[INFO] Using selected theme: 'Generosity and Selflessness' for this batch of generation
[INFO] Generation LLM temperature fixed or updated to 1.0
[INFO] Synthetic Data Generation Ongoing => Good Pairs: 3/10. Bad Pairs: 0. Similar Pairs: 0
[INFO] Progress=> Good: 3/10 Bad: 0 Similar: 0
[INFO] Using selected theme: 'Responsibility and Accountability' for this batch of generation
[INFO] Generation LLM temperature fixed or updated to 1.0

One can also stream the data instead as shown in the cell below

In [5]:
from withpi_utils.colab import stream_data

synthetic_data_generation_status = stream_data(
    synthetic_data_generation_status.job_id, client.data.generate_input_response_pairs
)

{
  "llm_input": "Tell a story that shows how kindness can make a difference in a community.",
  "llm_output": "In a small village, there lived a young rabbit named Rosie. Rosie loved to help others and always looked for ways to make her community a better place. One day, she noticed that Mrs. Hedgehog, a elderly resident, was struggling to carry her groceries home. Rosie quickly ran to help her, and together they carried the bags to Mrs. Hedgehog's cozy little house.\n\nAs they walked, Rosie learned that Mrs. Hedgehog was not only struggling with groceries but also with feeling lonely. Her family had moved away, and she had no one to talk to. Rosie decided to visit Mrs. Hedgehog every day, bringing her fresh vegetables from her garden and listening to her stories.\n\nSoon, other animals in the village noticed Rosie's kind actions and wanted to help too. A group of birds started singing for Mrs. Hedgehog, a squirrel offered to help with her chores, and a family of deer invited her to j

## Take a look at the generated examples

Take a look at the returned examples (inputs + outputs)

In [6]:
if synthetic_data_generation_status.state not in ["ERROR", "DONE"]:
  print("Please wait for the job to finish and then run this cell again...")
else:
    if synthetic_data_generation_status.state == "DONE":
        print("Printing all the generated examples below...")
        assert synthetic_data_generation_status.data is not None
        for example in synthetic_data_generation_status.data:
            print(example)
    else:
        print("Job ended in error")

Printing all the generated examples below...
SDKExample(llm_input='Tell a story that shows how kindness can make a difference in a community.', llm_output="In a small village, there lived a young rabbit named Rosie. Rosie loved to help others and always looked for ways to make her community a better place. One day, she noticed that Mrs. Hedgehog, a elderly resident, was struggling to carry her groceries home. Rosie quickly ran to help her, and together they carried the bags to Mrs. Hedgehog's cozy little house.\n\nAs they walked, Rosie learned that Mrs. Hedgehog was not only struggling with groceries but also with feeling lonely. Her family had moved away, and she had no one to talk to. Rosie decided to visit Mrs. Hedgehog every day, bringing her fresh vegetables from her garden and listening to her stories.\n\nSoon, other animals in the village noticed Rosie's kind actions and wanted to help too. A group of birds started singing for Mrs. Hedgehog, a squirrel offered to help with her c

In [7]:
# @title Let's Score and manually inspect the data
from withpi_utils.colab import pretty_print_responses

for example in synthetic_data_generation_status.data:
    score = client.scoring_system.score(
        llm_input=example.llm_input,
        llm_output=example.llm_output,
        scoring_spec=aesop_scoring_spec,
    )

    pretty_print_responses(
        header="#### Input:\n" + example.llm_input,
        response1="#### Output:\n" + example.llm_output,
        left_label="PI Synthetic Data",
        scores_left=score,
    )
    print("\n\n")

0,1,2
Story Structure,,0.922
,Plot Structure,1.0
,Conflict Introduction,0.766
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.853
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.979
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.938
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.859
,Engaging Narrative,1.0







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.852
,Character Presence,0.801
,Character Development,0.754
,Dialogue Quality,1.0
Narrative Engagement,,0.922
,Engaging Narrative,1.0







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.889
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.941
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.824
Character Development,,0.883
,Character Presence,0.766
,Character Development,0.883
,Dialogue Quality,1.0
Narrative Engagement,,0.822
,Engaging Narrative,0.938







0,1,2
Story Structure,,0.862
,Plot Structure,1.0
,Conflict Introduction,0.816
,Resolution Clarity,0.77
Character Development,,0.84
,Character Presence,0.754
,Character Development,0.766
,Dialogue Quality,1.0
Narrative Engagement,,0.796
,Engaging Narrative,0.859







0,1,2
Story Structure,,0.911
,Plot Structure,1.0
,Conflict Introduction,0.93
,Resolution Clarity,0.805
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.768
,Engaging Narrative,0.77







0,1,2
Story Structure,,0.995
,Plot Structure,1.0
,Conflict Introduction,0.984
,Resolution Clarity,1.0
Character Development,,0.997
,Character Presence,0.992
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.84
,Engaging Narrative,0.914







0,1,2
Story Structure,,0.598
,Plot Structure,1.0
,Conflict Introduction,0.0
,Resolution Clarity,0.793
Character Development,,0.923
,Character Presence,1.0
,Character Development,0.77
,Dialogue Quality,1.0
Narrative Engagement,,0.953
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.971
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.914
Character Development,,0.917
,Character Presence,0.75
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.906
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.931
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.793
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.699
,Engaging Narrative,0.781







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.919
,Character Presence,0.758
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.764
,Engaging Narrative,0.754







## Save the set

We will come back to this in a future colab, so it's useful to capture.  Store it as a Parquet table, which you can download.

Alternatively, upload to Hugging Face.

In [9]:
from datasets import Dataset

dataset = Dataset.from_dict({
    'input': [example.llm_input for example in synthetic_data_generation_status.data],
    'output': [example.llm_output for example in synthetic_data_generation_status.data]
})

print(dataset)
# dataset.push_to_hub("...")

Dataset({
    features: ['input', 'output'],
    num_rows: 12
})


## Next Steps

This input set can drive many other techniques in Pi.  You can adjust the above methods to add seeds and steer the AI in different ways.