<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Synthetic_Data_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Synthetic Data Generation

Many techniques require input and LLM response to drives evaluation and training, but getting high-quality data can be painful and expensive.

Generating this data with AI support can give you a higher quality set with much lower effort.  This notebook walks you through getting full Examples (input and output pairs).

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm pandas numpy

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Generate an Example Set

Let's say we want to build an AI that generates stories in the style of Aesop's Fables around given moral lessons.  To start with, we'll need some training examples containing plausible moral lessons and reasonable stories.

In [2]:
from withpi_utils import stream
from datasets import Dataset

synthetic_data_generation_status = pi.data.generate_input_response_pairs.start_job(
    system_prompt="""
Write a children's story in the style of Aesop's Fables teaching a life lesson
specified by the user. Provide just the story with no extra content.
""",
    num_pairs_to_generate=9,
    seeds=[],
    batch_size=3,
    num_shots=3,
)

examples = []

for data in stream(pi.data.generate_input_response_pairs, synthetic_data_generation_status):
    examples.append(data)
    print(f"[OUTPUT] - {data}")

examples = Dataset.from_list(examples)

LAUNCHING
[INFO] Generating 9 seeds as they are not provided.
[INFO] Yielding generated 9 seeds
[INFO] Synthetic Data Generation Complete => Good Pairs: 9. Bad Pairs: 0. Similar Pairs: 0
DONE
[OUTPUT] - {'llm_input': 'Write a story teaching children the importance of sharing with others.', 'llm_output': "Once upon a time in a lush, green forest, there lived a cheerful little rabbit named Ruby. Ruby was known for her beautiful, shiny carrots, which she grew in her own secret garden. She loved her carrots so much that she always kept them hidden, fearing that if she shared them, there would be none left for her.\n\nOne sunny morning, while Ruby was hopping through the forest, she stumbled upon her good friend Benny the Badger. Benny looked sad and tired as he sat by a tree, his tummy grumbling loudly. Ruby's heart sank when she saw how hungry he was.\n\n“Benny! What’s wrong?” Ruby asked, her ears perked in concern.\n\n“Oh Ruby,” Benny said with a sigh, “I haven’t eaten all day. I just ca

# Score the data

Let's score the data to see how the generator did.  This lets us focus on the "good" responses for use later.

In [9]:
# @title Let's Score and manually inspect the data
from withpi_utils.colab import pretty_print_responses
from tqdm.notebook import tqdm

scoring_spec = [{'question': q} for q in [
    "Does the response contain a clear beginning, middle, and end?",
    "Does the story follow a logical progression of events?",
    "Does the story resolve the conflict in a satisfying manner?",
    "Is the life lesson clearly conveyed in the story?",
    "Is the life lesson relevant to the input provided by the user?"
]]
for example in tqdm(examples):
    score = pi.scoring_system.score(
        llm_input=example['llm_input'],
        llm_output=example['llm_output'],
        scoring_spec=scoring_spec,
    )
    pretty_print_responses(
        header="#### Input:\n" + example['llm_input'],
        response1="#### Output:\n" + example['llm_output'],
        left_label="Pi Synthetic Data",
        scores_left=score,
    )
    print("\n\n")

  0%|          | 0/9 [00:00<?, ?it/s]

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.828,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.966







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.863,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.973







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.918,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.984







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.902,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.98







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,1.0,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,1.0







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,1.0,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,1.0







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.789,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.958







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.914,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.983







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.793,
Is the life lesson clearly conveyed in the story?,0.992,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.957







## Next Steps

This input set can drive many other techniques in Pi.  You can adjust the above methods to add seeds and steer the AI in different ways.