<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Generate_Synthetic_Training_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Generate Synthetic Training Data

Many techniques require input and LLM response to drives evaluation and training, but getting high-quality data can be painful and expensive.

Generating this data with AI support can give you a higher quality set with much lower effort.  This notebook walks you through getting full Examples (input and output pairs).

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets tqdm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Setup scoring system

Let's say we're building an AI to generate stories in the style of Aesop's Fables.  In good test-driven development, we need to decide what we're looking for out of our system.  Initialize a Scoring System and score function:

In [None]:
scoring_spec = [{'question': q} for q in [
    "Does the response contain a clear beginning, middle, and end?",
    "Does the story follow a logical progression of events?",
    "Does the story resolve the conflict in a satisfying manner?",
    "Is the life lesson clearly conveyed in the story?",
    "Is the life lesson relevant to the input provided by the user?"
]]

def score(input, output):
  return pi.scoring_system.score(
    scoring_spec=scoring_spec,
    llm_input=input,
    llm_output=output,
)

## Generate an Example Set

To start with, we'll need some training examples containing plausible moral lessons and reasonable stories.

In [None]:
from withpi_utils import stream
from datasets import Dataset

synthetic_data_generation_status = pi.data.generate_input_response_pairs.start_job(
    system_prompt="""
Write a children's story in the style of Aesop's Fables teaching a life lesson
specified by the user. Provide just the story with no extra content.
""",
    num_pairs_to_generate=9,
    seeds=[],
    batch_size=3,
    num_shots=3,
)

examples = []

for data in stream(pi.data.generate_input_response_pairs, synthetic_data_generation_status):
    examples.append(data)
    print(f"[OUTPUT] - {data}")

examples = Dataset.from_list(examples)

LAUNCHING
Still waiting...
[INFO] Generating 9 seeds as they are not provided.
[INFO] Yielding generated 9 seeds
[INFO] Synthetic Data Generation Complete => Good Pairs: 9. Bad Pairs: 0. Similar Pairs: 0
DONE
[OUTPUT] - {'llm_input': "Write a children's story teaching the importance of sharing with others.", 'llm_output': 'Once upon a time in a lush green forest, there lived a clever little squirrel named Sammy. Sammy was known far and wide for his incredible acorn stash. Every fall, while other animals toiled to collect their food, Sammy would gather the most acorns anyone had ever seen, storing them in his cozy tree hollow.\n\nOne crisp autumn day, as Sammy admired his mountain of acorns, he heard a soft rustling nearby. Curiosity piqued, he peeked out and saw Lily the rabbit, her nose twitching with hunger. She looked thin and weary, having struggled to find enough food for the coming winter.\n\n"Hello, Sammy," Lily called, trying to sound cheerful. "Your acorns are so beautiful! Do

# Score the data

Let's score the data to see how the generator did.  This lets us focus on the "good" responses for use later.

In [None]:
from withpi_utils.colab import pretty_print_responses
from tqdm.notebook import tqdm

for example in tqdm(examples):
    pretty_print_responses(
        header="#### Input:\n" + example['llm_input'],
        response1="#### Output:\n" + example['llm_output'],
        left_label="Pi Synthetic Data",
        scores_left=score(example['llm_input'],example['llm_output']),
    )
    print("\n\n")

  0%|          | 0/9 [00:00<?, ?it/s]

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.754,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.951







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,1.0,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,1.0







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.996,
Does the story resolve the conflict in a satisfying manner?,0.809,
Is the life lesson clearly conveyed in the story?,0.941,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.949







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.754,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.951







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.824,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.965







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.715,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.943







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.98,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.996







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.801,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.96







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,1.0,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,1.0







## Next Steps

This set can drive training or evaluation workflows.  You can adjust the above methods to add seeds and steer the AI in different ways.