<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Calibrate_with_User_Preferences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Calibrate with User Preferences

This Colab is the companion to the Preference Collection Playground, showing how you can apply preference data to your training pipeline.

It's easier to collect training data from the UI, but this Colab will have you rate a small number of examples in-line.

We will walk through the same `Aesop AI` example, but any contract with feedback data should work.

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

# Load a Pi-Scorer and a Dataset

In [2]:
# @title Load Scorer
from withpi_utils.colab import load_scorer_from_web, display_scorer

aesop_scorer = load_scorer_from_web("https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/contracts/aesop_ai.json")

display_scorer(aesop_scorer)

In [5]:
# @title Load dataset
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop")

print(aesop_dataset)


DatasetDict({
    train: Dataset({
        features: ['input', 'output'],
        num_rows: 23
    })
})


## Cluster Inputs

We're going to label some inputs as "good" and "bad", but to do this it is helpful to focus on a few different types of input.  We'll use clustering to make sure we don't have to look at too many examples.

In [24]:
import pandas as pd

input_topic_clusters = client.data.cluster_inputs(
    inputs=[
        {"identifier": str(index), "llm_input": row["input"]}
        for index, row in enumerate(aesop_dataset["train"])
    ],
)

cluster_data = []
topics = [None] * len(aesop_dataset["train"])
for cluster in input_topic_clusters:
    cluster_data.append([cluster.topic, cluster.inputs, len(cluster.inputs)])
    for item in cluster.inputs:
        topics[int(item)] = cluster.topic

cluster_df = pd.DataFrame(cluster_data, columns=["Topic", "Items", "Size"])
cluster_df

Unnamed: 0,Topic,Items,Size
0,Animal-Based Moral Story Requests,"[1, 4, 5, 7, 8, 9, 11, 14, 16, 17, 18, 20]",12
1,"""Storytelling with Moral Lessons""","[2, 6, 12, 13, 15, 21, 22]",7
2,"""Tortoise Fables with Moral Lessons""","[0, 3, 10, 19]",4


## Identify outliers

Let's first score every input against the scoring system, adding that as a column.  Pi scoring is fast enough that serially processing the dataset is fine, though we could increase parallelism for more speed.

In [27]:
from tqdm import tqdm

scores = []
for example in tqdm(aesop_dataset["train"]):
  scores.append(client.scoring_system.score(
       scorer=aesop_scorer,
       llm_input=example["input"],
       llm_output=example["output"])
  )

df = pd.DataFrame({
    "input": aesop_dataset["train"]["input"],
    "output": aesop_dataset["train"]["output"],
    "cluster topic": topics,
    "score": [score.total_score for score in scores]}
)

df

100%|██████████| 23/23 [00:06<00:00,  3.43it/s]


Unnamed: 0,input,output,cluster topic,score
0,Write a children's story in the style of Aesop...,Barnaby the hare was a blur of twitching whisk...,"""Tortoise Fables with Moral Lessons""",0.831055
1,Tell a fable about a crow and a fox that illus...,"Once upon a time, in a sun-drenched forest, li...",Animal-Based Moral Story Requests,0.88954
2,Create a story featuring a lion and a mouse th...,"Leo the lion, king of the sprawling savanna, w...","""Storytelling with Moral Lessons""",0.808051
3,Write a fable involving a tortoise and a hare ...,The Tortoise and the Determined Hare\n\nIn the...,"""Tortoise Fables with Moral Lessons""",0.974392
4,Tell a story about a greedy dog who loses his ...,Barnaby the Beagle was a dog of magnificent ap...,Animal-Based Moral Story Requests,0.972222
5,Spin a tale with a squirrel and an owl teachin...,Barnaby the squirrel was renowned throughout t...,Animal-Based Moral Story Requests,0.980035
6,Compose a fable with a feuding sun and wind th...,"The Sun and the Wind\n\nThe Sun, a fiery ball ...","""Storytelling with Moral Lessons""",0.876519
7,Dream up a story involving a hummingbird and a...,"Pip the hummingbird, a flash of emerald and ru...",Animal-Based Moral Story Requests,0.987413
8,Tell a saga with a rabbit and a cunning crow i...,Barnaby the rabbit was a champion hopper. He'...,Animal-Based Moral Story Requests,0.999783
9,Craft a fable about a young rabbit needing hel...,Barnaby Bunson was a young rabbit with a very ...,Animal-Based Moral Story Requests,0.987847


## Label data

Now it's time to label examples against a simple statement.  **The response fully satisfies the input according to the scoring system**.  Valid responses are **Strongly Agree**, **Agree**, **Neutral**, **Disagree**, and **Strongly Disagree**, or simply **5** down to **1**.

The below cell will select a high and low scoring exemplar from each cluster, asking you to respond **5** through **1**

In [None]:
def get_label(row):
  display("Input Prompt:")
  display(row["input"])
  display("Output Response:")
  display(row["output"])
  while True:
    resp = input("Your rating from 1 to 5: ")
    try:
      if int(resp) not in [1,2,3,4,5]:
        raise ValueError("Invalid")
    except:
      display("Invalid input. Try again")
      continue
    break
  row['label'] = resp
  return row

cluster_labels = set([x for x in aesop_scored["cluster"]])
labelled = []
for cluster in cluster_labels:
  sorted = aesop_scored.filter(lambda e: e['cluster'] == cluster).sort("uncalibrated_scores")
  labelled.append(get_label(sorted[0]))
  labelled.append(get_label(sorted[-1]))

Filter:   0%|          | 0/20 [00:00<?, ? examples/s]

'Input Prompt:'

'Tell a fable about a crow and a fox that illustrates the importance of being cautious with flattery.'

'Output Response:'

'Once upon a time, in a sun-drenched forest, lived a clever crow named Corvus.  Corvus was known throughout the woods for his magnificent, iridescent feathers, which shimmered like captured sunlight.  A sly fox, named Reynard, prowled the forest, his eyes gleaming with cunning.\n\nReynard, always on the lookout for an easy meal, spied Corvus perched on a branch.  He padded silently towards him, a mischievous glint in his eye.\n\n"Oh, Corvus," Reynard purred, his voice smooth as honey. "What a magnificent creature you are!  Your plumage is the most exquisite I\'ve ever beheld.  No other bird in the forest can compare."\n\nCorvus, flattered beyond measure, puffed up his feathers. "Indeed," he crowed, feeling quite important. "My beauty is unparalleled."\n\nReynard continued his flattery. "And your voice! It is like the sweet chime of a thousand bells. I\'ve never heard such melodious notes."\n\nCorvus\'s tail feathers trembled with pride.  "My song is a gift from the gods," he boasted.\n

Your rating from 1 to 5: 1


'Input Prompt:'

"Tell a story about a greedy dog who loses his bone to teach the lesson 'Don't be greedy; appreciate what you have.'"

'Output Response:'

'Barnaby the Beagle was a dog of magnificent appetites.  His nose twitched at the scent of a crumb, his eyes widened at the sight of a juicy bone.  He wasn\'t just greedy, he was *extraordinarily* greedy.  He\'d sniff at a scrap on the ground, then whine and whine until he got it, even if it was barely a nibble.  Barnaby considered every bone, every bit of meat, a personal conquest.\n\nOne sunny afternoon, Barnaby discovered a magnificent bone. It wasn\'t just any bone; it was a bone of dazzling white, smooth as polished marble, and plump with marrow.  It was the most beautiful bone Barnaby had ever laid eyes on.  He practically drooled with delight.\n\n"Mine!" he barked, wagging his tail with exaggerated fervor.  He snatched the bone, ignoring the whimpers of a smaller, less fortunate dog named Pip who\'d been eyeing it wistfully.  Barnaby happily gnawed, his eyes sparkling with satisfaction.  But the bone, though beautiful, tasted rather bland.\n\nHe spent the next hour circling the 

Your rating from 1 to 5: 3


Filter:   0%|          | 0/20 [00:00<?, ? examples/s]

'Input Prompt:'

"Create a story featuring a lion and a mouse that conveys the message 'Even the smallest friends can be the most helpful.'"

'Output Response:'

'Leo the lion, king of the sprawling savanna, was a creature of immense pride. He ruled with a roar that echoed through the tall grass, and his mane, a golden crown, shimmered in the sun.  He had everything a lion could want – plenty of meat, a comfortable den, and, of course, the unwavering respect of all the other animals.\n\nBut Leo was also very, very lonely.  He spent his days patrolling his territory, his powerful paws barely touching the ground as he stalked imaginary prey.  He was so focused on his own importance that he hardly noticed the smaller creatures around him.\n\nOne scorching afternoon, Leo was napping under a baobab tree, his magnificent roar reduced to a gentle snore.  Suddenly, a tiny squeak shattered the silence.  A tiny mouse, no bigger than Leo\'s paw, had somehow gotten tangled in a thick vine near his mighty foot.\n\nThe mouse, frantic with fear, squeaked and squirmed, unable to free itself.  Leo, roused from his sleep, grumbled.  "Another pest," he muttered, 

Your rating from 1 to 5: 2


'Input Prompt:'

"Think up a scenario involving a butterfly embarrassed by its wings to highlight 'Embrace your uniqueness.'"

'Output Response:'

'Pip the butterfly was a beautiful creature, but Pip had a secret. He wasn\'t happy with his wings.  All the other butterflies had wings so perfectly patterned, so vibrant and shimmery.  Pip\'s wings were… well, they were speckled.  Tiny, muted spots of brown, grey, and pale yellow, scattered across a background of almost invisible, dusty rose.  They weren\'t flashy, they weren\'t striking.  They were, in Pip\'s opinion, plain.\n\nPip watched the other butterflies flitting from flower to flower, their wings catching the sunlight in dazzling displays.  He would flutter close, but only briefly, then retreat to the shadows of the willow tree.  He was ashamed.  He wished his wings were like the vibrant monarchs, the dazzling swallowtails, the graceful painted ladies.  He longed to be admired for his beauty, not pitied for his…unremarkable wings.\n\nOne day, a wise old ladybug, named Dot, spotted Pip hiding under a leaf.  "Why the long face, little one?" she asked gently.\n\nPip, blushing a

Your rating from 1 to 5: 3


Filter:   0%|          | 0/20 [00:00<?, ? examples/s]

'Input Prompt:'

"Create a story with a novice ant who refuses guidance from elders and learns that 'Listening gives you endless wisdom.'"

'Output Response:'

'Tiny, a brand-new ant, emerged from her chrysalis, her tiny legs trembling.  The sun warmed her tiny back, and a delicious scent of sweet dew filled the air.  She scurried about, her antennae twitching with curiosity.  Nearby, old Anthea, with her wise, weathered wings, and Elder Ant, with his glistening, experience-marked mandibles, were diligently collecting crumbs.\n\n"Tiny, dear," Anthea whispered, her voice like rustling leaves. "The best path to the meadow is through the whispering willows."\n\n"Whispering willows?" Tiny scoffed. "I\'m Tiny!  I\'m an amazing ant! I\'ll find the best path myself!" And with a determined flick of her head, she darted off, heading straight for the towering, prickly thistles.\n\nThe thistles pricked her, the path was rough, and the sun beat down.  She struggled, her tiny body aching.  She missed the juicy berries Elder Ant had pointed out, and the delicious nectar from the fragrant clover blossoms,  because her path led her past them.\n\nElder Ant, m

Your rating from 1 to 5: 1


'Input Prompt:'

"Create a story with a novice ant who refuses guidance from elders and learns that 'Listening gives you endless wisdom.'"

'Output Response:'



Your rating from 1 to 5: 3


## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job and monitor for completion.

In [None]:
def to_rating(label):
  match label:
    case '1':
      return "Strongly Disagree"
    case '2':
      return "Disagree"
    case '3':
      return "Neutral"
    case '4':
      return "Agree"
    case '5':
      return "Strongly Agree"

scoring_system_calibration_status = client.pi_scoring_system.calibrate.start_job(
    scoring_system=aesop_scoring_system,
    examples=[
        {"llm_input": row['input'],
         "llm_output": row['output'],
         "rating": to_rating(row['label'])}
        for row in labelled]
)
aesop_scoring_system_calibrated = stream_response(
    scoring_system_calibration_status.job_id,
    client.pi_scoring_system.calibrate).calibrated_scoring_system

Detailed Status for contract_calibration_jobs:09ac227c912792130a876568ea872593308c0d4b3d7c896ec7991f041cbeedd8:012e635a-ff1f-4f45-9c3d-99a6d7993000
LAUNCHING
RUNNING
Training the AST...
Overall initial loss = 0.4827003761574074
Optimizing ROOT + dim:step_c4cd484b-e4c0-416c-823a-c92eac1e826b ...
Initial loss = 0.4827003761574074
Best trial = Measurement(metrics={'a4709c07-218a-4375-a483-74f0215576e8_loss': Metric(value=0.39789842966806854, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_76333b77-bfd1-4992-af48-50a96eec371b ...
Initial loss = 0.39789842966806854
Best trial = Measurement(metrics={'a4709c07-218a-4375-a483-74f0215576e8_loss': Metric(value=0.38478766358371086, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_e3ffabc6-12b4-4ecc-baef-a6ec54f45341 ...
Initial loss = 0.38478766358371086
Best trial = Measurement(metrics={'a4709c07-218a-4375-a483-7

## Rescore after calibration

Now add a new column with calibrated scores. You can examine these to see if they more closely align with the examples you labelled.  Ideally the score starts separating good responses from bad.

If it does not, that suggests the properties you **really** care about aren't captured in your scoring dimensions and will need to be added.  Proceed to the playgrounds at http://play.withpi.ai to experiment with this.

If this is looking good, you have a powerful function for improving your system.

In [None]:
for row in labelled:
   row['calibrated'] = client.pi_scoring_system.score(
       scoring_system=aesop_scoring_system_calibrated,
       llm_input=row["input"],
       llm_output=row["output"]).total_score
   print(f"Label: {row['label']}, Original Score: {row['uncalibrated_scores']}, Calibrated: {row['calibrated']}")

## Save calibrated scoring system

The updated scoring system now has different weights assigned to its dimensions.  Save those for later.

In [None]:
save_file('aesop_ai_calibrated.json', aesop_scoring_system_calibrated.model_dump_json(indent=2))

## Next Steps

Now that you have a calibrated scoring system, other parts of Pi should work better.  This Colab used a limited amount of hand-labeled data, but scaling up this feedback loop will pay dividends.