<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Calibrate_with_User_Preferences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Calibrate with User Preferences

This Colab is the companion to the Preference Collection Playground, showing how you can apply preference data to your training pipeline.

It's easier to collect training data from the UI, but this Colab will have you rate a small number of examples in-line.

We will walk through the same `Aesop AI` example, but any contract with feedback data should work.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm pandas numpy

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()


# Load a scoring spec and a dataset

In [None]:
# @title Load Scoring Spec
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

aesop_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/aesop_ai.json"
)

display_scoring_spec(aesop_scoring_spec)

In [None]:
# @title Load dataset
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop", split="train")

print(aesop_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 23
})


## Cluster Inputs

We're going to label some inputs as "good" and "bad", but to do this it is helpful to focus on a few different types of input.  We'll use clustering to make sure we don't have to look at too many examples.

In [None]:
import pandas as pd

input_topic_clusters = client.data.cluster_inputs(
    inputs=[
        {"identifier": str(index), "llm_input": row["input"]}
        for index, row in enumerate(aesop_dataset)
    ],
)

cluster_data = []
topics = [None] * len(aesop_dataset)
for cluster in input_topic_clusters:
    cluster_data.append([cluster.topic, cluster.inputs, len(cluster.inputs)])
    for item in cluster.inputs:
        topics[int(item)] = cluster.topic

cluster_df = pd.DataFrame(cluster_data, columns=["Topic", "Items", "Size"])
cluster_df

Unnamed: 0,Topic,Items,Size
0,"""Animal Fables with Moral Lessons""","[1, 4, 5, 7, 8, 9, 11, 14, 16, 17, 18, 20]",12
1,"""Fables Teaching Moral Lessons""","[2, 6, 12, 13, 15, 21, 22]",7
2,"""Tortoise-Themed Moral Story Requests""","[0, 3, 10, 19]",4


## Identify outliers

Let's first score every input against the scoring system, adding that as a column.  Pi scoring is fast enough that serially processing the dataset is fine, though we could increase parallelism for more speed.

In [None]:
# @title Score all examples
from tqdm import tqdm

scores = []
for example in tqdm(aesop_dataset):
    scores.append(
        client.scoring_system.score(
            scoring_spec=aesop_scoring_spec,
            llm_input=example["input"],
            llm_output=example["output"],
        )
    )

df = pd.DataFrame(
    {
        "input": aesop_dataset["input"],
        "output": aesop_dataset["output"],
        "cluster topic": topics,
        "score": [score.total_score for score in scores],
    }
)

df

100%|██████████| 23/23 [00:05<00:00,  4.00it/s]


Unnamed: 0,input,output,cluster topic,score
0,Write a children's story in the style of Aesop...,Barnaby the hare was a blur of twitching whisk...,"""Tortoise-Themed Moral Story Requests""",0.831055
1,Tell a fable about a crow and a fox that illus...,"Once upon a time, in a sun-drenched forest, li...","""Animal Fables with Moral Lessons""",0.88954
2,Create a story featuring a lion and a mouse th...,"Leo the lion, king of the sprawling savanna, w...","""Fables Teaching Moral Lessons""",0.808051
3,Write a fable involving a tortoise and a hare ...,The Tortoise and the Determined Hare\n\nIn the...,"""Tortoise-Themed Moral Story Requests""",0.974392
4,Tell a story about a greedy dog who loses his ...,Barnaby the Beagle was a dog of magnificent ap...,"""Animal Fables with Moral Lessons""",0.972222
5,Spin a tale with a squirrel and an owl teachin...,Barnaby the squirrel was renowned throughout t...,"""Animal Fables with Moral Lessons""",0.980035
6,Compose a fable with a feuding sun and wind th...,"The Sun and the Wind\n\nThe Sun, a fiery ball ...","""Fables Teaching Moral Lessons""",0.876519
7,Dream up a story involving a hummingbird and a...,"Pip the hummingbird, a flash of emerald and ru...","""Animal Fables with Moral Lessons""",0.987413
8,Tell a saga with a rabbit and a cunning crow i...,Barnaby the rabbit was a champion hopper. He'...,"""Animal Fables with Moral Lessons""",0.999783
9,Craft a fable about a young rabbit needing hel...,Barnaby Bunson was a young rabbit with a very ...,"""Animal Fables with Moral Lessons""",0.987847


In [None]:
# @title Manually inspect the scores
from withpi_utils.colab import pretty_print_responses


for i in tqdm(range(10)):
    row = aesop_dataset[i]
    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + row["output"],
        scores_left=scores[i],
    )
    print("\n\n")

  0%|          | 0/10 [00:00<?, ?it/s]

0,1,2
Story Structure,,0.839
,Plot Structure,1.0
,Conflict Introduction,0.758
,Resolution Clarity,0.758
Character Development,,0.862
,Character Presence,1.0
,Character Development,0.758
,Dialogue Quality,0.828
Narrative Engagement,,0.755
,Engaging Narrative,0.754







0,1,2
Story Structure,,0.849
,Plot Structure,1.0
,Conflict Introduction,0.789
,Resolution Clarity,0.758
Character Development,,0.964
,Character Presence,0.891
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.81
,Engaging Narrative,0.754







0,1,2
Story Structure,,0.839
,Plot Structure,1.0
,Conflict Introduction,0.762
,Resolution Clarity,0.754
Character Development,,0.767
,Character Presence,0.762
,Character Development,0.773
,Dialogue Quality,0.766
Narrative Engagement,,0.754
,Engaging Narrative,0.754







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.91
,Plot Structure,1.0
,Conflict Introduction,0.75
,Resolution Clarity,0.98
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.88
,Engaging Narrative,1.0


 60%|██████    | 6/10 [00:00<00:00, 59.42it/s]






0,1,2
Story Structure,,0.922
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.766
Character Development,,0.849
,Character Presence,0.777
,Character Development,0.77
,Dialogue Quality,1.0
Narrative Engagement,,0.837
,Engaging Narrative,0.754







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.924
,Engaging Narrative,1.0







0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.999
,Engaging Narrative,1.0







0,1,2
Story Structure,,0.999
,Plot Structure,1.0
,Conflict Introduction,0.996
,Resolution Clarity,1.0
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.928
,Engaging Narrative,1.0


100%|██████████| 10/10 [00:00<00:00, 61.30it/s]









## Label data

Now it's time to label examples against a simple criteria:

**Does the response fully satisfy the input based on the scoring spec above?**.

Valid responses are:

* 5: **Strongly Agree**
* 4: **Agree**
* 3: **Neutral**
* 2: **Disagree**
* 1: **Strongly Disagree**



In [None]:
from withpi_utils.colab import pretty_print_responses


def to_rating(label):
    match label:
        case "1":
            return "Strongly Disagree"
        case "2":
            return "Disagree"
        case "3":
            return "Neutral"
        case "4":
            return "Agree"
        case "5":
            return "Strongly Agree"


def get_rating(row, score):
    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + row["output"],
        scores_left=score,
    )
    print("\n\n")

    while True:
        user_rating = input("Your rating (1 to 5): ")
        try:
            if int(user_rating) not in [1, 2, 3, 4, 5]:
                raise ValueError("Invalid")
        except:
            display("Invalid input. Try again")
            continue
        break
    return to_rating(user_rating)


# Take 2 examples from each cluster.
examples = []
for cluster in input_topic_clusters:
    for item in cluster.inputs[:2]:
        row = aesop_dataset[int(item)]

        examples.append(
            {
                "llm_input": row["input"],
                "llm_output": row["output"],
                "rating": get_rating(row, scores[int(item)]),
            }
        )

0,1,2
Story Structure,,0.849
,Plot Structure,1.0
,Conflict Introduction,0.789
,Resolution Clarity,0.758
Character Development,,0.964
,Character Presence,0.891
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.81
,Engaging Narrative,0.754





Your rating (1 to 5): 1


0,1,2
Story Structure,,0.91
,Plot Structure,1.0
,Conflict Introduction,0.75
,Resolution Clarity,0.98
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0





Your rating (1 to 5): 2


0,1,2
Story Structure,,0.839
,Plot Structure,1.0
,Conflict Introduction,0.762
,Resolution Clarity,0.754
Character Development,,0.767
,Character Presence,0.762
,Character Development,0.773
,Dialogue Quality,0.766
Narrative Engagement,,0.754
,Engaging Narrative,0.754





Your rating (1 to 5): 3


0,1,2
Story Structure,,0.922
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,0.766
Character Development,,0.849
,Character Presence,0.777
,Character Development,0.77
,Dialogue Quality,1.0
Narrative Engagement,,0.837
,Engaging Narrative,0.754





Your rating (1 to 5): 4


0,1,2
Story Structure,,0.839
,Plot Structure,1.0
,Conflict Introduction,0.758
,Resolution Clarity,0.758
Character Development,,0.862
,Character Presence,1.0
,Character Development,0.758
,Dialogue Quality,0.828
Narrative Engagement,,0.755
,Engaging Narrative,0.754





Your rating (1 to 5): 5


0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0





Your rating (1 to 5): 2


## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job and monitor for completion.

In [None]:
from withpi_utils.colab import stream_response

scoring_system_calibration_status = client.scoring_system.calibrate.start_job(
    scoring_spec=aesop_scoring_spec, examples=examples
)

aesop_scoring_spec_calibrated = stream_response(
    scoring_system_calibration_status.job_id, client.scoring_system.calibrate
).calibrated_scoring_spec

Detailed Status for contract_calibration_jobs:09ac227c912792130a876568ea872593308c0d4b3d7c896ec7991f041cbeedd8:dc25dcec-e058-4f7e-8603-96af9c76a105
LAUNCHING
RUNNING
Training the AST...
Overall initial loss = 0.3816116898148149
Optimizing ROOT + dim:step_3ec959a1-c01e-4ae7-9edb-a9a05755dc15 ...
Initial loss = 0.3816116898148149
Best trial = Measurement(metrics={'b4f54f72-a5e0-46ec-8021-7901f422e8ef_loss': Metric(value=0.34727027628021784, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_ef3775fa-86f7-450b-a682-768c950d5091 ...
Initial loss = 0.34727027628021784
Best trial = Measurement(metrics={'b4f54f72-a5e0-46ec-8021-7901f422e8ef_loss': Metric(value=0.3224004171431197, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_bd653e13-3137-4286-9b8b-38d614be373f ...
Initial loss = 0.3224004171431197
Best trial = Measurement(metrics={'b4f54f72-a5e0-46ec-8021-790

## Rescore after calibration

Now add a new column with calibrated scores. You can examine these to see if they more closely align with the examples you labelled.  Ideally the score starts separating good responses from bad.

If it does not, that suggests the properties you **really** care about aren't captured in your scoring dimensions and will need to be added.  Proceed to the playgrounds at http://build.withpi.ai to experiment with this.

If this is looking good, you have a powerful function for improving your system.

In [None]:
from withpi_utils.colab import pretty_print_responses

for i in tqdm(range(5)):
    example = aesop_dataset[i]
    old_score = client.scoring_system.score(
        scoring_spec=aesop_scoring_spec,
        llm_input=example["input"],
        llm_output=example["output"],
    )
    new_score = client.scoring_system.score(
        scoring_spec=aesop_scoring_spec_calibrated,
        llm_input=example["input"],
        llm_output=example["output"],
    )
    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + row["output"],
        response2="#### Output:\n" + row["output"],
        scores_left=old_score,
        scores_right=new_score,
    )

  0%|          | 0/5 [00:00<?, ?it/s]

0,1,2
Story Structure,,0.839
,Plot Structure,1.0
,Conflict Introduction,0.758
,Resolution Clarity,0.758
Character Development,,0.862
,Character Presence,1.0
,Character Development,0.758
,Dialogue Quality,0.828
Narrative Engagement,,0.755
,Engaging Narrative,0.754

0,1,2
Story Structure,,0.713
,Plot Structure,1.0
,Conflict Introduction,0.758
,Resolution Clarity,0.758
Character Development,,0.881
,Character Presence,1.0
,Character Development,0.758
,Dialogue Quality,0.828
Narrative Engagement,,0.755
,Engaging Narrative,0.754


 20%|██        | 1/5 [00:00<00:02,  1.76it/s]

0,1,2
Story Structure,,0.849
,Plot Structure,1.0
,Conflict Introduction,0.789
,Resolution Clarity,0.758
Character Development,,0.964
,Character Presence,0.891
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.81
,Engaging Narrative,0.754

0,1,2
Story Structure,,0.757
,Plot Structure,1.0
,Conflict Introduction,0.789
,Resolution Clarity,0.758
Character Development,,0.977
,Character Presence,0.891
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.81
,Engaging Narrative,0.754


 40%|████      | 2/5 [00:01<00:01,  1.91it/s]

0,1,2
Story Structure,,0.839
,Plot Structure,1.0
,Conflict Introduction,0.762
,Resolution Clarity,0.754
Character Development,,0.767
,Character Presence,0.762
,Character Development,0.773
,Dialogue Quality,0.766
Narrative Engagement,,0.754
,Engaging Narrative,0.754

0,1,2
Story Structure,,0.718
,Plot Structure,1.0
,Conflict Introduction,0.762
,Resolution Clarity,0.754
Character Development,,0.803
,Character Presence,0.762
,Character Development,0.773
,Dialogue Quality,0.766
Narrative Engagement,,0.754
,Engaging Narrative,0.754


 60%|██████    | 3/5 [00:01<00:01,  1.95it/s]

0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.923
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0

0,1,2
Story Structure,,1.0
,Plot Structure,1.0
,Conflict Introduction,1.0
,Resolution Clarity,1.0
Character Development,,0.93
,Character Presence,0.77
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0


 80%|████████  | 4/5 [00:02<00:00,  1.99it/s]

0,1,2
Story Structure,,0.91
,Plot Structure,1.0
,Conflict Introduction,0.75
,Resolution Clarity,0.98
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0

0,1,2
Story Structure,,0.798
,Plot Structure,1.0
,Conflict Introduction,0.75
,Resolution Clarity,0.98
Character Development,,1.0
,Character Presence,1.0
,Character Development,1.0
,Dialogue Quality,1.0
Narrative Engagement,,0.923
,Engaging Narrative,1.0


100%|██████████| 5/5 [00:02<00:00,  1.95it/s]


## Save calibrated scoring system

The updated scoring system now has different weights assigned to its dimensions.  Save those for later.

In [None]:
with open("aesop_ai_calibrated.json", "w") as file:
    file.write(aesop_scoring_spec_calibrated.model_dump_json(indent=2))

## Next Steps

Now that you have a calibrated scoring system, other parts of Pi should work better.  This Colab used a limited amount of hand-labeled data, but scaling up this feedback loop will pay dividends.