<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Calibrate_with_User_Preferences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Calibrate with User Preferences

After you have a set of questions in your Scoring System, it's important to figure out which are important and which are not.  **Calibration** lets your Scoring System learn this using a few labelled examples.

This colab walk through the same `Aesop AI` example with some mock feedback, but you can use this recipe for your own application.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [1]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm pandas numpy

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()


# Load a scoring spec and a dataset

The below cells load a sample Scoring Spec from our repo and a dataset from Hugging Face.  You'll need an `HF_TOKEN` from https://huggingface.co/settings/tokens set in your notebook secrets.

In [2]:
# @title Load Scoring Spec
from withpi_utils.colab import load_scoring_spec_from_web, display_scoring_spec

aesop_scoring_spec = load_scoring_spec_from_web(
    "https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/scoring_specs/aesop_ai.json"
)

display_scoring_spec(aesop_scoring_spec)

In [3]:
# @title Load dataset
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop", split="train")

print(aesop_dataset)

README.md:   0%|          | 0.00/302 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/55.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/23 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output'],
    num_rows: 23
})


## Identify outliers

Let's first score every input against the scoring system, adding that as a column.  Pi scoring is fast enough that serially processing the dataset is fine, though we could increase parallelism for more speed.

In [4]:
# @title Score all examples
from tqdm.notebook import tqdm
import pandas as pd

scores = []
for example in tqdm(aesop_dataset):
    scores.append(
        pi.scoring_system.score(
            scoring_spec=aesop_scoring_spec,
            llm_input=example["input"],
            llm_output=example["output"],
        )
    )

df = pd.DataFrame(
    {
        "input": aesop_dataset["input"],
        "output": aesop_dataset["output"],
        "score": [score.total_score for score in scores],
    }
)

  0%|          | 0/23 [00:00<?, ?it/s]

In [5]:
# @title Manually inspect the scores
from withpi_utils.colab import pretty_print_responses


for i in range(10):
    row = aesop_dataset[i]
    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + row["output"],
        scores_left=scores[i],
    )
    print("\n\n")

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.318,
Is the resolution of the conflict clear and satisfying?,0.773,
Does the story include characters that are relatable for children?,0.824,
Do the characters demonstrate growth or change by the end of the story?,0.75,
Is the dialogue between characters natural and age-appropriate?,0.773,
What makes a story engaging for children?,0.82,
Is the story engaging and likely to hold a child's interest?,0.781,
Does the story use vivid imagery to help children visualize the scenes?,0.754,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.73,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.742,
Is the resolution of the conflict clear and satisfying?,0.879,
Does the story include characters that are relatable for children?,0.719,
Do the characters demonstrate growth or change by the end of the story?,0.738,
Is the dialogue between characters natural and age-appropriate?,0.996,
What makes a story engaging for children?,0.781,
Is the story engaging and likely to hold a child's interest?,0.75,
Does the story use vivid imagery to help children visualize the scenes?,0.758,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.77,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.859,
Is the resolution of the conflict clear and satisfying?,0.809,
Does the story include characters that are relatable for children?,0.676,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.746,
What makes a story engaging for children?,0.777,
Is the story engaging and likely to hold a child's interest?,0.77,
Does the story use vivid imagery to help children visualize the scenes?,0.734,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.664,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.715,
Is the resolution of the conflict clear and satisfying?,0.875,
Does the story include characters that are relatable for children?,0.707,
Do the characters demonstrate growth or change by the end of the story?,0.812,
Is the dialogue between characters natural and age-appropriate?,0.691,
What makes a story engaging for children?,0.824,
Is the story engaging and likely to hold a child's interest?,0.777,
Does the story use vivid imagery to help children visualize the scenes?,0.738,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.742,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.617,
Is the resolution of the conflict clear and satisfying?,0.785,
Does the story include characters that are relatable for children?,0.711,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.77,
What makes a story engaging for children?,0.809,
Is the story engaging and likely to hold a child's interest?,0.758,
Does the story use vivid imagery to help children visualize the scenes?,0.766,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.703,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.699,
Is the resolution of the conflict clear and satisfying?,0.797,
Does the story include characters that are relatable for children?,0.723,
Do the characters demonstrate growth or change by the end of the story?,0.98,
Is the dialogue between characters natural and age-appropriate?,0.758,
What makes a story engaging for children?,0.781,
Is the story engaging and likely to hold a child's interest?,0.762,
Does the story use vivid imagery to help children visualize the scenes?,0.75,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.703,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.996,
Is the resolution of the conflict clear and satisfying?,1.0,
Does the story include characters that are relatable for children?,0.723,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.996,
What makes a story engaging for children?,0.754,
Is the story engaging and likely to hold a child's interest?,0.746,
Does the story use vivid imagery to help children visualize the scenes?,0.738,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.785,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.746,
Is the resolution of the conflict clear and satisfying?,0.84,
Does the story include characters that are relatable for children?,1.0,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,1.0,
What makes a story engaging for children?,1.0,
Is the story engaging and likely to hold a child's interest?,1.0,
Does the story use vivid imagery to help children visualize the scenes?,1.0,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.965,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.922,
Is the resolution of the conflict clear and satisfying?,0.766,
Does the story include characters that are relatable for children?,0.719,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.812,
What makes a story engaging for children?,0.766,
Is the story engaging and likely to hold a child's interest?,0.746,
Does the story use vivid imagery to help children visualize the scenes?,0.77,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.758,







0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.789,
Is the resolution of the conflict clear and satisfying?,0.992,
Does the story include characters that are relatable for children?,0.812,
Do the characters demonstrate growth or change by the end of the story?,0.957,
Is the dialogue between characters natural and age-appropriate?,1.0,
What makes a story engaging for children?,0.914,
Is the story engaging and likely to hold a child's interest?,0.812,
Does the story use vivid imagery to help children visualize the scenes?,0.781,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.773,







## Label data

Now it's time to label examples against a simple criteria:

**Does the response fully satisfy the input based on the scoring spec above?**.

Valid responses are:

* 5: **Strongly Agree**
* 4: **Agree**
* 3: **Neutral**
* 2: **Disagree**
* 1: **Strongly Disagree**



In [6]:
from withpi_utils.colab import pretty_print_responses


def to_rating(label):
    match label:
        case "1":
            return "Strongly Disagree"
        case "2":
            return "Disagree"
        case "3":
            return "Neutral"
        case "4":
            return "Agree"
        case "5":
            return "Strongly Agree"


def get_rating(row, score):
    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + row["output"],
        scores_left=score,
    )
    print("\n\n")

    while True:
        user_rating = input("Your rating (1 to 5): ")
        try:
            if int(user_rating) not in [1, 2, 3, 4, 5]:
                raise ValueError("Invalid")
        except:
            display("Invalid input. Try again")
            continue
        break
    return to_rating(user_rating)

examples = []
for idx, row in df.head(5).iterrows():
    examples.append(
        {
            "llm_input": row["input"],
            "llm_output": row["output"],
            "rating": get_rating(row, scores[idx]),
        }
    )

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.318,
Is the resolution of the conflict clear and satisfying?,0.773,
Does the story include characters that are relatable for children?,0.824,
Do the characters demonstrate growth or change by the end of the story?,0.75,
Is the dialogue between characters natural and age-appropriate?,0.773,
What makes a story engaging for children?,0.82,
Is the story engaging and likely to hold a child's interest?,0.781,
Does the story use vivid imagery to help children visualize the scenes?,0.754,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.73,





Your rating (1 to 5): 5


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.742,
Is the resolution of the conflict clear and satisfying?,0.879,
Does the story include characters that are relatable for children?,0.719,
Do the characters demonstrate growth or change by the end of the story?,0.738,
Is the dialogue between characters natural and age-appropriate?,0.996,
What makes a story engaging for children?,0.781,
Is the story engaging and likely to hold a child's interest?,0.75,
Does the story use vivid imagery to help children visualize the scenes?,0.758,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.77,





Your rating (1 to 5): 4


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.859,
Is the resolution of the conflict clear and satisfying?,0.809,
Does the story include characters that are relatable for children?,0.676,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.746,
What makes a story engaging for children?,0.777,
Is the story engaging and likely to hold a child's interest?,0.77,
Does the story use vivid imagery to help children visualize the scenes?,0.734,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.664,





Your rating (1 to 5): 1


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.715,
Is the resolution of the conflict clear and satisfying?,0.875,
Does the story include characters that are relatable for children?,0.707,
Do the characters demonstrate growth or change by the end of the story?,0.812,
Is the dialogue between characters natural and age-appropriate?,0.691,
What makes a story engaging for children?,0.824,
Is the story engaging and likely to hold a child's interest?,0.777,
Does the story use vivid imagery to help children visualize the scenes?,0.738,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.742,





Your rating (1 to 5): 5


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.617,
Is the resolution of the conflict clear and satisfying?,0.785,
Does the story include characters that are relatable for children?,0.711,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.77,
What makes a story engaging for children?,0.809,
Is the story engaging and likely to hold a child's interest?,0.758,
Does the story use vivid imagery to help children visualize the scenes?,0.766,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.703,





Your rating (1 to 5): 4


## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job and monitor for completion.

In [7]:
from withpi_utils import stream

scoring_system_calibration_status = pi.scoring_system.calibrate.start_job(
    scoring_spec=aesop_scoring_spec, examples=examples
)

next(stream(pi.scoring_system.calibrate, scoring_system_calibration_status), None)

aesop_scoring_spec_calibrated = pi.scoring_system.calibrate.retrieve(scoring_system_calibration_status.job_id).calibrated_scoring_spec

LAUNCHING
RUNNING
Training the AST...
Overall initial loss = 0.27275904605263157
Optimizing ROOT + dim:step_62ede5c4-db5e-48a9-aec0-9bd96be36daa ...
Initial loss = 0.27275904605263157
Best trial = Measurement(metrics={'4b1fe1a6-5d7f-492b-b869-c056bc05e9f2_loss': Metric(value=0.2632608599154835, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_f0c7c334-8235-49ff-8c68-915fe82940d2 ...
Initial loss = 0.2632608599154835
Best trial = Measurement(metrics={'4b1fe1a6-5d7f-492b-b869-c056bc05e9f2_loss': Metric(value=0.25622007728416074, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_6073c885-79a1-4de2-98ae-00c29714158c ...
Initial loss = 0.25622007728416074
Best trial = Measurement(metrics={'4b1fe1a6-5d7f-492b-b869-c056bc05e9f2_loss': Metric(value=0.24864596659019994, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimi

## Rescore after calibration

Now add a new column with calibrated scores. You can examine these to see if they more closely align with the examples you labelled.  Ideally the score starts separating good responses from bad.

If it does not, that suggests the properties you **really** care about aren't captured in your scoring dimensions and will need to be added.  Proceed to the Copilot at http://build.withpi.ai to experiment with this.

If this is looking good, you have a powerful function for improving your system.

In [8]:
from withpi_utils.colab import pretty_print_responses

for i in tqdm(range(5)):
    example = aesop_dataset[i]
    old_score = pi.scoring_system.score(
        scoring_spec=aesop_scoring_spec,
        llm_input=example["input"],
        llm_output=example["output"],
    )
    new_score = pi.scoring_system.score(
        scoring_spec=aesop_scoring_spec_calibrated,
        llm_input=example["input"],
        llm_output=example["output"],
    )
    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + row["output"],
        response2="#### Output:\n" + row["output"],
        scores_left=old_score,
        scores_right=new_score,
    )

  0%|          | 0/5 [00:00<?, ?it/s]

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.318,
Is the resolution of the conflict clear and satisfying?,0.773,
Does the story include characters that are relatable for children?,0.824,
Do the characters demonstrate growth or change by the end of the story?,0.75,
Is the dialogue between characters natural and age-appropriate?,0.773,
What makes a story engaging for children?,0.82,
Is the story engaging and likely to hold a child's interest?,0.781,
Does the story use vivid imagery to help children visualize the scenes?,0.754,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.73,

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.318,
Is the resolution of the conflict clear and satisfying?,0.566,
Does the story include characters that are relatable for children?,0.856,
Do the characters demonstrate growth or change by the end of the story?,0.675,
Is the dialogue between characters natural and age-appropriate?,0.773,
What makes a story engaging for children?,0.82,
Is the story engaging and likely to hold a child's interest?,0.781,
Does the story use vivid imagery to help children visualize the scenes?,0.754,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.73,


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.742,
Is the resolution of the conflict clear and satisfying?,0.879,
Does the story include characters that are relatable for children?,0.719,
Do the characters demonstrate growth or change by the end of the story?,0.738,
Is the dialogue between characters natural and age-appropriate?,0.996,
What makes a story engaging for children?,0.781,
Is the story engaging and likely to hold a child's interest?,0.75,
Does the story use vivid imagery to help children visualize the scenes?,0.758,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.77,

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.742,
Is the resolution of the conflict clear and satisfying?,0.834,
Does the story include characters that are relatable for children?,0.405,
Do the characters demonstrate growth or change by the end of the story?,0.659,
Is the dialogue between characters natural and age-appropriate?,0.996,
What makes a story engaging for children?,0.781,
Is the story engaging and likely to hold a child's interest?,0.75,
Does the story use vivid imagery to help children visualize the scenes?,0.758,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.77,


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.859,
Is the resolution of the conflict clear and satisfying?,0.809,
Does the story include characters that are relatable for children?,0.676,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.746,
What makes a story engaging for children?,0.777,
Is the story engaging and likely to hold a child's interest?,0.77,
Does the story use vivid imagery to help children visualize the scenes?,0.734,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.664,

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.859,
Is the resolution of the conflict clear and satisfying?,0.666,
Does the story include characters that are relatable for children?,0.35,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.746,
What makes a story engaging for children?,0.777,
Is the story engaging and likely to hold a child's interest?,0.77,
Does the story use vivid imagery to help children visualize the scenes?,0.734,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.664,


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.715,
Is the resolution of the conflict clear and satisfying?,0.875,
Does the story include characters that are relatable for children?,0.707,
Do the characters demonstrate growth or change by the end of the story?,0.812,
Is the dialogue between characters natural and age-appropriate?,0.691,
What makes a story engaging for children?,0.824,
Is the story engaging and likely to hold a child's interest?,0.777,
Does the story use vivid imagery to help children visualize the scenes?,0.738,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.742,

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.715,
Is the resolution of the conflict clear and satisfying?,0.828,
Does the story include characters that are relatable for children?,0.379,
Do the characters demonstrate growth or change by the end of the story?,0.761,
Is the dialogue between characters natural and age-appropriate?,0.691,
What makes a story engaging for children?,0.824,
Is the story engaging and likely to hold a child's interest?,0.777,
Does the story use vivid imagery to help children visualize the scenes?,0.738,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.742,


0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.617,
Is the resolution of the conflict clear and satisfying?,0.785,
Does the story include characters that are relatable for children?,0.711,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.77,
What makes a story engaging for children?,0.809,
Is the story engaging and likely to hold a child's interest?,0.758,
Does the story use vivid imagery to help children visualize the scenes?,0.766,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.703,

0,1,2
"Does the story have a clear beginning, middle, and end?",1.0,
Is there a conflict introduced early in the story that drives the plot?,0.617,
Is the resolution of the conflict clear and satisfying?,0.599,
Does the story include characters that are relatable for children?,0.383,
Do the characters demonstrate growth or change by the end of the story?,1.0,
Is the dialogue between characters natural and age-appropriate?,0.77,
What makes a story engaging for children?,0.809,
Is the story engaging and likely to hold a child's interest?,0.758,
Does the story use vivid imagery to help children visualize the scenes?,0.766,
Does the story incorporate repetitive elements that aid in comprehension and retention?,0.703,


## Save calibrated scoring system

The updated scoring system now has different weights assigned to its dimensions.  Save those for later.

In [10]:
from withpi_utils.colab import dump_scoring_spec
from google.colab import files

with open("aesop_ai_calibrated.json", "w") as file:
    file.write(dump_scoring_spec(aesop_scoring_spec_calibrated))
files.download('aesop_ai_calibrated.json')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Next Steps

Now that you have a calibrated scoring system, other parts of Pi should work better.  This Colab used a limited amount of hand-labeled data, but scaling up this feedback loop will pay dividends.