<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Calibration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Calibration

Calibration lets you alter how a Pi Scoring System evaluates a question or a set of questions by providing a few updated score labels.  This notebook walks through tuning a single question so you can see what's happening, but the same API works on full Scoring Systems too.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Setup scoring system

Let's say we're building an AI to generate stories in the style of Aesop's Fables.  Consider the following Scoring System:

In [None]:
scoring_spec = [{'question': q} for q in [
    "Is the life lesson clearly conveyed in the story?",
]]

def score(example):
    example["score"] = pi.scoring_system.score(
        llm_input=example["input"],
        llm_output=example["output"],
        scoring_spec=scoring_spec,
    ).total_score
    return example

# Load a dataset

Load our example training dataset, and let's dig in. You'll need an `HF_TOKEN` from https://huggingface.co/settings/tokens set in your notebook secrets.

In [None]:
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop", split="train")

print(aesop_dataset)

README.md:   0%|          | 0.00/302 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/55.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/23 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output'],
    num_rows: 23
})


## Examine scores

Let's score these and see how they're behaving.

In [None]:
aesop_dataset = aesop_dataset.map(score)

print(aesop_dataset["score"][:])

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

[0.80078125, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]


## Examine the responses

The first row was a little worse on this question, but 0.8 is still pretty good.  What if it should have been harsher?  Let's take a look and see what it was, along with an example from the rest of the set.

In [None]:
from withpi_utils.colab import pretty_print_responses

pretty_print_responses(
    response1="#### Input:\n" + aesop_dataset[0]["input"] + "\n#### Output:\n" + aesop_dataset[0]["output"],
    response2="#### Input:\n" + aesop_dataset[1]["input"] + "\n#### Output:\n" + aesop_dataset[1]["output"],
    left_label="Bad", right_label="Good")

## Make it harsher

The first story fails to set up the race, relying on the reader to fill in that context from their knowledge of original story.  You might want to harshly penalize this story because it misses the moral point.

In reality you'd label more examples, but two labels should show off the effect of calibration.

In [None]:
examples = [
    {
        "llm_input": aesop_dataset[0]["input"],
        "llm_output": aesop_dataset[0]["output"],
        "rating": "Strongly Disagree",
    },
    {
        "llm_input": aesop_dataset[1]["input"],
        "llm_output": aesop_dataset[1]["output"],
        "rating": "Strongly Agree",
    },
]


## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job and monitor for completion.

In [None]:
from withpi_utils import stream

scoring_system_calibration_status = pi.scoring_system.calibrate.start_job(
    scoring_spec=scoring_spec, examples=examples
)

next(stream(pi.scoring_system.calibrate, scoring_system_calibration_status), None)

scoring_spec_calibrated = pi.scoring_system.calibrate.retrieve(scoring_system_calibration_status.job_id).calibrated_scoring_spec

LAUNCHING
RUNNING
Training the AST...
Overall initial loss = 0.400390625
Optimizing ROOT + dim:step_9f6278f4-0135-40af-98e9-24ede97410dc ...
Initial loss = 0.400390625
Best trial = Measurement(metrics={'acf2de4c-1541-4a0c-98f8-1b1fc5d3ae25_loss': Metric(value=0.19981745737270945, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing AST ROOT ...
Initial loss = 0.19981745737270945
Best trial = Measurement(metrics={'acf2de4c-1541-4a0c-98f8-1b1fc5d3ae25_loss': Metric(value=0.19981745737270942, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Keep the initial learned params!
DONE


## Score after calibration

Now add a new column with calibrated scores so we can compare.

In [None]:
from withpi_utils.colab import pretty_print_responses

def score_calibrated(example):
    example["score_calibrated"] = pi.scoring_system.score(
        llm_input=example["input"],
        llm_output=example["output"],
        scoring_spec=scoring_spec_calibrated,
    ).total_score
    return example

aesop_dataset = aesop_dataset.map(score_calibrated)

print("Original Scores:")
print(aesop_dataset["score"][:])
print("\nCalibrated Scores:")
print(aesop_dataset["score_calibrated"][:])

Map:   0%|          | 0/23 [00:00<?, ? examples/s]

Original Scores:
[0.80078125, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Calibrated Scores:
[0.3996349147454189, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]


## What have we achieved?

Now this question is much "harsher".  It's saturated on many of the inputs, but we've pulled the "bad" response away from the other "good" ones, allowing this question to have more power to filter out bad examples.

## Save calibrated scoring system

Save the updated scoring spec to a file, which can be loaded in the future with `load_scoring_spec`.

In [None]:
from withpi_utils.colab import dump_scoring_spec
from google.colab import files

with open("aesop_ai_calibrated.json", "w") as file:
    file.write(dump_scoring_spec(scoring_spec_calibrated))
files.download('aesop_ai_calibrated.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Next Steps

This Colab used an (extremely!) limited amount of labeled data, but scaling up this feedback loop will pay dividends, allowing you to tune the range of your questions to match your notion of "goodness".

You can also try calibrating a set of questions rather than just one.  The API is exactly the same.