<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Calibration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Calibration

After you have a set of questions in your Scoring System, it's important to figure out which are important and which are not.  **Calibration** lets your Scoring System learn this using a few labelled examples.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [7]:
%%capture

%pip install withpi withpi-utils datasets

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

# Load a dataset

Let's say you're building an AI to generate stories in the style of Aesop's Fables.  Load our example training dataset, and let's dig in. You'll need an `HF_TOKEN` from https://huggingface.co/settings/tokens set in your notebook secrets.

In [8]:
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop", split="train")

print(aesop_dataset)

Dataset({
    features: ['input', 'output'],
    num_rows: 23
})


## Label data

Now it's time to label examples against a simple criteria:

**Does the response fully satisfy the input based on the scoring spec in the cell below?**.

Valid responses are:

* 5: **Strongly Agree**
* 4: **Agree**
* 3: **Neutral**
* 2: **Disagree**
* 1: **Strongly Disagree**



In [10]:
from withpi_utils.colab import pretty_print_responses

scoring_spec = [{'question': q} for q in [
    "Does the response contain a clear beginning, middle, and end?",
    "Does the story follow a logical progression of events?",
    "Does the story resolve the conflict in a satisfying manner?",
    "Is the life lesson clearly conveyed in the story?",
    "Is the life lesson relevant to the input provided by the user?"
]]

def to_rating(label):
    match label:
        case "1":
            return "Strongly Disagree"
        case "2":
            return "Disagree"
        case "3":
            return "Neutral"
        case "4":
            return "Agree"
        case "5":
            return "Strongly Agree"


def get_rating(row):
    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + row["output"],
    )
    print("\n\n")

    while True:
        user_rating = input("Your rating (1 to 5): ")
        try:
            if int(user_rating) not in [1, 2, 3, 4, 5]:
                raise ValueError("Invalid")
        except:
            display("Invalid input. Try again")
            continue
        break
    return to_rating(user_rating)

examples = []
for example in aesop_dataset.take(5):
    examples.append(
        {
            "llm_input": example["input"],
            "llm_output": example["output"],
            "rating": get_rating(example),
        }
    )




Your rating (1 to 5): 1





Your rating (1 to 5): 5





Your rating (1 to 5): 3





Your rating (1 to 5): 2





Your rating (1 to 5): 4


## Score all examples

Let's first score every input against the scoring system, adding that as a column.  Pi scoring is fast enough that serially processing the dataset is fine, though we could increase parallelism for more speed.

## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job and monitor for completion.

In [11]:
from withpi_utils import stream

scoring_system_calibration_status = pi.scoring_system.calibrate.start_job(
    scoring_spec=scoring_spec, examples=examples
)

next(stream(pi.scoring_system.calibrate, scoring_system_calibration_status), None)

scoring_spec_calibrated = pi.scoring_system.calibrate.retrieve(scoring_system_calibration_status.job_id).calibrated_scoring_spec

LAUNCHING
RUNNING
Training the AST...
Overall initial loss = 0.4646875
Optimizing ROOT + dim:step_8a8b0f28-0763-48a5-9998-3b7bf2dbf94d ...
Initial loss = 0.4646875
Best trial = Measurement(metrics={'68f2b405-1aa4-493a-89ad-5da5ef39d173_loss': Metric(value=0.44115863199369265, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_5d4fa5a5-0226-4974-b4a5-fea7d0fd2265 ...
Initial loss = 0.44115863199369265
Best trial = Measurement(metrics={'68f2b405-1aa4-493a-89ad-5da5ef39d173_loss': Metric(value=0.44047694212332383, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_96a0ee03-9a90-44cf-9322-427e2202d9e2 ...
Initial loss = 0.44047694212332383
Best trial = Measurement(metrics={'68f2b405-1aa4-493a-89ad-5da5ef39d173_loss': Metric(value=0.4324078038846185, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:ste

## Score before and after calibration

Now add a new column with calibrated scores. You can examine these to see if they more closely align with the examples you labelled.  Ideally the score starts separating good responses from bad.

If it does not, that suggests the properties you **really** care about aren't captured in your scoring dimensions and will need to be added.  Proceed to the Copilot at http://build.withpi.ai to experiment with this.

If this is looking good, you have a powerful function for improving your system.

In [12]:
from withpi_utils.colab import pretty_print_responses

for example in aesop_dataset.take(5):
    old_score = pi.scoring_system.score(
        scoring_spec=scoring_spec,
        llm_input=example["input"],
        llm_output=example["output"],
    )
    new_score = pi.scoring_system.score(
        scoring_spec=scoring_spec_calibrated,
        llm_input=example["input"],
        llm_output=example["output"],
    )
    pretty_print_responses(
        header="#### Input:\n" + example["input"],
        response1="#### Output:\n" + example["output"],
        response2="#### Output:\n" + example["output"],
        scores_left=old_score,
        scores_right=new_score,
    )

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.832,
Does the story resolve the conflict in a satisfying manner?,0.781,
Is the life lesson clearly conveyed in the story?,0.801,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.883

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.745,
Does the story resolve the conflict in a satisfying manner?,0.454,
Is the life lesson clearly conveyed in the story?,0.701,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.594


0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.836,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.967

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.464,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.686


0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.922,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.984

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.709,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.829


0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.852,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.97

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.466,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.688


0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.988,
Does the story resolve the conflict in a satisfying manner?,0.777,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.953

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.982,
Does the story resolve the conflict in a satisfying manner?,0.454,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.68


## Save calibrated scoring system

The updated scoring system now has different weights assigned to its questions.  Save those for later.

In [13]:
from withpi_utils.colab import dump_scoring_spec
from google.colab import files

with open("aesop_ai_calibrated.json", "w") as file:
    file.write(dump_scoring_spec(scoring_spec_calibrated))
files.download('aesop_ai_calibrated.json')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Next Steps

Now that you have a calibrated scoring system, other parts of Pi should work better.  This Colab used a limited amount of hand-labeled data, but scaling up this feedback loop will pay dividends.