<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/Calibration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Calibration

Calibration lets you alter how a Pi Scoring System evaluates a question or a set of questions by providing a few updated score labels.  This notebook walks through tuning a single question so you can see what's happening, but the same API works on full Scoring Systems too.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Setup scoring system

Let's say we're building a scoring system to score Pithy Emails. Consider the following Scoring System:

In [None]:
import textwrap

python_code = textwrap.dedent("""
    from typing import Any
    import json

    def score(**kwargs) -> dict:
        def evaluate_response(response):
            words = response.split()
            word_count = len(words)

            if word_count < 50:
                return 1.0
            elif word_count < 100:
                return 0.7
            elif word_count < 150:
                return 0.4
            return 0.0

        email_text = None

        if 'email' in kwargs:
            email_text = kwargs['email']
        elif 'response_text' in kwargs:
            try:
                parsed = json.loads(kwargs['response_text'])
                if isinstance(parsed, dict) and 'email' in parsed:
                    email_text = parsed['email']
                else:
                    email_text = kwargs['response_text']
            except (json.JSONDecodeError, TypeError):
                email_text = kwargs['response_text']

        final_score = evaluate_response(email_text)
        return {"score": final_score, "explanation": ""}
""").strip()

scoring_spec = [
    {
      "label": "Call To Action",
      "question": "Does the email include a clear and actionable call to action?",
      "weight": 0.5
    },
    {
      "label": "Pithy Content",
      "question": "Is the email concise and to the point without unnecessary details?",
      "weight": 0.5
    },
    {
      "label": "Professional Tone",
      "question": "Does the email maintain a tone that is professional yet approachable?",
      "weight": 0.5
    },
    {
      "label": "Word Limit",
      "question": "Is the email within an appropriate word count for a cold sales email (e.g., less than 150 words)?",
      "python_code": python_code,
      "scoring_type": "PYTHON_CODE",
      "weight": 0.5
    },
    {
      "label": "Recipient Focus",
      "question": "Does the email focus on the recipient's needs or interests rather than solely on the sender's product or service?",
      "weight": 0.3
    },
    {
      "label": "Clarity Of Purpose",
      "question": "Is the purpose of the email immediately clear to the recipient?",
      "weight": 0.3
    }
  ]

In [None]:
print(scoring_spec)

[{'label': 'Call To Action', 'question': 'Does the email include a clear and actionable call to action?', 'weight': 0.5}, {'label': 'Pithy Content', 'question': 'Is the email concise and to the point without unnecessary details?', 'weight': 0.5}, {'label': 'Professional Tone', 'question': 'Does the email maintain a tone that is professional yet approachable?', 'weight': 0.5}, {'label': 'Word Limit', 'question': 'Is the email within an appropriate word count for a cold sales email (e.g., less than 150 words)?', 'python_code': 'from typing import Any\nimport json\n\ndef score(**kwargs) -> dict:\n    def evaluate_response(response):\n        words = response.split()\n        word_count = len(words)\n\n        if word_count < 50:\n            return 1.0\n        elif word_count < 100:\n            return 0.7\n        elif word_count < 150:\n            return 0.4\n        return 0.0\n\n    email_text = None\n\n    if \'email\' in kwargs:\n        email_text = kwargs[\'email\']\n    elif \'

In [None]:
import json
def score(email):
    email_score = pi.scoring_system.score(
        llm_input="",
        llm_output=json.dumps({"email": email}),
        scoring_spec=scoring_spec,
    )
    return email_score

# Load a dataset

Load our example training dataset, and let's dig in.

In [None]:
emails = [
    "Saw your team just rolled out a new logistics dashboard—congrats! We help companies like yours cut processing time by 30% with real-time anomaly detection. Worth a quick chat to see if we can complement your stack?",

    "If I had a dollar for every ops leader who hates manual reconciliations, I'd have…$37. Or you could try our automation tool and stop hating Tuesdays. Want the 3-min pitch?",

    "Your recent expansion into Illinois triggered a compliance alert on our end—we specialize in multi-state labor law navigation for fast-scaling companies. Would love to share how we've helped similar orgs stay litigation-free while growing fast.",

    "Just read about your Series B—congrats! We work with scaling SaaS teams to improve onboarding retention by 22% using adaptive tutorials. Happy to show you how we could slot into your flow.",

    "I noticed you just posted an opening for 3 new sales reps. We built an AI email coach that gives your BDRs live feedback while they write. Think Grammarly for prospecting—except it actually cares about your quota. Got 15 mins for a demo?",

    "In light of your new data-sharing partnerships, I thought it worth reaching out. Our firm specializes in proactive data transfer compliance audits and helped firms like yours avoid costly remediation post-implementation. Could we schedule a quick call next week?",

    "Hey—saw your CTO on the TechCrunch panel this week (sharp takes!). We've helped companies like yours simplify SOC 2 readiness using automated control mapping. Mind if I send over a walkthrough?",

    "Congrats on closing your A round. Most teams your stage are trying to balance velocity with sanity—and that's where we come in. We built a dashboard that connects product usage data to GTM signals, so your teams know which users are actually ready to expand or upsell.\n\nCompanies like Spline and Beam have used us to grow faster without adding headcount. We'd love to walk you through how it works and see if it's worth a trial.",

    "Did you know that 3 out of 4 startups violate wage classification rules by accident? (The 4th is lying.) We built a compliance scanner that spots misclassified roles before the DOL does. Want to see what it says about your job board?",

    "After seeing your team highlighted in FastCo for customer-led growth, I had to reach out. We're building AI-driven customer advisory boards—think Slack meets Gong—for surfacing what your best customers actually want next. Would love to hear what you're seeing internally and compare notes.",

    "Last year, teams like yours spent 100+ hours manually triaging support tickets that could've been deflected. Our NLP engine pre-sorts inbound messages by sentiment and urgency—helping ops teams reduce load by 40%. Want to test it on your last month's inbox?",

    "Noticed your CFO just came from a public company—likely means your reporting bar just got higher. We help mid-stage companies prep clean GAAP audit trails from day one. Happy to chat if that's on your roadmap.",

    "As your team continues global expansion, new jurisdictions bring new employment law exposure—especially with hybrid work policies. We provide scalable legal frameworks that help fast-growing orgs like yours remain compliant while staying flexible. If international HR is becoming a concern, would love to set up a call.",

    "Saw your CEO at the SaaStr panel on usage-based billing—awesome session. We built a forecasting tool for exactly that model and help teams cut revenue leakage by 18%. Want to try it?",

    "When Clara at Orb told us about the mess that was their pricing ops workflow, we assumed it was unique—until we talked to 20 more teams with the same story. That's why we built SplitRate: an API-first way to test new pricing tiers without engineering overhead.\n\nWe'd love to show you how it works and see if it could help your GTM team iterate faster. Want to sync next week?",

    "New SEC rules on vendor disclosures go into effect Q4. If you rely on third-party data processors, this could get messy. We've helped teams like yours prep ahead. Free next Tuesday for a quick overview?",

    "Hey, I'm a founder too—noticed you just launched your beta. We built a tool to help early-stage teams convert waitlists into learning loops (and users). Can I send you a few examples?",

    "When teams hit 50+ people, benefits compliance becomes a mess. We built a platform that lets HR see COBRA, FMLA, and state laws in one pane—plus we alert for risk zones before they become liabilities. Legal teams love us. HR teams love us more. Want a quick look?",

    "I bet you $5 your CRM is full of dead leads. We resurrect them. Our reactivation engine has helped companies like yours turn junk contacts into pipeline again. If I'm wrong, I owe you coffee. If I'm right, you'll owe me a call.",

    "We're a bunch of nerds who hate broken analytics dashboards. So we built one that auto-diagnoses when your metrics go weird (like when traffic drops 20% at midnight with no reason). Want to see it catch gremlins in your GA4?"
]

## Examine scores

Let's score these and see how they're behaving.

In [None]:

import pandas as pd

data = []

for email in emails:
    result = score(email)
    question_scores = result.question_scores

    row = {'email': email, 'Total Score': result.total_score}
    row.update(question_scores)
    data.append(row)

# Create DataFrame
pd.set_option('display.max_colwidth', None)
df = pd.DataFrame(data)
print(display(df))

Unnamed: 0,email,Total Score,Call To Action,Pithy Content,Professional Tone,Word Limit,Recipient Focus,Clarity Of Purpose
0,Saw your team just rolled out a new logistics dashboard—congrats! We help companies like yours cut processing time by 30% with real-time anomaly detection. Worth a quick chat to see if we can complement your stack?,0.9017,0.7656,0.9883,0.9062,1.0,0.8438,0.9062
1,"If I had a dollar for every ops leader who hates manual reconciliations, I'd have…$37. Or you could try our automation tool and stop hating Tuesdays. Want the 3-min pitch?",0.6043,0.8789,0.4277,0.5,1.0,0.334,0.6172
2,Your recent expansion into Illinois triggered a compliance alert on our end—we specialize in multi-state labor law navigation for fast-scaling companies. Would love to share how we've helped similar orgs stay litigation-free while growing fast.,0.5344,0.051,0.9922,0.9297,1.0,0.7578,0.9414
3,Just read about your Series B—congrats! We work with scaling SaaS teams to improve onboarding retention by 22% using adaptive tutorials. Happy to show you how we could slot into your flow.,0.5824,0.0967,0.9648,0.9258,1.0,0.875,0.625
4,I noticed you just posted an opening for 3 new sales reps. We built an AI email coach that gives your BDRs live feedback while they write. Think Grammarly for prospecting—except it actually cares about your quota. Got 15 mins for a demo?,0.8482,0.7305,0.9609,0.8281,1.0,0.6406,0.9258
5,"In light of your new data-sharing partnerships, I thought it worth reaching out. Our firm specializes in proactive data transfer compliance audits and helped firms like yours avoid costly remediation post-implementation. Could we schedule a quick call next week?",0.8436,0.7891,0.9844,0.8867,1.0,0.4766,0.8945
6,Hey—saw your CTO on the TechCrunch panel this week (sharp takes!). We've helped companies like yours simplify SOC 2 readiness using automated control mapping. Mind if I send over a walkthrough?,0.8297,0.5039,0.9844,0.8867,1.0,0.8789,0.8867
7,"Congrats on closing your A round. Most teams your stage are trying to balance velocity with sanity—and that's where we come in. We built a dashboard that connects product usage data to GTM signals, so your teams know which users are actually ready to expand or upsell.\n\nCompanies like Spline and Beam have used us to grow faster without adding headcount. We'd love to walk you through how it works and see if it's worth a trial.",0.6972,0.4238,0.6758,0.9023,0.7,0.9609,0.7891
8,Did you know that 3 out of 4 startups violate wage classification rules by accident? (The 4th is lying.) We built a compliance scanner that spots misclassified roles before the DOL does. Want to see what it says about your job board?,0.8299,0.8125,0.7383,0.8203,1.0,0.7148,0.9062
9,"After seeing your team highlighted in FastCo for customer-led growth, I had to reach out. We're building AI-driven customer advisory boards—think Slack meets Gong—for surfacing what your best customers actually want next. Would love to hear what you're seeing internally and compare notes.",0.7027,0.2031,0.9844,0.8633,1.0,0.9141,0.9609


None


## Set up data for calibration

Setting up the data for calibration in this step

In [None]:
examples = []
for index, row in df.iterrows():
    example = {}
    example["llm_input"] = ''
    example["llm_output"] = json.dumps({"email": row["email"]})
    example["score"] = 1.0
    example["old_score"] = row["Total Score"]
    examples.append(example)


## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job and monitor for completion.

In [None]:
from withpi_utils import stream

scoring_system_calibration_status = pi.scoring_system.calibrate.start_job(
    scoring_spec=scoring_spec, examples=examples, preference_examples=[]
)

next(stream(pi.scoring_system.calibrate, scoring_system_calibration_status), None)

scoring_spec_calibrated = pi.scoring_system.calibrate.retrieve(scoring_system_calibration_status.job_id).calibrated_scoring_spec

LAUNCHING
RUNNING
Training the AST...
Overall initial loss = 0.31988312592647217
Optimizing ROOT + dim:step_9baa1c0a-60d6-4b89-8a5b-1cb99b52ac06 ...
Initial loss = 0.31988312592647217
Best trial = Measurement(metrics={'8c569e43-07aa-49bc-8378-8ab530f38b5b_loss': Metric(value=0.18571116416868247, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_0490f7a9-e840-4ab9-b365-43dfb62dd2f0 ...
Initial loss = 0.18571116416868247
Best trial = Measurement(metrics={'8c569e43-07aa-49bc-8378-8ab530f38b5b_loss': Metric(value=0.15463767273563342, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Apply the new learned params!
Optimizing ROOT + dim:step_12a06d26-7cc9-4b48-9bde-17dc9bed2799 ...
Initial loss = 0.15463767273563342
Best trial = Measurement(metrics={'8c569e43-07aa-49bc-8378-8ab530f38b5b_loss': Metric(value=0.2027632561704606, std=None)}, elapsed_secs=0.0, steps=0, checkpoint_path='')
Keep the initial learned params!
Op

## Score after calibration

Now add a new column with calibrated scores so we can compare.

In [None]:
from withpi_utils.colab import pretty_print_responses

def score_calibrated(example):
    example["score_calibrated"] = pi.scoring_system.score(
        llm_input=example["llm_input"],
        llm_output=example["llm_output"],
        scoring_spec=scoring_spec_calibrated,
    ).total_score
    return example

# Apply calibrated scoring to your examples
examples_with_calibrated = [score_calibrated(example.copy()) for example in examples]

# Create comparison DataFrame
comparison_df = df.copy()
comparison_df["original_score"] = [ex["old_score"] for ex in examples]
comparison_df["calibrated_score"] = [ex["score_calibrated"] for ex in examples_with_calibrated]

print("Original vs Calibrated Scores:")
print(comparison_df[["original_score", "calibrated_score"]].head(10))

Original vs Calibrated Scores:
   original_score  calibrated_score
0          0.9017            0.9893
1          0.6043            0.6630
2          0.5344            0.9835
3          0.5824            0.9748
4          0.8482            0.9730
5          0.8436            0.9707
6          0.8297            0.9875
7          0.6972            0.7671
8          0.8299            0.9568
9          0.7027            0.9880


## What have we achieved?

We have tuned the scores so that they follow the "label" scores more closely.

## Save calibrated scoring system

Save the updated scoring spec to a file, which can be loaded in the future with `load_scoring_spec`.

In [None]:
from withpi_utils.colab import dump_scoring_spec
from google.colab import files

with open("email_ai_calibrated.json", "w") as file:
    file.write(dump_scoring_spec(scoring_spec_calibrated))
files.download('email_ai_calibrated.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Next Steps

This Colab used an (extremely!) limited amount of labeled data, but scaling up this feedback loop will pay dividends, allowing you to tune the range of your questions to match your notion of "goodness".