<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Preference_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://withpi.ai/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# WithPi Contract Calibration

This colab assumes that you already went through [Input Generation](https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Input_Generation.ipynb), and now wish to calibrate your prompt.

We will walk through the same `Aesop AI` example, but you can load any contract here. Let's dig in!

This should take about **15 minutes**, even if you're unfamiliar with Colab.

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi litellm

import os
from google.colab import userdata
from litellm import completion
from withpi import PiClient

os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

def print_contract(contract):
  for dimension in contract.dimensions:
    print(dimension.label)
  for sub_dimension in dimension.sub_dimensions:
    print(f"\t{sub_dimension.description}")

def generate(system: str, user: str, model: str) -> str:
  messages = [
    {
      "content": system,
      "role": "system"
    },
    {
      "content": prompt,
      "role": "user"
    }
  ]
  return completion(model=model,
                    messages=messages).choices[0].message.content

class printer(str):
  def __repr__(self):
    return self
def prettyprint(response: str):
  display(printer(response))

def print_scores(pi_scores):
  for dimension_name, dimension_scores in pi_scores.dimension_scores.items():
    print(f"{dimension_name}: {dimension_scores.total_score}")
    for subdimension_name, subdimension_score in dimension_scores.subdimension_scores.items():
      print(f"\t{subdimension_name}: {subdimension_score}")
    print("\n")
  print("---------------------")
  print(f"Total score: {pi_scores.total_score}")

In [None]:
import httpx
import pandas as pd
from google.colab import data_table
from withpi.types import Contract

resp = httpx.get("https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/contracts/aesop_ai.json")

aesop_contract = Contract.model_validate_json(resp.content)

for dimension in aesop_contract.dimensions:
  print(dimension.label)
  for sub_dimension in dimension.sub_dimensions:
    print(f"\t{sub_dimension.description}")

df = pd.read_parquet("https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/datasets/aesop_ai_examples.parquet")
data_table.enable_dataframe_formatter()
df


## Cluster Inputs

We're going to label some inputs as "good" and "bad", but to do this it is helpful to focus on a few different types of input.  We'll use clustering to make sure we don't have to look at too many examples.

In [None]:
input_topic_clusters = client.data.inputs.cluster(
    inputs=[{"identifier": str(index), "llm_input": row["input"]} for index, row in df.iterrows()],
)

df['cluster'] = ['']*len(df)
for cluster in input_topic_clusters:
  for identifier in cluster.inputs:
    df.loc[int(identifier),'cluster'] = cluster.topic
df

## Identify outliers

Let's first score every input against the contract, adding that as a column.  Pi scoring is fast enough that serially processing the dataset is fine, though we could increase parallelism for more speed.

In [None]:
df["uncalibrated_scores"] = [client.contracts.score(contract=aesop_contract, llm_input=row["input"], llm_output=row["output"]).total_score for idx, row in df.iterrows()]
df

## Label data

Now it's time to label examples against a simple statement.  **The response fully satisfies the input according to the contract**.  Valid responses are **Strongly Agree**, **Agree**, **Neutral**, **Disagree**, and **Strongly Disagree**, or simply **5** down to **1**.

The below cell will select a high and low scoring exemplar from each cluster, asking you to respond **5** through **1**

In [None]:
df["label"]= ['']*len(df)

def get_label(row):
  display("Input Prompt:")
  display(row.loc["input"])
  display("Output Response:")
  display(row.loc["output"])
  while True:
    resp = input("Your rating from 1 to 5: ")
    try:
      if int(resp) not in [1,2,3,4,5]:
        raise ValueError("Invalid")
    except:
      display("Invalid input. Try again")
      continue
    break
  df.loc[row.name,'label'] = resp

clusters = [x for _, x in df.groupby(df["cluster"])]
for cluster in clusters:
  sorted = cluster.sort_values(by=['uncalibrated_scores'])
  get_label(sorted.iloc[0])
  get_label(sorted.iloc[-1])

## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job.

In [None]:
def to_rating(label):
  match label:
    case '1':
      return "Strongly Disagree"
    case '2':
      return "Disagree"
    case '3':
      return "Neutral"
    case '4':
      return "Agree"
    case '5':
      return "Strongly Agree"

labelled = df[df['label'] != '']
contract_calibration_status = client.contracts.calibrate.start_job(
    contract=aesop_contract,
    examples=[{"llm_input": row['input'], "llm_output": row['output'], "rating": to_rating(row['label'])} for _, row in labelled.iterrows()]
)

# Monitor for completion

The next cell will monitor logs from the calibration job, ending when it's complete.

In [None]:
import json

while True:
  calibrated_response = client.contracts.calibrate.retrieve(job_id=contract_calibration_status.job_id)
  if (calibrated_response.state != 'QUEUED') and (calibrated_response.state != 'RUNNING'):
    break

  with client.contracts.calibrate.with_streaming_response.stream_messages(
      job_id=contract_calibration_status.job_id, timeout=None) as response:
    for line in response.iter_lines():
          print(line)

aesop_contract_calibrated = calibrated_response.calibrated_contract

## Rescore after calibration

Now add a new column with calibrated scores. You can examine these to see if they more closely align with the examples you labelled.  Ideally the score starts separating good responses from bad.

If it does not, that suggests the properties you **really** care about aren't captured in your scoring dimensions and will need to be added.  Proceed to the playgrounds at http://play.withpi.ai to experiment with this.

If this is looking good, you have a powerful function for improving your system.

In [None]:
df["calibrated_scores"] = [client.contracts.score(contract=aesop_contract_calibrated, llm_input=row["input"], llm_output=row["output"]).total_score for idx, row in df.iterrows()]
df

## Save calibrated contract

The updated contract now has different weights assigned to its dimensions.  Save those for later.

In [None]:
from pathlib import Path
from google.colab import files

filename = 'aesop_ai_calibrated.json'
Path(filename).write_text(aesop_contract_calibrated.model_dump_json(indent=2))
files.download(filename)

## Next Steps

Now that you have a calibrated contract, you can start incorporating Feedback to improve your system.   Proceed on to the [Feedback Clustering](https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Feedback_Clustering.ipynb) colab to do this.