<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Preference_Collection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://withpi.ai/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Preference Collection

This Colab is the companion to the Preference Collection Playground, showing how you can apply preference datat to your training pipeline.

It's easier to collect training data from the UI, but this Colab will have you rate a small number of examples in-line.

We will walk through the same `Aesop AI` example, but any contract with feedback data should work.

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

import os
from google.colab import files, userdata

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

%pip install withpi litellm httpx datasets jinja2 tqdm

# Import a bunch of useful libraries for later.
from concurrent.futures import ThreadPoolExecutor
from collections import defaultdict
import time
import json
from pathlib import Path
import re

import datasets
import httpx
import litellm
import jinja2
from tqdm.notebook import tqdm
from withpi import PiClient
from withpi.types import Contract

from rich.console import Console
from rich.table import Table
from rich.live import Live

console = Console()

client = PiClient()

def print_contract(contract: Contract):
  """print_contract pretty-prints a contract"""
  for dimension in contract.dimensions:
    print(dimension.label)
    for sub_dimension in dimension.sub_dimensions:
      print(f"\t{sub_dimension.description}")

def generate(system: str, user: str, model: str) -> str:
  """generate passes the provided system and user prompts into the given model
  via LiteLLM"""
  messages = [
    {
      "content": system,
      "role": "system"
    },
    {
      "content": user,
      "role": "user"
    }
  ]
  return litellm.completion(model=model,
                            messages=messages).choices[0].message.content

class printer(str):
  """printer makes strings with embedded newlines print more nicely"""
  def __repr__(self):
    return self
def print_response(response: str):
  """print_response pretty-prints an LLM response, respecting newlines"""
  display(printer(response))

def print_scores(pi_scores):
  """print_scores pretty-prints a Pi Score response as a table."""
  for dimension_name, dimension_scores in pi_scores.dimension_scores.items():
    print(f"{dimension_name}: {dimension_scores.total_score}")
    for subdimension_name, subdimension_score in dimension_scores.subdimension_scores.items():
      print(f"\t{subdimension_name}: {subdimension_score}")
    print("\n")
  print("---------------------")
  print(f"Total score: {pi_scores.total_score}")

def save_file(filename: str, model: str):
  """save_file offers to download the model with the given filename"""
  Path(filename).write_text(model)
  files.download(filename)

def load_contract(url: str) -> Contract:
  """load_contract pulls a Contract JSON blob locally with validation."""
  resp = httpx.get(url)
  return Contract.model_validate_json(resp.content)

def load_and_split_dataset(url: str) -> datasets.DatasetDict:
  """load_and_split_dataset pulls in the Parquet file at url and does a 90/10 split"""
  return datasets.load_dataset('parquet', data_files=url, split="train").train_test_split(test_size=0.1)

def do_bulk_inference(dataset, system, model):
  """do_bulk_inference performs inference on the 'input' column of dataset, using
  the provided system prompt.  The model identified will be used via LiteLLM"""

  def do_generate(user, pbar):
    result = generate(system, user, model)
    pbar.update(1)
    return result

  futures = []
  pbar = tqdm(total=len(dataset))
  with ThreadPoolExecutor(max_workers=4) as executor:
    for row in dataset:
      futures.append(executor.submit(do_generate, row["input"], pbar))
  return [future.result() for future in futures]

def do_bulk_templated_inference(dataset, optimized, model):
  """do_bulk_templated_inference performs inference on the 'input' column of dataset,
  using the provided optimized prompt.  It should be a Jinja2 template as returned
  by DSPy"""
  prompt_template = jinja2.Template(optimized)
  result_extractor = re.compile(r".*\[\[ ## response ## \]\](.*)\[\[ ## completed ## \]\]", re.DOTALL)

  def do_generate(prompt: str, pbar) -> str:
    messages = json.loads(prompt_template.render(input=prompt))
    result = litellm.completion(model=model,
                                messages=messages).choices[0].message.content

    pbar.update(1)
    return result_extractor.match(result).group(1)

  futures = []
  pbar = tqdm(total=len(dataset))
  with ThreadPoolExecutor(max_workers=4) as executor:
    for row in dataset:
      futures.append(executor.submit(do_generate, row["input"], pbar))
  return [future.result() for future in futures]

def generate_table(
    job_id: str, training_data: dict, is_done: bool, additional_columns: dict[str, str]
):
    """Generate a training progress table dynamically."""
    table = Table(title=f"Training Status for {job_id}")

    # Define columns
    table.add_column("Step", justify="right", style="cyan")
    table.add_column("Epoch", justify="right", style="cyan")
    table.add_column("Learning Rate", justify="right", style="cyan")
    table.add_column("Train Loss", justify="right", style="magenta")
    table.add_column("Eval Loss", justify="right", style="green")
    for header in additional_columns.keys():
        table.add_column(header, justify="right", style="black")

    def format_num(num: float | None, digits: int = 4) -> str:
        if num is None:
            return "X"
        return format(num, f".{digits}f")

    for step, data in training_data.items():
        additional_columns_data = [
            format_num(data.get(column_name, None))
            for column_name in additional_columns.values()
        ]
        table.add_row(
            str(step),
            format_num(data.get("epoch", None)),
            format_num(data.get("learning_rate", None), digits=10),
            format_num(data.get("loss", None)),
            format_num(data.get("eval_loss", None)),
            *additional_columns_data,
        )

    if not is_done:
        table.add_row("...", "", "", "", "", "")

    return table


def stream_response(job_id: str, method, additional_columns: dict[str, str]):
    """stream_response streams messages from the provided method

    method should be a Pi client object with `retrieve` and `stream_messages`
    endpoints.  This is primarily for convenience."""

    training_data = defaultdict(dict)
    is_log_console = False

    while True:
        response = method.retrieve(job_id=job_id)
        if (response.state != "QUEUED") and (response.state != "RUNNING"):
            if response.state == "DONE" and not is_log_console:
                for line in response.detailed_status:
                    try:
                        data_dict = json.loads(line)
                        training_data[data_dict["step"]].update(data_dict)
                    except Exception:
                        pass
                console.print(
                    generate_table(
                        job_id,
                        training_data,
                        is_done=True,
                        additional_columns=additional_columns,
                    )
                )
            return response

        with method.with_streaming_response.stream_messages(
            job_id=job_id, timeout=None
        ) as response:
            with Live(auto_refresh=True, console=console, refresh_per_second=4) as live:
                is_done = False
                for line in response.iter_lines():
                    if line == "DONE":
                        is_done = True
                    try:
                        data_dict = json.loads(line)
                        training_data[data_dict["step"]].update(data_dict)
                    except Exception:
                        pass
                    live.update(
                        generate_table(
                            job_id,
                            training_data,
                            is_done,
                            additional_columns=additional_columns,
                        )
                    )
                    is_log_console = True


# Load contract and Dataset

Load the contract and example data.  The examples are stored on Pi's repository, but you can easily load any Hugging Face dataset with this library.

In [None]:
aesop_contract = load_contract("https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/contracts/aesop_ai.json")
aesop = load_and_split_dataset("https://raw.githubusercontent.com/withpi/cookbook-withpi/refs/heads/main/datasets/aesop_ai_examples.parquet")
aesop

## Cluster Inputs

We're going to label some inputs as "good" and "bad", but to do this it is helpful to focus on a few different types of input.  We'll use clustering to make sure we don't have to look at too many examples.

In [None]:
input_topic_clusters = client.data.inputs.cluster(
    inputs=[{"identifier": str(index), "llm_input": row["input"]} for index, row in enumerate(aesop['train'])],
)

cluster_column = ['']*len(aesop['train'])
for cluster in input_topic_clusters:
  print(f"{cluster.topic}: {len(cluster.inputs)}")
  for identifier in cluster.inputs:
    cluster_column[int(identifier)] = cluster.topic

clustered_aesop = aesop['train'].add_column('cluster', cluster_column)

## Identify outliers

Let's first score every input against the contract, adding that as a column.  Pi scoring is fast enough that serially processing the dataset is fine, though we could increase parallelism for more speed.

In [None]:
aesop_scored = clustered_aesop.add_column("uncalibrated_scores",
   [client.contracts.score(contract=aesop_contract, llm_input=row["input"], llm_output=row["output"]).total_score for row in clustered_aesop]
)

## Label data

Now it's time to label examples against a simple statement.  **The response fully satisfies the input according to the contract**.  Valid responses are **Strongly Agree**, **Agree**, **Neutral**, **Disagree**, and **Strongly Disagree**, or simply **5** down to **1**.

The below cell will select a high and low scoring exemplar from each cluster, asking you to respond **5** through **1**

In [None]:
def get_label(row):
  display("Input Prompt:")
  display(row["input"])
  display("Output Response:")
  display(row["output"])
  while True:
    resp = input("Your rating from 1 to 5: ")
    try:
      if int(resp) not in [1,2,3,4,5]:
        raise ValueError("Invalid")
    except:
      display("Invalid input. Try again")
      continue
    break
  row['label'] = resp
  return row

cluster_labels = set([x for x in aesop_scored["cluster"]])
labelled = []
for cluster in cluster_labels:
  sorted = aesop_scored.filter(lambda e: e['cluster'] == cluster).sort("uncalibrated_scores")
  labelled.append(get_label(sorted[0]))
  labelled.append(get_label(sorted[-1]))

## Calibrate

Now it's time to calibrate with the labelled sets.  The following cell will launch a job and monitor for completion.

In [None]:
def to_rating(label):
  match label:
    case '1':
      return "Strongly Disagree"
    case '2':
      return "Disagree"
    case '3':
      return "Neutral"
    case '4':
      return "Agree"
    case '5':
      return "Strongly Agree"

contract_calibration_status = client.contracts.calibrate.start_job(
    contract=aesop_contract,
    examples=[{"llm_input": row['input'], "llm_output": row['output'], "rating": to_rating(row['label'])} for row in labelled]
)
aesop_contract_calibrated = stream_response(contract_calibration_status.job_id, client.contracts.calibrate).calibrated_contract

## Rescore after calibration

Now add a new column with calibrated scores. You can examine these to see if they more closely align with the examples you labelled.  Ideally the score starts separating good responses from bad.

If it does not, that suggests the properties you **really** care about aren't captured in your scoring dimensions and will need to be added.  Proceed to the playgrounds at http://play.withpi.ai to experiment with this.

If this is looking good, you have a powerful function for improving your system.

In [None]:
for row in labelled:
   row['calibrated'] = client.contracts.score(contract=aesop_contract_calibrated, llm_input=row["input"], llm_output=row["output"]).total_score
   print(f"Label: {row['label']}, Original Score: {row['uncalibrated_scores']}, Calibrated: {row['calibrated']}")

## Save calibrated contract

The updated contract now has different weights assigned to its dimensions.  Save those for later.

In [None]:
save_file('aesop_ai_calibrated.json', aesop_contract_calibrated.model_dump_json(indent=2))

## Next Steps

Now that you have a calibrated contract, other parts of Pi should work better.  This Colab used a limited amount of hand-labeled data, but scaling up this feedback loop will pay dividends.