<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/workshop/Building_Evals_that_actually_work.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://withpi.ai"><font size="4">Copilot</font></a>

# [Workshop] - Building Evals that actually work

This colab walks you through scoring data programmatically and using it in a few different workflows.

It uses similar data to the Google Sheets workflow [Building Evals that actually work](https://docs.google.com/spreadsheets/d/1AfBIWpIr0wUIpHnPEMXV7Rh_qGEsx-dxyzHhavX_xgU/edit?gid=2058777572#gid=2058777572), but it uses Python code snippets to go one level deeper.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils litellm
# See https://github.com/huggingface/datasets/issues/7570
%pip install -U datasets huggingface_hub fsspec

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Setup scoring system

From your Copilot session with your scoring system, click `Code`, and then copy the contents of that cell.  Paste it below, overwriting the example spec.  This will define a `score()` function that the rest of the Colab will refer to.

In [None]:
from withpi import PiClient

# Initialize Pi client
pi = PiClient()

def score(llm_input, llm_output)->float:
  scoring_spec =   [
    {
      "label": "Value Proposition",
      "question": "Is the value proposition clearly articulated"
    },
    {
      "label": "Hook",
      "question": "Does the tweet begin with an attention-grabbing hook?"
    },
    {
      "label": "Customer Focus",
      "question": "Is the tweet focused on customer benefits rather than just product features?"
    }
  ]

  return pi.scoring_system.score(llm_input=llm_input, llm_output=llm_output, scoring_spec=scoring_spec).total_score


## Load a dataset

Load a sample dataset from Hugging Face to play around with.

In [None]:
from datasets import load_dataset

ds = load_dataset("withpi/aiewf_workshop_data_extra_long_transcripts")

display(ds)

DatasetDict({
    train: Dataset({
        features: ['Length', 'Manual Rating', 'Prompt', 'Generated Tweet'],
        num_rows: 151
    })
})

In [None]:
import pandas as pd

def score_example(example):
  example["Score"] = score(example["Prompt"], example["Generated Tweet"])
  return example

scored = ds["train"].map(score_example)
display(pd.DataFrame(scored))

Map:   0%|          | 0/151 [00:00<?, ? examples/s]

Unnamed: 0,Length,Manual Rating,Prompt,Generated Tweet,Score
0,228,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,"Plan your dream getaway in minutes, not hours....",0.826823
1,227,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,From wishlist to itinerary in 60 seconds. Our ...,0.790365
2,184,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Introducing a smarter way to travel! Our app c...,0.742188
3,252,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Travelers are saving 5+ hours of planning time...,0.815104
4,236,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Travel planning shouldn't be a full-time job. ...,0.854167
...,...,...,...,...,...
146,207,,Draft a tweet highlighting our trip planning a...,Introducing Local Transport Integration: Our t...,0.601562
147,199,,Write me a product marketing tweet for an AI a...,Bringing spontaneity back to travel. Our last-...,0.656250
148,232,,Write me a product marketing tweet for an AI a...,Let data guide your next adventure. Our AI-pow...,0.723958
149,111,,Write me a product marketing tweet for an AI a...,We have a new app for planning trips. You can ...,0.164632


## Rejection Sampling

Now that you have some measure of goodness, it's easy to throw out bad results, to, for example, build a training set.

In [None]:
filtered = scored.filter(lambda example: example["Score"] > 0.7)
display(pd.DataFrame(filtered))

Filter:   0%|          | 0/151 [00:00<?, ? examples/s]

Unnamed: 0,Length,Manual Rating,Prompt,Generated Tweet,Score
0,228,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,"Plan your dream getaway in minutes, not hours....",0.826823
1,227,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,From wishlist to itinerary in 60 seconds. Our ...,0.790365
2,184,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Introducing a smarter way to travel! Our app c...,0.742188
3,252,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Travelers are saving 5+ hours of planning time...,0.815104
4,236,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Travel planning shouldn't be a full-time job. ...,0.854167
5,167,"Average: Basic information but generic, lacks ...",Write me a product marketing tweet for an AI a...,Planning a trip? Our software makes it easy! C...,0.765625
6,178,,Write me a product marketing tweet for an AI a...,The trip planning revolution has arrived! Our ...,0.721354
7,212,,Write me a product marketing tweet for an AI a...,Ever spent hours planning a trip only to wonde...,0.779948
8,228,,Write me a product marketing tweet for an AI a...,I planned my entire European vacation in under...,0.752604
9,211,,Write me a product marketing tweet for an AI a...,"The only trip planner with a ""Serendipity Scor...",0.808594


## Model Comparison

Now that you have a scoring system, let's try a few different models to more objectively compare how they do.

We'll redo a few of the prompts from the dataset above with a couple of differnt models.

You can import a Google Gemini key from AI Studio on the left pane, which populates a `GOOGLE_API_KEY` secret.  At low rates it's free.  Or adjust to a model of your choice with a key using docs at https://docs.litellm.ai/docs/.

This task is "easy", so you should expect most any model to handle it, but this technique could be extended to actually evaluate models robustly.

In [None]:
import litellm

os.environ["GEMINI_API_KEY"] = userdata.get('GOOGLE_API_KEY')

def generate_and_score(model, example):
  example[f"{model} tweet"] = litellm.completion(
    model=model,
    messages=[
        {"content": "Answer the prompt directly with no extra text or explanation", "role": "system"},
        {"content": example["Prompt"], "role": "user"}
    ]).choices[0].message.content
  example[f"{model} Score"] = score(example["Prompt"], example[f"{model} tweet"])
  return example

more_models = scored.select(range(5)).map(lambda ex: generate_and_score("gemini/gemini-2.5-flash-preview-05-20", ex))
more_models = more_models.select(range(5)).map(lambda ex: generate_and_score("gemini/gemini-1.5-flash", ex))

display(pd.DataFrame(more_models))

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Unnamed: 0,Length,Manual Rating,Prompt,Generated Tweet,Score,gemini/gemini-2.5-flash-preview-05-20 tweet,gemini/gemini-2.5-flash-preview-05-20 Score,gemini/gemini-1.5-flash tweet,gemini/gemini-1.5-flash Score
0,228,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,"Plan your dream getaway in minutes, not hours....",0.826823,"✈️ Ditch the stress, not the adventure! Our AI...",0.828125,Planning a trip is stressful. Let our AI trav...,0.841146
1,227,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,From wishlist to itinerary in 60 seconds. Our ...,0.790365,"Stress-free travel planning? Yes, please! ✈️ O...",0.878906,Planning a trip is stressful. Let our AI trav...,0.841146
2,184,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Introducing a smarter way to travel! Our app c...,0.742188,Planning your dream trip just got easier. Our ...,0.838542,Planning a trip is stressful? Let our AI agen...,0.829427
3,252,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Travelers are saving 5+ hours of planning time...,0.815104,Say goodbye to trip planning stress! 👋 Our AI ...,0.813802,Planning a trip is stressful? Let our AI trav...,0.841146
4,236,"Excellent: Clear value proposition, specific b...",Write me a product marketing tweet for an AI a...,Travel planning shouldn't be a full-time job. ...,0.854167,Dreaming of a getaway but dreading the plannin...,0.841146,Planning a trip is stressful? Let our AI trav...,0.835938


## Test system prompts

The worked example so far embeds the task into the prompt itself and uses a simple instruction following system prompt.

You can also try varying the system prompt to see how that influences scoring.

In [None]:
import litellm
from datasets import Dataset

from IPython.display import display, Markdown, HTML

os.environ["GEMINI_API_KEY"] = userdata.get('GOOGLE_API_KEY')

bad_prompt = """
Your job is to summarize meetings. Include title, key insights, and action items. Output as json. Do not surround the JSON in triple ticks or any other decoration.
"""

good_prompt = """
You are an AI assistant designed to analyze meeting or conversation transcripts. Given a transcript, your task is to extract the following elements and return them in JSON format:

Title: A concise and informative title summarizing the main topic or purpose of the conversation.
Key Insights: A list of the most important insights, takeaways, or conclusions drawn from the transcript. Each insight should be clear, standalone, and phrased in natural language.
Action Items: A list of actionable tasks or follow-ups mentioned or implied in the conversation. Include who is responsible (if mentioned), what the action is, and any relevant deadlines or context.

Guidelines:
Do not include any information not found in or inferred directly from the transcript. Be concise and clear in phrasing each output field.
Output only valid JSON. Do not include any additional text or markdown formatting. Output only a valid JSON object.

JSON Output Format:
{
  "title": "string",
  "key_insights": [ "string", "string", ... ],
  "action_items": [
    { "description": "string", "owner": "string (optional)", "due_date": "string (optional)" }
  ]
}
"""

inputs = pd.DataFrame(load_dataset('withpi/aiewf_workshop_data')['train'].filter(lambda ex: ex["Rating"] == "thumbs_down"))
inputs = inputs['Raw meeting transcript'].iloc[:5]

def generate_and_score(system_prompt, input):
  response = litellm.completion(
    model="gemini/gemini-2.5-flash-preview-05-20",
    messages=[
      {"content": system_prompt, "role": "system"},
      {"content": input, "role": "user"}
    ]).choices[0].message.content
  score_ = score(input, response)
  print(f'Response ({score_}):', response)
  return { 'response': response, 'score': score_ }

results = { 'Bad Prompt': [], 'Good Prompt': [] }
for input in inputs:
  bad = generate_and_score(bad_prompt, input)
  good = generate_and_score(good_prompt, input)
  results['Bad Prompt'].append(bad['score'])
  results['Good Prompt'].append(good['score'])
results = pd.DataFrame({ 'Inputs': inputs, **results })
display(results)
display(results[['Bad Prompt', 'Good Prompt']].mean())

Question scores {'Title Accuracy': 0.9961, 'Key Insights': 0.957, 'Action Items': 0.9688, 'Action Details': 0.2256, 'Transcript Coverage': 0.9531, 'Insight Clarity': 0.8711, 'Action Completeness': 0.9648, 'Owner Identification': 0.9961, 'Due Date Inclusion': 0.0664, 'Redundancy Avoidance': 0.9688, 'Abusive Language': 1.0}
Response (0.3795): ```json
{
  "title": "Community Facility District (CFD) Annual Special Tax Increase Approval",
  "key_insights": [
    "A resolution was proposed to adopt an annual special tax for Community Facility District (CFD) Number 2007-24, applicable to commercial properties in Belmont Shore for fiscal year 2019.",
    "The special tax rate, which has been $0.12 per commercial square foot since 2006, is proposed to be increased. This is the first increase in approximately 12 years.",
    "The primary reasons for the proposed increase are the rising costs associated with administering the various payments, servicing the bond, and increased annual debt payment

Unnamed: 0,Inputs,Bad Prompt,Good Prompt
0,Speaker 0: Next item is the one that you're an...,0.3795,0.1336
1,Speaker 0: Next item is the one that you're an...,0.1136,0.9783
2,Speaker 2: Thank you. Item 15.\nSpeaker 0: Rep...,0.0281,0.8413
3,"Speaker 1: Madam Clerk, could you please read ...",0.0359,0.047
4,"Speaker 1: Madam Clerk, could you please read ...",0.0415,0.045


Unnamed: 0,0
Bad Prompt,0.11972
Good Prompt,0.40904
