<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Model_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Model Comparison

Pi lets you objectively compare two models based on how they perform against questions you care about.  This notebook walks through using a Scoring System to do this evaluation.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [2]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Load Dataset

We've published a small dataset on Hugging Face for this example.  You should get a token from https://huggingface.co/settings/tokens and set it as a notebook secret called `HF_TOKEN`.

This is a set of examples for an AI that's trying to generate stories in the style of Aesop's Fables around given moral lessons.

In [3]:
from datasets import load_dataset

aesop_dataset = load_dataset("withpi/aesop", split="train")

print(aesop_dataset)

README.md:   0%|          | 0.00/302 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/55.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/23 [00:00<?, ? examples/s]

Dataset({
    features: ['input', 'output'],
    num_rows: 23
})


## Evaluate the scoring spec on different models.

Let's try generating responses from a "big" model and a "small" one to compare scores.

The cell below uses a simple prompt and a list of questions to evaluate.  It uses Gemini because you can get a free key on the left pane (select Gemini API keys).  You can use a different model (see https://docs.litellm.ai/docs/), or your own questions as you see fit.

In [5]:
from withpi_utils.colab import pretty_print_responses
from tqdm.notebook import tqdm
import litellm

os.environ["GEMINI_API_KEY"] = userdata.get("GOOGLE_API_KEY")

system_prompt = """
Write a children's story in the style of Aesop's Fables teaching a life lesson
specified by the user. Provide just the story with no extra content.
"""

scoring_spec = [{'question': q} for q in [
    "Does the response contain a clear beginning, middle, and end?",
    "Does the story follow a logical progression of events?",
    "Does the story resolve the conflict in a satisfying manner?",
    "Is the life lesson clearly conveyed in the story?",
    "Is the life lesson relevant to the input provided by the user?"
]]

def generate(user: str, model: str) -> str:
    """generate passes the provided system and user prompts into the given model
    via LiteLLM"""
    messages = [
        {"content": system_prompt, "role": "system"},
        {"content": user, "role": "user"},
    ]
    return litellm.completion(model=model, messages=messages).choices[0].message.content


for i in tqdm(range(5)):
    row = aesop_dataset[i]
    small_model_output = generate(
        user=row["input"],
        model="gemini/gemini-2.0-flash-lite",
    )
    big_model_output = generate(
        user=row["input"],
        model="gemini/gemini-2.0-flash",
    )

    small_score = pi.scoring_system.score(
        llm_input=row["input"],
        llm_output=small_model_output,
        scoring_spec=scoring_spec,
    )
    big_score = pi.scoring_system.score(
        llm_input=row["input"],
        llm_output=big_model_output,
        scoring_spec=scoring_spec,
    )

    pretty_print_responses(
        header="#### Input:\n" + row["input"],
        response1="#### Output:\n" + small_model_output,
        response2="#### Output:\n" + big_model_output,
        left_label="gemini/gemini-2.0-flash-lite",
        right_label="gemini/gemini-2.0-flash",
        scores_left=small_score,
        scores_right=big_score,
    )
    print("\n\n")

  0%|          | 0/5 [00:00<?, ?it/s]

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.902,
Does the story resolve the conflict in a satisfying manner?,0.688,
Is the life lesson clearly conveyed in the story?,0.832,
Is the life lesson relevant to the input provided by the user?,0.996,
Total score,,0.884

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.73,
Does the story resolve the conflict in a satisfying manner?,0.738,
Is the life lesson clearly conveyed in the story?,0.703,
Is the life lesson relevant to the input provided by the user?,0.887,
Total score,,0.812







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.805,
Does the story resolve the conflict in a satisfying manner?,0.436,
Is the life lesson clearly conveyed in the story?,0.984,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.845

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.961,
Does the story resolve the conflict in a satisfying manner?,0.68,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.928







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.824,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.965

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,1.0,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,1.0







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.828,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.966

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,1.0,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,1.0







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.598,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.92

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.816,
Does the story resolve the conflict in a satisfying manner?,0.295,
Is the life lesson clearly conveyed in the story?,0.656,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.754







## Next Steps

Try using different models (Gemini Flash and Lite are quite similar for such an easy task).  Pick a different task.  Use different questions.