<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Model_Comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

# Model Comparison

Pi lets you objectively compare two models based on how they perform against questions you care about.  This notebook walks through using a Scoring System to do this evaluation.

## Install and initialize SDK

You'll need a `WITHPI_API_KEY` from https://build.withpi.ai/account/keys.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [8]:
%%capture

%pip install withpi withpi-utils litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

pi = PiClient()

## Setup scoring system

Let's say we're building an AI to generate stories in the style of Aesop's Fables.  In good test-driven development, we need to decide what we're looking for out of our system.  Initialize a Scoring System and score function:

In [10]:
scoring_spec = [{'question': q} for q in [
    "Does the response contain a clear beginning, middle, and end?",
    "Does the story follow a logical progression of events?",
    "Does the story resolve the conflict in a satisfying manner?",
    "Is the life lesson clearly conveyed in the story?",
    "Is the life lesson relevant to the input provided by the user?"
]]

def score(input, output):
  return pi.scoring_system.score(
    scoring_spec=scoring_spec,
    llm_input=input,
    llm_output=output,
)

## Evaluate the scoring spec on different models.

Let's try generating responses from a "big" model and a "small" one to compare scores.

The cell below uses a simple prompt and a list of questions to evaluate.  It uses Gemini because you can get a free key on the left pane (select Gemini API keys).  You can use a different model (see https://docs.litellm.ai/docs/), or your own questions as you see fit.

In [11]:
from withpi_utils.colab import pretty_print_responses
import litellm

os.environ["GEMINI_API_KEY"] = userdata.get("GOOGLE_API_KEY")

system_prompt = """
Write a children's story in the style of Aesop's Fables teaching a life lesson
specified by the user. Provide just the story with no extra content.
"""

aesop_prompts = [
    "Slow and steady wins the race",
    "Be cautious with flattery",
    "Even the smallest friends can be the most helpful",
]

def generate(user: str, model: str) -> str:
    """generate passes the provided system and user prompts into the given model
    via LiteLLM"""
    messages = [
        {"content": system_prompt, "role": "system"},
        {"content": user, "role": "user"},
    ]
    return litellm.completion(model=model, messages=messages).choices[0].message.content


for prompt in aesop_prompts:
    small_model_output = generate(
        user=prompt,
        model="gemini/gemini-2.0-flash-lite",
    )
    big_model_output = generate(
        user=prompt,
        model="gemini/gemini-2.0-flash",
    )

    small_score = score(prompt, small_model_output)
    big_score = score(prompt, big_model_output)

    pretty_print_responses(
        header="#### Input:\n" + prompt,
        response1="#### Output:\n" + small_model_output,
        response2="#### Output:\n" + big_model_output,
        left_label="gemini/gemini-2.0-flash-lite",
        right_label="gemini/gemini-2.0-flash",
        scores_left=small_score,
        scores_right=big_score,
    )
    print("\n\n")

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.902,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.98

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.961,
Does the story resolve the conflict in a satisfying manner?,0.98,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.988







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.93,
Does the story resolve the conflict in a satisfying manner?,0.208,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.828

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,0.824,
Does the story resolve the conflict in a satisfying manner?,0.387,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.842







0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,0.789,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,0.958

0,1,2
"Does the response contain a clear beginning, middle, and end?",1.0,
Does the story follow a logical progression of events?,1.0,
Does the story resolve the conflict in a satisfying manner?,1.0,
Is the life lesson clearly conveyed in the story?,1.0,
Is the life lesson relevant to the input provided by the user?,1.0,
Total score,,1.0







## Next Steps

Try using different models (Gemini Flash and Lite are quite similar for such an easy creative task).  Pick a different task.  Use different questions.