<a href="https://colab.research.google.com/github/zach-2pir/docs/blob/main/colabs/Quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://play.withpi.ai"><font size="4">Technique Catalog</font></a>

# Prompt Optimization with ScoringSpec

This Colab is the companion to the "Prompt iteration with scoring" Playground, which introduces the core concept of Pi, the **ScoringSpec**.

A **ScoringSpec** is a **human and machine readable** description of what **goodness** means to you and is the cornerstone of our approach because it lets you measure improvements mechanically, while still being explainable.

See https://build.withpi.ai/techniques/scoring_system/scoring_manual_calibration for more information.

This colab will walk you through generating a **ScoringSpec**, scoring some responses with it, and tinkering with your application description to improve it.

## Install and initialize SDK

Connect to a regular CPU Python 3 runtime.  You won't need GPUs for this notebook.

You'll need a WITHPI_API_KEY from https://play.withpi.ai.  Add it to your notebook secrets (the key symbol) on the left.

Run the cell below to install packages and load the SDK

In [None]:
%%capture

%pip install withpi withpi-utils datasets tqdm litellm

import os
from google.colab import userdata
from withpi import PiClient

# Load the notebook secret into the environment so the Pi Client can access it.
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

client = PiClient()

# Make a Scoring Spec

Let's say you want to build an application that generates children's stories teaching a life lesson.  Call it `AesopAI`.

Start by creating a first cut scoring spec based on that general input, proposed in the following cell:


In [None]:
from withpi_utils.colab import display_scoring_spec

aesop_scoring_spec = client.scoring_system.generate(
    application_description=(
        "Write a children's story in the style of Aesop's Fables "
        "teaching a life lesson specified by the user. Provide just the "
        "story with no extra content."
    ),
)

display_scoring_spec(aesop_scoring_spec)

A scoring spec is essentially a hierarchical rubric for grading a response.  A bunch of "simple" questions add up to broader categories, which yield a final score.  Output will vary somewhat, but the table above should have reasonable grading questions for the application.

## Generate a response
Let's see how it performs! The below cell uses Gemini to generate a response, but any suitable model will work fine.

Adjust to pick a different model and supply your own key with docs at https://docs.litellm.ai/docs/.

You can import a Google Gemini key from AI Studio on the left pane, which populates a GOOGLE_API_KEY secret.  At low rates it's free.

In [None]:
import litellm

os.environ["GEMINI_API_KEY"] = userdata.get('GOOGLE_API_KEY')

def generate(system: str, user: str, model: str) -> str:
    """generate passes the provided system and user prompts into the given model
    via LiteLLM"""
    messages = [
        {"content": system, "role": "system"},
        {"content": user, "role": "user"},
    ]
    return litellm.completion(model=model, messages=messages).choices[0].message.content

prompt = "The importance of sharing"
response = generate(
    system=aesop_scoring_spec.description,
    user=prompt,
    model="gemini/gemini-1.5-flash-8b")

print(response)

Barnaby the badger had a magnificent hoard of acorns.  Every sunny afternoon, he'd scavenge, burying treasures deeper and deeper in his cozy burrow.  He'd boast to the other woodland creatures about the size of his pile, his voice rumbling with pride.  The squirrels, the rabbits, and the chipmunks watched, their bellies rumbling with hunger.  The tiny field mice, too small to reach the plentiful acorns, often went without.

One day, a fierce storm swept through the forest.  Trees crashed, branches snapped, and the ground was covered in a blanket of fallen leaves.  The acorns, once so carefully buried, were scattered, some lost forever. Barnaby, frantic, searched high and low, his pride turning to despair.  He had nothing left.

The squirrels, though their own stores were depleted, shared the few acorns they'd managed to salvage. The rabbits offered their carrots, and the chipmunks their berries.  Even the tiny field mice, grateful for the gesture, shared their roots.  Barnaby, humbled 

## Score it!

Take the generated response and see how it scores with Pi.

The below cell will run Pi Scoring, evaluating each dimension in the scoring spec, offering a score from 1 (excellent!) to 0 (terrible!).  The current scoring spec is **uncalibrated**, meaning that all the dimensions are equally important, but it's a starting point for learning which are **actually** imporant based on your preferences.

In [None]:
from withpi_utils.colab import pretty_print_responses

score = client.scoring_system.score(
    scoring_spec=aesop_scoring_spec,
    llm_input=prompt,
    llm_output=response,
)

pretty_print_responses(
    header="#### Prompt:\n" + prompt,
    response1="#### Response:\n" + response,
    left_label="gemini/gemini-1.5-flash-8b",
    scores_left=score,
)

0,1,2
Story Structure,,0.943
,Story Completeness,1.0
,Conflict Resolution,0.996
,Narrative Flow,1.0
,Appropriate Length,0.777
Moral and Lesson,,1.0
,Life Lesson Inclusion,1.0
,Lesson Clarity,1.0
,Moral Statement Presence,1.0
,Lesson Integration,1.0


## Save it!

Finally, save the ScoringSpec so you can come back to it later.

A scoring spec is a simple Pydantic model, which can be serialized to JSON and stored locally.

The cell below will offer a download of the scoring spec.

In [None]:
with open("aesop_ai.json", "w") as file:
    file.write(aesop_scoring_spec.model_dump_json(indent=2))

## Next Steps

Go back and try different system prompts to see how they respond to outputs.  Try a different model.  Manually tweak the dimensions. Get a feel for what's happening.

When you're ready to move beyond basic vibe checking, you'll need to take a systematic approach.  To do that, you'll need input data.  Fortunately, we have tools to help build a representative set.  Head over to the input data playground for this.