<a href="https://colab.research.google.com/github/withpi/cookbook-withpi/blob/main/colabs/Braintrust_PI_Integration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://withpi.ai"><img src="https://play.withpi.ai/logo/logoFullBlack.svg" width="240"></a>

<a href="https://code.withpi.ai"><font size="4">Documentation</font></a>

<a href="https://build.withpi.ai"><font size="4">Copilot</font></a>

[Pi-Scorer](https://build.withpi.ai) offers an alternative to LLM-as-a-judge with several advantages:

* Significantly faster

* Highly consistent — always returns the same score for the same inputs

* Eliminates the need for prompt tuning or adjustments

In [1]:
%%capture
%pip install -U braintrust openai datasets autoevals

In [2]:
# @title Setup API Keys

from google.colab import userdata
import os

os.environ["BRAINTRUST_API_KEY"] = userdata.get("BRAINTRUST_API_KEY")

# Get PI API key: https://build.withpi.ai/account/keys
os.environ["WITHPI_API_KEY"] = userdata.get('WITHPI_API_KEY')

In [3]:
# @title Load a sample dataset

from datasets import load_dataset

ds = load_dataset("withpi/mlmastery_com_blogs_condensed", split="train")

topics = ds["topic"][:5]

data = [{"input": topic} for topic in topics]

display(data)

README.md:   0%|          | 0.00/425 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/636k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/171k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/156 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/40 [00:00<?, ? examples/s]

[{'input': 'Tips for beginners to get started with deep learning, including mastering machine learning fundamentals, choosing a framework, understanding neural network architectures, starting with simple projects, and practicing regularly while engaging with the community.'},
 {'input': 'The topic of this blog post is: "Understanding the Data Science Mind Map: A comprehensive guide to essential Python packages for data preparation, visualization, and statistical analysis, with an emphasis on storytelling in data science."'},
 {'input': 'The 5 Most Influential Machine Learning Research Papers of 2024 and Their Contributions to AI Advancement'},
 {'input': 'Cross-validation techniques for comprehensive model evaluation beyond simple train-test splits'},
 {'input': 'Creating an effective machine learning portfolio that demonstrates practical skills and helps land job opportunities in the competitive ML industry.'}]

In [4]:
# @title Braintrust tracing setup

import braintrust
from openai import OpenAI

MODEL = "gpt-4o-mini"

client = braintrust.wrap_openai(
    OpenAI(
        base_url="https://api.braintrust.dev/v1/proxy",
        api_key=os.environ["BRAINTRUST_API_KEY"],
    )
)

@braintrust.traced
def generate_blog_post(input):
    messages = [
        {
            "role": "system",
            "content": """You are a specialized blog post writer. Given a topic, write a technical blog post. Here are specific instructions:
- Make sure that the blog is approximately under 500 words
- The blog should be technical in nature with clear instructions
""",
        },
        {
            "role": "user",
            "content": input,
        },
    ]
    result = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        max_tokens=4096,
    )
    return result.choices[0].message.content

In [5]:
# @title Pi-Scorer Setup

import os
import requests
from autoevals import ScorerWithPartial
from braintrust_core.score import Score

PI_API_URL = "https://api.withpi.ai/v1/scoring_system/score"
HEADERS = {
    "Content-Type": "application/json",
    "x-api-key": os.environ.get("WITHPI_API_KEY"),
}

#Paste in the scoring_spec array from build.withpi.ai here

scoring_spec = [
    {
      "label": "Relevance to Life",
      "question": "Does the blog post focus on topics directly related to life experiences or themes?",
      "weight": 0.5
    },
    {
      "label": "Word Limit Adherence",
      "question": "Is the blog post less than 300 words?",
      "python_code": "from typing import Any\n\ndef score(\n    response_text: str,\n    input_text: str,\n    kwargs: dict[str, Any],\n) -> dict:\n\n    def evaluate_response(response):\n        \"\"\"\n        Check if the blog post is less than 300 words.\n        \n        Args:\n            response (str): The LLM's response text\n        \n        Returns:\n            float: 1.0 if the blog post is less than 300 words, 0.0 otherwise\n        \"\"\"\n        # Handle edge case for empty response\n        if not response or not isinstance(response, str):\n            return 0.0\n        \n        # Split the response into words\n        words = response.split()\n        \n        # Count the number of words\n        word_count = len(words)\n        \n        # Check if the word count is less than 300\n        if word_count < 300:\n            return 1.0\n        else:\n            return 0.0\n\n    final_score = evaluate_response(response_text)\n    return {\"score\": final_score, \"explanation\": \"\"}\n",
      "scoring_type": "PYTHON_CODE",
      "weight": 0.5
    },
    {
      "label": "Coherence",
      "question": "Is the blog post logically structured and easy to follow?",
      "weight": 0.3
    },
    {
      "label": "Emotional Resonance",
      "question": "Does the blog post evoke an emotional response or connection?",
      "weight": 0.3
    }
  ]

#########

class PiScorerBase(ScorerWithPartial):
    question: str = ""
    label: str = ""

    def __init__(self, question: str, label: str):
        self.question = question
        self.label = label

    def _run_eval_sync(self, output, expected=None, **kwargs):
        assert "input" in kwargs, "Missing 'input' in kwargs"
        payload = {
            "llm_input": kwargs["input"],
            "llm_output": output,
            "scoring_spec": [{"question": self.question}]
        }
        response = requests.post(PI_API_URL, headers=HEADERS, json=payload)
        pi_score = response.json()
        return Score(name=self.label or self._name(), score=pi_score["total_score"])


In [6]:
# @title Run eval

await braintrust.Eval(
    "Blog Post Generator",
    data=data,
    task=generate_blog_post,
    scores=[PiScorerBase(question=question_spec["question"], label=question_spec["label"]) for question_spec in scoring_spec],
    experiment_name="Pi Blog Post",
)

Experiment Pi Blog Post-5e80589f is running at https://www.braintrust.dev/app/Pi%20Labs/p/Blog%20Post%20Generator/experiments/Pi%20Blog%20Post-5e80589f
`Eval()` was called from an async context. For better performance, it is recommended to use `await EvalAsync()` instead.
Blog Post Generator [experiment_name=Pi Blog Post] (data): 5it [00:00, 19840.61it/s]


Blog Post Generator [experiment_name=Pi Blog Post] (tasks):   0%|          | 0/5 [00:00<?, ?it/s]


Pi Blog Post-5e80589f compared to Pi Blog Post-47e9fc56:
99.06% 'Coherence'            score
56.88% 'Emotional Resonance'  score
20.35% 'Relevance to Life'    score
67.19% 'Word Limit Adherence' score

1745958741.35s start
1745958771.42s end
29.31s (+724.10%) 'duration'         	(0 improvements, 5 regressions)
16.25s (+383.42%) 'llm_duration'     	(0 improvements, 5 regressions)
85.80tok (-) 'prompt_tokens'    	(0 improvements, 0 regressions)
844tok (+2420.00%) 'completion_tokens'	(2 improvements, 3 regressions)
929.80tok (+2420.00%) 'total_tokens'     	(2 improvements, 3 regressions)
0.00$ (+00.00%) 'estimated_cost'   	(0 improvements, 1 regressions)

See results for Pi Blog Post-5e80589f at https://www.braintrust.dev/app/Pi%20Labs/p/Blog%20Post%20Generator/experiments/Pi%20Blog%20Post-5e80589f


EvalResultWithSummary(summary="...", results=[...])

See results: https://www.braintrust.dev/app/Pi%20Labs/p/Blog%20Post%20Generator/experiments/Pi%20Blog%20Post-5e80589f