# Pydantic Evals

> **Prerequisites**: Make sure that you've signed up for an account with [Logfire](http://logfire.pydantic.dev) and created your Read and Write tokens.
> 1. [Creating your Write Tokens](https://logfire.pydantic.dev/docs/how-to-guides/create-write-tokens/)
> 2. [Creating your Read Tokens](https://logfire.pydantic.dev/docs/how-to-guides/query-api/)
> 
> Once you've done so, you should then set the write token as an environment variable called `LOGFIRE_TOKEN` and your read token as `LOGFIRE_READ_TOKEN`

In this notebook, we'll learn how to use Pydantic Evals to run evaluation testcases and track results with Logfire.

You'll need a Read token to export your data and a Write token to be able to save the generated logs to Logfire.

## Why This Matters

Using the same tools for evaluations, production systems, and monitoring dashboards creates a powerful unified workflow. Logfire provides native integration with various tracing libraries, allowing you to keep all your metrics in one place with minimal code changes. This integration between Pydantic Evals and Logfire means you can:

1. Track model performance consistently across development and production
2. Build dashboards that monitor critical aspects of your RAG system
3. Identify performance regressions immediately when they occur
4. Share evaluation results easily with your entire team

Instead of maintaining separate evaluation scripts, production monitoring, and reporting systems, this unified approach streamlines your workflow and ensures nothing falls through the cracks.

## What You'll Learn

Through this hands-on tutorial, you'll discover how to:

1. Set Up Evaluation Infrastructure
- Configure Logfire for tracking results
- Create test cases and datasets
- Define custom evaluators

2. Run Basic Evaluations
- Test model outputs against expected results
- Calculate performance scores
- Track results in Logfire

3. Build Custom Evaluators
- Create specialized evaluation metrics
- Customize scoring logic
- Combine multiple evaluators


By the end of this notebook, you'll have a foundation for systematically evaluating model performance and tracking results. We'll be using Pydantic Evals heavily in this course so make sure that you're able to run the code here in the notebook.

## Configuring Logfire

Before running this notebook, you'll need to configure your logfire environment variables. If you're facing some issues and being asked to authenticate, you can manually set the env variables by doing

```python
import os

os.environ["LOGFIRE_TOKEN"] = <your logfire write token>
os.environ["LOGFIRE_READ_TOKEN] = <your logfire read token >
```

We recommend setting it so these variables are present in your shell instead or with `python-dotenv` as outlined in the README instead.

Alternatively, you can run the following command

```
logfire auth
```

Follow the instructions and link your project to Logfire to get it set up.

In [51]:
import logfire

logfire.configure(
    send_to_logfire=True,  
    environment='experimentation',  
    service_name='evals',  
)

<logfire._internal.main.Logfire at 0x1074677c0>

[1mLogfire[0m project URL: ]8;id=319594;https://logfire-us.pydantic.dev/ivanleomk/logfire-demo\[4;36mhttps://logfire-us.pydantic.dev/ivanleomk/logfire-demo[0m]8;;\


With Logfire configured, we're ready to build our first evaluation using Pydantic Evals.

## Creating Your First Evaluation

When running evaluations on AI systems, we typically need three key elements: test cases to evaluate, a way to run those test cases against our model, and metrics to grade the outputs. Pydantic Evals formalizes this process with three main components that work together to create a complete evaluation pipeline.


1. Cases: Individual test scenarios with specific inputs and expected outputs
2. Datasets: Collections of test cases that can be run together
3. Evaluators: Functions that assess model outputs and calculate performance metrics

This structure allows you to organize your test cases logically, run them efficiently, and apply consistent evaluation metrics across different models or versions. Let's build a simple example to see how these components work together



In [36]:
from dataclasses import dataclass

from pydantic_evals.evaluators import Evaluator, EvaluatorContext

from pydantic_evals import Case, Dataset

# Create a single test case
case1 = Case(
    name='simple_case',
    inputs='What is the capital of France?',
    expected_output='paris',
    metadata={'difficulty': 'easy'},
)

# Create a dataset from our case
dataset = Dataset(cases=[case1])

# Create a custom evaluator that checks for exact matches
@dataclass
class IsExactMatch(Evaluator):
    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:  
        if ctx.output == ctx.expected_output:
            return 1.0  # Perfect match
        return 0.0


# Add our custom evaluators to the dataset
dataset.add_evaluator(IsExactMatch())

# Then we define a function that we want to evaluate

async def guess_city(question: str) -> str:  
    return 'Paris'


report = await dataset.evaluate(guess_city)
print(report)

04:19:54.438 evaluate guess_city
04:19:54.439   case: simple_case
04:19:54.439     execute guess_city





The evaluation summary above shows the results of our first test case. Our function `guess_city` returned "Paris" (with a capital P) while our expected output was "paris" (lowercase). The `IsExactMatch` evaluator returned a score of 0.0 because the case doesn't match exactly. This demonstrates how strict the `IsExactMatch` evaluator is - even a difference in capitalization results in a failed test.

If we change the `guess_city` function above to instead return a string `paris`, we'll see a change in the score to be `1` instead


And with that we've just ran our first set of evaluations with Logfire. If you navigate to [the Logfire website](https://logfire.pydantic.dev), you should be able to see your trace show up.

## Evaluating LLM Classification

Moving beyond simple examples, let's explore how Pydantic Evals can be used to evaluate more complex tasks like classification. We'll evaluate a customer support function that generates responses to common questions about product returns and refunds - a scenario directly relevant to e-commerce RAG systems.

We'll first define a function that takes in a user question and then classifies as either Refunds, Informational, Shipping or Account related queries. Once we've done so, we'll then run an evaluation on our test cases to see if our model is able to predict the right value a majority of the time.

In [8]:
from pydantic import BaseModel
from typing import Literal
from openai import AsyncOpenAI
import instructor

class QueryType(BaseModel):
    query_type: Literal['Refunds', 'Informational', 'Shipping', 'Account']

client = instructor.from_openai(AsyncOpenAI())

async def classify_query(question: str) -> str:
    resp =await client.chat.completions.create(
        model='gpt-4.1-nano',
        messages=[
            {'role': 'system', 'content': 'You are a customer support agent that can classify user queries into one of the following categories: Refunds, Informational, Shipping, Account.'},
            {'role': 'user', 'content': question}
        ],
        response_model=QueryType
    )
    return resp.query_type

await classify_query('How do I return a product?')

'Refunds'

Now that we've defined our task with instructor, let's now define some simple test cases

In [32]:

cases = [
    Case(
        name='refund_query',
        inputs='How do I return a product?',
        expected_output='Refunds'
    ),
    Case(
        name='informational_query',
        inputs='What is the return policy?',
        expected_output='Informational'
    ),
    Case(
        name='shipping_query',
        inputs='How do I track my order?',
        expected_output='Shipping'
    ),
    Case(
        name='account_query',
        inputs='How do I change my password?',
        expected_output='Account'
    )
]

dataset = Dataset(cases=cases)

# Now we'll add our evaluator to the dataset
dataset.add_evaluator(IsExactMatch())

# Now we'll run our evaluation
report = await dataset.evaluate(classify_query)
print(report)

04:07:27.840 evaluate classify_query
04:07:27.841   case: refund_query
04:07:27.841     execute classify_query
04:07:27.846   case: informational_query
04:07:27.846     execute classify_query
04:07:27.851   case: shipping_query
04:07:27.851     execute classify_query
04:07:27.856   case: account_query
04:07:27.856     execute classify_query





We can also save this dataset to a `.yaml` file for versioning and tracking. Note here that we need to provide the custom types of all of the evaluators that we used which don't ship out of the box with Pydantic Evals.

In [33]:
dataset.to_file("dataset.yaml", custom_evaluator_types=[IsExactMatch])

We can then load it in again from the yaml file and run our evaluations again using the same function

In [None]:
dataset.from_file("dataset.yaml", custom_evaluator_types=[IsExactMatch])

report = await dataset.evaluate(classify_query)
print(report)

04:07:32.216 evaluate classify_query
04:07:32.217   case: refund_query
04:07:32.217     execute classify_query
04:07:32.222   case: informational_query
04:07:32.222     execute classify_query
04:07:32.227   case: shipping_query
04:07:32.228     execute classify_query
04:07:32.232   case: account_query
04:07:32.233     execute classify_query





## Exporting your Logfire spans

Working with Logfire is easy because you can query your spans using simple SQL.

This means that you have an incredible amount of flexibility to create custom views and group by(s) that suit your specific needs.

In [49]:
from logfire.experimental.query_client import LogfireQueryClient
import os
from rich import print

client = LogfireQueryClient(os.environ["LOGFIRE_READ_TOKEN"])

results = client.query_json_rows(
    """
    SELECT * FROM records
    WHERE service_name = 'evals'
    LIMIT 1;
    """
)

for row in results['rows']:
    print(row['attributes'])



Logfire exposes the full span and information that was captured during the course of the eval run, making it easy to customise and compute custom metrics for your use case.

## Conclusion

In this notebook, we've explored how to use Pydantic Evals to create systematic evaluation pipelines for AI systems. By combining Pydantic Evals with Logfire, we've built a foundation for tracking and improving model performance over time.

Key takeaways from this introduction include:

1. **Structured Evaluation Framework**: Pydantic Evals provides a clear structure for organizing test cases, running evaluations, and measuring performance with custom metrics.

2. **Integration with Production Tools**: The same evaluation system can connect directly to your monitoring infrastructure, creating a unified workflow from development to production.

3. **Scalable Approach**: This framework scales from simple exact-match evaluations to more complex assessments of model outputs.

In the following notebooks, we'll build on these concepts to evaluate more complex RAG behaviors, like retrieval quality, answer correctness, and citation accuracy. You'll see how these evaluation techniques integrate with the embedding fine-tuning from Week 2, the query understanding from Week 4, and the structured data handling from Week 5.

Make sure you have Logfire properly configured before proceeding, as we'll continue to use this evaluation framework to measure our progress throughout the rest of the course.