In [None]:
---
date: "2025-05-06T15:00:00.00Z"
description: "A Principled Approach for Evaluating Summarisation Tasks"
published: true
tags:
  - python
  - llm
  - ragas
  - pydantic-ai
  - evals
time_to_read: 10
title: "📈 Evaluation Driven Development (EDD) with PydanticAI"
type: post
---


When embarking on any AI implementation, I prioritise establishing a robust evaluation framework. The initial development of this evaluation process itself provides valuable insights into effective assessment criteria. Here are the critical aspects I consider when selecting evaluation tools:

### Developer Experience
The evaluation environment must be optimized for rapid iteration and reliable feedback. An effective framework:

- Enables quick, dependable, and observable experiment execution
- Helps troubleshoot issues rather than creating additional friction
- Simplifies the addition of new test scenarios

### Composability
Identifying meaningful performance metrics requires deliberate consideration and experimentation. Optimal evaluation might be composed of several metrics including:

- Rules based metrics (e.g., BLEU score for NLP tasks)
- Reference-free assessments (like LLM-based judges)
- Human annotated labels

PydanticAI has recently released Evals which provides an interesting and simple approach which may tick a few of my boxes above. Let's try this out with a simple finantial markets text summarisation use-case. We'll be building up the evaluation suite as we go, following the principles of Evaluation Driven Development (EDD).

## Setup

In [None]:
%%capture
!uv pip install --upgrade pydantic-evals 'pydantic-ai-slim[bedrock]' pydantic-graph boto3 logfire

In [None]:
%%capture
!logfire auth
!logfire projects use stephenhib-blog

import logfire
from pydantic_ai import Agent

logfire.configure(send_to_logfire='if-token-present', scrubbing=False)
Agent.instrument_all()

In [2]:
from pydantic_evals import Case, Dataset
from datasets import load_dataset

import nest_asyncio
nest_asyncio.apply()

def convert_hf_to_pydantic_dataset(
    hf_dataset_name,
    input_column="user_input",
    split="train",
    subset=None,
):
    """
    Convert a Hugging Face dataset to a PydanticAI Dataset
    
    Args:
        hf_dataset_name: Name of the Hugging Face dataset
        input_column: Column to use as Case inputs
        output_column: Column to use as model outputs (if available)
        split: Dataset split to use
    
    Returns:
        A Pydantic Dataset object
    """
    # Load the Hugging Face dataset
    hf_dataset = load_dataset(hf_dataset_name, split=split)
    
    # Convert each row to a Pydantic Case
    cases = []
    for i, item in enumerate(hf_dataset):
        # Create a case name using the index
        case_name = f"case_{i}"
        
        # Extract the required fields
        case_input = item.get(input_column)
        
        # Create the case
        case = Case(
            name=case_name,
            inputs=case_input,
            expected_output=None, # No expected output, we'll let the LLM judge the quality.
        )
        cases.append(case)
    
    # Create and return the Dataset
    return Dataset(cases=cases)

We'll use the small `explodinggradients/earning_report_summary` dataset downloaded from HuggingFace to get started. We use the helper function above to convert it into the input format PydanticAI expects.

In [3]:
from __future__ import annotations

from dataclasses import dataclass
from typing import Any
from pydantic import BaseModel

import asyncio
from pydantic_ai import Agent, format_as_xml
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance, LLMJudge, Evaluator, EvaluatorContext

dataset = convert_hf_to_pydantic_dataset(
    "explodinggradients/earning_report_summary", 
    input_column="user_input",
    split="train[0:10]", # Load the first 10 rows from the train split
)

print(f"Created Pydantic Dataset with {len(dataset.cases)} cases")

Created Pydantic Dataset with 10 cases


Now, let's implement our first evaluator. This uses the build-in `LLMJudge` with a rubric to check for hullucination by checking that any facts in the output are explicitly mentioned in the input.

In [4]:
dataset.add_evaluator(
    LLMJudge(
            rubric='All facts in the output are correct and explicitly present in the input',
            include_input=True,
            model='bedrock:us.anthropic.claude-3-7-sonnet-20250219-v1:0', # It's good practice to use a large, capable model for LLM Judge
        )
)

When implementing evals, I like to do an inital run with a bad output to see the test fails and my framework works before implementing anything more sopfisticated. This is very similar to Test Driven Development (TDD) practice, but here we are doing Eval Driven Development (EDD). This helps iron out any testing issues before working on the implementation. Let's start by simply predicting the same answer for each row in our evaluaiton dataset:

In [5]:
async def generate_bad_summary(news_inputs: str) -> str:
    await asyncio.sleep(5) # Trying hard to avoid throttle limits
    return 'The bond market rose by 33%'
    
report = dataset.evaluate_sync(generate_bad_summary, max_concurrency=1) # Limit concurrency to avoid throttle limits
print(report)

06:00:05.476 evaluate generate_bad_summary
06:00:05.478   case: case_0
06:00:05.479     execute generate_bad_summary
06:00:12.098     judge_input_output run
06:00:12.098       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:00:15.421   case: case_1
06:00:15.422     execute generate_bad_summary
06:00:20.429     judge_input_output run
06:00:20.430       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:00:28.539   case: case_2
06:00:28.539     execute generate_bad_summary
06:00:33.546     judge_input_output run
06:00:33.546       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:00:38.097   case: case_3
06:00:38.098     execute generate_bad_summary
06:00:43.107     judge_input_output run
06:00:43.108       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:00:46.327   case: case_4
06:00:46.327     execute generate_bad_summary
06:00:51.334     judge_input_output run
06:00:51.335       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:01:00.734   case: case_5
06:01:00.734  




As expected, performance is poor. Next, we'll implement something more sophosticated. An AI summariser that uses an LLM to create a structured output defined by the `Summary` Pydantic model, following the instructions in the system prompt to `Create a short, concise summary of the news.`.

In [7]:
from pydantic_ai import Agent

class Summary(BaseModel):
    title: str
    facts: list[str]
    summary: str

summary_agent = Agent(
    'bedrock:us.anthropic.claude-3-5-haiku-20241022-v1:0', # Using a smaller, faster, cheaper but also less capable model
    output_type=Summary,
    system_prompt = (f"""Create a short, concise summary of the news article that prioritizes factual accuracy.
Follow these guidelines:
- Present only verifiable facts from the original text
- Maintain the original meaning without distortion

Respond with a structured output according to {Summary.model_json_schema()}"""
    )
)

async def generate_better_summary(news_inputs: str) -> Summary:
    await asyncio.sleep(10) # Trying even harder to avoid throttle limits
    r = await summary_agent.run({format_as_xml(news_inputs)})
    return r.output.summary

report = dataset.evaluate_sync(generate_better_summary, max_concurrency=1)
print(report)

06:03:02.414 evaluate generate_better_summary
06:03:02.415   case: case_0
06:03:02.415     execute generate_better_summary
             evaluate generate_better_summary
               case: case_0
                 execute generate_better_summary
06:03:07.419       summary_agent run
06:03:07.421         chat us.anthropic.claude-3-5-haiku-20241022-v1:0
06:03:10.283     judge_input_output run
06:03:10.283       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:03:13.322   case: case_1
06:03:13.322     execute generate_better_summary
             evaluate generate_better_summary
               case: case_1
                 execute generate_better_summary
06:03:18.330       summary_agent run
06:03:18.332         chat us.anthropic.claude-3-5-haiku-20241022-v1:0
             evaluate generate_better_summary
               case: case_1
                 execute generate_better_summary
                   summary_agent run
06:03:22.858         chat us.anthropic.claude-3-5-haiku-20241022-v1:0
 




Now, we have better performance on the fact rubric, at least according to our evaluation dataset. Let's move onto evaluating and optimising another characteristic of our system. Imagine we are building a mobile app page where we can only show a maximum of 200 characters in the summary field. So, following our EDD principles, we can add another `Evaluator` which is a test for summary length:

In [8]:
@dataclass
class SummaryLengthEvaluator(Evaluator):
    max_num_chars: int
    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> bool:  
        if len(ctx.output) <= self.max_num_chars:
            return True
        else:
            return False

dataset.add_evaluator(SummaryLengthEvaluator(200)) # Summary should be 200 characters or less

Ideally, we add this evaluation **before** changing our prompt to provide instructions for this part of the task. This approach creates the baseline first, and makes it easier to identify any areas of performance trade-offs. Eventually, our prompt may be trying to achieve many things, some of which may interact. The evaluations will help spot regressions that could creep in by optimising for one area over another.

In [9]:
# Simply re-running the same generation, but now with more evals
report = dataset.evaluate_sync(generate_better_summary, max_concurrency=1)
print(report)

06:06:26.803 evaluate generate_better_summary
06:06:26.804   case: case_0
06:06:26.804     execute generate_better_summary
06:06:31.808       summary_agent run
06:06:31.810         chat us.anthropic.claude-3-5-haiku-20241022-v1:0
06:06:35.281     judge_input_output run
06:06:35.282       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:06:38.689   case: case_1
06:06:38.690     execute generate_better_summary
06:06:43.694       summary_agent run
06:06:43.696         chat us.anthropic.claude-3-5-haiku-20241022-v1:0
06:06:47.397     judge_input_output run
06:06:47.398       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:06:50.896   case: case_2
06:06:50.896     execute generate_better_summary
06:06:55.901       summary_agent run
06:06:55.902         chat us.anthropic.claude-3-5-haiku-20241022-v1:0
06:07:00.094     judge_input_output run
06:07:00.095       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:07:03.850   case: case_3
06:07:03.851     execute generate_better_summar




Now, we can re-write our prompt to try and reduce the amount of jargon in our summaries:

In [12]:
concise_summary_agent = Agent(
    'bedrock:us.anthropic.claude-3-5-sonnet-20241022-v2:0', # Sonnet 3.5 is a very capable mid-sized model
    output_type=Summary,
    system_prompt = (f"""Create a short, concise summary of the news article that prioritizes factual accuracy.
Follow these guidelines:
- Present only verifiable facts from the original text
- Maintain the original meaning without distortion

<IMPORTANT>
It's critical to generate one or two or three short sentences - but it must be less that 200 characters (about 30 words)!
</IMPORTANT>

Respond with a structured output according to {Summary.model_json_schema()}"""
    )
)

async def generate_concise_summary(news_inputs: str) -> Summary:
    await asyncio.sleep(15) # Trying even harder to avoid throttle limits, longer sleep because of the shorter generation
    r = await concise_summary_agent.run({format_as_xml(news_inputs)}, instrument=True)
    return r.output.summary

report = dataset.evaluate_sync(generate_concise_summary, max_concurrency=1)
print(report)

06:37:10.287 evaluate generate_concise_summary
06:37:10.288   case: case_0
06:37:10.288     execute generate_concise_summary
06:37:25.291       concise_summary_agent run
06:37:25.294         chat us.anthropic.claude-3-5-sonnet-20241022-v2:0
06:37:29.216     judge_input_output run
06:37:29.217       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:37:33.729   case: case_1
06:37:33.730     execute generate_concise_summary
06:37:48.732       concise_summary_agent run
06:37:48.732         chat us.anthropic.claude-3-5-sonnet-20241022-v2:0
06:37:52.695     judge_input_output run
06:37:52.695       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:38:00.476   case: case_2
06:38:00.477     execute generate_concise_summary
06:38:15.479       concise_summary_agent run
06:38:15.480         chat us.anthropic.claude-3-5-sonnet-20241022-v2:0
06:38:18.371     judge_input_output run
06:38:18.372       chat us.anthropic.claude-3-7-sonnet-20250219-v1:0
06:38:21.286   case: case_3
06:38:21.286    




Now the tests pass! But, our application certainly isn't perfect. We could go on to ask if there's now missing important information that we're not capturing in the summary because it's too short, or maybe the tone of the summary isn't in line with bran guidelines. The process of engineering the evals goes on!

## Summary

Today you've seen a practical example of how to practice Eval Driven Development (EDD) for a finantial news summarisation use-case using the simple PydanticAI Evals framework. The key takeaway is how you can use a principled, structured methodology to building and evaluating with LLMs. I hope it's helped demistify this area and given you confidence to implement evals earlier in your AI application dev cycle. Happy building!