# Testing Multimodal AI Systems with Pydantic Evals - Part 1

## Overview

When you deploy AI systems to production, you need confidence that they work correctly. Unlike traditional software where you can write exact assertions (`assert result == 42`), AI outputs are more complex to evaluate. You need to check:

- **Semantic correctness**: Does the answer actually address the question?
- **Factual accuracy**: Are the claims true?
- **Safety**: Is the content free of harmful or biased material?
- **Consistency**: Does the system behave reliably across multiple runs?

This notebook demonstrates how to build a systematic testing framework for a multimodal AI system using Pydantic Evals.

## What We'll Build

~~~
┌─────────────────────────────────────────────────────────────┐
│                    Testing Pipeline                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Image Input + Question                                  │
│          ↓                                                  │
│  2. Multimodal Q&A System (our application)                 │
│          ↓                                                  │
│  3. Structured Output (MountainQAResponse)                  │
│          ↓                                                  │
│  4. Automated Evaluators                                    │
│     • Type validation                                       │
│     • Relevance checks                                      │
│     • Content safety                                        │
│     • Domain-specific validation                            │
│          ↓                                                  │
│  5. Detailed Test Report                                    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
~~~

## Setup

First, configure your environment with the necessary API keys.

In [None]:
from dotenv import load_dotenv
load_dotenv()

# This loads GEMINI_API_KEY from your .env file
# Gemini will be used as the LLM for the LLM-as-a-judge evaluator

# Only needed on the Udacity workspace. Comment this out if running on another system.
import os
os.environ['HF_HOME'] = '/voc/data/huggingface'
os.environ['OLLAMA_MODELS'] = '/voc/data/ollama/cache'
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['PATH'] = f"/voc/data/ollama/bin:/voc/data/ffmpeg/bin:{os.environ.get('PATH', '')}"
os.environ['LD_LIBRARY_PATH'] = f"/voc/data/ollama/lib:/voc/data/ffmpeg/lib:{os.environ.get('LD_LIBRARY_PATH', '')}"

True

### Model Configuration

We'll use a local vision model (Qwen 2.5 VL) running through Ollama. This gives us:
- **Cost efficiency**: No API charges during development
- **Privacy**: Images stay on your machine
- **Speed**: Fast iteration during testing

In [32]:
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.ollama import OllamaProvider

qwen_2_5vl_3b = OpenAIChatModel(
    model_name="qwen2.5vl:3b",
    provider=OllamaProvider(base_url="http://localhost:11434/v1"),
)

## Part 1: Building the Multimodal Q&A System

### Architecture

Our system follows this flow:

~~~
┌──────────────┐      ┌──────────────────────┐      ┌─────────────────┐
│              │      │                      │      │                 │
│  Image +     │─────▶│  Pydantic AI Agent   │─────▶│  Structured     │
│  Question    │      │  (Vision Model)      │      │  Response       │
│              │      │                      │      │                 │
└──────────────┘      └──────────────────────┘      └─────────────────┘
                                                     {
                                                       "answer": "...",
                                                       "mountain": "..."
                                                     }
~~~

### Why Structured Output?

Structured output makes testing dramatically easier:
- **Type safety**: We know exactly what fields to expect
- **Extractability**: We can pull out specific information (like the mountain name) for validation
- **Consistency**: The model always returns the same format

In [33]:
import os
import textwrap
from pydantic_ai import Agent, PromptedOutput
from pydantic_ai.messages import BinaryContent
from PIL import Image
from io import BytesIO
from pydantic import BaseModel, Field
from pydantic_ai.models.google import GoogleModel, GoogleProvider, GoogleModelSettings
import tenacity


# Configure Google models for evaluation
provider = GoogleProvider(api_key=os.getenv('GOOGLE_API_KEY'))
gemini_2_5_flash_lite = GoogleModel('gemini-2.5-flash-lite', provider=provider)

# Disable thinking mode to conserve API quota
model_settings = GoogleModelSettings(google_thinking_config={"thinking_budget": 0})

### Define the Response Schema

This Pydantic model defines what we expect from our Q&A system.

In [34]:
class MountainQAResponse(BaseModel):
    """Structured response from our Q&A system"""
    
    answer: str = Field(description="The answer to the question")
    mountain: str = Field(
        description="The mountain identified in the image. If no mountain, answer 'unknown'"
    )

### Build the Q&A System

The `MountainQA` class wraps our agent and provides a clean interface for answering questions about mountain images.

**Key components:**

1. **System Prompt**: Instructs the model to analyze mountain images and format responses correctly
2. **PromptedOutput**: Ensures the model returns valid JSON matching our schema. Adding `PromptedOutput` is necessary for Qwen 2.5 because it does not support structured output natively.
3. **Retries**: If the model's output doesn't match the schema, it gets 3 attempts to correct it

**Important note about retries**: The `retries=3` parameter applies *only* to output formatting. If the model returns invalid JSON or misses a field, Pydantic AI will automatically retry. API failures require separate retry logic.

In [35]:
class MountainQA:
    """A simple multimodal question answering system about mountains"""
    
    def __init__(self):
        self.agent = Agent(
            qwen_2_5vl_3b,
            # PromptedOutput is necessary because qwen_2_5vl_3b doesn't support
            # structured outputs natively
            output_type=PromptedOutput(MountainQAResponse),
            # We include examples in the prompt to help the model understand the task
            # and - importantly - the expected output format, because qwen_2_5vl_3b
            # doesn't support structured outputs natively
            system_prompt=textwrap.dedent(
                """
                You are an expert at analyzing images of mountains and answering questions about them.
                Recognize the mountain in the image and provide relevant information.
                
                Answer with a valid JSON object with the following fields:
                 - answer: The answer to the question
                 - mountain: The mountain identified in the image. If no mountain, answer 'unknown'

                 Example 1:

                 Question: What mountain is shown in this image?
                 Image: ![image](image1.jpg)

                Answer:
                {
                    "answer": "The mountain shown in the image is Mount Blanc, located in the Alps...",
                    "mountain": "Mount Blanc"
                }
                
                Example 2:
                Question: What mountain is shown in this image?
                Image: ![image](image2.jpg)

                Answer:
                {
                    "answer": "The mountain shown in the image is Kiliamanjaro, a dormant volcano in Tanzania...",
                    "mountain": "Kilimanjaro"
                }
                """
            ),
            retries=3,  # Retries apply to output formatting only
        )
    
    async def answer_question(self, image: Image.Image, question: str) -> MountainQAResponse:
        """Answer a question about an image"""
        
        # Optimization: resize to reduce token count
        image.thumbnail((300, 300))
        
        # Convert PIL Image to bytes
        image_bytes = BytesIO()
        image.save(image_bytes, format="JPEG")
        
        # Run the agent with both text and image
        result = await self.agent.run(
            [
                question,
                BinaryContent(data=image_bytes.getvalue(), media_type="image/jpeg"),
            ]
        )
        
        return result.output

### Test the System

Before building tests, verify the system works as expected. We will use an image of the Matterhorn, a mountain in the Alps:

<img src="./matterhorn.png">The Matterhorn</img>

In [36]:
# Create an instance
qa_system = MountainQA()

# Test with the Matterhorn
image = Image.open("matterhorn.png")
question = "What mountain is shown in this image?"

answer = await qa_system.answer_question(image, question)

print(f"Question: {question}")
print(f"\nAnswer: {answer.answer}")
print(f"\nMountain identified: {answer.mountain}")

Question: What mountain is shown in this image?

Answer: The mountain shown in the image is the Matterhorn, a prominent and iconic peak in the Swiss Alps...

Mountain identified: Matterhorn


## Part 2: Introduction to Pydantic Evals

### The Testing Challenge

Traditional software testing uses exact comparisons:
```python
assert calculate_sum(2, 3) == 5  # Either passes or fails
```

AI testing requires semantic evaluation:
```python
# These answers are semantically equivalent:
"The Matterhorn is 4,478 meters tall"
"This mountain has an elevation of 4478m"
"The peak reaches 4,478 meters above sea level"
```

### What is Pydantic Evals?

Pydantic Evals is a testing framework designed specifically for LLM applications. Think of it as pytest for AI systems.

**Key concepts:**

```
Dataset
├─ Case 1
│  ├─ Input: (image, question)
│  ├─ Expected Output: (optional)
│  └─ Evaluators: [eval1, eval2, ...]
├─ Case 2
│  └─ ...
└─ Global Evaluators (apply to all cases)
```

### Types of Evaluators
The framework provides several evaluators. In this notebook we are going to use these 3:

1. **IsInstance**: Validates output type
2. **LLMJudge**: Uses another LLM to evaluate quality
3. **Custom**: Your own custom validation logic (we'll build some!)

## Part 3: Building Test Cases

### Test Case Structure

Each test case needs:
- **Inputs**: What we feed to the system
- **Evaluators**: How we judge the output
- **Metadata**: Tags for organizing results
- **Expected output** (optional): For exact matching scenarios

In [37]:
import json
from typing import Any, List
from pydantic_evals import Case, Dataset
from pydantic_ai.retries import RetryConfig
from pydantic_evals.evaluators import IsInstance, LLMJudge
import pprint

# Define input schema for clarity
class MountainQAInput(BaseModel):
    """Input to our Q&A system"""
    image: str = Field(description="Path to an image of a mountain")
    question: str = Field(description="Question about the image")

### Wrapper Function

Pydantic Evals needs a function that takes inputs and returns outputs. We create a wrapper that:
1. Loads the image from disk
2. Calls our Q&A system
3. Returns the structured response

In [38]:
async def run_agent(inputs: List[MountainQAInput]) -> MountainQAResponse:
    """
    Wrapper around the QA system that reads the image from disk
    and passes it to the agent.
    """
    assert len(inputs) == 1, "Only one input at a time is supported"
    
    image = Image.open(inputs[0].image)
    return await qa_system.answer_question(image, inputs[0].question)

### Case 1: Simple Identification

This test verifies the system can identify the Matterhorn.

**LLMJudge parameters:**
- `model`: Which LLM evaluates the output
- `rubric`: Instructions for grading
- `include_input`: Whether to show the evaluator the original question

In [39]:
case_1 = Case(
    name="matterhorn_identification",
    inputs=[
        MountainQAInput(
            image="matterhorn.png",
            question="What mountain is shown in this image?"
        )
    ],
    expected_output=None,  # We don't expect exact text
    metadata={"focus": "matterhorn"},
    evaluators=(
        LLMJudge(
            model=gemini_2_5_flash_lite,
            rubric="The answer should correctly identify the mountain as the Matterhorn",
            # In this case the model doesn't need the input, because it can evaluate
            # the answer on its own merit (does it contain 'Matterhorn'?). So we spare some tokens
            # and avoid providing potentially confusing context.
            include_input=False,
        ),
    ),
)

### Case 2: Factual Accuracy

This test verifies the system provides accurate information about the Matterhorn.

For this demo we load verified facts from a JSON file and include them in the rubric. In a real case, these facts could come from a retrieval pipeline, a database, or some other source. 

The evaluator checks if the response:
1. Identifies the mountain correctly
2. Provides facts from our verified list
3. Doesn't include false information

In [40]:
# Load verified facts about the Matterhorn
matterhorn_facts: list[str] = json.load(open("matterhorn_facts.json"))

case_2 = Case(
    name="matterhorn_facts",
    inputs=[
        MountainQAInput(
            image="matterhorn.png",
            question="Tell me everything you know about this mountain",
        )
    ],
    expected_output=None,
    metadata={"focus": "matterhorn"},
    evaluators=(
        LLMJudge(
            model=gemini_2_5_flash_lite,
            rubric=f"""
            The answer should correctly identify the mountain as the Matterhorn, and then provide 
            relevant and correct facts about it from the following information:

            {'\n'.join(matterhorn_facts)}

            Note that the answer does not need to include all the facts, but it should not contain 
            any incorrect information.
            """,
            include_input=True,
        ),
    ),
)

### Create the Dataset

A Dataset bundles test cases with global evaluators that apply to every case.

**Our global checks:**
1. **Type validation**: Is the output a proper MountainQAResponse?
2. **Relevance**: Does the answer relate to the question and image?
3. **Safety**: Is the content free of harmful material?

In [99]:
mountain_dataset = Dataset[str, MountainQAResponse, Any](
    # Test cases, which contain their own specific evaluators
    cases=[case_1, case_2],
    # Global evaluators applied to all cases
    evaluators=[
        IsInstance(type_name="MountainQAResponse"),
        LLMJudge(
            rubric="The answer is relevant to the question and the image",
            include_input=True,
            model=gemini_2_5_flash_lite,
        ),
        LLMJudge(
            rubric="The answer does not contain any harmful or biased content",
            include_input=True,
            model=gemini_2_5_flash_lite,
        ),
    ],
)

### Run the Evaluation

Now we execute all test cases and collect results.

**Retry configuration**: Sometimes API calls fail due to transient issues (rate limits, timeouts). The retry config handles this gracefully using exponential backoff.

In [100]:
# Configure retries for API failures
retry_config = RetryConfig(
    stop=tenacity.stop_after_attempt(10),
    wait=tenacity.wait_full_jitter(multiplier=0.5, max=15),
)

# Run the evaluation
report = await mountain_dataset.evaluate(
    run_agent,
    retry_task=retry_config,
    retry_evaluators=retry_config
)

print(report)

Output()




### Understanding the Report

The report shows:
- **Assertions**: How many evaluators passed/failed per case
- **Duration**: Execution time
- **Averages**: Overall performance metrics

Let's examine a specific case in detail:

In [101]:
# Deep dive into case 2 results
pprint.pprint(report.cases[1])

ReportCase(name='matterhorn_facts',
           inputs=[MountainQAInput(image='matterhorn.png', question='Tell me everything you know about this mountain')],
           metadata={'focus': 'matterhorn'},
           expected_output=None,
           output=MountainQAResponse(answer='The mountain shown in the image is the Matterhorn, a peak located in the Swiss Alps between Valais and Ticino provinces in Switzerland and in the canton of Valais, Switzerland.', mountain='Matterhorn'),
           metrics={},
           attributes={},
           scores={},
           labels={},
           assertions={'IsInstance': EvaluationResult(name='IsInstance',
                                                      value=True,
                                                      reason=None,
                                                      source=EvaluatorSpec(name='IsInstance', arguments=('MountainQAResponse',))),
                       'LLMJudge': EvaluationResult(name='LLMJudge',
                  

The detailed output includes:
- **Inputs**: What we sent to the system
- **Output**: The system's response
- **Evaluations**: Each evaluator's verdict with reasoning
- **Metadata**: Tags and execution info

## Part 4: Custom Evaluators

### Why Custom Evaluators?

Built-in evaluators are powerful, but production systems need domain-specific checks. For example:
- Does the entity exist in our database?
- Does the output match our business rules?
- Are there regulatory compliance issues?

### Building an Entity Validator

Let's create an evaluator that checks if the identified mountain exists in our knowledge base.

```
┌──────────────────────────────────────────────┐
│         EntityCheck Evaluator                │
├──────────────────────────────────────────────┤
│                                              │
│  Input: MountainQAResponse                   │
│         {                                    │
│           answer: "...",                     │
│           mountain: "Matterhorn"             │
│         }                                    │
│         ↓                                    │
│  Check: Is "matterhorn" in known_mountains?  │
│         ↓                                    │
│  Output: True/False                          │
│                                              │
└──────────────────────────────────────────────┘
```

In [102]:
from dataclasses import dataclass
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

@dataclass
class EntityCheck(Evaluator):
    """Validates that the identified mountain exists in our knowledge base"""
    
    def __init__(self):
        # In production, this would query a database or API
        # For now, we use a hardcoded list
        self.__known_mountains = ["matterhorn", "everest", "mont blanc", "kilimanjaro"]
        super().__init__()
    
    async def evaluate(self, ctx: EvaluatorContext[str, MountainQAResponse]) -> bool:
        """
        Evaluate whether the identified mountain is in our knowledge base.
        
        Args:
            ctx: Contains the system's output
        
        Returns:
            True if the mountain is known, False otherwise
        """
        # Case-insensitive check if any of the mountains in our list is
        # mentioned in the output (this is so that we match Mount Matterhorn 
        # as well as just Matterhorn)
        return any(mountain in ctx.output.mountain.lower() for mountain in self.__known_mountains)

### Add the Evaluator and Re-run

Custom evaluators integrate seamlessly with built-in ones.

In [None]:
# Add the new global evaluator to our dataset
mountain_dataset.add_evaluator(EntityCheck())

report = await mountain_dataset.evaluate(
    run_agent, retry_evaluators=retry_config, progress=False
)

print(report)

 14%|█▍        | 14/100 [01:19<08:06,  5.65s/it]





### Analyzing Updated Results

Now each case shows one additional evaluator checking entity validity.

We can see that here the first case failed two evaluators. But why? It didn't the previous trial!

This is _exaclty_ the problem with LLMs: the output is often non-deterministic. The same input produces different outputs. This is why - as we will see - you need to run each test multiple times to determine the average performance of your models and prompts.

Let's look at what happened in that case:

In [115]:
pprint.pprint(report.cases[0])

ReportCase(name='matterhorn_identification',
           inputs=[MountainQAInput(image='matterhorn.png', question='What mountain is shown in this image?')],
           metadata={'focus': 'matterhorn'},
           expected_output=None,
           output=MountainQAResponse(answer='The mountain shown in the image is Mount Blanc, located in the Alps...', mountain='Mount Blanc'),
           metrics={},
           attributes={},
           scores={},
           labels={},
           assertions={'EntityCheck': EvaluationResult(name='EntityCheck',
                                                       value=False,
                                                       reason=None,
                                                       source=EvaluatorSpec(name='EntityCheck', arguments=None)),
                       'IsInstance': EvaluationResult(name='IsInstance',
                                                      value=True,
                                                      reason=None,

We can see that the model answered:


"The mountain shown in the image is Mount Blanc, located in the Alps..."

So it truly failed to recognize Matterhorn, even though it did recognize it before! This is why we need to consider this flakyness in our testing and account for it, as we will see later.

## Key Takeaways from Part 1

### What We've Built

1. **Multimodal Q&A System**: Processes images and questions, returns structured data
2. **Test Framework**: Systematic evaluation using Pydantic Evals
3. **Multiple Evaluator Types**: Built-in (IsInstance, LLMJudge) and custom (EntityCheck)
4. **Reproducible Pipeline**: Automated testing with detailed reporting

### Testing Best Practices

```
┌─────────────────────────────────────────────────────────────┐
│                   Testing Pyramid for AI                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│                    ▲                                        │
│                   ╱ ╲        Manual Review                  │
│                  ╱   ╲       (Edge cases, UX)               │
│                 ╱─────╲                                     │
│                ╱       ╲     Domain-Specific                │
│               ╱  Custom ╲    (EntityCheck, Factuality)      │
│              ╱───────────╲                                  │
│             ╱             ╲  Semantic Quality               │
│            ╱   LLM Judge   ╲ (Relevance, Safety)            │
│           ╱─────────────────╲                               │
│          ╱                   ╲                              │
│         ╱   Type Validation   ╲                             │
│        ╱    (IsInstance, etc)  ╲                            │
│       ╱─────────────────────────╲                           │
│                                                             │
│  Foundation → More tests, faster, cheaper                   │
│  Top → Fewer tests, slower, more expensive                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Current Limitations

Our testing so far has some gaps:

1. **Binary evaluation**: Evaluators return pass/fail, not graded scores
2. **No hallucination detection**: We don't catch factual errors systematically
3. **Limited fact-checking**: LLMJudge checks overall quality but not individual claims
4. **Single runs**: We haven't tested consistency across multiple executions, and we have seen how the model can unexpectedly fail sometimes.

**In Part 2**, we'll address these by building:
- A sophisticated hallucination detector that scores factual accuracy
- Agents that extract and verify individual facts
- Comprehensive test suites with repeated runs
- Detailed metrics for production monitoring

# Part 2




## Advanced Evaluation: Hallucination Detection

In Part 1, we built evaluators that check overall quality. Now we'll build a system that:
- Extracts individual facts from responses
- Verifies each fact against ground truth
- Assigns numerical scores (not just pass/fail)

### The Hallucination Problem

```
Model Output: "The Matterhorn is 3500m tall and was first climbed in 1865."
                              ^^^^^^^^^ WRONG!              ^^^^ CORRECT

Ground Truth: "The Matterhorn is 4478m tall and was first climbed in 1865."
```

LLMJudge might pass this despite the error. We need fact-level verification.

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────┐
│              Truth Scoring Pipeline                         │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Model Output                                               │
│  "The Matterhorn is 3500m tall in the Alps on the           │
│   Switzerland-Italy border. First climbed in 1865."         │
│          ↓                                                  │
│  Facts Divider Agent                                        │
│  Extracts: ["3500m tall", "in the Alps",                    │
│             "Switzerland-Italy border", "climbed 1865"]     │
│          ↓                                                  │
│  Truth Scorer Agent (with Ground Truth)                     │
│  Grades: [False, True, True, True]                          │
│          ↓                                                  │
│  Score: 3/4 = 0.75                                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Step 1: Facts Extraction Agent

This agent takes a paragraph and breaks it into atomic facts.

In [None]:
# Import necessary modules (assumes Part 1 setup is complete)
import json
import textwrap
from typing import List
from pydantic import BaseModel, Field
from pydantic_ai import Agent, RunContext
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, EvaluationReason
from dataclasses import dataclass
import pprint
from PIL import Image


# Define a Pydantic model for the facts output
class FactsOutput(BaseModel):
    facts: List[str] = Field(description="List of facts about the mountain")


facts_divider_agent = Agent(
    model=gemini_2_5_flash_lite,
    output_type=FactsOutput,
    system_prompt=(
        """
        Extract individual facts about mountains from text.
        Each fact should be a concise, standalone statement.
        
        Example Input:
        "The Matterhorn is 4478m tall in the Alps on the Switzerland-Italy border. 
        First ascended in 1865. Known in Italy as Monte Cervino."
        
        Example Output:
        - The Matterhorn is 4478 meters tall
        - The Matterhorn is located in the Alps
        - The Matterhorn is on the Switzerland-Italy border
        - The Matterhorn was first ascended in 1865
        - The Matterhorn is known in Italy as Monte Cervino
        """
    ),
    retries=3,
)

# Test it
text = (
    "The Matterhorn is a mountain in the Alps on the border between Switzerland and Italy. "
    "It has an elevation of 4,478 meters (14,692 feet) and is one of the highest peaks in the Alps. "
    "The Matterhorn was first ascended in 1865 by a team led by Edward Whymper. "
    "It is known as Monte Cervino in Italian and Mont Cervin in French."
)
facts_output = await facts_divider_agent.run(text)
print("Extracted Facts:")
pprint.pprint(facts_output.output.model_dump())

Extracted Facts:
{'facts': ['The Matterhorn is a mountain in the Alps.',
           'The Matterhorn is on the border between Switzerland and Italy.',
           'The Matterhorn has an elevation of 4,478 meters (14,692 feet).',
           'The Matterhorn is one of the highest peaks in the Alps.',
           'The Matterhorn was first ascended in 1865 by a team led by Edward '
           'Whymper.',
           'The Matterhorn is known as Monte Cervino in Italian.',
           'The Matterhorn is known as Mont Cervin in French.']}


### Step 2: Truth Scoring Agent

This agent grades extracted facts against ground truth using dynamic prompting. This is a capability of pydantic AI that allows us to customize elements of the agent for a specific run on an example. In this case, we want to specialize the system prompt and the instructions to a specific mountain and the facts we know about that mountain, coming from our hypothetical database.

In [None]:
class TruthOutput(BaseModel):
    is_true: List[bool] = Field(
        description="Indicates whether each provided statement is true or false"
    )
    rationale: List[str] = Field(
        description="Rationale for why each statement is true or false"
    )


@dataclass
class MountainDeps:
    """Dependencies injected into the truth scorer"""

    mountain_name: str
    mountain_facts: List[str]  # Ground truth
    model_facts: FactsOutput  # Facts to verify


truth_scorer_agent = Agent(
    model=gemini_2_5_flash_lite,
    output_type=TruthOutput,
    deps_type=MountainDeps,
    retries=3,
)


# Dynamic system prompt based on the mountain
@truth_scorer_agent.system_prompt
def dynamic_prompt(ctx: RunContext[MountainDeps]) -> str:
    formatted_facts = "\n- ".join(ctx.deps.mountain_facts)

    return f"""
    You are an expert on the {ctx.deps.mountain_name} mountain.
    
    Known facts:
    {formatted_facts}
    
    Grade statements as true or false based on these facts.
    """


# Inject the facts to be verified
@truth_scorer_agent.instructions
def inject_facts(ctx: RunContext[MountainDeps]) -> str:
    facts = "\n- ".join(ctx.deps.model_facts.facts)
    return f"Grade these statements:\n- {facts}"


# Test the truth scorer
mountain_name = "Matterhorn"
ground_truth_facts = [
    "The Matterhorn is 4478 meters tall",
    "The Matterhorn is located in the Alps",
    "The Matterhorn is on the Switzerland-Italy border",
    "The Matterhorn was first ascended in 1865",
    "The Matterhorn is known in Italy as Monte Cervino",
]
# Introduce an error for testing
test_facts = FactsOutput(
    facts=[
        "The Matterhorn has an elevation of 3500 meters",  # false!
        "The Matterhorn is situated in the Alpine mountain range",  # true
        "The Matterhorn straddles the border between Switzerland and Italy",  # true
    ]
)
truth_output = await truth_scorer_agent.run(
    deps=MountainDeps(
        mountain_name=mountain_name,
        mountain_facts=ground_truth_facts,
        model_facts=test_facts,
    )
)
print("Truth Scoring Results:")
pprint.pprint(truth_output.output.model_dump())


Truth Scoring Results:
{'is_true': [False, True, True],
 'rationale': ['The Matterhorn has an elevation of 4478 meters, not 3500 '
               'meters.',
               'The Matterhorn is indeed situated in the Alpine mountain '
               'range.',
               'The Matterhorn straddles the border between Switzerland and '
               'Italy.']}


### Step 3: Test the Pipeline

Before building the evaluator, verify the agents work correctly.

In [120]:
# Load ground truth
matterhorn_facts: list[str] = json.load(open("matterhorn_facts.json"))

# Test input with one error (wrong altitude)
test_answer = """The mountain is the Matterhorn. It is 3500 meters tall, 
located in the Alps, and was first ascended in 1865."""

# Extract facts
facts_result = await facts_divider_agent.run([test_answer])
print("Extracted facts:")
for i, fact in enumerate(facts_result.output.facts, 1):
    print(f"{i}. {fact}")

# Verify facts
deps = MountainDeps(
    mountain_name="Matterhorn",
    mountain_facts=matterhorn_facts,
    model_facts=facts_result.output,
)

truth_result = await truth_scorer_agent.run(deps=deps)
print("\nVerification results:")
for i, (fact, is_true, reason) in enumerate(zip(
    facts_result.output.facts,
    truth_result.output.is_true,
    truth_result.output.rationale
), 1):
    status = "✓" if is_true else "✗"
    print(f"{i}. {status} {fact}")
    print(f"   → {reason}\n")

# Calculate score
score = sum(truth_result.output.is_true) / len(truth_result.output.is_true)
print(f"Truth Score: {score:.2f}")

Extracted facts:
1. The Matterhorn is 3500 meters tall
2. The Matterhorn is located in the Alps
3. The Matterhorn was first ascended in 1865

Verification results:
1. ✗ The Matterhorn is 3500 meters tall
   → The Matterhorn is 4,478 meters tall, not 3500 meters.

2. ✓ The Matterhorn is located in the Alps
   → The Matterhorn is located in the Alps, straddling the border between Switzerland and Italy.

3. ✓ The Matterhorn was first ascended in 1865
   → The Matterhorn was first ascended on July 14, 1865.

Truth Score: 0.67


### Step 4: Build the TruthScoring Evaluator

Now wrap everything into a reusable evaluator.

In [121]:
@dataclass
class TruthScoring(Evaluator):
    """Evaluates factual accuracy by extracting and verifying individual claims"""
    
    def __init__(self):
        self.__known_mountains = ["matterhorn", "everest", "mont blanc", "kilimanjaro"]
        self._truth_result: TruthOutput | None = None
        super().__init__()
    
    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> EvaluationReason:
        # Return 0 if mountain is unknown
        if not any(
            mountain in ctx.output.mountain.lower()
            for mountain in self.__known_mountains
        ):
            return EvaluationReason(value=0, reason="Unknown mountain")
        
        # Load ground truth (in production, this would be a database query)
        known_facts = json.load(open("matterhorn_facts.json"))
        
        # Extract facts from model output
        facts_result = await facts_divider_agent.run([ctx.output.answer])
        
        if len(facts_result.output.facts) == 0:
            return EvaluationReason(value=0, reason="No facts extracted")
        
        # Verify each fact
        deps = MountainDeps(
            mountain_name=ctx.output.mountain,
            mountain_facts=known_facts,
            model_facts=facts_result.output,
        )
        
        truth_result = await truth_scorer_agent.run(deps=deps)
        self._truth_result = truth_result.output
        
        # Calculate score
        true_statements = sum(truth_result.output.is_true)
        score = true_statements / len(truth_result.output.is_true)
        
        # Build detailed reason
        reason = "\n".join([
            f"{fact} is "
            + ("true" if truth_result.output.is_true[i] else "false")
            + f" because {truth_result.output.rationale[i]}"
            for i, fact in enumerate(facts_result.output.facts)
        ])
        
        return EvaluationReason(
            value=score,
            reason=textwrap.dedent(
                f"""
                {true_statements} out of {len(truth_result.output.is_true)} statements are true.
                Details:
                {reason}
                """
            ),
        )
    
    @property
    def truth_result(self) -> TruthOutput | None:
        return self._truth_result

## Putting it all together

### Dealing with variance

AI systems can be inconsistent. The same input might produce different outputs due to:
- Temperature settings (randomness)
- Model non-determinism
- API variations

Running tests multiple times helps identify:
- **Variance**: How much an answer changes randomly?
- **Consistency**: What's the variability in quality?
- **Edge cases**: Rare failure modes

Measuring the third thing (edge cases) require large datasets.

For the first two we can however use a simple idea. We can repeat the exact same test multiple times, and measure the variance of the answer. The idea of many repeated runs is especially useful for small evals datasets. When the evals dataset grows large enough, the variability is taken into consideration by averaging over many (different) test cases, so repeating can be tuned down.

In [122]:
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import IsInstance, LLMJudge
from typing import Any
from pydantic_ai.retries import RetryConfig
import tenacity

def create_repeated_cases(base_cases, num_repeats=5):
    """Clone each test case multiple times"""
    repeated_cases = []
    for i, base_case in enumerate(base_cases):
        for j in range(num_repeats):
            repeated_cases.append(
                Case(
                    name=f"{base_case.name}_run_{j+1}",
                    inputs=base_case.inputs,
                    expected_output=base_case.expected_output,
                    metadata={**(base_case.metadata or {}), "run_number": j + 1},
                    evaluators=base_case.evaluators,
                )
            )
    return repeated_cases

### Build Comprehensive Test Suite

In [123]:
# Reuse MountainQAInput and EntityCheck from Part 1
class MountainQAInput(BaseModel):
    image: str = Field(description="Path to an image of a mountain")
    question: str = Field(description="Question about the image")

@dataclass
class EntityCheck(Evaluator):
    def __init__(self):
        self.__known_mountains = ["matterhorn", "everest", "mont blanc", "kilimanjaro"]
        super().__init__()
    
    async def evaluate(self, ctx: EvaluatorContext[str, str]) -> bool:
        return ctx.output.mountain.lower() in self.__known_mountains

# Define base cases
base_cases = [
    Case(
        name="matterhorn_identification",
        inputs=[
            MountainQAInput(
                image="matterhorn.png",
                question="What mountain is shown in this image?",
            )
        ],
        expected_output=None,
        metadata={"focus": "matterhorn"},
        evaluators=(
            LLMJudge(
                model=gemini_2_5_flash_lite,
                rubric="The answer should correctly identify the mountain as the Matterhorn",
                include_input=False,
            ),
            EntityCheck(),
        ),
    ),
    Case(
        name="matterhorn_facts",
        inputs=[
            MountainQAInput(
                image="matterhorn.png",
                question="Tell me everything you know about this mountain.",
            )
        ],
        expected_output=None,
        metadata={"focus": "matterhorn"},
        evaluators=(
            LLMJudge(
                model=gemini_2_5_flash_lite,
                rubric=textwrap.dedent(
                    f"""
                    The answer should correctly identify the mountain as the Matterhorn, and provide 
                    relevant and correct facts from:

                    - { '\n- '.join(matterhorn_facts) }

                    The answer doesn't need all facts, but must not contain false information.
                    """
                ),
            ),
            EntityCheck(),
            TruthScoring(),  # Our new evaluator!
        ),
    ),
]

# Create dataset with 10 repeats of each case
mountain_dataset = Dataset[str, MountainQAResponse, Any](
    cases=create_repeated_cases(base_cases, num_repeats=10),
    evaluators=[
        IsInstance(type_name="MountainQAResponse"),
        LLMJudge(
            rubric="The answer is relevant to the question and the image",
            include_input=True,
            model=gemini_2_5_flash_lite,
        ),
        LLMJudge(
            rubric="The answer does not contain any harmful or biased content",
            include_input=True,
            model=gemini_2_5_flash_lite,
        ),
    ],
)

### Run Full Evaluation

This will execute 20 test cases (2 base cases × 10 repeats).

In [124]:
retry_config = RetryConfig(
    stop=tenacity.stop_after_attempt(10),
    wait=tenacity.wait_full_jitter(multiplier=0.5, max=15),
)

# Reuse run_agent from Part 1
async def run_agent(inputs: List[MountainQAInput]) -> MountainQAResponse:
    assert len(inputs) == 1
    image = Image.open(inputs[0].image)
    return await qa_system.answer_question(image, inputs[0].question)

report = await mountain_dataset.evaluate(
    run_agent,
    max_concurrency=50,  # Run many tests in parallel
    retry_task=retry_config,
    retry_evaluators=retry_config,
)

print(report)

Output()




### Analyze Results

In [131]:
# Look at a specific case with TruthScoring results
pprint.pprint(report.cases[12].scores['TruthScoring'])

EvaluationResult(name='TruthScoring',
                 value=0.6666666666666666,
                 reason='\n'
                        '                2 out of 3 statements are true.\n'
                        '                Details:\n'
                        '                The Matterhorn is also known as Mount '
                        "Cervino. is true because The name 'Matterhorn' "
                        "derives from German words meaning 'peak of the "
                        "meadows'. 'Cervino' is the Italian name for the "
                        'mountain, reflecting its location on the border '
                        'between Switzerland and Italy.\n'
                        'The Matterhorn is the highest peak of the Pennine '
                        'Alps. is false because While the Matterhorn is a '
                        'prominent peak in the Alps, it is not the highest '
                        'peak of the Pennine Alps. Monte Rosa is the highest '
              

## Beyond this demo

We've introduced some basic concepts about testing multimodal AI applications. 

For the sake of the demo we have mixed together in one dataset different cases with different tests. Some people prefer instead to have several datasets, each focusing on specific aspects (say allucinations) with each dataset containing several test cases for that one aspect.

Also, Pydantic Evals provides integrations with tracing platforms such as Arize Phoenix and others. Those integrations are invaluable to dig in into failures and debug the situation.

Finally, creating good evals with LLM-as-a-judge is about improving two sets of prompts concurrently: the prompts generating the answers for the system under test, and the prompts of the LLM acting as judge. It is normal for this process to take several iterations before you land on a decent spot, so do not get discouraged! For this demo we kept things simple, but you'll find that the prompt or rubric for your LLM-as-a-judge tend to become pretty detailed over time.