# LLM Judges for Agent Testing



In this lesson we will see how we can use LLMs to test other LLMs.

I tried asking this question: "what is llm as a judge evaluation".

The result:

```bash
User: what is llm as a judge evaluation
TOOL CALL (search): search({"query": "LLM judge evaluation in legal settings"})
TOOL CALL (search): search({"query": "LLM in judicial decision making"})
TOOL CALL (search): search({"query": "LLM legal profession evaluation use cases"})
```

In [None]:
This "legal" and "judicial" interpretation doesn't make any sense for our documentation use case.

Also sometimes the agent goes overboard with searches:

```bash
TOOL CALL (search): search({"query": "LLM as a judge evaluation"})
TOOL CALL (search): search({"query": "Evaluating language models as judges"})
TOOL CALL (search): search({"query": "LLM legal evaluation performance"})
TOOL CALL (search): search({"query": "LLM evaluator use cases"})
TOOL CALL (search): search({"query": "how to use LLM for evaluation"})
TOOL CALL (search): search({"query": "language model judge effectiveness"})
TOOL CALL (search): search({"query": "LLM evaluation methodology"})
TOOL CALL (search): search({"query": "training LLM for evaluations"})
TOOL CALL (search): search({"query": "applications of LLM in assessments"})
TOOL CALL (search): search({"query": "LLM effectiveness in judicial evaluations"})
TOOL CALL (search): search({"query": "LLM evaluation framework best practices"})
TOOL CALL (search): search({"query": "using language models for accuracy evaluation"})
TOOL CALL (search): search({"query": "custom language model evaluations"})
```

Probably because it can't find what it's looking for (judicial stuff), it keeps searching for more and more variations.

# Simple Keyword-Based Testing

We already know how to write a basic test that checks that words like "judicial" and "legal" don't appear:

In [None]:
def test_no_judicial_terms_in_search_queries():
    result = run_agent_sync("what is llm as a judge evaluation")

    tool_calls = get_tool_calls(result)

    forbidden_terms = ['judicial', 'court', 'law', 'legal']

    for call in tool_calls:
        query = call.args['query'].lower()

        for term in forbidden_terms:
            assert term not in query, f"Forbidden term '{term}' found in search query: {query}"


But we can't anticipate all possible legal terminology in advance, so some words like "jurisprudence" can slip through.

# Creating an LLM Judge

Instead, we can create a Judge - an LLM that evaluates the output of another LLM according to specific criteria. It will be smart enough to recognize if "jurisprudence" sneaks in.

Let's create another test file: `test_agent_judge.py` (`test/test_agent_judge.py`). It will contain our judge as well as the tests.

First, define the judge structure:

In [None]:
judge_instructions = """
you are an expert judge evaluating the performance of an AI search agent.
"""

class JudgeCriterion(BaseModel):
    criterion_description: str
    passed: bool
    judgement: str


class JudgeFeedback(BaseModel):
    criteria: list[JudgeCriterion]
    feedback: str


def create_judge():
    judge = Agent(
        name="judge",
        instructions=judge_instructions,
        model="openai:gpt-4o-mini",
        output_type=JudgeFeedback,
    )
    return judge


This creates an agent that focuses solely on evaluation. (Technically, it's not an agent because it doesn't have any tools, but whatever.)

The structured output ensures we get consistent feedback across all criteria.

# Agent Evaluation Funcion

Create this evaluation function:

In [None]:
async def evaluate_agent_performance(
        criteria: list[str],
        result: AgentRunResult,
        output_transformer: callable = None
) -> JudgeFeedback:
    judge = create_judge()

    tool_calls = get_tool_calls(result)

    output = result.output
    if output_transformer is not None:
        output = output_transformer(output)

    user_prompt = f"""
    Evaluate the agent's performance based on the following criteria:
    <CRITERIA>
    {'\n'.join(criteria)}
    </CRITERIA>
    
    The agent's final output was:
    <AGENT_OUTPUT>
    {output}
    </AGENT_OUTPUT>
    
    Tool calls:
    <TOOL_CALLS>
    {'\n'.join([str(c) for c in tool_calls])}
    </TOOL_CALLS>
    """

    print("Judge evaluating with prompt:")
    print("-----")
    print(user_prompt)
    print("-----")

    eval_results = await judge.run(
        user_prompt=user_prompt
    )

    return eval_results.output


This function takes evaluation criteria as input and creates a comprehensive prompt for the judge.

The output_transformer allows us to format the agent's output (like converting structured data to readable text) before evaluation.

# Using the Judge in Tests

Use it in your tests like this:

In [None]:
@pytest.mark.asyncio
async def test_no_judicial_terms_in_search_queries():
    criteria = [
        "agent makes 3 search calls",
        "the references are relevant to the topic",
        "each section has references",
        "all search queries are free of judicial terms (except 'judge')",
        "the article contains properly formatted python code examples",
    ]

    result = await run_agent("what is llm as a judge evaluation")

    eval_results = await evaluate_agent_performance(
        criteria,
        result,
        output_transformer=lambda x: x.format_article()
    )

    print(eval_results)

    for criterion in eval_results.criteria:
        print(criterion)
        assert criterion.passed, f"Criterion failed: {criterion.criterion_description}"


This approach allows you to define complex evaluation criteria in natural language rather than trying to code every possible edge case.

I'd also move it out into a separate file, like judge.py.

To run async tests with pytest, we need to install a pytest plugin:



In [None]:
!uv add --dev pytest-asyncio

# Filtering Internal Tool Calls

When running this test, I noticed it includes a final_result tool call in the prompt:

```bash
Tool calls:
<TOOL_CALLS>
ToolCall(name='search', args={'query': 'LLM as a judge evaluation'})
ToolCall(name='search', args={'query': 'judge evaluation using LLM'})
ToolCall(name='search', args={'query': 'legal evaluation LLM applications'})
ToolCall(name='final_result', args={'title': 'LLM as a Judge in Evaluation Processes', 'sections' ...})
```

This is an internal virtual tool that Pydantic AI uses for structured output.

See [here](https://ai.pydantic.dev/output/):

By default, Pydantic AI leverages the model's tool calling capability to make it return

structured data.

Other libraries (like LangChain) also use this pattern.

We don't need it in our evaluation: it consumes input tokens and not really useful. So let's filter it out: (This is in `utils.py`)

In [None]:
from pydantic_ai._output import DEFAULT_OUTPUT_TOOL_NAME

def get_tool_calls(result) -> List[ToolCall]:
    """Extract tool-call parts from an agent result and return them as ToolCall objects."""
    calls: List[ToolCall] = []

    for m in result.new_messages():
        for p in m.parts:
            kind = p.part_kind

            if kind == 'tool-call':
                if p.tool_name == DEFAULT_OUTPUT_TOOL_NAME:
                    continue

                call = ToolCall(
                    name=p.tool_name,
                    args=json.loads(p.args)
                )

                calls.append(call)

    return calls

# Improving Agent Instructions

The initial test fails with feedback like:

"The agent's performance was close to meeting the criteria but fell short in a few key areas, including the number of search calls, the use of judicial terms in search queries, and the lack of Python code examples."

Let's fix this by updating the instructions:

In [None]:
search_instructions = """
You are a search assistant for the Evidently documentation.

Evidently is an open-source Python library and cloud platform for evaluating, testing, and monitoring data and AI systems.
It provides evaluation metrics, testing APIs, and visual reports for model and data quality.

Your task is to help users find accurate, relevant information about Evidently's features, usage, and integrations.

Requirements:

- For every user query, you must perform at least 3 separate searches
    to gather enough context and verify accuracy.  
- Each search should use a different angle, phrasing, or keyword
    variation of the user's query. 
- Keep all searches relevant to Evidently and centered on technical
    or conceptual details from its documentation.
- After performing all searches, write a concise, accurate answer
    that synthesizes the findings.  
- For each section, include references listing all the sources
    you used to write that section.
"""


We added more context about the Evidently library so the agent knows what kind of search terms to formulate.

Point 3 specifically guides the agent to stay relevant to the documentation context.

# Test Results and Iteration

Let's test the updated agent.

The original test (without a judge) now passes.

However, the judge test fails on the Python code requirement:

```bash
E           AssertionError: Criterion failed: The article contains properly formatted python code examples
E           assert False
```

In this particular case, we might not actually need Python code. We can force it in instructions, but it might lead to hallucination or excessive searching. So let's drop this criterion from this particular test.

The test now passes.

# Cost Monitoring

Since we use Pydantic AI for both the agent and the judge, our cost report includes the total expenses:

```bash
Model: gpt-4o-mini
Cost: CostInfo(input_cost=0.0052, output_cost=0.0014, total_cost=0.0066)
```

# Future Improvements

I noticed that the agent now includes "Evidently" in all search queries, which isn't always necessary. We can create additional tests to catch this behavior, see them fail, and then refine our prompts accordingly.

Other evaluation criteria we could implement:

- Reference Quality: Check that references are relevant to their sections

- Content Completeness: Ensure each section has sufficient content

- Search Efficiency: Verify the agent doesn't make redundant searches

For some of the checks, we might even give the judge agent additional tools to perform more sophisticated evaluations. Then it will become a real agent.

This gives us a more flexible and powerful mechanism for testing our agents.

Note that judges are not a replacement for "usual" testing, but an addition. For many cases you probably don't need a judge.