# Agent Evaluation

This notebook contains code for evaluating agent performance using LLM-as-a-judge.
We evaluate logged agent interactions against a quality checklist.

## Setup

In [1]:
# Install dependencies
%pip install pydantic-ai
%pip install python-dotenv
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
from dotenv import load_dotenv
import os
import json
from pathlib import Path
from pydantic import BaseModel
from pydantic_ai import Agent
import pandas as pd

# Load environment variables
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("OPENAI_API_KEY not found. Set it in .env file.")

print("API Key loaded:", OPENAI_API_KEY[:6] + "...")

API Key loaded: sk-pro...


## Define Evaluation Schema

We define the structure for evaluation results using Pydantic models.

In [2]:
class EvaluationCheck(BaseModel):
    check_name: str
    justification: str
    check_pass: bool

class EvaluationChecklist(BaseModel):
    checklist: list[EvaluationCheck]
    summary: str

## Create Evaluation Agent (LLM as Judge)

The evaluation agent uses a detailed prompt to assess agent responses.

In [11]:
evaluation_prompt = """
Use this checklist to evaluate the quality of an AI agent's answer (<ANSWER>) to a user question (<QUESTION>).
We also include the entire log (<LOG>) for analysis.

For each item, check if the condition is met.

Checklist:
- instructions_follow: The agent followed the user's instructions (in <INSTRUCTIONS>)
- instructions_avoid: The agent avoided doing things it was told not to do
- answer_relevant: The response directly addresses the user's question
- answer_clear: The answer is clear and correct
- answer_citations: The response includes proper citations or sources when required
- completeness: The response is complete and covers all key aspects of the request
- tool_call_search: Is the search tool invoked?

Output true/false for each check and provide a short explanation for your judgment.
""".strip()

eval_agent = Agent(
    name='eval_agent',
    model='gpt-4o-mini',
    instructions=evaluation_prompt,
    output_type=EvaluationChecklist
)

## Load Log Files

Load the agent interaction logs that we want to evaluate.

In [20]:
LOG_DIR = Path('../logs')

def load_log_file(log_file):
    """Load a JSON log file and add the filename to the record."""
    with open(log_file, 'r', encoding='utf-8') as f:
        log_data = json.load(f)
        log_data['log_file'] = str(log_file)
        return log_data

In [22]:
# Collect all AI-generated logs for a specific agent version
agent_name = 'gh_agent'  # or 'faq_agent_v2', etc.
eval_set = []

for log_file in LOG_DIR.glob('*.json'):
    if agent_name not in log_file.name:
        continue

    log_record = load_log_file(log_file)
    
    # Only evaluate AI-generated test data (or use 'user' for user interactions)
    if log_record.get('source') == 'user':
        eval_set.append(log_record)

print(f"Loaded {len(eval_set)} log files for evaluation")

Loaded 9 log files for evaluation


## Simplify Log Messages

Remove unnecessary details from logs to reduce token usage during evaluation.

In [23]:
def simplify_log_messages(messages):
    """Simplify log messages by removing unnecessary metadata."""
    log_simplified = []

    for m in messages:
        parts = []
    
        for original_part in m['parts']:
            part = original_part.copy()
            kind = part['part_kind']
    
            if kind == 'user-prompt':
                del part['timestamp']
            if kind == 'tool-call':
                del part['tool_call_id']
            if kind == 'tool-return':
                del part['tool_call_id']
                del part['metadata']
                del part['timestamp']
                # Replace actual search results with placeholder to save tokens
                part['content'] = 'RETURN_RESULTS_REDACTED'
            if kind == 'text':
                del part['id']
    
            parts.append(part)
    
        message = {
            'kind': m['kind'],
            'parts': parts
        }
    
        log_simplified.append(message)
    return log_simplified

## Format Evaluation Prompt

Format the log records into the evaluation prompt structure.

In [24]:
user_prompt_format = """
<INSTRUCTIONS>{instructions}</INSTRUCTIONS>
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{log}</LOG>
""".strip()

def format_evaluation_prompt(log_record):
    """Format a log record into an evaluation prompt."""
    instructions = log_record['system_prompt']
    
    # Extract question from first message
    question = log_record['messages'][0]['parts'][0]['content']
    
    # Extract answer from last message
    answer = log_record['messages'][-1]['parts'][0]['content']
    
    # Simplify and serialize log
    simplified_messages = simplify_log_messages(log_record['messages'])
    log = json.dumps(simplified_messages)
    
    return user_prompt_format.format(
        instructions=instructions,
        question=question,
        answer=answer,
        log=log
    )

## Run Evaluation

Evaluate each log record using the evaluation agent.

In [25]:
async def evaluate_log_record(log_record):
    """Evaluate a single log record."""
    user_prompt = format_evaluation_prompt(log_record)
    result = await eval_agent.run(user_prompt, output_type=EvaluationChecklist)
    return result

In [26]:
from tqdm.auto import tqdm

eval_results = []

for log_record in tqdm(eval_set, desc="Evaluating logs"):
    eval_result = await evaluate_log_record(log_record)
    eval_results.append((log_record, eval_result))

print(f"\nCompleted evaluation of {len(eval_results)} interactions")

Evaluating logs:   0%|          | 0/9 [00:00<?, ?it/s]


Completed evaluation of 9 interactions


## Process Results into DataFrame

Convert evaluation results into a pandas DataFrame for analysis.

In [27]:
rows = []

for log_record, eval_result in eval_results:
    messages = log_record['messages']
    
    row = {
        'file': Path(log_record['log_file']).name,
        'question': messages[0]['parts'][0]['content'],
        'answer': messages[-1]['parts'][0]['content'],
    }
    
    # Extract check results
    checklist = eval_result.output
    checks = {c.check_name: c.check_pass for c in checklist.checklist}
    row.update(checks)
    
    # Add summary
    row['summary'] = checklist.summary
    
    rows.append(row)

df_evals = pd.DataFrame(rows)
df_evals.head()

Unnamed: 0,file,question,answer,instructions_follow,instructions_avoid,answer_relevant,answer_clear,answer_citations,completeness,tool_call_search,summary
0,gh_agent_20251003_223237_2efc1f.json,what requirements i need to join?,It seems there isn't a direct answer regarding...,False,True,False,False,False,False,True,The agent's response failed to follow the inst...
1,gh_agent_20251003_223314_864384.json,what course you offer?,The course offered is the **Data Engineering Z...,True,True,True,True,True,True,True,"The agent's answer is comprehensive, relevant,..."
2,gh_agent_20251003_233528_7c5fd3.json,hi,Hello! How can I assist you today?,False,True,False,True,False,False,False,The agent did not follow the instructions to u...
3,gh_agent_20251004_004211_2ac759.json,what due date to register?,I couldn't find specific information about the...,True,True,True,True,False,False,True,The agent followed the instructions and made a...
4,gh_agent_20251004_025044_c9ba1b.json,hello,Hello! How can I assist you today?,False,True,False,True,False,False,False,The agent's response fails to follow several k...


## Calculate Metrics

Calculate overall pass rates for each evaluation criterion.

In [28]:
# Calculate mean pass rate for each check
check_columns = [
    'instructions_follow',
    'instructions_avoid',
    'answer_relevant',
    'answer_clear',
    'answer_citations',
    'completeness',
    'tool_call_search'
]

print("Evaluation Results:")
print("=" * 50)
if df_evals.empty:
    print("No evaluation results available.")
else:
    for col in check_columns:
        if col in df_evals.columns:
            pass_rate = df_evals[col].mean() * 100
            print(f"{col}: {pass_rate:.1f}% pass")
    # Overall pass rate
    existing_cols = [col for col in check_columns if col in df_evals.columns]
    if existing_cols:
        overall_pass_rate = df_evals[existing_cols].mean().mean() * 100
        print("=" * 50)
        print(f"Overall Pass Rate: {overall_pass_rate:.1f}%")
    else:
        print("No check columns found in evaluation results.")

Evaluation Results:
instructions_follow: 44.4% pass
instructions_avoid: 100.0% pass
answer_relevant: 44.4% pass
answer_clear: 77.8% pass
answer_citations: 22.2% pass
completeness: 22.2% pass
tool_call_search: 55.6% pass
Overall Pass Rate: 52.4%


## Save Results

Save the evaluation results to a CSV file for further analysis.

In [29]:
output_path = Path('../logs/evaluation_results.csv')
df_evals.to_csv(output_path, index=False)
print(f"Saved evaluation results to {output_path}")

Saved evaluation results to ../logs/evaluation_results.csv


## Analyze Failed Checks

Examine interactions that failed specific checks.

In [30]:
# Show interactions that failed answer_citations check
if 'answer_citations' in df_evals.columns:
	failed_citations = df_evals[df_evals['answer_citations'] == False]
	print(f"\nFound {len(failed_citations)} interactions without proper citations:")
	print(failed_citations[['question', 'answer']].head())
else:
	print("No 'answer_citations' column found in evaluation results.")


Found 7 interactions without proper citations:
                            question  \
0  what requirements i need to join?   
2                                 hi   
3         what due date to register?   
4                              hello   
5                                 yo   

                                              answer  
0  It seems there isn't a direct answer regarding...  
2                 Hello! How can I assist you today?  
3  I couldn't find specific information about the...  
4                 Hello! How can I assist you today?  
5                 Hello! How can I assist you today?  


## Search Quality Evaluation (Optional)

Evaluate the search function using information retrieval metrics.

In [31]:
def evaluate_search_quality(search_function, test_queries):
    """
    Evaluate search function using Hit Rate and MRR (Mean Reciprocal Rank).
    
    Args:
        search_function: Function that takes a query and returns search results
        test_queries: List of (query, expected_docs) tuples
    
    Returns:
        List of evaluation results
    """
    results = []
    
    for query, expected_docs in test_queries:
        search_results = search_function(query, num_results=5)
        
        # Calculate hit rate
        relevant_found = any(doc['filename'] in expected_docs for doc in search_results)
        
        # Calculate MRR
        mrr = 0
        for i, doc in enumerate(search_results):
            if doc['filename'] in expected_docs:
                mrr = 1 / (i + 1)
                break
            
        results.append({
            'query': query,
            'hit': relevant_found,
            'mrr': mrr
        })
    
    return results

# Example usage (uncomment to use):
# test_queries = [
#     ("Can I join the course now?", ["003_3f1424af17_course-can-i-still-join-the-course-after-the-start.md"]),
#     # Add more test queries...
# ]
# search_results = evaluate_search_quality(faq_index.search, test_queries)
# df_search = pd.DataFrame(search_results)
# print(f"Hit Rate: {df_search['hit'].mean():.2%}")
# print(f"Mean MRR: {df_search['mrr'].mean():.3f}")