Agent Evaluation

This notebook demonstrates different agent evaluation approaches by creating test scenarios and evaluation criteria.

## What we'll learn:

- Metrics
- Test Case
- Evaluator
- Approaches
    - black box, 
    - single step, 
    - trajectory
- Evaluation Report

## Setup
First, let's import the necessary libraries:

In [1]:
# Only needed for Udacity workspace

import importlib.util
import sys

# Check if 'pysqlite3' is available before importing
if importlib.util.find_spec("pysqlite3") is not None:
    import pysqlite3
    sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import time
from typing import List, Dict
from dotenv import load_dotenv

from lib.agents import Agent, AgentState
from lib.state_machine import Run
from lib.messages import AIMessage
from lib.tooling import tool
from lib.evaluation import TestCase, AgentEvaluator, EvaluationResult

In [3]:
load_dotenv()

True

## Your Custom Tools

Design and implement tools that will showcase different aspects of agent behavior. Consider creating tools that:

- Require multi-step reasoning
- Handle different data types and formats
- Interact with external services or APIs
- Process complex inputs
- Have potential failure modes to test error handling


In [4]:
@tool
def get_games(num_games: int = 1, top: bool = True) -> List[Dict]:
    """
    Returns the top or bottom N games with highest or lowest scores.    
    args:
        num_games (int): Number of games to return (default is 1)
        top (bool): If True, return top games, otherwise return bottom (default is True)
    """
    data = [
        {"Game": "The Legend of Zelda: Breath of the Wild", "Platform": "Switch", "Score": 98},
        {"Game": "Super Mario Odyssey", "Platform": "Switch", "Score": 97},
        {"Game": "Metroid Prime", "Platform": "GameCube", "Score": 97},
        {"Game": "Super Smash Bros. Brawl", "Platform": "Wii", "Score": 93},
        {"Game": "Mario Kart 8 Deluxe", "Platform": "Switch", "Score": 92},
        {"Game": "Fire Emblem: Awakening", "Platform": "3DS", "Score": 92},
        {"Game": "Donkey Kong Country Returns", "Platform": "Wii", "Score": 87},
        {"Game": "Luigi's Mansion 3", "Platform": "Switch", "Score": 86},
        {"Game": "Pikmin 3", "Platform": "Wii U", "Score": 85},
        {"Game": "Animal Crossing: New Leaf", "Platform": "3DS", "Score": 88}
    ]
    # Sort the games list by Score
    sorted_games = sorted(data, key=lambda x: x['Score'], reverse=top)
    return sorted_games[:num_games]

## Develop your Agent

In [5]:
agent = Agent(
    model_name="gpt-4o-mini",
    tools=[get_games],
    instructions="You can bring insights about a game dataset based on users questions",
)

## Design Your Test Cases

In [6]:
evaluator = AgentEvaluator()

In [7]:
test_cases = [
    TestCase(
        id="game_query_1",
        description="Find the best game in the dataset",
        user_query="What's the best game in the dataset?",
        expected_tools=["get_games"],
        reference_answer="The Legend of Zelda: Breath of the Wild with score 98",
        max_steps=4
    ),
]

## Run Your Evaluation Experiments

In [8]:
for test_case in test_cases:
    print(f"\n=== Evaluating Test Case: {test_case.id} ===")
    print(f"Description: {test_case.description}")
    print(f"Query: {test_case.user_query}")
    
    # Run the agent
    start_time = time.time()
    print("\nWorkflow:")
    agent.memory.reset()
    run_object:Run = agent.invoke(test_case.user_query)
    execution_time = time.time() - start_time
    
    # Get final state and response
    final_state:AgentState = run_object.get_final_state()
    if final_state and final_state.get("messages"):
        # Find the last AI message as the final response
        final_response = ""
        for msg in reversed(final_state["messages"]):
            if isinstance(msg, AIMessage) and msg.content:
                final_response = msg.content
                break
        
        total_tokens = final_state.get("total_tokens", 0)
        
        # Evaluate using all three methods
        print("\n--- Black Box (Final Response) Evaluation ---")
        black_box_eval:EvaluationResult = evaluator.evaluate_final_response(
            test_case, final_response, execution_time, total_tokens
        )
        print(f"Overall Score: {black_box_eval.overall_score:.2f}")
        print(f"Task Completed: {black_box_eval.task_completion.task_completed}")
        print(f"Feedback: {black_box_eval.feedback}")
        
        print("\n--- Single Step Evaluation ---")
        step_eval:EvaluationResult = evaluator.evaluate_single_step(
            final_state["messages"], test_case.expected_tools
        )
        print(f"Overall Score: {step_eval.overall_score:.2f}")
        print(f"Correct Tool Selected: {step_eval.tool_interaction.correct_tool_selected}")
        print(f"Feedback: {step_eval.feedback}")
        
        print("\n--- Trajectory Evaluation ---")
        traj_eval:EvaluationResult = evaluator.evaluate_trajectory(test_case, run_object)
        print(f"Overall Score: {traj_eval.overall_score:.2f}")
        print(f"Steps Taken: {traj_eval.task_completion.steps_taken}")
        print(f"Total Tokens: {traj_eval.system_metrics.total_tokens}")
        print(f"Execution Time: {traj_eval.system_metrics.execution_time:.2f}s")
        print(f"Estimated Cost: ${traj_eval.system_metrics.cost_estimate:.6f}")
        print(f"Feedback: {traj_eval.feedback}")
        
    else:
        print("ERROR: No final state or messages found")


=== Evaluating Test Case: game_query_1 ===
Description: Find the best game in the dataset
Query: What's the best game in the dataset?

Workflow:
[StateMachine] Starting: __entry__
[StateMachine] Executing step: message_prep
[StateMachine] Executing step: llm_processor
[StateMachine] Executing step: tool_executor
[StateMachine] Executing step: llm_processor
[StateMachine] Terminating: __termination__

--- Black Box (Final Response) Evaluation ---
Overall Score: 1.00
Task Completed: True
Feedback: The agent response successfully identifies the best game in the dataset, providing the correct title and score, which fully answers the user's query. The format is appropriate as it clearly states the game title and score. Additionally, the response follows the implicit instruction to provide a definitive answer regarding the best game.

--- Single Step Evaluation ---
Overall Score: 1.00
Correct Tool Selected: True
Feedback: Selected tools: ['get_games'], Expected: ['get_games']

--- Trajector

## Reflection and Iteration

Based on your evaluation results:

1. **Identify Patterns**: What trends do you see in your agent's performance?
2. **Spot Weaknesses**: Where does your agent struggle most?
3. **Recognize Strengths**: What does your agent do particularly well?
4. **Iterate**: Modify your tools, test cases, or agent configuration and re-evaluate
