# Testing ADK Agents: From Simple LLMs to Agentic Behavior


By the end of this notebook, you will be able to:

1. Build testable agents using Google's ADK (Agent Development Kit)
2. Test agent tool selection (which tools the agent chooses to call)
3. Test tool parameters (accuracy of extracted information)
4. Test multi-step reasoning (tool call sequences)
5. Test edge cases (invalid inputs, ambiguous requests)

---

## 🎯 Context: Why Agent Testing is Different

### Quick Recap: Notebook 01

In the previous lesson, you learned to:
- ✅ Write automated tests with pytest
- ✅ Test LLM text outputs for factual correctness
- ✅ Validate structured outputs with Pydantic
- ✅ Use parameterized tests

### What's Different with Agents?

**Simple LLM (Notebook 01):**
```
User: "What port does SSH use?"
LLM: "Port 22"
Test: Check if response contains "22" ✅
```

**Agent with Tools (This Notebook):**
```
User: "What's the status of ticket #5678?"
Agent: Thinks... I need to look up this ticket
       Calls: lookup_ticket("5678")
       Tool returns: {ticket_id: 5678, status: "In Progress", ...}
       Agent: "Ticket #5678 is currently In Progress..."
       
Test: Did agent call the right tool? ✅
Test: Did agent extract correct ticket ID? ✅
Test: Did agent use the tool results properly? ✅
```

### Key Differences

| Simple LLM Testing | Agent Testing |
|-------------------|---------------|
| Test final text output | Test tool selection & parameters |
| Single-step response | Multi-step reasoning |
| Stateless | Stateful (tool results affect next steps) |
| Straightforward assertions | Test tool call sequences |

### What We're Building Today

An **IT Support Agent** with these tools:
- 🎫 `lookup_ticket(ticket_id)` - Retrieve ticket details
- 📚 `search_knowledge_base(query)` - Find help articles
- 🔍 `check_system_status(service)` - Check if systems are up

**Testing scenarios:**
- User asks about a ticket → Agent calls `lookup_ticket` with correct ID
- User has a problem → Agent searches KB for solution
- User asks multi-step question → Agent calls multiple tools in order

---

Let's get started! 🚀

## 1. Environment Setup

First, we'll install the required packages.

In [None]:
# Install required packages
!pip install -q \
  google-adk==1.17.0 \
  litellm==1.79.0 \
  openai==1.109.1 \
  python-dotenv==1.1.1 \
  nest-asyncio==1.6.0 \
  deprecated==1.3.1 \
  google-genai==1.46.0 \
  pydantic==2.11.10

In [None]:
# Import required libraries
import os
from openai import OpenAI
from pydantic import BaseModel, Field
import json
import asyncio
from typing import List, Optional, Dict, Any
import nest_asyncio

# Enable nested event loops (required for Colab)
nest_asyncio.apply()

# Core ADK imports
from google.adk.agents import LlmAgent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.models.lite_llm import LiteLlm
from google.genai import types

print("✅ All imports successful!")

### API Key Setup

In [None]:
# Configure OpenAI API key
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✅ API key loaded from Colab secrets")
except:
    from getpass import getpass
    print("💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("❌ ERROR: No API key provided!")

print("✅ Authentication configured!")

# Model configuration
OPENAI_MODEL = "gpt-5-nano"
print(f'🤖 Selected Model: {OPENAI_MODEL}')

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

## 2. Creating a Testable IT Support Agent

Let's build a simple IT support agent with three tools that handle common helpdesk tasks:

  1. `lookup_ticket(ticket_id: str)` - **Ticket Database Lookup**
  - Purpose: Retrieves ticket information from a mock database
  - What it returns: Ticket details including ID, user, issue description,
  status, and priority
  - Test database: Contains 3 pre-defined tickets (TICKET-001, TICKET-002,
  TICKET-003) with different scenarios
 - ✅ We will verify if the agent can correctly extract ticket IDs from
  user queries and call the right tool

  2. `search_knowledge_base(query: str)` - **Knowledge Base Search**
  - Purpose: Searches IT documentation for troubleshooting articles
  - What it returns: Relevant help articles based on keywords in the query
  - Mock data: Contains 3 articles covering VPN issues, password resets, and
  printer problems
  - ✅ We will checks if the agent can identify the user's problem type
  and search with appropriate keywords

  3. `escalate_to_human(ticket_id: str, reason: str)` - **Escalation to Human Agent**
  - Purpose: Transfers complex issues to a human support agent
  - What it returns: Confirmation message with ticket ID and escalation reason
  - Mock behavior: Simply logs the escalation (doesn't actually contact anyone)
  - ✅ We will test if the agent knows when it can't solve a problem and
   properly escalates

⭐ Key point: These are mock implementations - they return hardcoded data
  instead of querying real systems. This makes our tests fast, reliable, and
  safe to run repeatedly without worrying about costs or side effects.


Let's implement them:




In [None]:
# Tool 1: Lookup Ticket
def lookup_ticket(ticket_id: str) -> dict:
    """
    Look up details for a support ticket.

    Args:
        ticket_id: The ticket ID to look up (e.g., "5678")

    Returns:
        Dictionary with ticket details
    """
    mock_tickets = {
        "5678": {
            "ticket_id": "5678",
            "status": "In Progress",
            "priority": "High",
            "user": "Alice Johnson",
            "issue": "Cannot access email",
            "assigned_to": "Tech Support Team"
        },
        "1234": {
            "ticket_id": "1234",
            "status": "Resolved",
            "priority": "Medium",
            "user": "Bob Smith",
            "issue": "Printer not working",
            "assigned_to": "Hardware Team"
        },
        "9999": {
            "ticket_id": "9999",
            "status": "Open",
            "priority": "Critical",
            "user": "Charlie Brown",
            "issue": "Server down",
            "assigned_to": "Infrastructure Team"
        }
    }

    if ticket_id in mock_tickets:
        return mock_tickets[ticket_id]
    else:
        return {"error": f"Ticket {ticket_id} not found"}


# Tool 2: Search Knowledge Base
def search_knowledge_base(query: str) -> dict:
    """
    Search the IT knowledge base for help articles.

    Args:
        query: Search query (e.g., "how to reset password")

    Returns:
        Dictionary with search results
    """
    mock_kb = {
        "password": [
            {"title": "How to Reset Your Password", "article_id": "KB001"},
            {"title": "Password Requirements", "article_id": "KB002"}
        ],
        "email": [
            {"title": "Troubleshooting Email Access", "article_id": "KB010"},
            {"title": "Email Configuration Guide", "article_id": "KB011"}
        ],
        "vpn": [
            {"title": "VPN Setup Instructions", "article_id": "KB020"},
            {"title": "VPN Connection Issues", "article_id": "KB021"}
        ],
        "printer": [
            {"title": "Printer Offline Solutions", "article_id": "KB030"},
            {"title": "How to Install Printer Drivers", "article_id": "KB031"}
        ]
    }

    query_lower = query.lower()
    results = []

    for keyword, articles in mock_kb.items():
        if keyword in query_lower:
            results.extend(articles)

    if results:
        return {"query": query, "results": results, "count": len(results)}
    else:
        return {"query": query, "results": [], "count": 0}


# Tool 3: Check System Status
def check_system_status(service_name: str) -> dict:
    """
    Check the operational status of a service or system.

    Args:
        service_name: Name of the service (e.g., "email", "vpn", "database")

    Returns:
        Dictionary with service status
    """
    mock_status = {
        "email": {"service": "email", "status": "operational", "uptime": "99.9%"},
        "vpn": {"service": "vpn", "status": "operational", "uptime": "100%"},
        "database": {"service": "database", "status": "degraded", "uptime": "95.2%"},
        "file_server": {"service": "file_server", "status": "down", "uptime": "0%"},
        "web_portal": {"service": "web_portal", "status": "operational", "uptime": "99.5%"}
    }

    service_lower = service_name.lower().replace(" ", "_")

    if service_lower in mock_status:
        return mock_status[service_lower]
    else:
        return {"service": service_name, "status": "unknown"}


print("✅ Tools defined successfully!")

### Creating the ADK Agent
This code creates and configures an ADK agent with three key parts:

  1. The LLM Model Configuration: connects ADK to OpenAI's GPT model through LiteLLM
  2. The Agent Definition
  3. The Instruction Prompt: here we teach the agent when and how to use tools

- The instructions are detailed:
  - Agent behavior is controlled primarily by the instruction prompt
  - Clear instructions = predictable behavior = easier testing
  - In testing, we verify the agent follows these rules

- We include negative examples as well (when NOT to use tools):
  - Without them, agents tend to over-use tools (calling tools for every
  question)
  - Clear boundaries make testing easier: "Did the agent correctly decide NOT
  to use a tool?"



In [None]:
# Create the LiteLlm model instance for OpenAI
llm_model = LiteLlm(
    model=f"openai/{OPENAI_MODEL}",
    api_key=OPENAI_API_KEY
)

# Create the IT Support Agent
it_support_agent = LlmAgent(
    name="it_support_agent",
    model=llm_model,
    description="An IT support agent that helps users with tickets, knowledge base searches, and system status checks",
    instruction="""You are an IT support agent. You help users with IT issues by using the appropriate tools.

WHEN TO USE TOOLS:
1. Ticket questions (e.g., "ticket 5678", "check ticket status"): call lookup_ticket tool
2. HOW-TO questions or help with specific problems: call search_knowledge_base tool
3. Service status questions (e.g., "is email working?"): call check_system_status tool

WHEN NOT TO USE TOOLS:
- General questions about concepts (e.g., "What is IT support?", "What is a firewall?")
- Greetings or small talk
- Questions that don't require looking up specific information

EXAMPLES OF TOOL USAGE:
- "What's ticket 5678 status?" -> call lookup_ticket("5678")
- "How do I fix VPN issues?" -> call search_knowledge_base("VPN issues")
- "Is email working?" -> call check_system_status("email")

EXAMPLES OF NO TOOL NEEDED:
- "What is IT support?" -> Answer directly without tools
- "Hello" -> Respond without tools
- "What does a firewall do?" -> Answer directly without tools

For questions requiring specific current information (tickets, KB articles, system status), ALWAYS use the appropriate tool.""",
    tools=[lookup_ticket, search_knowledge_base, check_system_status]
)

print("✅ IT Support Agent created!")
print(f"Agent has {len(it_support_agent.tools)} tools available")

## 3. Helper Function for Testing
To test our agent, we need to run it with different user messages and inspect what tools it called. `run_agent_and_get_tools()` helper function handles all the ADK setup (creating sessions, runners, formatting messages) and extracts the key information we care about for testing:
- which tools were called
- what parameters were used
- what the final response was


In [None]:
async def run_agent_and_get_tools(user_message: str, session_id: str = "test_session"):
    """
    Run the agent and capture tool calls.

    Returns:
        dict: {
            'tool_calls': list of {name, parameters},
            'tool_count': int,
            'response': str
        }
    """
    # Create session
    session_service = InMemorySessionService()
    user_id = "test_user"

    await session_service.create_session(
        app_name="it_support_test",
        user_id=user_id,
        session_id=session_id,
        state={}
    )

    # Create runner
    runner = Runner(
        app_name="it_support_test",
        agent=it_support_agent,
        session_service=session_service
    )

    # Format message and run
    content = types.Content(role='user', parts=[types.Part(text=user_message)])
    events = runner.run_async(user_id=user_id, session_id=session_id, new_message=content)

    # Collect tool calls and response
    tool_calls = []
    final_response = ""

    async for event in events:
        # Use get_function_calls() to extract tool calls
        if hasattr(event, 'get_function_calls'):
            function_calls = event.get_function_calls()
            if function_calls:
                for func_call in function_calls:
                    tool_calls.append({
                        'name': func_call.name,
                        'parameters': func_call.args
                    })

        # Get final response
        if event.is_final_response():
            final_response = event.content.parts[0].text

    return {
        'tool_calls': tool_calls,
        'tool_count': len(tool_calls),
        'response': final_response
    }

print("✅ Helper function defined!")

## 4. Testing Tool Selection
  
  The first and most fundamental test: **Does the agent choose the right tool?**

  This is critical because everything else depends on it. If the agent picks
  the wrong tool (searching the knowledge base when it should look up a
  ticket), nothing else matters—the user gets the wrong help, even if the
  parameters are perfect.

### Pattern: Test Tool Selection

```python
1. Give agent a task requiring a specific tool (e.g., "Check ticket 5678")
2. Run the agent with that message
3. Assert that the expected tool was called
```

### Test 1: Agent calls lookup_ticket tool

In [None]:
# Test: When user asks about a ticket, agent should call lookup_ticket
user_message = "Can you check the status of ticket 5678?"
result = await run_agent_and_get_tools(user_message, session_id="test_1")

print(f"User: {user_message}")
print(f"\nTools called: {result['tool_count']}")
print(f"Tool names: {[tc['name'] for tc in result['tool_calls']]}")
print(f"\nAgent response: {result['response'][:200]}...\n")

# Assertions
assert result['tool_count'] > 0, "❌ Agent should have called at least one tool"
tool_names = [tc['name'] for tc in result['tool_calls']]
assert 'lookup_ticket' in tool_names, f"❌ Expected 'lookup_ticket', got: {tool_names}"

print("✅ TEST PASSED: Agent correctly called lookup_ticket")

### Test 2: Agent calls knowledge base search

In [None]:
# Test: When user has a problem, agent should search knowledge base
user_message = "How do I reset my password?"
result = await run_agent_and_get_tools(user_message, session_id="test_2")

print(f"User: {user_message}")
print(f"\nTools called: {result['tool_count']}")
print(f"Tool names: {[tc['name'] for tc in result['tool_calls']]}")
print(f"\nAgent response: {result['response'][:200]}...\n")

# Assertions
tool_names = [tc['name'] for tc in result['tool_calls']]
assert 'search_knowledge_base' in tool_names, f"❌ Expected 'search_knowledge_base', got: {tool_names}"

print("✅ TEST PASSED: Agent correctly called search_knowledge_base")

### Test 3: Your Turn - Test System Status Check

Now it's your turn to write a test!

Your task:
  - Write a test that verifies the agent calls `check_system_status` when the
  user asks about service availability
  - Use the pattern from Tests 1 and 2 above
  - Test with a message like: "Is the email service working?"

  What to assert:
  1. The agent called at least one tool
  2. The tool called was `check_system_status`

In [None]:
# YOUR CODE HERE

### Key Insight: Testing Behavior, Not Text

Notice what we're testing:
- ✅ **We test**: Which tool was called
- ❌ **We don't test**: The exact text of the response

Why? Because:
1. Tool calls are **deterministic** (agent logic)
2. Text responses are **variable** (natural language)
3. Tool calls prove the agent **understood** the task
4. Tool calls are what **actually matter** for functionality

## 5. Testing Tool Parameters

Now let's test whether the agent extract the correct parameters from the user's message.

**Example:**
- User: "Check ticket 5678"
- Agent calls: `lookup_ticket("5678")` ✅
- Agent calls: `lookup_ticket("1234")` ❌ Wrong ticket!

Both scenarios call the right tool (lookup_ticket), but only the first one is actually useful. Parameter accuracy is essential for correct functionality.

  Agents must handle:
  - 🔢 Extracting IDs from natural language ("ticket 5678" → "5678")
  - 🔍 Formulating search queries ("How to reset password?" → "reset password"
  or "password reset")
  - 🏷️ Identifying service names ("Is email working?" → "email")
  - 🗣️ Noisy input ("Hey so like check ticket 5678 please?" → "5678")



### Test 4: Agent extracts ticket ID correctly

In [None]:
# Test: Agent correctly extracts ticket ID from user message
user_message = "What's the status of ticket 1234?"
result = await run_agent_and_get_tools(user_message, session_id="test_4")

print(f"User: {user_message}")
print(f"\nTools called: {[tc['name'] for tc in result['tool_calls']]}")

# Find the lookup_ticket call
ticket_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'lookup_ticket']
assert len(ticket_calls) > 0, "❌ Agent should have called lookup_ticket"

# Check the ticket_id parameter
ticket_id = ticket_calls[0]['parameters'].get('ticket_id')
print(f"Extracted ticket_id: {ticket_id}\n")

assert ticket_id == "1234", f"❌ Expected ticket_id '1234', but got: {ticket_id}"

print("✅ TEST PASSED: Agent correctly extracted ticket ID")

### Test 5: Agent extracts search query correctly

In [None]:
# Test: Agent correctly extracts and formats search query
user_message = "How do I fix VPN connection issues?"
result = await run_agent_and_get_tools(user_message, session_id="test_5")

print(f"User: {user_message}")
print(f"\nTools called: {[tc['name'] for tc in result['tool_calls']]}")

# Find the KB search call
kb_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'search_knowledge_base']
assert len(kb_calls) > 0, "❌ Agent should have called search_knowledge_base"

# Check that query contains relevant keywords
query = kb_calls[0]['parameters'].get('query', '').lower()
print(f"Search query: {query}\n")

assert 'vpn' in query, f"❌ Query should contain 'vpn', but got: {query}"

print("✅ TEST PASSED: Agent correctly extracted search query")

### Test 6:  Your Turn - Test Parameter Extraction for Service Names

Now test if the agent can correctly extract service names from user queries!

  Your task:
  - Write a test that verifies the agent extracts the correct service name when
   asking about system status
  - Use a message like: "Can you tell me if the VPN is up and running?"

  What to assert:
  1. The agent called check_system_status
  2. The service_name parameter contains "vpn"

In [None]:
# YOUR CODE HERE

## 6. Testing Multi-Step Reasoning

So far, we've tested agents that use one tool at a time. But real-world tasks often require multiple steps where the agent must:
  1. Break down a complex request into smaller steps
  2. Call multiple tools in a logical sequence
  3. Use information from one tool call to inform the next

In production, users don't always ask simple, single-step questions. They ask things like:
  - "Is the database down? If so, find affected tickets" (status check → ticket
   search)
  - "Help me troubleshoot my VPN, starting with my open tickets" (lookup →
  search → status)


 What makes multi-step reasoning challenging:

| Challenge            | Why It's Hard |
|----------------------|-------------------------------------------------------------|
| Task decomposition   | Agent must understand that one question requires multiple actions |
| Correct sequencing   | Tools must be called in the right order (can't search before knowing what to search for) |
| Context maintenance  | Agent must remember results from step 1 when executing step 2 |
| Knowing when to stop | Agent needs to recognize when enough tools have been called |

**Example:**
Imagine a user asks "Check ticket 5678 and find solutions for the issue"

```
User: "Check ticket 5678 and find solutions for the issue"

Step 1: Agent calls lookup_ticket("5678")
        Returns: {issue: "Cannot access email"}
        
Step 2: Agent calls search_knowledge_base("email access")
        Returns: [KB articles about email]
        
Step 3: Agent synthesizes both results into helpful response
```



### Test 7: Multi-step reasoning (ticket then KB)


Let's create a test that verify whether the agent can break down a complex request into a logical sequence:
- first looking up the ticket to understand the
   problem
- then searching for solutions based on that information

Before we run the test, let's see what ticket 5678 contains so we understand
  what the agent is working with:

```
 "5678": {
      "ticket_id": "5678",
      "status": "In Progress",
      "priority": "High",
      "user": "Alice Johnson",
      "issue": "Cannot access email",
      "assigned_to": "Tech Support Team"
  }
```

Now let's test the agent's multi-step reasoning:

In [None]:
# Test: Agent performs multi-step reasoning
# 1. Looks up ticket to understand the issue
# 2. Searches KB for solutions
user_message = "Check ticket 5678 and help me find solutions for the issue"
result = await run_agent_and_get_tools(user_message, session_id="test_7")

print(f"User: {user_message}")
print(f"\nTools called: {result['tool_count']}")
tool_sequence = [tc['name'] for tc in result['tool_calls']]
print(f"Tool sequence: {tool_sequence}\n")

# Assert multiple tools were called
assert result['tool_count'] >= 2, f"❌ Expected at least 2 tool calls, but got: {result['tool_count']}"

# Assert that both tools were called
assert 'lookup_ticket' in tool_sequence, "❌ Agent should look up the ticket"
assert 'search_knowledge_base' in tool_sequence, "❌ Agent should search KB for solutions"

# Assert correct order: ticket lookup should come before KB search
ticket_index = tool_sequence.index('lookup_ticket')
kb_index = tool_sequence.index('search_knowledge_base')
assert ticket_index < kb_index, f"❌ Agent should look up ticket before searching KB, but order was: {tool_sequence}"

print("✅ TEST PASSED: Agent correctly performed multi-step reasoning")

Let's observe the result.  The agent called 3 tools in this sequence: `lookup_ticket` → `search_knowledge_base` → `check_system_status`

The agent showed **proactive reasoning**! After discovering the ticket is  about email access issues, it went a step further to check whether the email  service itself is down. This is actually intelligent troubleshooting — determining if it's a user-specific issue or a system-wide problem.


  Agents can exhibit **emergent behavior** beyond the minimum required steps.
  This can be:
  - ✅ **Beneficial:** Smarter, more thorough assistance
  - ⚠️ **Unexpected:** More API calls, higher costs, potential for errors

  As testers, you need to:
  - ✅ Test for **minimum expected behavior** (the 2 core tools)
  - ✅ Be aware of **additional tool calls** and understand why they happen
  - ✅ Decide if extra steps are desirable or need to be constrained



### Test 8: Your Turn - Test Multi-Step Reasoning with Status Check

Now test if the agent can chain together a status check with knowledge base
  search!

  Your task:
  - Write a test where the user says: "The database seems slow. Check its
  status and find troubleshooting guides."
  - The agent should perform two steps:
    1. Call check_system_status to check database status
    2. Call search_knowledge_base to find troubleshooting articles

  What to assert:
  1. At least 2 tools were called
  2. Both check_system_status and search_knowledge_base were called
  3. The status check happens before the KB search (logical order)

In [None]:
# YOUR CODE HERE



## 7. Testing Edge Cases

  So far, we've tested the happy path—scenarios where users provide clear,
  well-formed requests with all necessary information. However, in production, users will provide **incomplete information, use ambiguous language, make typos and errors, ask questions that don't require any tools or provide invalid IDs or references that don't exist**.

Therefore we need to test **edge cases**. These are scenarios that test the boundaries and limits of your system such as:


  - ❓ Ambiguous requests - "Help me with my problem" (what problem?)
  - ❌ Invalid data - "Check ticket ABC123" (malformed ticket ID)
  - 🤷 No tool needed - "What is IT support?" (general knowledge question)
  - 🔀 Multiple interpretations - "Check the email" (ticket about email? or
  email service status?)
  - 🗣️ Noisy input - "Hey so like, umm, could you maybe check ticket 5678?"


**Your agent needs to handle these gracefully!**

### Test 9: General question (no tool needed)

  This test verifies that the agent knows when NOT to use tools. General
  conceptual questions like "What is IT support?" don't require looking up data
   or checking systems—they just need a direct answer.

In [None]:
# Test: Agent handles general questions without calling tools
user_message = "What is IT support?"
result = await run_agent_and_get_tools(user_message, session_id="test_9")

print(f"User: {user_message}")
print(f"\nTools called: {result['tool_count']}")
print(f"\nAgent response: {result['response'][:200]}...\n")

# For a general question, agent shouldn't need tools
assert result['tool_count'] == 0, f"❌ General question shouldn't require tools, but {result['tool_count']} tools were called"

# Should still provide a response
assert len(result['response']) > 0, "❌ Agent should provide a response"

print("✅ TEST PASSED: Agent handled general question without tools")

### Test 10: Invalid ticket ID


Test 10 will check how the agent handles requests for data that doesn't exist. The agent should still attempt the lookup (because the request is valid), but then handle the error response gracefully.

In [None]:
# Test: Agent attempts to look up non-existent ticket
# The tool will return an error, agent should handle it
user_message = "Check ticket 99999999"
result = await run_agent_and_get_tools(user_message, session_id="test_10")

print(f"User: {user_message}")
print(f"\nTools called: {[tc['name'] for tc in result['tool_calls']]}")
print(f"\nAgent response: {result['response'][:200]}...\n")

# Agent should still try to look up the ticket
tool_names = [tc['name'] for tc in result['tool_calls']]
assert 'lookup_ticket' in tool_names, "❌ Agent should attempt ticket lookup"

# The final response should indicate the ticket wasn't found
response_lower = result['response'].lower()
assert ('not found' in response_lower or "isn't found" in response_lower or
        "doesn't exist" in response_lower or 'invalid' in response_lower or
        'error' in response_lower), \
       f"❌ Response should indicate ticket not found"

print("✅ TEST PASSED: Agent handled invalid ticket ID")

### Test 11: Noisy input

  Real users don't speak like documentation examples. They add filler words,
  pleasantries, typos, and extra context such as:

  - "umm could u maybe check ticket 5678 thx"
  - "Hey there! So I was wondering if you could help me with ticket number
  5678? That would be great!"
  - "ticket 5678 please asap!!!"
  
  
This test verifies the agent can extract key information from messy, conversational input.

In [None]:
# Test: Agent extracts ticket ID from messy/noisy user input
user_message = "Hey, so like, I was wondering, could you maybe check ticket 5678 for me? Thanks!"
result = await run_agent_and_get_tools(user_message, session_id="test_11")

print(f"User: {user_message}")
print(f"\nTools called: {[tc['name'] for tc in result['tool_calls']]}")

# Agent should extract the ticket ID despite the noise
ticket_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'lookup_ticket']
assert len(ticket_calls) > 0, "❌ Agent should extract ticket ID from noisy input"

ticket_id = ticket_calls[0]['parameters'].get('ticket_id')
print(f"Extracted ticket_id: {ticket_id}\n")

assert ticket_id == "5678", f"❌ Expected ticket '5678', but got: {ticket_id}"

print("✅ TEST PASSED: Agent extracted info from noisy input")

### Edge Case Testing Strategy

  When testing edge cases, think about what could go wrong in production. Ask
  yourself: "How might users actually use this?"

**Categories of edge cases to test:**

  1. **❌ Invalid inputs** - What if data is malformed or doesn't exist?
     - Non-existent ticket IDs
     - Misspelled service names
     - Ticket IDs in wrong format (e.g., "#5678" vs "5678")

  2. **❓ Missing information** - What if user doesn't provide required
  details?
     - "Check my ticket" (which ticket?)
     - "Is it working?" (what service?)
     - "Find me some help" (with what?)

  3. **🔀 Ambiguity** - What if request could mean multiple things?
     - "Check the email" (ticket about email? or email service status?)
     - "Database issue" (lookup ticket? check status? search KB?)
     - "The printer" (which printer? printer service?)

  4. **⚠️ Error conditions** - What if tools fail or return errors?
     - Tool times out
     - API returns 500 error
     - Network connection lost

  5. **:❗Boundary conditions** - What about extreme values?
     - Very long ticket IDs
     - Special characters in queries
     - Empty input strings
     - Extremely long user messages

  6. **💬 Noisy or conversational input** - Real users aren't robots!
     - Extra filler words ("um", "like", "maybe")
     - Politeness phrases ("please", "thank you", "could you")
     - Typos and grammatical errors
     - Multiple languages mixed together



## 8. Exercises 🎓

Now it's your turn! Apply what you've learned about testing agents.

### Exercise 1: Test Different Ticket IDs

Write tests for ticket IDs: 1234, 9999. Verify the agent:
1. Calls lookup_ticket
2. Extracts the correct ticket ID
3. Provides appropriate response

In [None]:
# Exercise 1: Your code here
# Test ticket 1234

# Test ticket 9999


### Exercise 2: Test Tool Disambiguation

Write a test where the user says "Check the database". The agent could:
- Call `lookup_ticket` if they think it's a ticket
- Call `check_system_status` if they think it's the service

Which one should the agent choose? Test your hypothesis!

In [None]:
# Exercise 2: Your code here


### Exercise 3: Test Multiple Tickets

Write a test where the user says "Compare tickets 5678 and 1234".

**Assert:**
1. Agent looks up both tickets
2. Both ticket IDs are correctly extracted

In [None]:
# Exercise 3: Your code here


### Exercise 4: Create a Failing Test

Write a test that you EXPECT to fail. Then explain:
1. Why it fails
2. Is it a bug in the agent or a test problem?
3. How would you fix it?

**Example failing scenarios:**
- Agent calls wrong tool
- Agent extracts wrong parameter
- Agent doesn't handle edge case
- Test is too strict

In [None]:
# Exercise 4: Your failing test here


## 9. Best Practices for Agent Testing

### ✅ DO:

1. **Test behavior, not text** - Focus on tool calls and parameters
2. **Test tool selection first** - Ensure agent picks the right tool
3. **Test parameter accuracy** - Verify extracted information is correct
4. **Test tool call sequences** - Multi-step reasoning matters
5. **Test edge cases** - Invalid input, missing data, ambiguity
6. **Use mock data** - Fast tests with fake/mock tool responses
7. **Test happy path AND failures** - Both success and error cases
8. **Write descriptive test names** - Make it clear what you're testing

### ❌ DON'T:

1. **Don't test only final text** - Tool calls are more important
2. **Don't expect exact tool sequences** - Some variation is OK
3. **Don't skip edge cases** - That's where bugs hide
4. **Don't use real external services** - Slow and unreliable
5. **Don't test too many things in one test** - Keep tests focused

### Testing Hierarchy

**Priority 1: Critical Functionality**
- Does agent call the right tool?
- Does agent extract correct parameters?

**Priority 2: Complex Behavior**
- Multi-step reasoning
- Tool call ordering

**Priority 3: Edge Cases**
- Invalid inputs
- Error handling
- Ambiguous requests

**Priority 4: Output Quality**
- Response helpfulness
- Text clarity
- (This is covered in the upcoming notebook: LLM-as-judge)



