# Testing ADK Agents: From Simple LLMs to Agentic Behavior

**Course:** LLM and Agent Testing - Lesson 2  
**Domain:** IT Support/Services Company  
**Environment:** Google Colab

---

## 📚 Learning Objectives

By the end of this notebook, you will be able to:

1. Build testable agents using Google's ADK (Agent Development Kit)
2. Test agent tool selection (which tools the agent chooses to call)
3. Test tool parameters (accuracy of extracted information)
4. Test multi-step reasoning (tool call sequences)
5. Test edge cases (invalid inputs, ambiguous requests)
6. Apply pytest patterns to agent behavior testing

---

## 🎯 Context: Why Agent Testing is Different

### Quick Recap: Notebook 01

In the previous lesson, you learned to:
- ✅ Write automated tests with pytest
- ✅ Test LLM text outputs for factual correctness
- ✅ Validate structured outputs with Pydantic
- ✅ Use parameterized tests

### What's Different with Agents?

**Simple LLM (Notebook 01):**
```
User: "What port does SSH use?"
LLM: "Port 22"
Test: Check if response contains "22" ✅
```

**Agent with Tools (This Notebook):**
```
User: "What's the status of ticket #5678?"
Agent: Thinks... I need to look up this ticket
       Calls: lookup_ticket("5678")
       Tool returns: {ticket_id: 5678, status: "In Progress", ...}
       Agent: "Ticket #5678 is currently In Progress..."
       
Test: Did agent call the right tool? ✅
Test: Did agent extract correct ticket ID? ✅
Test: Did agent use the tool results properly? ✅
```

### Key Differences

| Simple LLM Testing | Agent Testing |
|-------------------|---------------|
| Test final text output | Test tool selection & parameters |
| Single-step response | Multi-step reasoning |
| Stateless | Stateful (tool results affect next steps) |
| Straightforward assertions | Test tool call sequences |

### What We're Building Today

An **IT Support Agent** with these tools:
- 🎫 `lookup_ticket(ticket_id)` - Retrieve ticket details
- 📚 `search_knowledge_base(query)` - Find help articles
- 🔍 `check_system_status(service)` - Check if systems are up

**Testing scenarios:**
- User asks about a ticket → Agent calls `lookup_ticket` with correct ID
- User has a problem → Agent searches KB for solution
- User asks multi-step question → Agent calls multiple tools in order

---

Let's get started! 🚀

## 1. Environment Setup

First, we'll install the required packages.

In [None]:
# Install required packages
!pip install -q google-adk litellm openai python-dotenv nest-asyncio deprecated google-genai pytest pytest-html pydantic

In [None]:
# Import required libraries
import os
from openai import OpenAI
import pytest
from pydantic import BaseModel, Field
import json
from typing import List, Optional, Dict, Any
import nest_asyncio

# Enable nested event loops (required for Colab)
nest_asyncio.apply()

# Core ADK imports
from google.adk.agents import LlmAgent  # For creating AI agents
from google.adk.runners import Runner  # For executing agent interactions
from google.adk.sessions import InMemorySessionService  # For managing conversation sessions
from google.adk.models.lite_llm import LiteLlm  # For using OpenAI and other LLM providers

# Google AI SDK for content formatting
from google.genai import types

print("✅ All imports successful!")

### API Key Setup

In [None]:
# Configure OpenAI API key
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✅ API key loaded from Colab secrets")
except:
    from getpass import getpass
    print("💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("❌ ERROR: No API key provided!")

print("✅ Authentication configured!")

# Model configuration
OPENAI_MODEL = "gpt-5-nano"

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

## 2. Creating a Testable IT Support Agent

Let's build a simple IT support agent with three tools. These tools will return mock data for fast testing.

### Understanding ADK Tool Structure

An ADK tool consists of:
1. **Function** - The actual code that runs
2. **Tool definition** - Describes the function to the agent
3. **Registration** - Adding the tool to the agent

In [None]:
# Tool 1: Lookup Ticket
def lookup_ticket(ticket_id: str) -> dict:
    """
    Look up details for a support ticket.
    
    Args:
        ticket_id: The ticket ID to look up (e.g., "5678")
    
    Returns:
        Dictionary with ticket details
    """
    # Mock data for testing - in real system, would query database
    mock_tickets = {
        "5678": {
            "ticket_id": "5678",
            "status": "In Progress",
            "priority": "High",
            "user": "Alice Johnson",
            "issue": "Cannot access email",
            "assigned_to": "Tech Support Team"
        },
        "1234": {
            "ticket_id": "1234",
            "status": "Resolved",
            "priority": "Medium",
            "user": "Bob Smith",
            "issue": "Printer not working",
            "assigned_to": "Hardware Team"
        },
        "9999": {
            "ticket_id": "9999",
            "status": "Open",
            "priority": "Critical",
            "user": "Charlie Brown",
            "issue": "Server down",
            "assigned_to": "Infrastructure Team"
        }
    }
    
    if ticket_id in mock_tickets:
        return mock_tickets[ticket_id]
    else:
        return {"error": f"Ticket {ticket_id} not found"}


# Tool 2: Search Knowledge Base
def search_knowledge_base(query: str) -> dict:
    """
    Search the IT knowledge base for help articles.
    
    Args:
        query: Search query (e.g., "how to reset password")
    
    Returns:
        Dictionary with search results
    """
    # Mock knowledge base articles
    mock_kb = {
        "password": [
            {"title": "How to Reset Your Password", "article_id": "KB001"},
            {"title": "Password Requirements", "article_id": "KB002"}
        ],
        "email": [
            {"title": "Troubleshooting Email Access", "article_id": "KB010"},
            {"title": "Email Configuration Guide", "article_id": "KB011"}
        ],
        "vpn": [
            {"title": "VPN Setup Instructions", "article_id": "KB020"},
            {"title": "VPN Connection Issues", "article_id": "KB021"}
        ],
        "printer": [
            {"title": "Printer Offline Solutions", "article_id": "KB030"},
            {"title": "How to Install Printer Drivers", "article_id": "KB031"}
        ]
    }
    
    query_lower = query.lower()
    results = []
    
    for keyword, articles in mock_kb.items():
        if keyword in query_lower:
            results.extend(articles)
    
    if results:
        return {"query": query, "results": results, "count": len(results)}
    else:
        return {"query": query, "results": [], "count": 0}


# Tool 3: Check System Status
def check_system_status(service_name: str) -> dict:
    """
    Check the operational status of a service or system.
    
    Args:
        service_name: Name of the service (e.g., "email", "vpn", "database")
    
    Returns:
        Dictionary with service status
    """
    # Mock system status
    mock_status = {
        "email": {"service": "email", "status": "operational", "uptime": "99.9%"},
        "vpn": {"service": "vpn", "status": "operational", "uptime": "100%"},
        "database": {"service": "database", "status": "degraded", "uptime": "95.2%"},
        "file_server": {"service": "file_server", "status": "down", "uptime": "0%"},
        "web_portal": {"service": "web_portal", "status": "operational", "uptime": "99.5%"}
    }
    
    service_lower = service_name.lower().replace(" ", "_")
    
    if service_lower in mock_status:
        return mock_status[service_lower]
    else:
        return {"service": service_name, "status": "unknown"}


print("✅ Tools defined successfully!")

### Creating the ADK Agent

Now we'll create an ADK agent and register our tools.

**Note:** In ADK, you can pass Python functions directly as tools. The agent will automatically understand them based on their docstrings and type hints!

In [None]:
# Create the LiteLlm model instance for OpenAI
# This allows ADK to work with OpenAI's API
llm_model = LiteLlm(
    model=f"openai/{OPENAI_MODEL}",  # Format: provider/model
    api_key=OPENAI_API_KEY
)

# Create the IT Support Agent using LlmAgent
# In ADK, custom functions can be passed directly as tools
# NOTE: Agent name must be a valid Python identifier (no spaces, starts with letter/underscore)
it_support_agent = LlmAgent(
    name="it_support_agent",  # Valid identifier: lowercase with underscores
    model=llm_model,
    description="An IT support agent that helps users with tickets, knowledge base searches, and system status checks",
    instruction="""You are an IT support agent. Help users with their IT issues by:
    1. Looking up ticket information when asked about specific tickets
    2. Searching the knowledge base for solutions to problems
    3. Checking system status when users report service issues
    
    Always use the appropriate tools to get accurate information.""",
    tools=[lookup_ticket, search_knowledge_base, check_system_status]
)

print("✅ IT Support Agent created!")
print(f"Agent has {len(it_support_agent.tools)} tools available")

### Helper Function for Testing

We need a way to run the agent and capture tool calls for testing.

In [None]:
# Let's first test if our agent can successfully call tools
# We'll add some debug output to see what's happening

import asyncio
import nest_asyncio

nest_asyncio.apply()

async def test_agent_with_debug(user_message: str):
    """Test the agent and show detailed debug information."""
    print(f"\n{'='*60}")
    print(f"Testing: {user_message}")
    print(f"{'='*60}")
    
    # Create session service
    session_service = InMemorySessionService()
    
    # Create session
    session_id = "debug_session"
    user_id = "debug_user"
    
    await session_service.create_session(
        app_name="it_support_test",
        user_id=user_id,
        session_id=session_id,
        state={}
    )
    
    # Create runner
    runner = Runner(
        app_name="it_support_test",
        agent=it_support_agent,
        session_service=session_service
    )
    
    # Format message
    content = types.Content(
        role='user',
        parts=[types.Part(text=user_message)]
    )
    
    # Run and collect all events
    events = runner.run_async(
        user_id=user_id,
        session_id=session_id,
        new_message=content
    )
    
    print("\n📊 Events received:")
    event_count = 0
    tool_uses = []
    final_response = None
    
    async for event in events:
        event_count += 1
        print(f"\nEvent {event_count}:")
        print(f"  Type: {type(event).__name__}")
        print(f"  Has tool_use: {hasattr(event, 'tool_use')}")
        print(f"  Is final: {event.is_final_response() if hasattr(event, 'is_final_response') else 'N/A'}")
        
        # Try to extract tool use information
        if hasattr(event, 'tool_use') and event.tool_use:
            print(f"  ✅ Tool use detected!")
            for tool in event.tool_use:
                print(f"    - Tool: {tool.name if hasattr(tool, 'name') else tool}")
                tool_uses.append(tool)
        
        # Get final response
        if hasattr(event, 'is_final_response') and event.is_final_response():
            final_response = event.content.parts[0].text
            print(f"  ✅ Final response: {final_response[:100]}...")
    
    print(f"\n📈 Summary:")
    print(f"  Total events: {event_count}")
    print(f"  Tools called: {len(tool_uses)}")
    print(f"  Final response length: {len(final_response) if final_response else 0} chars")
    
    return {
        'tool_count': len(tool_uses),
        'tools': tool_uses,
        'response': final_response
    }

# Test with a ticket lookup query
try:
    loop = asyncio.get_running_loop()
    result = asyncio.run(test_agent_with_debug("What's the status of ticket 5678?"))
except RuntimeError:
    result = asyncio.run(test_agent_with_debug("What's the status of ticket 5678?"))

print(f"\n🎯 Result:")
print(f"  Tool count: {result['tool_count']}")
print(f"  Response: {result['response']}")

## 3. Testing Tool Selection

The first thing to test: **Does the agent choose the right tool?**

This is fundamental - if the agent picks the wrong tool, nothing else matters!

### Pattern: Test Tool Selection

```python
1. Give agent a task that requires a specific tool
2. Run the agent
3. Assert that the expected tool was called
```

In [None]:
%%writefile test_agent_tool_selection.py
# test_agent_tool_selection.py
# Tests for agent tool selection

import os
from openai import OpenAI
import pytest
import nest_asyncio

# Enable nested event loops
nest_asyncio.apply()

# ADK imports
from google.adk.agents import LlmAgent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.models.lite_llm import LiteLlm

# Import our tool functions
from tools import lookup_ticket, search_knowledge_base, check_system_status
from helpers import run_agent_and_capture_tools, create_it_support_agent

# Initialize agent (we'll use a fixture in real tests)
agent = create_it_support_agent()


def test_agent_calls_ticket_lookup_tool():
    """
    Test: When user asks about a ticket, agent should call lookup_ticket tool.
    """
    user_message = "Can you check the status of ticket 5678?"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Assert that at least one tool was called
    assert result['tool_count'] > 0, "Agent should have called at least one tool"
    
    # Assert that lookup_ticket was called
    tool_names = [tc['name'] for tc in result['tool_calls']]
    assert 'lookup_ticket' in tool_names, f"Expected 'lookup_ticket' to be called, but got: {tool_names}"


def test_agent_calls_knowledge_base_tool():
    """
    Test: When user has a problem, agent should search knowledge base.
    """
    user_message = "How do I reset my password?"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Assert that search_knowledge_base was called
    tool_names = [tc['name'] for tc in result['tool_calls']]
    assert 'search_knowledge_base' in tool_names, \
        f"Expected 'search_knowledge_base' to be called, but got: {tool_names}"


def test_agent_calls_system_status_tool():
    """
    Test: When user asks about system status, agent should check status.
    """
    user_message = "Is the email service working?"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Assert that check_system_status was called
    tool_names = [tc['name'] for tc in result['tool_calls']]
    assert 'check_system_status' in tool_names, \
        f"Expected 'check_system_status' to be called, but got: {tool_names}"


def test_agent_chooses_correct_tool_for_ticket_vs_kb():
    """
    Test: Agent distinguishes between ticket lookup and KB search.
    """
    # Scenario 1: Ticket lookup
    result1 = run_agent_and_capture_tools(agent, "Show me ticket 1234")
    tool_names1 = [tc['name'] for tc in result1['tool_calls']]
    
    # Scenario 2: KB search
    result2 = run_agent_and_capture_tools(agent, "How to fix printer offline error?")
    tool_names2 = [tc['name'] for tc in result2['tool_calls']]
    
    # Assert correct tool selection
    assert 'lookup_ticket' in tool_names1, "Should use lookup_ticket for ticket queries"
    assert 'search_knowledge_base' in tool_names2, "Should use KB search for how-to questions"


print("✅ Test file created: test_agent_tool_selection.py")

### Key Insight: Testing Behavior, Not Text

Notice what we're testing:
- ✅ **We test**: Which tool was called
- ❌ **We don't test**: The exact text of the response

Why? Because:
1. Tool calls are **deterministic** (agent logic)
2. Text responses are **variable** (natural language)
3. Tool calls prove the agent **understood** the task
4. Tool calls are what **actually matter** for functionality

## 4. Testing Tool Parameters

Choosing the right tool is good. But did the agent extract the correct parameters?

**Example:**
- User: "Check ticket 5678"
- Agent calls: `lookup_ticket("5678")` ✅
- Agent calls: `lookup_ticket("1234")` ❌ Wrong ticket!

### Pattern: Test Parameter Extraction

```python
1. Give agent a task with specific information
2. Run the agent
3. Assert the tool was called with correct parameters
```

In [None]:
%%writefile test_agent_parameters.py
# test_agent_parameters.py
# Tests for agent parameter extraction

import os
import pytest
from helpers import run_agent_and_capture_tools, create_it_support_agent

agent = create_it_support_agent()


def test_agent_extracts_ticket_id():
    """
    Test: Agent correctly extracts ticket ID from user message.
    """
    user_message = "What's the status of ticket 5678?"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Find the lookup_ticket call
    ticket_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'lookup_ticket']
    assert len(ticket_calls) > 0, "Agent should have called lookup_ticket"
    
    # Check the ticket_id parameter
    ticket_id = ticket_calls[0]['parameters'].get('ticket_id')
    assert ticket_id == "5678", f"Expected ticket_id '5678', but got: {ticket_id}"


def test_agent_extracts_search_query():
    """
    Test: Agent correctly extracts and formats search query.
    """
    user_message = "I need help with VPN connection issues"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Find the KB search call
    kb_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'search_knowledge_base']
    assert len(kb_calls) > 0, "Agent should have called search_knowledge_base"
    
    # Check that query contains relevant keywords
    query = kb_calls[0]['parameters'].get('query', '').lower()
    assert 'vpn' in query, f"Query should contain 'vpn', but got: {query}"


def test_agent_extracts_service_name():
    """
    Test: Agent correctly identifies service name from user query.
    """
    user_message = "Is the email service operational right now?"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Find the status check call
    status_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'check_system_status']
    assert len(status_calls) > 0, "Agent should have called check_system_status"
    
    # Check the service name
    service = status_calls[0]['parameters'].get('service_name', '').lower()
    assert 'email' in service, f"Service name should contain 'email', but got: {service}"


@pytest.mark.parametrize("ticket_id", ["5678", "1234", "9999"])
def test_agent_extracts_various_ticket_ids(ticket_id):
    """
    Test: Agent correctly extracts different ticket ID formats.
    This is a parameterized test - runs once for each ticket_id!
    """
    user_message = f"Look up ticket {ticket_id}"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Find lookup_ticket call
    ticket_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'lookup_ticket']
    assert len(ticket_calls) > 0, "Agent should have called lookup_ticket"
    
    # Verify correct ticket ID
    extracted_id = ticket_calls[0]['parameters'].get('ticket_id')
    assert extracted_id == ticket_id, \
        f"Expected ticket_id '{ticket_id}', but got: {extracted_id}"


print("✅ Test file created: test_agent_parameters.py")

### Why Parameter Testing Matters

Imagine these scenarios:

**Scenario 1: Correct Parameters** ✅
```
User: "Check ticket 5678"
Agent: lookup_ticket("5678") → Returns correct ticket
User: Happy! Gets the right information
```

**Scenario 2: Wrong Parameters** ❌
```
User: "Check ticket 5678"
Agent: lookup_ticket("1234") → Returns wrong ticket
User: Confused! Gets incorrect information
```

**Testing parameters ensures data integrity!**

## 5. Testing Multi-Step Reasoning

Real-world agent tasks often require multiple steps:

**Example:**
```
User: "Check ticket 5678 and find solutions for the issue"

Step 1: Agent calls lookup_ticket("5678")
        Returns: {issue: "Cannot access email"}
        
Step 2: Agent calls search_knowledge_base("email access")
        Returns: [KB articles about email]
        
Step 3: Agent synthesizes information and responds
```

### Pattern: Test Tool Call Sequences

```python
1. Give agent a multi-step task
2. Run the agent
3. Assert on tool count
4. Assert on tool call order
5. Assert on parameter correctness across calls
```

In [None]:
%%writefile test_agent_multistep.py
# test_agent_multistep.py
# Tests for multi-step agent reasoning

import os
import pytest
from helpers import run_agent_and_capture_tools, create_it_support_agent

agent = create_it_support_agent()


def test_agent_multistep_ticket_then_kb():
    """
    Test: Agent performs multi-step reasoning.
    1. Looks up ticket to understand the issue
    2. Searches KB for solutions
    """
    user_message = "Check ticket 5678 and help me find solutions for the issue"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Assert multiple tools were called
    assert result['tool_count'] >= 2, \
        f"Expected at least 2 tool calls, but got: {result['tool_count']}"
    
    # Get tool names in order
    tool_sequence = [tc['name'] for tc in result['tool_calls']]
    
    # Assert that lookup_ticket was called
    assert 'lookup_ticket' in tool_sequence, "Agent should look up the ticket"
    
    # Assert that search_knowledge_base was called
    assert 'search_knowledge_base' in tool_sequence, "Agent should search KB for solutions"
    
    # Assert correct order: ticket lookup should come before KB search
    ticket_index = tool_sequence.index('lookup_ticket')
    kb_index = tool_sequence.index('search_knowledge_base')
    assert ticket_index < kb_index, \
        f"Agent should look up ticket before searching KB, but order was: {tool_sequence}"


def test_agent_multistep_status_then_kb():
    """
    Test: Agent checks system status, then searches for known issues.
    """
    user_message = "The database seems slow. Check its status and find troubleshooting guides."
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Assert multiple tools were called
    assert result['tool_count'] >= 2, \
        f"Expected at least 2 tool calls for this complex request"
    
    tool_names = [tc['name'] for tc in result['tool_calls']]
    
    # Should check status and search KB
    assert 'check_system_status' in tool_names, "Should check database status"
    assert 'search_knowledge_base' in tool_names, "Should search for troubleshooting guides"


def test_agent_single_step_when_appropriate():
    """
    Test: Agent uses single tool when that's all that's needed.
    Not every query requires multiple steps!
    """
    user_message = "What's the status of ticket 1234?"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # For this simple query, should only need lookup_ticket
    assert result['tool_count'] == 1, \
        f"Simple ticket lookup should use 1 tool, but used: {result['tool_count']}"
    
    assert result['tool_calls'][0]['name'] == 'lookup_ticket', \
        "Should use lookup_ticket for ticket status query"


print("✅ Test file created: test_agent_multistep.py")

### Understanding Tool Call Order

Why does order matter?

**Good Order:**
```
1. lookup_ticket("5678") → Get issue: "email access"
2. search_knowledge_base("email access") → Find relevant articles
3. Provide informed response
```

**Bad Order:**
```
1. search_knowledge_base("unknown") → Generic results
2. lookup_ticket("5678") → Too late, already gave bad advice
```

**Testing ensures logical reasoning flow!**

## 6. Testing Edge Cases

What happens when things go wrong or are unclear?

**Edge cases to test:**
1. ❓ **Ambiguous requests** - "Help me with my problem" (what problem?)
2. ❌ **Invalid data** - "Check ticket XYZ" (invalid ticket ID)
3. 🤷 **No tool needed** - "What is IT support?" (general question)
4. 🔀 **Multiple interpretations** - "Check the email" (ticket or system status?)

### Why Test Edge Cases?

In production:
- Users won't always provide perfect input
- Systems may return errors
- Requests may be vague or ambiguous

**Your agent needs to handle these gracefully!**

In [None]:
%%writefile test_agent_edge_cases.py
# test_agent_edge_cases.py
# Tests for edge cases and error handling

import os
import pytest
from helpers import run_agent_and_capture_tools, create_it_support_agent

agent = create_it_support_agent()


def test_agent_no_tool_needed():
    """
    Test: Agent handles general questions without calling tools.
    """
    user_message = "What is IT support?"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # For a general question, agent shouldn't need tools
    assert result['tool_count'] == 0, \
        f"General question shouldn't require tools, but {result['tool_count']} tools were called"
    
    # Should still provide a response
    assert len(result['final_response']) > 0, "Agent should provide a response"


def test_agent_handles_invalid_ticket_id():
    """
    Test: Agent attempts to look up non-existent ticket.
    The tool will return an error, agent should handle it.
    """
    user_message = "Check ticket 99999999"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Agent should still try to look up the ticket
    tool_names = [tc['name'] for tc in result['tool_calls']]
    assert 'lookup_ticket' in tool_names, "Agent should attempt ticket lookup"
    
    # The final response should indicate the ticket wasn't found
    response_lower = result['final_response'].lower()
    assert 'not found' in response_lower or 'doesn\'t exist' in response_lower or 'invalid' in response_lower, \
        f"Response should indicate ticket not found, but got: {result['final_response']}"


def test_agent_handles_ambiguous_request():
    """
    Test: Agent handles vague requests.
    """
    user_message = "I need help"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Agent might call KB search to find general help
    # OR might not call any tools and ask for clarification
    # Both are acceptable behaviors
    
    # Check that agent provides SOME response
    assert len(result['final_response']) > 0, "Agent should respond even to vague requests"
    
    # If tools were called, they should be relevant
    if result['tool_count'] > 0:
        tool_names = [tc['name'] for tc in result['tool_calls']]
        # Should use KB search (not ticket lookup with no ID)
        if 'lookup_ticket' in tool_names:
            pytest.fail("Agent shouldn't call lookup_ticket without a ticket ID")


def test_agent_extracts_from_noisy_input():
    """
    Test: Agent extracts ticket ID from messy/noisy user input.
    """
    user_message = "Hey, so like, I was wondering, could you maybe check ticket 5678 for me? Thanks!"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Agent should extract the ticket ID despite the noise
    ticket_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'lookup_ticket']
    assert len(ticket_calls) > 0, "Agent should extract ticket ID from noisy input"
    
    ticket_id = ticket_calls[0]['parameters'].get('ticket_id')
    assert ticket_id == "5678", f"Expected ticket '5678', but got: {ticket_id}"


def test_agent_handles_multiple_ticket_mentions():
    """
    Test: User mentions multiple tickets - agent should handle appropriately.
    """
    user_message = "Compare tickets 5678 and 1234"
    
    result = run_agent_and_capture_tools(agent, user_message)
    
    # Agent should look up both tickets
    ticket_calls = [tc for tc in result['tool_calls'] if tc['name'] == 'lookup_ticket']
    assert len(ticket_calls) >= 2, \
        f"Agent should look up both tickets, but made {len(ticket_calls)} lookup calls"
    
    # Check that both ticket IDs were used
    ticket_ids = [tc['parameters'].get('ticket_id') for tc in ticket_calls]
    assert '5678' in ticket_ids and '1234' in ticket_ids, \
        f"Agent should look up both ticket IDs, but looked up: {ticket_ids}"


print("✅ Test file created: test_agent_edge_cases.py")

### Edge Case Testing Strategy

When testing edge cases, consider:

1. **Invalid inputs** - What if data is malformed?
2. **Missing information** - What if user doesn't provide required details?
3. **Ambiguity** - What if request could mean multiple things?
4. **Error conditions** - What if tools fail or return errors?
5. **Boundary conditions** - What about extreme values or edge values?

**Good agents degrade gracefully, not catastrophically!**

## 7. Running All Agent Tests

Let's see all our tests in action!

In [None]:
# First, let's create the helper modules that our test files import
# This organizes our code better

# Create tools.py with our tool functions
tools_code = '''"""
IT Support Tools for ADK Agent Testing
Contains mock tool functions for ticket lookup, KB search, and system status checks.
"""

def lookup_ticket(ticket_id: str) -> dict:
    """
    Look up details for a support ticket.
    
    Args:
        ticket_id: The ticket ID to look up (e.g., "5678")
    
    Returns:
        Dictionary with ticket details
    """
    # Mock data for testing - in real system, would query database
    mock_tickets = {
        "5678": {
            "ticket_id": "5678",
            "status": "In Progress",
            "priority": "High",
            "user": "Alice Johnson",
            "issue": "Cannot access email",
            "assigned_to": "Tech Support Team"
        },
        "1234": {
            "ticket_id": "1234",
            "status": "Resolved",
            "priority": "Medium",
            "user": "Bob Smith",
            "issue": "Printer not working",
            "assigned_to": "Hardware Team"
        },
        "9999": {
            "ticket_id": "9999",
            "status": "Open",
            "priority": "Critical",
            "user": "Charlie Brown",
            "issue": "Server down",
            "assigned_to": "Infrastructure Team"
        }
    }
    
    if ticket_id in mock_tickets:
        return mock_tickets[ticket_id]
    else:
        return {"error": f"Ticket {ticket_id} not found"}


def search_knowledge_base(query: str) -> dict:
    """
    Search the IT knowledge base for help articles.
    
    Args:
        query: Search query (e.g., "how to reset password")
    
    Returns:
        Dictionary with search results
    """
    # Mock knowledge base articles
    mock_kb = {
        "password": [
            {"title": "How to Reset Your Password", "article_id": "KB001"},
            {"title": "Password Requirements", "article_id": "KB002"}
        ],
        "email": [
            {"title": "Troubleshooting Email Access", "article_id": "KB010"},
            {"title": "Email Configuration Guide", "article_id": "KB011"}
        ],
        "vpn": [
            {"title": "VPN Setup Instructions", "article_id": "KB020"},
            {"title": "VPN Connection Issues", "article_id": "KB021"}
        ],
        "printer": [
            {"title": "Printer Offline Solutions", "article_id": "KB030"},
            {"title": "How to Install Printer Drivers", "article_id": "KB031"}
        ]
    }
    
    query_lower = query.lower()
    results = []
    
    for keyword, articles in mock_kb.items():
        if keyword in query_lower:
            results.extend(articles)
    
    if results:
        return {"query": query, "results": results, "count": len(results)}
    else:
        return {"query": query, "results": [], "count": 0}


def check_system_status(service_name: str) -> dict:
    """
    Check the operational status of a service or system.
    
    Args:
        service_name: Name of the service (e.g., "email", "vpn", "database")
    
    Returns:
        Dictionary with service status
    """
    # Mock system status
    mock_status = {
        "email": {"service": "email", "status": "operational", "uptime": "99.9%"},
        "vpn": {"service": "vpn", "status": "operational", "uptime": "100%"},
        "database": {"service": "database", "status": "degraded", "uptime": "95.2%"},
        "file_server": {"service": "file_server", "status": "down", "uptime": "0%"},
        "web_portal": {"service": "web_portal", "status": "operational", "uptime": "99.5%"}
    }
    
    service_lower = service_name.lower().replace(" ", "_")
    
    if service_lower in mock_status:
        return mock_status[service_lower]
    else:
        return {"service": service_name, "status": "unknown"}
'''

# Write tools.py
with open('tools.py', 'w') as f:
    f.write(tools_code)

print("✅ Created tools.py")

# Create helpers.py with helper functions
helpers_code = '''"""
Helper functions for ADK agent testing
"""
import os
import asyncio
import nest_asyncio
from typing import Dict, Any

from google.adk.agents import LlmAgent
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.adk.models.lite_llm import LiteLlm
from google.genai import types

from tools import lookup_ticket, search_knowledge_base, check_system_status

# Enable nested event loops for Colab
nest_asyncio.apply()


def create_it_support_agent():
    """Create an IT support agent for testing."""
    llm_model = LiteLlm(
        model=f"openai/{os.environ.get('OPENAI_MODEL', 'gpt-5-nano')}",
        api_key=os.environ.get('OPENAI_API_KEY')
    )
    
    agent = LlmAgent(
        name="it_support_agent",
        model=llm_model,
        description="An IT support agent that helps users with tickets, knowledge base searches, and system status checks",
        instruction="""You are an IT support agent. Help users with their IT issues by:
        1. Looking up ticket information when asked about specific tickets
        2. Searching the knowledge base for solutions to problems
        3. Checking system status when users report service issues
        
        Always use the appropriate tools to get accurate information.""",
        tools=[lookup_ticket, search_knowledge_base, check_system_status]
    )
    
    return agent


def run_agent_and_capture_tools(agent: LlmAgent, user_message: str, session_id: str = "test_session", user_id: str = "test_user") -> Dict[str, Any]:
    """
    Run the agent and capture tool calls for testing.
    
    Args:
        agent: The ADK agent to run
        user_message: User's message to the agent
        session_id: Session identifier for this conversation
        user_id: User identifier
    
    Returns:
        Dictionary with:
        - final_response: Agent's final text response
        - tool_calls: List of tool calls made
        - tool_count: Number of tools called
    """
    
    async def run_async():
        # Create a session service for managing conversation state
        session_service = InMemorySessionService()
        
        # Create the session
        await session_service.create_session(
            app_name="it_support_test",
            user_id=user_id,
            session_id=session_id,
            state={}
        )
        
        # Create a runner to execute the agent
        runner = Runner(
            app_name="it_support_test",
            agent=agent,
            session_service=session_service
        )
        
        # Format the user message in the required Content structure
        content = types.Content(
            role='user',
            parts=[types.Part(text=user_message)]
        )
        
        # Run the agent and collect events
        events = runner.run_async(
            user_id=user_id,
            session_id=session_id,
            new_message=content
        )
        
        # Extract tool calls and final response from events
        tool_calls = []
        final_response = ""
        
        async for event in events:
            # Check for tool calls in the event
            if hasattr(event, 'tool_use') and event.tool_use:
                for tool_use in event.tool_use:
                    tool_calls.append({
                        'name': tool_use.name if hasattr(tool_use, 'name') else str(tool_use),
                        'parameters': tool_use.input if hasattr(tool_use, 'input') else {}
                    })
            
            # Get the final response
            if event.is_final_response():
                final_response = event.content.parts[0].text
        
        return {
            'final_response': final_response,
            'tool_calls': tool_calls,
            'tool_count': len(tool_calls)
        }
    
    # Run the async function
    try:
        loop = asyncio.get_running_loop()
        return asyncio.run(run_async())
    except RuntimeError:
        return asyncio.run(run_async())
'''

# Write helpers.py
with open('helpers.py', 'w') as f:
    f.write(helpers_code)

print("✅ Created helpers.py")
print("\\n✅ Helper modules created successfully!")

In [None]:
# Run all agent tests
!pytest test_agent_tool_selection.py test_agent_parameters.py test_agent_multistep.py test_agent_edge_cases.py -v

## 8. Student Exercises 🎓

Now it's your turn! Apply what you've learned about testing agents.

### Exercise 1: Add a New Tool and Test It

Create a new tool `restart_service(service_name: str)` that simulates restarting an IT service.

**Requirements:**
1. Write the tool function
2. Add it to the agent
3. Write 2 tests:
   - Test that agent calls restart_service when asked to restart
   - Test that agent extracts correct service name

In [None]:
# Exercise 1: Your code here

def restart_service(service_name: str) -> dict:
    """
    TODO: Implement this function
    Should return a dict with restart status
    """
    pass

# TODO: Add to agent and write tests

### Exercise 2: Test Tool Call Order

Write a test for this scenario:
"Check ticket #1234 and restart the affected service"

**Assert:**
1. Agent calls `lookup_ticket` first
2. Agent calls `restart_service` second
3. Service name matches the ticket's issue

In [None]:
# Exercise 2: Your test here

def test_ticket_lookup_then_restart():
    """
    TODO: Test multi-step flow
    """
    pass

### Exercise 3: Test Unclear Ticket Information

Users don't always provide clear ticket IDs. Test these cases:
- "Check my ticket" (no ID provided)
- "Ticket ABC123" (invalid format)
- "Check tickets 1, 2, and 3" (multiple tickets)

What should the agent do in each case?

In [None]:
# Exercise 3: Your tests here

def test_no_ticket_id_provided():
    """TODO: Test agent behavior when no ticket ID given"""
    pass

def test_invalid_ticket_format():
    """TODO: Test agent behavior with invalid ticket format"""
    pass

def test_multiple_tickets():
    """TODO: Test agent handling multiple ticket IDs"""
    pass

### Exercise 4: Parameterized Parameter Testing

Use `@pytest.mark.parametrize` to test parameter extraction for 5 different ticket IDs in various formats:
- "5678"
- "#1234"
- "ticket 9999"
- "TICKET-5555"
- "id: 7777"

In [None]:
%%writefile exercise_4.py
# Exercise 4: Parameterized parameter extraction test

import pytest
from helpers import run_agent_and_capture_tools, create_it_support_agent

agent = create_it_support_agent()

# TODO: Write parameterized test
# @pytest.mark.parametrize(...)
# def test_extract_ticket_id_formats(...):
#     pass

### Exercise 5: Create a Failing Test

Write a test that you EXPECT to fail. Then explain:
1. Why it fails
2. Is it a bug in the agent or a test problem?
3. How would you fix it?

**Example failing scenarios:**
- Agent calls wrong tool
- Agent extracts wrong parameter
- Agent doesn't handle edge case
- Test is too strict

In [None]:
# Exercise 5: Your failing test and explanation

def test_that_will_fail():
    """
    TODO: Write a test that fails
    """
    pass

# TODO: Write explanation in markdown cell below

**Your explanation here:**

<!-- 
TODO: Explain:
- What test did you write?
- Why does it fail?
- Is it agent bug or test bug?
- How to fix?
-->

### Exercise 6: Build a Mini Test Suite

Create a comprehensive test suite (10+ tests) for your IT support agent.

**Must include:**
- 3 tool selection tests (one per tool)
- 3 parameter extraction tests
- 2 multi-step tests
- 2 edge case tests
- At least 1 parameterized test

**Bonus points for:**
- Clear test names
- Good assertions with helpful messages
- Testing realistic scenarios
- Well-organized test file

In [None]:
%%writefile exercise_6_test_suite.py
# Exercise 6: Comprehensive test suite

import os
import pytest
from helpers import run_agent_and_capture_tools, create_it_support_agent

agent = create_it_support_agent()

# TODO: Write your 10+ tests here
# Organize them into sections:
# - Tool Selection Tests
# - Parameter Extraction Tests
# - Multi-Step Tests
# - Edge Case Tests

# Example structure:
# def test_1_tool_selection_ticket():
#     pass
#
# def test_2_tool_selection_kb():
#     pass
# 
# ... etc

In [None]:
# Run your test suite
!pytest exercise_6_test_suite.py -v

## 9. Best Practices for Agent Testing

### ✅ DO:

1. **Test behavior, not text** - Focus on tool calls and parameters
2. **Test tool selection first** - Ensure agent picks the right tool
3. **Test parameter accuracy** - Verify extracted information is correct
4. **Test tool call sequences** - Multi-step reasoning matters
5. **Test edge cases** - Invalid input, missing data, ambiguity
6. **Use mock data** - Fast tests with fake/mock tool responses
7. **Test happy path AND failures** - Both success and error cases
8. **Write descriptive test names** - `test_agent_extracts_ticket_id()` not `test_1()`

### ❌ DON'T:

1. **Don't test only final text** - Tool calls are more important
2. **Don't expect exact tool sequences** - Some variation is OK
3. **Don't skip edge cases** - That's where bugs hide
4. **Don't use real external services** - Slow and unreliable
5. **Don't test too many things in one test** - Keep tests focused

### Testing Hierarchy

**Priority 1: Critical Functionality**
- Does agent call the right tool?
- Does agent extract correct parameters?

**Priority 2: Complex Behavior**
- Multi-step reasoning
- Tool call ordering

**Priority 3: Edge Cases**
- Invalid inputs
- Error handling
- Ambiguous requests

**Priority 4: Output Quality**
- Response helpfulness
- Text clarity
- (This is for next lesson: LLM-as-judge!)

### Agent Testing vs LLM Testing

| Aspect | LLM Testing (Notebook 01) | Agent Testing (This Notebook) |
|--------|---------------------------|-------------------------------|
| **What to test** | Text outputs | Tool calls & parameters |
| **Assertions** | String contains, JSON structure | Tool names, parameter values |
| **Complexity** | Single response | Multi-step sequences |
| **Determinism** | Medium (LLM variability) | High (tool logic) |
| **Focus** | Content correctness | Behavior correctness |

### Performance Tips

1. **Use mock data** - Don't hit real APIs in tests
2. **Parallel test execution** - pytest can run tests in parallel
3. **Cache agent initialization** - Use pytest fixtures
4. **Test at the right level** - Don't test LLM internals

### Common Pitfalls

**Pitfall 1: Over-testing text**
```python
❌ assert "The ticket status is In Progress" == response
✅ assert 'lookup_ticket' in tool_calls
```

**Pitfall 2: Brittle sequence assertions**
```python
❌ assert tool_sequence == ['tool1', 'tool2', 'tool3']  # Too strict
✅ assert 'tool1' in tool_sequence and 'tool2' in tool_sequence  # Flexible
```

**Pitfall 3: Not testing edge cases**
```python
❌ Only test: "Check ticket 5678"
✅ Also test: "Check ticket XYZ", "Check ticket", "Check tickets 1, 2, 3"
```

## 10. Key Takeaways & Next Steps

### 🎉 What You've Learned

1. **Agents are different** - Test behavior (tool calls) not just text
2. **Tool selection matters** - Right tool = right functionality
3. **Parameters must be accurate** - Wrong parameters = wrong results
4. **Multi-step reasoning is testable** - Assert on sequences and order
5. **Edge cases reveal bugs** - Test invalid, ambiguous, and error cases
6. **ADK makes testing easier** - Structured tool calls are inspectable

### 🚀 What You Can Do Now

- ✅ Build testable agents with ADK
- ✅ Test tool selection and parameters
- ✅ Test multi-step agent reasoning
- ✅ Test edge cases and error handling
- ✅ Apply pytest patterns to agent testing
- ✅ Distinguish between agent testing and LLM testing

### 📚 Testing Levels Completed

**✅ Lesson 1:** Simple LLM Testing
- Text outputs
- Factual correctness
- Structured output validation

**✅ Lesson 2:** Agent Testing (This Lesson)
- Tool selection
- Parameter extraction
- Multi-step reasoning
- Edge cases

**🔜 Lesson 3:** Advanced Testing (Coming Next)
- LLM-as-judge for subjective quality
- Testing response helpfulness
- Testing conversation flow
- Integration testing

### 💡 Real-World Application

You've learned skills applicable to:
- 🎫 Customer support agents
- 📊 Data analysis agents
- 🔧 DevOps automation agents
- 📝 Content generation agents
- 🔍 Research and retrieval agents

### 🤔 Discussion Questions

1. When should you test tool calls vs final text output?
2. How do you decide if agent behavior is "correct" for ambiguous requests?
3. What's the tradeoff between strict and flexible test assertions?
4. How would you test an agent that uses 10+ tools?
5. Should you test tool call order strictly or flexibly? When does each make sense?

### 🛠️ Next Steps

1. **Complete all exercises** - Hands-on practice is essential
2. **Build your own agent** - Apply these patterns to a new domain
3. **Expand your test suite** - Aim for 80%+ coverage of behaviors
4. **Test in production scenarios** - Use real user queries as test cases

---

## 📝 Additional Resources

- [Google ADK Documentation](https://ai.google.dev/adk)
- [Pytest Documentation](https://docs.pytest.org/)
- [OpenAI API Documentation](https://platform.openai.com/docs/)
- [Agent Testing Best Practices](https://example.com/agent-testing)

---

**Excellent work! You've completed Lesson 2: Testing ADK Agents** 🎓

You can now test both simple LLMs and complex agents with tools. Next, we'll learn advanced techniques for testing subjective quality using LLM-as-judge!