# Building a Complete Evaluation Pipeline

By the end of this notebook, you will be able to:

1. Design structured evaluation datasets for IT support agents
2. Build a reusable `AgentEvaluator` class
3. Combine multiple test types (assertions, tool calls, LLM judges)
4. Generate comprehensive evaluation reports
5. Visualize evaluation results
6. Use evaluation for iterative agent improvement
7. Apply production-ready evaluation patterns

---

## 🎯 Context: Why Build an Evaluation Pipeline?

### Quick Recap: What You've Learned

**Notebook 01:** Pytest basics
- ✅ Automated testing with pytest
- ✅ Assertion-based tests for factual correctness
- ✅ Parameterized tests

**Notebook 02:** Agent testing
- ✅ Testing tool selection and parameters
- ✅ Multi-step reasoning validation
- ✅ Edge case handling

**Notebook 03:** LLM-as-judge
- ✅ Evaluating subjective quality
- ✅ Multi-dimension evaluation
- ✅ Batch evaluation

### The Problem: Testing at Scale

You now know THREE testing approaches... but:

❓ **How do you test 100+ different IT support scenarios?**  
❓ **How do you track quality over time as you improve your agent?**  
❓ **How do you prove your agent got better after changes?**  
❓ **How do you organize different test types efficiently?**

### The Solution: Evaluation Pipeline

**An evaluation pipeline is a systematic, reusable framework for testing at scale.**

```
Evaluation Dataset → AgentEvaluator → Report + Visualizations
     (Test Cases)      (Runs Tests)    (Shows Results)
```

**Key components:**
1. **Structured dataset** - Test cases in consistent format
2. **Unified evaluator** - Runs all test types
3. **Comprehensive reports** - Metrics, failures, insights
4. **Visualizations** - Charts to understand performance
5. **Iteration workflow** - Compare before/after improvements

### Today's Focus: IT Support Agent Evaluation

We'll build a complete evaluation pipeline for an **IT Support Helpdesk Agent**:

**What we'll test:**
- 📚 Technical knowledge (assertions)
- 🔧 Tool usage (tool call validation)
- 💬 Response quality (LLM judges)
- 🎯 Helpfulness and professionalism

**What we'll build:**
- Reusable `AgentEvaluator` class
- 10-case evaluation dataset
- Automated report generation
- Visualization dashboard
- Improvement tracking system

---

Let's get started! 🚀

## 1. Environment Setup

First, we'll install the required packages and set up our environment.

In [None]:
# Install required packages
!pip install openai pydantic>=2.11.0 pandas matplotlib -q

In [None]:
# Import required libraries
import os
import time
import json
from typing import List, Dict, Any, Optional
from dataclasses import dataclass, asdict
from datetime import datetime

# OpenAI and Pydantic
from openai import OpenAI
from pydantic import BaseModel, Field

# Data analysis and visualization
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

print("✅ All imports successful!")

### API Key Setup

Configure your OpenAI API key for the evaluation pipeline.

**💰 Cost Note:** We'll use `gpt-5-nano` model. Running the full evaluation pipeline will cost approximately €0.10-0.20.

In [None]:
# Configure OpenAI API key
# Method 1: Try to get API key from Colab secrets (recommended)
try:
    from google.colab import userdata
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    print("✅ API key loaded from Colab secrets")
except:
    # Method 2: Manual input (fallback)
    from getpass import getpass
    print("💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY")
    OPENAI_API_KEY = getpass("Enter your OpenAI API Key: ")

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Validate that the API key is set
if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == "":
    raise ValueError("❌ ERROR: No API key provided!")

print("✅ Authentication configured!")

# Configure which OpenAI model to use
OPENAI_MODEL = "gpt-5-nano"  # Using gpt-5-nano for cost efficiency
print(f"🤖 Selected Model: {OPENAI_MODEL}")

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

## 2. Evaluation Dataset Structure

An evaluation dataset is a collection of test cases in a structured format.

### Test Case Format

Each test case is a dictionary with:
- **id**: Unique identifier (e.g., "TC001")
- **input**: User query or scenario
- **test_type**: One of `assertion`, `tool_call`, or `llm_judge`
- **expected**: Expected outcome (format depends on test_type)
- **criteria**: Evaluation criteria (mainly for llm_judge)
- **description**: Human-readable description

### Why This Structure?

✅ **Consistent format** - Easy to process programmatically  
✅ **Flexible** - Supports different test types  
✅ **Maintainable** - Easy to add/modify test cases  
✅ **Version control friendly** - Can track changes over time  
✅ **Scalable** - Grows from 10 to 1000+ test cases

In [None]:
# Create evaluation dataset for IT support agent
EVALUATION_DATASET = [
    # ========================================
    # ASSERTION-BASED TESTS (Technical Knowledge)
    # ========================================
    {
        "id": "TC001",
        "description": "Technical knowledge: SSH port",
        "input": "What port does SSH use? Answer with just the number.",
        "test_type": "assertion",
        "expected": {"contains": "22"},
        "criteria": None
    },
    {
        "id": "TC002",
        "description": "Technical knowledge: HTTPS port",
        "input": "What port does HTTPS use? Answer with just the number.",
        "test_type": "assertion",
        "expected": {"contains": "443"},
        "criteria": None
    },
    {
        "id": "TC003",
        "description": "Technical knowledge: DNS definition",
        "input": "What does DNS stand for? Answer in one sentence.",
        "test_type": "assertion",
        "expected": {"contains": ["domain", "name", "system"]},
        "criteria": None
    },

    # ========================================
    # TOOL CALL TESTS (Agent Behavior)
    # ========================================
    {
        "id": "TC004",
        "description": "Tool call: Ticket lookup",
        "input": "Can you check the status of ticket #5678?",
        "test_type": "tool_call",
        "expected": {
            "tool_name": "lookup_ticket",
            "parameter": {"key": "ticket_id", "value": "5678"}
        },
        "criteria": None
    },
    {
        "id": "TC005",
        "description": "Tool call: Knowledge base search",
        "input": "How do I reset my password?",
        "test_type": "tool_call",
        "expected": {
            "tool_name": "search_knowledge_base",
            "parameter": {"key": "query", "contains": "password"}
        },
        "criteria": None
    },
    {
        "id": "TC006",
        "description": "Tool call: System status check",
        "input": "Is the email service working?",
        "test_type": "tool_call",
        "expected": {
            "tool_name": "check_system_status",
            "parameter": {"key": "service_name", "contains": "email"}
        },
        "criteria": None
    },

    # ========================================
    # LLM-AS-JUDGE TESTS (Response Quality)
    # ========================================
    {
        "id": "TC007",
        "description": "Judge: Helpfulness of troubleshooting advice",
        "input": "My computer won't start. What should I do?",
        "test_type": "llm_judge",
        "expected": {"min_score": 3},
        "criteria": "Helpfulness: Does the response provide actionable troubleshooting steps?"
    },
    {
        "id": "TC008",
        "description": "Judge: Clarity of technical explanation",
        "input": "What's the difference between RAM and storage?",
        "test_type": "llm_judge",
        "expected": {"min_score": 3},
        "criteria": "Clarity: Is the explanation clear and easy to understand for a non-technical user?"
    },
    {
        "id": "TC009",
        "description": "Judge: Professionalism and tone",
        "input": "I've been waiting for 2 hours and nobody helped me!",
        "test_type": "llm_judge",
        "expected": {"min_score": 4},
        "criteria": "Professionalism: Does the response handle the frustrated customer professionally and empathetically?"
    },
    {
        "id": "TC010",
        "description": "Judge: Completeness of response",
        "input": "How do I set up VPN on my work laptop?",
        "test_type": "llm_judge",
        "expected": {"min_score": 3},
        "criteria": "Completeness: Does the response provide comprehensive instructions or indicate where to find them?"
    }
]

print(f"✅ Evaluation dataset created with {len(EVALUATION_DATASET)} test cases")
print(f"\n📊 Test type breakdown:")
print(f"  - Assertion tests: {sum(1 for tc in EVALUATION_DATASET if tc['test_type'] == 'assertion')}")
print(f"  - Tool call tests: {sum(1 for tc in EVALUATION_DATASET if tc['test_type'] == 'tool_call')}")
print(f"  - LLM judge tests: {sum(1 for tc in EVALUATION_DATASET if tc['test_type'] == 'llm_judge')}")

### View Sample Test Cases

In [None]:
# Display one example of each test type
print("📋 EXAMPLE TEST CASES:\n")

for test_type in ["assertion", "tool_call", "llm_judge"]:
    example = next(tc for tc in EVALUATION_DATASET if tc['test_type'] == test_type)
    print(f"{'='*60}")
    print(f"Test Type: {test_type.upper()}")
    print(f"{'='*60}")
    print(f"ID: {example['id']}")
    print(f"Description: {example['description']}")
    print(f"Input: {example['input']}")
    print(f"Expected: {example['expected']}")
    if example['criteria']:
        print(f"Criteria: {example['criteria']}")
    print()

## 3. Building the AgentEvaluator Class

The `AgentEvaluator` class is the heart of our evaluation pipeline.

### Class Design

**Core methods:**
- `__init__(self, agent_function)` - Initialize with agent to test
- `evaluate_case(self, test_case)` - Evaluate single test case
- `evaluate_all(self, dataset)` - Evaluate entire dataset
- `generate_report(self)` - Create comprehensive report

**Handles three test types:**
1. **Assertion** - String contains checks
2. **Tool call** - Mock tool validation (simplified for this notebook)
3. **LLM judge** - Quality evaluation

### Result Format

Each evaluation produces:
```python
{
    'test_id': 'TC001',
    'passed': True/False,
    'score': 0-5 (for judge tests),
    'reasoning': 'Why it passed/failed',
    'duration': 1.23,  # seconds
    'agent_output': 'The actual response...'
}
```

In [None]:
# Pydantic model for LLM judge evaluation
class JudgeEvaluation(BaseModel):
    """Structured evaluation from LLM judge."""
    score: int = Field(..., ge=1, le=5, description="Score from 1 (worst) to 5 (best)")
    reasoning: str = Field(..., description="Explanation for the score")


@dataclass
class EvaluationResult:
    """Result from evaluating a single test case."""
    test_id: str
    test_type: str
    description: str
    passed: bool
    score: Optional[int] = None  # Only for llm_judge tests
    reasoning: str = ""
    duration: float = 0.0
    agent_output: str = ""


class AgentEvaluator:
    """
    Unified evaluation class for IT support agents.

    Supports three test types:
    - assertion: String containment checks
    - tool_call: Tool selection and parameter validation
    - llm_judge: Quality evaluation using LLM-as-judge
    """

    def __init__(self, agent_function, tools=None):
        """
        Initialize evaluator.

        Args:
            agent_function: Function that takes user input and returns response
            tools: Optional dict of tool functions (for tool_call tests)
        """
        self.agent_function = agent_function
        self.tools = tools or {}
        self.results = []
        self.client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

    def evaluate_case(self, test_case: Dict) -> EvaluationResult:
        """
        Evaluate a single test case.

        Args:
            test_case: Test case dictionary from evaluation dataset

        Returns:
            EvaluationResult with pass/fail status and details
        """
        start_time = time.time()

        # Get agent response
        agent_output = self.agent_function(test_case['input'])

        duration = time.time() - start_time

        # Route to appropriate evaluator based on test type
        if test_case['test_type'] == 'assertion':
            result = self._evaluate_assertion(test_case, agent_output)
        elif test_case['test_type'] == 'tool_call':
            result = self._evaluate_tool_call(test_case, agent_output)
        elif test_case['test_type'] == 'llm_judge':
            result = self._evaluate_with_judge(test_case, agent_output)
        else:
            raise ValueError(f"Unknown test type: {test_case['test_type']}")

        # Set common fields
        result.test_id = test_case['id']
        result.test_type = test_case['test_type']
        result.description = test_case['description']
        result.duration = duration
        result.agent_output = agent_output

        return result

    def _evaluate_assertion(self, test_case: Dict, agent_output: str) -> EvaluationResult:
        """Evaluate assertion-based test."""
        expected = test_case['expected']
        output_lower = agent_output.lower()

        if 'contains' in expected:
            contains_value = expected['contains']

            # Handle list of required strings
            if isinstance(contains_value, list):
                missing = [s for s in contains_value if s.lower() not in output_lower]
                passed = len(missing) == 0
                reasoning = f"Expected all of {contains_value}. " + \
                           (f"Missing: {missing}" if not passed else "All found.")
            else:
                # Single string
                passed = str(contains_value).lower() in output_lower
                reasoning = f"Expected '{contains_value}' in response. " + \
                           ("Found." if passed else "Not found.")
        else:
            passed = False
            reasoning = "Unknown assertion format"

        return EvaluationResult(
            test_id="",  # Will be set by evaluate_case
            test_type="assertion",
            description="",
            passed=passed,
            reasoning=reasoning
        )

    def _evaluate_tool_call(self, test_case: Dict, agent_output: str) -> EvaluationResult:
        """
        Evaluate tool call test.

        NOTE: This is a simplified version that checks if the agent's response
        mentions the expected tool or action. In a full ADK implementation,
        you would check actual tool calls from the agent.
        """
        expected = test_case['expected']
        tool_name = expected['tool_name']
        output_lower = agent_output.lower()

        # Check if response indicates the tool would be used
        # This is a proxy for actual tool call checking
        tool_keywords = {
            'lookup_ticket': ['ticket', 'check', 'status', 'lookup'],
            'search_knowledge_base': ['search', 'knowledge', 'article', 'guide'],
            'check_system_status': ['status', 'service', 'check', 'working']
        }

        keywords = tool_keywords.get(tool_name, [])
        mentions_tool = any(kw in output_lower for kw in keywords)

        # Check parameter if specified
        param_check = True
        param_reasoning = ""
        if 'parameter' in expected:
            param = expected['parameter']
            if 'value' in param:
                param_check = str(param['value']) in agent_output
                param_reasoning = f" Parameter '{param['key']}={param['value']}' " + \
                                 ("found" if param_check else "not found")
            elif 'contains' in param:
                param_check = param['contains'].lower() in output_lower
                param_reasoning = f" Parameter contains '{param['contains']}': " + \
                                 ("yes" if param_check else "no")

        passed = mentions_tool and param_check
        reasoning = f"Expected tool '{tool_name}' to be used. " + \
                   ("Indicators found." if mentions_tool else "No indicators found.") + \
                   param_reasoning

        return EvaluationResult(
            test_id="",
            test_type="tool_call",
            description="",
            passed=passed,
            reasoning=reasoning
        )

    def _evaluate_with_judge(self, test_case: Dict, agent_output: str) -> EvaluationResult:
        """Evaluate using LLM-as-judge."""
        judge_prompt = f"""
You are an expert evaluator of IT support responses.

USER QUERY:
{test_case['input']}

AGENT RESPONSE:
{agent_output}

EVALUATION CRITERIA:
{test_case['criteria']}

Evaluate the agent's response based on the criteria above.
Provide your evaluation as JSON:
{{
  "score": <integer from 1-5, where 1=very poor, 5=excellent>,
  "reasoning": "<detailed explanation>"
}}
"""

        try:
            # Call judge LLM
            response = self.client.responses.create(
                model=OPENAI_MODEL,
                input=judge_prompt
            )

            # Parse and validate
            result_json = json.loads(response.output_text)
            evaluation = JudgeEvaluation(**result_json)

            # Check if score meets threshold
            min_score = test_case['expected'].get('min_score', 3)
            passed = evaluation.score >= min_score

            return EvaluationResult(
                test_id="",
                test_type="llm_judge",
                description="",
                passed=passed,
                score=evaluation.score,
                reasoning=f"Score: {evaluation.score}/5 (threshold: {min_score}). {evaluation.reasoning}"
            )

        except Exception as e:
            return EvaluationResult(
                test_id="",
                test_type="llm_judge",
                description="",
                passed=False,
                reasoning=f"Judge evaluation failed: {str(e)}"
            )

    def evaluate_all(self, dataset: List[Dict]) -> List[EvaluationResult]:
        """
        Evaluate all test cases in dataset.

        Args:
            dataset: List of test case dictionaries

        Returns:
            List of EvaluationResult objects
        """
        self.results = []

        print(f"🚀 Starting evaluation of {len(dataset)} test cases...\n")

        for i, test_case in enumerate(dataset, 1):
            print(f"[{i}/{len(dataset)}] Evaluating {test_case['id']}: {test_case['description'][:50]}...")

            result = self.evaluate_case(test_case)
            self.results.append(result)

            # Show result
            status = "✅ PASS" if result.passed else "❌ FAIL"
            print(f"        {status} ({result.duration:.2f}s)")
            if not result.passed:
                print(f"        Reason: {result.reasoning[:80]}...")
            print()

        print(f"✅ Evaluation complete!\n")
        return self.results

    def generate_report(self) -> Dict[str, Any]:
        """
        Generate comprehensive evaluation report.

        Returns:
            Dictionary with metrics and analysis
        """
        if not self.results:
            return {"error": "No evaluation results available"}

        total_cases = len(self.results)
        passed_cases = sum(1 for r in self.results if r.passed)
        failed_cases = total_cases - passed_cases
        pass_rate = (passed_cases / total_cases) * 100

        # Calculate average score for judge tests
        judge_results = [r for r in self.results if r.score is not None]
        avg_score = sum(r.score for r in judge_results) / len(judge_results) if judge_results else None

        # Average duration
        avg_duration = sum(r.duration for r in self.results) / total_cases

        # Group by test type
        by_type = {}
        for result in self.results:
            if result.test_type not in by_type:
                by_type[result.test_type] = {'total': 0, 'passed': 0}
            by_type[result.test_type]['total'] += 1
            if result.passed:
                by_type[result.test_type]['passed'] += 1

        # Failed test details
        failures = [
            {
                'test_id': r.test_id,
                'description': r.description,
                'reasoning': r.reasoning
            }
            for r in self.results if not r.passed
        ]

        return {
            'timestamp': datetime.now().isoformat(),
            'total_cases': total_cases,
            'passed': passed_cases,
            'failed': failed_cases,
            'pass_rate': pass_rate,
            'average_score': avg_score,
            'average_duration': avg_duration,
            'by_test_type': by_type,
            'failures': failures
        }

print("✅ AgentEvaluator class defined!")

## 4. Creating the IT Support Agent

Let's create a simple IT support agent that we'll evaluate.

In [None]:
def it_support_agent(user_query: str) -> str:
    """
    IT Support agent - provides technical support assistance.

    Args:
        user_query: User's question or issue

    Returns:
        Agent's response
    """
    system_prompt = """
You are a helpful IT support agent for a company helpdesk.

Your role:
- Answer technical questions clearly
- Provide troubleshooting steps
- Be professional and empathetic
- Check ticket status when asked
- Search knowledge base for solutions
- Check system status when relevant

Always be concise but helpful.
"""

    response = client.responses.create(
        model=OPENAI_MODEL,
        input=f"System: {system_prompt}\n\nUser: {user_query}\n\nAssistant:"
    )

    return response.output_text

# Test the agent
test_response = it_support_agent("What port does SSH use?")
print("🤖 Agent test:")
print(f"Q: What port does SSH use?")
print(f"A: {test_response}")

## 5. Running Your First Evaluation

Now let's evaluate our IT support agent against the full dataset!

In [None]:
# Create evaluator
evaluator = AgentEvaluator(it_support_agent)

# Run evaluation
results = evaluator.evaluate_all(EVALUATION_DATASET)

## 6. Evaluation Report Generation

Let's generate and display a comprehensive report.

In [None]:
# Generate report
report = evaluator.generate_report()

# Display report
print("="*60)
print("📊 EVALUATION REPORT")
print("="*60)
print(f"\n📅 Timestamp: {report['timestamp']}")
print(f"\n📈 Overall Metrics:")
print(f"  Total Test Cases: {report['total_cases']}")
print(f"  ✅ Passed: {report['passed']}")
print(f"  ❌ Failed: {report['failed']}")
print(f"  📊 Pass Rate: {report['pass_rate']:.1f}%")

if report['average_score'] is not None:
    print(f"  ⭐ Average Score (Judge tests): {report['average_score']:.2f}/5")

print(f"  ⏱️  Average Duration: {report['average_duration']:.2f}s")

print(f"\n📋 Results by Test Type:")
for test_type, stats in report['by_test_type'].items():
    pass_rate = (stats['passed'] / stats['total']) * 100
    print(f"  {test_type:15s}: {stats['passed']}/{stats['total']} passed ({pass_rate:.1f}%)")

if report['failures']:
    print(f"\n❌ Failed Tests:")
    for failure in report['failures']:
        print(f"\n  🚩 {failure['test_id']}: {failure['description']}")
        print(f"     Reason: {failure['reasoning'][:100]}...")
else:
    print(f"\n✅ All tests passed!")

print("\n" + "="*60)

### Convert Results to DataFrame for Analysis

In [None]:
# Convert results to pandas DataFrame
results_df = pd.DataFrame([asdict(r) for r in evaluator.results])

# Display DataFrame (rendered as table in Jupyter/Colab)
print("📊 Results DataFrame:")
display(results_df[['test_id', 'test_type', 'passed', 'score', 'duration']])

print(f"\n📈 Summary Statistics:")
display(results_df.groupby('test_type')['passed'].agg(['count', 'sum', 'mean']))

## 7. Visualizing Results

Let's create visualizations to better understand our evaluation results.

In [None]:
def visualize_results(evaluator: AgentEvaluator):
    """
    Create comprehensive visualization dashboard.

    Generates 4 charts:
    1. Score distribution (for judge tests)
    2. Pass/fail rate (pie chart)
    3. Response duration by test
    4. Pass rate by test type
    """
    results_df = pd.DataFrame([asdict(r) for r in evaluator.results])

    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    fig.suptitle('IT Support Agent Evaluation Dashboard', fontsize=16, fontweight='bold')

    # 1. Score Distribution (for judge tests)
    ax1 = axes[0, 0]
    judge_scores = results_df[results_df['score'].notna()]['score']
    if len(judge_scores) > 0:
        ax1.hist(judge_scores, bins=[0.5, 1.5, 2.5, 3.5, 4.5, 5.5],
                edgecolor='black', color='skyblue', alpha=0.7)
        ax1.set_xlabel('Score', fontsize=12)
        ax1.set_ylabel('Frequency', fontsize=12)
        ax1.set_title('Score Distribution (LLM Judge Tests)', fontsize=12, fontweight='bold')
        ax1.set_xticks([1, 2, 3, 4, 5])
        ax1.grid(axis='y', alpha=0.3)
    else:
        ax1.text(0.5, 0.5, 'No judge test scores', ha='center', va='center', fontsize=12)
        ax1.set_title('Score Distribution', fontsize=12, fontweight='bold')

    # 2. Pass/Fail Rate (pie chart)
    ax2 = axes[0, 1]
    passed_count = results_df['passed'].sum()
    failed_count = len(results_df) - passed_count
    colors = ['#90EE90', '#FFB6C6']
    ax2.pie([passed_count, failed_count], labels=['Passed', 'Failed'],
           autopct='%1.1f%%', colors=colors, startangle=90)
    ax2.set_title('Overall Pass/Fail Rate', fontsize=12, fontweight='bold')

    # 3. Response Duration by Test
    ax3 = axes[1, 0]
    test_ids = results_df['test_id']
    durations = results_df['duration']
    colors_duration = ['green' if p else 'red' for p in results_df['passed']]
    ax3.bar(range(len(test_ids)), durations, color=colors_duration, alpha=0.6, edgecolor='black')
    ax3.set_xlabel('Test Case', fontsize=12)
    ax3.set_ylabel('Duration (seconds)', fontsize=12)
    ax3.set_title('Response Duration by Test', fontsize=12, fontweight='bold')
    ax3.set_xticks(range(len(test_ids)))
    ax3.set_xticklabels(test_ids, rotation=45, ha='right')
    ax3.grid(axis='y', alpha=0.3)

    # Legend
    green_patch = mpatches.Patch(color='green', alpha=0.6, label='Passed')
    red_patch = mpatches.Patch(color='red', alpha=0.6, label='Failed')
    ax3.legend(handles=[green_patch, red_patch], loc='upper right')

    # 4. Pass Rate by Test Type
    ax4 = axes[1, 1]
    type_stats = results_df.groupby('test_type')['passed'].agg(['sum', 'count'])
    type_stats['pass_rate'] = (type_stats['sum'] / type_stats['count']) * 100

    test_types = type_stats.index
    pass_rates = type_stats['pass_rate']
    colors_type = ['green' if pr >= 70 else 'orange' if pr >= 50 else 'red'
                   for pr in pass_rates]

    ax4.bar(test_types, pass_rates, color=colors_type, alpha=0.6, edgecolor='black')
    ax4.set_xlabel('Test Type', fontsize=12)
    ax4.set_ylabel('Pass Rate (%)', fontsize=12)
    ax4.set_title('Pass Rate by Test Type', fontsize=12, fontweight='bold', pad=20)
    ax4.set_ylim(0, 110)  # Extended to 110 to give space for percentage labels
    ax4.axhline(y=70, color='gray', linestyle='--', alpha=0.5, label='70% threshold')
    ax4.grid(axis='y', alpha=0.3)
    ax4.legend()

    # Add value labels on bars
    for i, (test_type, rate) in enumerate(zip(test_types, pass_rates)):
        ax4.text(i, rate + 2, f'{rate:.1f}%', ha='center', fontweight='bold')

    plt.tight_layout()
    plt.show()

# Generate visualizations
visualize_results(evaluator)

## 8. Iterative Improvement Workflow

One of the most powerful uses of an evaluation pipeline is **iterative improvement**.

### The Improvement Loop

```
1. Run evaluation → Get baseline metrics
2. Analyze failures → Identify patterns
3. Improve agent → Modify prompt/logic
4. Re-run evaluation → Get new metrics
5. Compare → Quantify improvement
6. Repeat
```

Let's demonstrate this workflow!

### Step 1: Baseline (Already Done)

We've already run our baseline evaluation above. Let's save those results.

In [None]:
# Save baseline results
baseline_report = evaluator.generate_report()

print("📊 Baseline Results:")
print(f"  Pass Rate: {baseline_report['pass_rate']:.1f}%")
print(f"  Passed: {baseline_report['passed']}/{baseline_report['total_cases']}")
if baseline_report['average_score']:
    print(f"  Avg Score: {baseline_report['average_score']:.2f}/5")

### Step 2: Analyze Failures

Look at what failed and why.

In [None]:
if baseline_report['failures']:
    print("🔍 Analyzing Failures:\n")
    for failure in baseline_report['failures']:
        print(f"❌ {failure['test_id']}: {failure['description']}")
        print(f"   Reason: {failure['reasoning'][:150]}...")
        print()
else:
    print("✅ No failures to analyze - agent performing perfectly!")

### Step 3: Improve the Agent

Based on failures, let's create an improved version of our agent.

In [None]:
def improved_it_support_agent(user_query: str) -> str:
    """
    Improved IT Support agent with better instructions.

    Improvements:
    - More explicit about being helpful and thorough
    - Better handling of frustrated customers
    - Clearer troubleshooting steps
    - More professional tone guidance
    """
    system_prompt = """
You are a helpful IT support agent for a company helpdesk.

Your role:
- Answer technical questions with clear, accurate information
- Provide step-by-step troubleshooting instructions
- Be professional, empathetic, and patient - especially with frustrated users
- When checking tickets, mention you're looking it up
- When searching for solutions, indicate you're searching the knowledge base
- When checking system status, mention you're verifying service status

Guidelines:
- Always acknowledge the user's concern
- Provide actionable steps
- Be concise but thorough
- Use simple language for technical concepts
- Show empathy for frustrated customers

Examples:
- "Let me check the status of ticket #5678 for you..."
- "I'll search our knowledge base for password reset instructions..."
- "Let me verify the email service status..."
"""

    response = client.responses.create(
        model=OPENAI_MODEL,
        input=f"System: {system_prompt}\n\nUser: {user_query}\n\nAssistant:"
    )

    return response.output_text

print("✅ Improved agent created!")

### Step 4: Re-run Evaluation

Evaluate the improved agent on the same dataset.

In [None]:
# Create new evaluator with improved agent
improved_evaluator = AgentEvaluator(improved_it_support_agent)

# Run evaluation
improved_results = improved_evaluator.evaluate_all(EVALUATION_DATASET)

### Step 5: Compare Results

Compare baseline vs improved metrics.

In [None]:
# Generate improved report
improved_report = improved_evaluator.generate_report()

# Comparison
print("="*70)
print("📊 BEFORE vs AFTER COMPARISON")
print("="*70)

print(f"\n{'Metric':<30} {'Baseline':<20} {'Improved':<20} {'Change'}")
print("-" * 70)

# Pass rate
baseline_pass_rate = baseline_report['pass_rate']
improved_pass_rate = improved_report['pass_rate']
pass_rate_change = improved_pass_rate - baseline_pass_rate
print(f"{'Pass Rate':<30} {baseline_pass_rate:>6.1f}%{'':<13} {improved_pass_rate:>6.1f}%{'':<13} {pass_rate_change:+.1f}%")

# Passed tests
print(f"{'Tests Passed':<30} {baseline_report['passed']}/{baseline_report['total_cases']}{'':<16} "
      f"{improved_report['passed']}/{improved_report['total_cases']}{'':<16} "
      f"{improved_report['passed'] - baseline_report['passed']:+d}")

# Average score (if available)
if baseline_report['average_score'] and improved_report['average_score']:
    score_change = improved_report['average_score'] - baseline_report['average_score']
    print(f"{'Avg Score (Judge)':<30} {baseline_report['average_score']:>6.2f}/5{'':<12} "
          f"{improved_report['average_score']:>6.2f}/5{'':<12} {score_change:+.2f}")

# Duration
duration_change = improved_report['average_duration'] - baseline_report['average_duration']
print(f"{'Avg Duration':<30} {baseline_report['average_duration']:>6.2f}s{'':<12} "
      f"{improved_report['average_duration']:>6.2f}s{'':<12} {duration_change:+.2f}s")

print("\n" + "="*70)

# Summary
if improved_pass_rate > baseline_pass_rate:
    print(f"\n✅ IMPROVEMENT! Pass rate increased by {pass_rate_change:.1f}%")
    print(f"   {improved_report['passed'] - baseline_report['passed']} more tests passing")
elif improved_pass_rate < baseline_pass_rate:
    print(f"\n⚠️  REGRESSION! Pass rate decreased by {abs(pass_rate_change):.1f}%")
else:
    print(f"\n➡️  NO CHANGE in pass rate")

print("\n" + "="*70)

### Visualize the Improvement

In [None]:
# Side-by-side comparison chart
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
fig.suptitle('Baseline vs Improved Agent Comparison', fontsize=16, fontweight='bold')

# Chart 1: Pass Rate Comparison
ax1 = axes[0]
versions = ['Baseline', 'Improved']
pass_rates = [baseline_report['pass_rate'], improved_report['pass_rate']]
colors = ['#FFB6C6', '#90EE90']

bars = ax1.bar(versions, pass_rates, color=colors, alpha=0.7, edgecolor='black')
ax1.set_ylabel('Pass Rate (%)', fontsize=12)
ax1.set_title('Pass Rate Comparison', fontsize=12, fontweight='bold')
ax1.set_ylim(0, 100)
ax1.axhline(y=70, color='gray', linestyle='--', alpha=0.5, label='70% target')
ax1.grid(axis='y', alpha=0.3)
ax1.legend()

# Add value labels
for i, (version, rate) in enumerate(zip(versions, pass_rates)):
    ax1.text(i, rate + 2, f'{rate:.1f}%', ha='center', fontweight='bold', fontsize=12)

# Chart 2: Test Results Breakdown
ax2 = axes[1]
x = [0, 1]
width = 0.35

passed_baseline = baseline_report['passed']
failed_baseline = baseline_report['failed']
passed_improved = improved_report['passed']
failed_improved = improved_report['failed']

ax2.bar([i - width/2 for i in x], [passed_baseline, passed_improved],
        width, label='Passed', color='#90EE90', alpha=0.7, edgecolor='black')
ax2.bar([i + width/2 for i in x], [failed_baseline, failed_improved],
        width, label='Failed', color='#FFB6C6', alpha=0.7, edgecolor='black')

ax2.set_ylabel('Number of Tests', fontsize=12)
ax2.set_title('Test Results Breakdown', fontsize=12, fontweight='bold')
ax2.set_xticks(x)
ax2.set_xticklabels(versions)
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 9. Advanced: Exporting and Tracking Results

In production, you'll want to track evaluation results over time.



- 📈 **Monitor trends** - Is quality improving or degrading?
- 🐛 **Detect regressions** - Did a change break something?
- 📊 **Prove ROI** - Show quantitative improvements
- 🎯 **Set benchmarks** - Track progress toward goals

### Best Practices

1. **Save results to CSV** - Easy to analyze in spreadsheets
2. **Version control** - Track evaluation dataset changes
3. **Timestamp everything** - Know when each evaluation ran
4. **Tag versions** - Link to agent/model versions
5. **Automate** - Run evals in CI/CD pipeline

### Example: Export to CSV (Theoretical)

```python
# In a real project, you would do:
results_df = pd.DataFrame([asdict(r) for r in evaluator.results])
results_df['timestamp'] = datetime.now()
results_df['agent_version'] = 'v1.2.3'
results_df.to_csv(f'eval_results_{datetime.now():%Y%m%d_%H%M%S}.csv', index=False)

# Load and compare multiple runs:
df1 = pd.read_csv('eval_results_20250101.csv')
df2 = pd.read_csv('eval_results_20250201.csv')
comparison = pd.merge(df1, df2, on='test_id', suffixes=('_jan', '_feb'))
```

### Continuous Monitoring Dashboard (Conceptual)

In production, you might build:
- **Daily automated evals** - Run tests nightly
- **Slack notifications** - Alert on regressions
- **Grafana dashboard** - Visualize trends
- **Threshold alerts** - Warn if pass rate drops below 80%

## 10. Best Practices for Evaluation Pipelines

### 1. Dataset Management

✅ **DO:**
- Keep evaluation dataset in version control
- Start small (10-20 cases), grow over time
- Add new cases when you find edge cases
- Balance different test types
- Document why each test exists

❌ **DON'T:**
- Hardcode test data in code
- Remove failing tests just to boost metrics
- Only test happy paths
- Create redundant test cases

### 2. Evaluation Frequency

**When to run evaluations:**
- ✅ Before every deployment
- ✅ After every prompt change
- ✅ After model upgrades
- ✅ Nightly (automated)
- ✅ When adding new features

### 3. Thresholds and Goals

**Set clear targets:**
- Overall pass rate: 80%+
- Critical tests: 100%
- Judge tests: Average score 4+
- Response time: <2s per test

### 4. Test Type Balance

**Recommended mix:**
- 30% Assertion tests (fast, deterministic)
- 30% Tool/behavior tests (validate agent logic)
- 40% Judge tests (quality and subjective measures)

### 5. Iteration Strategy

**Improvement workflow:**
1. Run baseline evaluation
2. Fix the WORST failures first
3. Don't try to fix everything at once
4. Re-evaluate after each change
5. Watch for regressions

### 6. Cost Management

**LLM-as-judge is expensive:**
- Each judge test = 2 API calls (agent + judge)
- 100 tests with 40 judges = ~140 API calls
- Use gpt-5-nano for cost efficiency
- Cache agent responses when possible
- Run full suite less frequently

### 7. Documentation

**Document everything:**
- What each test validates
- Why thresholds are set at specific values
- How to interpret results
- Who to notify on failures
- How to add new tests

### 8. Team Integration

**Make evals part of your workflow:**
- Run in CI/CD pipeline
- Review results in PR reviews
- Share reports with stakeholders
- Celebrate improvements!

### 9. Common Pitfalls to Avoid

⚠️ **"Gaming" the metrics:**
- Don't remove failing tests to boost pass rate
- Don't lower thresholds without good reason
- Don't cherry-pick which tests to include

⚠️ **Over-fitting to eval set:**
- Don't only optimize for eval dataset
- Real users will have different queries
- Keep adding new test cases

⚠️ **Ignoring edge cases:**
- Test error conditions
- Test unusual inputs
- Test adversarial queries

