# Lab 07: Evaluations

## Overview

In this final notebook, we run comprehensive evaluations across all deployed agent versions to validate that optimizations improved cost and latency **without degrading quality**.

**What you'll learn:**
- How to run systematic evaluations across agent versions
- How to compare metrics side-by-side
- How to validate quality hasn't degraded
- How to generate a final optimization report

## Prerequisites

- Completed Labs 01-06 (all agent versions deployed)

## Workshop Journey

```
01 Baseline → 02 Quick Wins → 03 Caching → 04 Routing → 05 Guardrails → 06 Gateway → [07 Evaluations]
                                                                                          ↑
                                                                                     You are here
```

## Step 1: Setup

In [None]:
from __future__ import annotations

import json
import os
import time
import uuid

from dotenv import load_dotenv

load_dotenv(override=True)

import boto3
import pandas as pd

region = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")
control_client = boto3.client("bedrock-agentcore-control", region_name=region)
data_client = boto3.client("bedrock-agentcore", region_name=region)

print(f"Region: {region}")
print(f"Langfuse Host: {os.environ.get('LANGFUSE_BASE_URL', 'https://cloud.langfuse.com')}")

## Step 2: Find All Deployed Agent Versions

In [None]:
def find_agent_by_name(name_pattern):
    """Find agent ARN by name pattern (handles both hyphen and underscore naming)."""
    response = control_client.list_agent_runtimes()
    agents = response.get("agentRuntimes", [])
    for agent in agents:
        agent_name = agent["agentRuntimeName"]
        # Handle both hyphen and underscore naming conventions
        if name_pattern.replace("-", "_") in agent_name or name_pattern in agent_name:
            return agent["agentRuntimeArn"]
    return None


# Find all agent versions
versions = {
    "v1-baseline": find_agent_by_name("v1-baseline"),
    "v2-quick-wins": find_agent_by_name("v2-quick-wins"),
    "v3-caching": find_agent_by_name("v3-caching"),
    "v4-routing": find_agent_by_name("v4-routing"),
    "v5-guardrails": find_agent_by_name("v5-guardrails"),
    "v6-gateway": find_agent_by_name("v6-gateway"),
}

print("Found agent versions:")
for name, arn in versions.items():
    status = "Found" if arn else "Not found"
    print(f"  {name}: {status}")

## Step 3: Load Test Scenarios

In [None]:
# Standard test prompts - each demonstrates a specific tool usage pattern
TEST_PROMPTS = [
    # Single tool: get_return_policy
    {"id": "return-policy", "query": "What is your return policy for laptops?"},
    # Single tool: get_product_info
    {"id": "product-info", "query": "Tell me about your smartphone options"},
    # Single tool: get_technical_support (Bedrock KB)
    {"id": "technical-support", "query": "My laptop won't turn on, can you help me troubleshoot?"},
    # Multi-tool: get_product_info + get_return_policy
    {"id": "multi-part", "query": "I want to buy a laptop. What are the specs and what's the return policy?"},
    # No tool: General greeting
    {"id": "general", "query": "Hello! What can you help me with today?"},
]

test_scenarios = TEST_PROMPTS

print(f"Loaded {len(test_scenarios)} test scenarios:")
for scenario in test_scenarios:
    print(f"  - {scenario['id']}: {scenario['query'][:50]}...")

In [None]:
# Import Langfuse metrics helper
from utils.langfuse_metrics import clear_metrics, get_latest_trace_metrics

print("Langfuse metrics helper imported")

## Step 4: Run Evaluations

In [None]:
# Map version name to agent trace name (uses hyphens)
TRACE_NAME_MAP = {
    "v1-baseline": "customer-support-v1-baseline",
    "v2-quick-wins": "customer-support-v2-quick-wins",
    "v3-caching": "customer-support-v3-caching",
    "v4-routing": "customer-support-v4-routing",
    "v5-guardrails": "customer-support-v5-guardrails",
    "v6-gateway": "customer-support-v6-gateway",
}


def invoke_agent(agent_arn, prompt):
    """Invoke agent and measure latency."""
    start_time = time.time()
    response = data_client.invoke_agent_runtime(
        agentRuntimeArn=agent_arn,
        runtimeSessionId=str(uuid.uuid4()),
        payload=json.dumps({"prompt": prompt}).encode(),
    )
    latency_ms = (time.time() - start_time) * 1000
    result = json.loads(response["response"].read().decode("utf-8"))
    return result, latency_ms


def run_evaluation(version_name, agent_arn, scenarios):
    """Run all scenarios against an agent version and collect Langfuse metrics."""
    results = []
    total_latency = 0
    successful = 0
    langfuse_metrics_list = []

    agent_trace_name = TRACE_NAME_MAP.get(version_name, version_name)

    print(f"\nEvaluating {version_name}...")
    clear_metrics()  # Clear for this version

    for scenario in scenarios:
        try:
            _result, latency = invoke_agent(agent_arn, scenario["query"])

            # Fetch Langfuse metrics for this trace
            metrics = get_latest_trace_metrics(
                agent_name=agent_trace_name,
                wait_seconds=3,
                max_retries=3,
                timeout_seconds=60,
            )
            langfuse_metrics_list.append(metrics)

            results.append(
                {
                    "scenario_id": scenario["id"],
                    "success": True,
                    "latency_ms": latency,
                }
            )
            total_latency += latency
            successful += 1
            print(f"  [{scenario['id']}] {latency:.0f}ms")
        except Exception as e:
            results.append(
                {
                    "scenario_id": scenario["id"],
                    "success": False,
                    "error": str(e),
                }
            )
            langfuse_metrics_list.append({"error": str(e)})
            print(f"  [{scenario['id']}] FAILED: {e}")

    return {
        "version": version_name,
        "results": results,
        "total_scenarios": len(scenarios),
        "successful": successful,
        "avg_latency_ms": total_latency / successful if successful > 0 else 0,
        "langfuse_metrics": langfuse_metrics_list,
    }

In [None]:
# Run evaluations for all available versions
all_results = {}

for version_name, agent_arn in versions.items():
    if agent_arn:
        all_results[version_name] = run_evaluation(version_name, agent_arn, test_scenarios)
    else:
        print(f"\nSkipping {version_name} (not deployed)")

## Step 5: Generate Comparison Report

**A note on sample size:** This evaluation runs 5 test queries per version. With a small sample, individual runs may show variance — for example, one version might appear slightly slower or costlier than expected due to network jitter, cold starts, or cache timing. Don't over-index on small differences between adjacent versions (e.g., v4 vs v5).

What to look for:
- **Overall trend** from v1 → v6: cost and latency should generally decrease
- **Large, consistent gains** (e.g., caching in v3, routing in v4) will be visible even at small scale
- **Small or marginal differences** between versions may not be statistically significant

In production, you would run a larger evaluation set (50-100+ queries) to get stable, reliable comparisons with tighter confidence intervals.

In [None]:
# Create comparison DataFrame
rows = []
for version_name, eval_result in all_results.items():
    success_rate = (eval_result["successful"] / eval_result["total_scenarios"]) * 100

    # Use Langfuse latency (server-side) instead of client-side E2E latency
    lf_metrics = eval_result.get("langfuse_metrics", [])
    valid = [m for m in lf_metrics if "error" not in m]
    avg_latency_s = sum(m.get("latency_seconds", 0) or 0 for m in valid) / len(valid) if valid else 0

    rows.append(
        {
            "Version": version_name,
            "Success Rate": f"{success_rate:.0f}%",
            "Avg Latency (s)": f"{avg_latency_s:.2f}",
            "Successful": eval_result["successful"],
            "Failed": eval_result["total_scenarios"] - eval_result["successful"],
        }
    )

df = pd.DataFrame(rows)
print("\n" + "=" * 70)
print("EVALUATION RESULTS")
print("=" * 70)
print(df.to_string(index=False))

In [None]:
# Show Langfuse metrics for each version
print("\n" + "=" * 100)
print("LANGFUSE METRICS BY VERSION")
print("=" * 100)

version_summaries = []
for version_name, eval_result in all_results.items():
    lf_metrics = eval_result.get("langfuse_metrics", [])
    valid = [m for m in lf_metrics if "error" not in m]

    if valid:
        total_input = sum(m.get("input_tokens", 0) for m in valid)
        total_output = sum(m.get("output_tokens", 0) for m in valid)
        total_cache_read = sum(m.get("cache_read_tokens", 0) for m in valid)
        total_cache_write = sum(m.get("cache_write_tokens", 0) for m in valid)
        total_cost = sum(m.get("cost_usd", 0) for m in valid)
        avg_latency = sum(m.get("latency_seconds", 0) or 0 for m in valid) / len(valid)
        avg_input = total_input / len(valid)

        version_summaries.append(
            {
                "Version": version_name,
                "Avg Input Tokens": f"{avg_input:,.0f}",
                "Avg Latency (s)": f"{avg_latency:.2f}",
                "Total Cost": f"${total_cost:.4f}",
                "Cache Read": f"{total_cache_read:,}",
                "Cache Write": f"{total_cache_write:,}",
            }
        )

if version_summaries:
    df_langfuse = pd.DataFrame(version_summaries)
    print(df_langfuse.to_string(index=False))

In [None]:
# Calculate comprehensive improvement from baseline (including token metrics)
if "v1-baseline" in all_results and len(all_results) > 1:
    baseline = all_results["v1-baseline"]

    # Get baseline Langfuse metrics
    baseline_lf = baseline.get("langfuse_metrics", [])
    baseline_valid = [m for m in baseline_lf if "error" not in m]
    baseline_input = (
        sum(m.get("input_tokens", 0) for m in baseline_valid) / len(baseline_valid) if baseline_valid else 0
    )
    baseline_cost = sum(m.get("cost_usd", 0) for m in baseline_valid)
    baseline_latency = (
        sum(m.get("latency_seconds", 0) or 0 for m in baseline_valid) / len(baseline_valid)
        if baseline_valid
        else 0
    )

    print("\n" + "=" * 80)
    print("IMPROVEMENT VS BASELINE (v1)")
    print("=" * 80)
    print(f"{'Version':<20} {'Input Tokens':<18} {'Latency':<15} {'Cost':<15}")
    print("-" * 80)
    print(f"{'v1-baseline':<20} {'(baseline)':<18} {'(baseline)':<15} {'(baseline)':<15}")

    for version_name, eval_result in all_results.items():
        if version_name == "v1-baseline":
            continue

        lf_metrics = eval_result.get("langfuse_metrics", [])
        valid = [m for m in lf_metrics if "error" not in m]

        if valid and baseline_input > 0:
            avg_input = sum(m.get("input_tokens", 0) for m in valid) / len(valid)
            total_cost = sum(m.get("cost_usd", 0) for m in valid)
            avg_latency = sum(m.get("latency_seconds", 0) or 0 for m in valid) / len(valid)

            token_change = ((baseline_input - avg_input) / baseline_input) * 100
            latency_change = (
                ((baseline_latency - avg_latency) / baseline_latency) * 100 if baseline_latency > 0 else 0
            )
            cost_change = ((baseline_cost - total_cost) / baseline_cost) * 100 if baseline_cost > 0 else 0

            print(
                f"{version_name:<20} {token_change:+.1f}%{'':<12} {latency_change:+.1f}%{'':<10} {cost_change:+.1f}%"
            )
        else:
            print(f"{version_name:<20} {'N/A':<18} {'N/A':<15} {'N/A':<15}")

    print("=" * 80)
    print("\nPositive % = improvement (reduction), Negative % = regression (increase)")

## Step 6: View Langfuse Dashboard

In [None]:
langfuse_base_url = os.environ.get("LANGFUSE_BASE_URL", "https://cloud.langfuse.com")
print(f"View comprehensive metrics at: {langfuse_base_url}")
print("\nFor detailed comparison:")
print("1. Filter by version tags (v1-baseline, v2-quick-wins, etc.)")
print("2. Compare token usage across versions")
print("3. Check cache hit rates (v3+)")
print("4. Verify model routing (v4+)")
print("5. Review guardrail interventions (v5+)")

## Step 7: Final Summary

In [None]:
print("\n" + "=" * 70)
print("WORKSHOP SUMMARY")
print("=" * 70)
print("""
Optimizations Applied:

1. Quick Wins (v2)
   - Concise system prompt: ~60% token reduction
   - max_tokens limit: Bounded output
   - stop_sequences: Early termination

2. Prompt Caching (v3)
   - System prompt caching: 90% discount on repeated requests
   - Tool definition caching: Additional savings

3. Model Routing (v4)
   - Haiku for simple queries: 12x cheaper input tokens
   - Sonnet for complex queries: Maintained quality

4. Guardrails (v5)
   - Topic filtering: Block off-topic queries
   - Content filtering: Improve safety
   - Zero LLM tokens for blocked queries

5. Gateway (v6)
   - Semantic tool search: Load only relevant tools
   - Reduced context size: Up to 75% fewer tool tokens

Success Criteria:
- Cost should decrease from v1 to v6
- Latency should improve from v1 to v6
- Quality (success rate) should NOT degrade
""")

## Cleanup (Optional)

Delete all deployed agents if you're done with the workshop.

In [None]:
# # Uncomment to delete all customer-support agents and their ECR repositories
# from utils.runtime_helpers import cleanup_agents
# cleanup_agents(control_client, name_prefix="customer_support", region=region)

## Congratulations!

You've completed the Prompt Optimization Workshop!

**Key Takeaways:**
- Start with measurement (baseline metrics)
- Apply optimizations incrementally
- Validate quality at each step
- Use observability (Langfuse) to verify improvements

**Next Steps:**
- Apply these techniques to your own agents
- Explore additional Bedrock features
- Set up automated evaluation pipelines