# Lab 5: Online Evaluations with AgentCore

In this lab, you'll set up **continuous online evaluation** to monitor the deployed agent's quality in real-time.

## What you'll learn
- How to configure online evaluations with built-in evaluators
- How to generate test interactions for evaluation
- How to interpret evaluation metrics in CloudWatch

## Architecture

<div style="text-align:left">
    <img src="images/architecture_lab5_evaluation.png" width="75%"/>
</div>


## Step 1: Setup

In [None]:
import os
import json
import uuid
import time

os.environ["CLAUDE_CODE_USE_BEDROCK"] = "1"
os.environ.pop("CLAUDECODE", None)

import boto3
from boto3.session import Session

boto_session = Session()
REGION = boto_session.region_name

from utils.aws_helpers import get_ssm_parameter, get_or_create_cognito_pool

print(f"Region: {REGION}")

## Step 2: Initialize Evaluation Client

In [None]:
from bedrock_agentcore_starter_toolkit import Evaluation, Runtime

eval_client = Evaluation(region=REGION)

# Get the runtime ARN from Lab 4 and extract agent ID
runtime_arn = get_ssm_parameter("/app/customersupport/agentcore/runtime_arn")
agent_id = runtime_arn.split("/")[-1]

print(f"Runtime ARN: {runtime_arn}")
print(f"Agent ID: {agent_id}")

## Step 3: Create Online Evaluation Configuration

We configure three built-in evaluators:
- **GoalSuccessRate**: Did the agent achieve the customer's goal?
- **Correctness**: Was the information provided accurate?
- **ToolSelectionAccuracy**: Did the agent select the right tools?

In [None]:
# Create online evaluation configuration
try:
    eval_config = eval_client.create_online_config(
        config_name="customer_support_agent_eval",
        agent_id=agent_id,
        evaluator_list=[
            "Builtin.GoalSuccessRate",
            "Builtin.Correctness",
            "Builtin.ToolSelectionAccuracy",
        ],
        sampling_rate=1.0,  # 100% for demo; use lower in production
        config_description="Customer support agent online evaluation",
        auto_create_execution_role=True,
    )
    print(f"Evaluation config created: {eval_config}")
except Exception as e:
    if "conflict" in str(e).lower() or "already exists" in str(e).lower():
        print(f"Evaluation config already exists - continuing with existing config.")
    else:
        print(f"Error: {e}")

# List existing configs
try:
    configs = eval_client.list_online_configs()
    print(f"\nOnline evaluation configs: {len(configs)} found")
    for c in configs:
        print(f"  {c}")
except Exception as e:
    print(f"Could not list configs: {e}")

## Step 4: Generate Test Interactions

Let's generate several test interactions to populate evaluation data.

In [None]:
agentcore_runtime = Runtime()

# Reconfigure to load existing deployment (same params as Lab 4)
execution_role_arn = get_ssm_parameter("/app/customersupport/agentcore/runtime_execution_role_arn")
cognito_config = get_or_create_cognito_pool(refresh_token=True)

agentcore_runtime.configure(
    entrypoint="runtime/app.py",
    execution_role=execution_role_arn,
    auto_create_ecr=True,
    requirements_file="requirements.txt",
    region=REGION,
    agent_name="customer_support_agent",
    authorizer_configuration={
        "customJWTAuthorizer": {
            "allowedClients": [cognito_config["client_id"]],
            "discoveryUrl": cognito_config["discovery_url"],
        }
    },
    request_header_configuration={
        "requestHeaderAllowlist": ["Authorization"]
    },
)

# Verify runtime is ready
status = agentcore_runtime.status()
print(f"Runtime status: {status.endpoint.get('status', 'unknown')}")

# Refresh auth token
access_token = cognito_config["bearer_token"]

# Test scenarios
test_queries = [
    "What are the specifications of your latest laptops?",
    "My headphones aren't connecting via Bluetooth. Can you help?",
    "I want to return a smartphone I bought 2 weeks ago. What's the process?",
    "Can you tell me about your monitor warranty and also search for the best 4K monitors in 2025?",
    "What payment methods do you accept?",
]

for i, query_text in enumerate(test_queries, 1):
    session_id = str(uuid.uuid4())
    print(f"\n[{i}/{len(test_queries)}] {query_text}")
    
    try:
        response = agentcore_runtime.invoke(
            {"prompt": query_text, "actor_id": "eval_customer"},
            bearer_token=access_token,
            session_id=session_id,
        )
        print(f"  Response: {str(response)[:100]}...")
    except Exception as e:
        print(f"  Error: {e}")
    
    time.sleep(2)  # Small delay between requests

print("\nTest interactions complete!")
print("Evaluation results will appear in CloudWatch within a few minutes.")

## Step 5: View Evaluation Results

Evaluation metrics are published to CloudWatch. You can view them:

1. **CloudWatch Console** > Metrics > `bedrock-agentcore` namespace
2. Look for:
   - `GoalSuccessRate` - Percentage of successful goal completions
   - `Correctness` - Accuracy of provided information
   - `ToolSelectionAccuracy` - How well the agent selects appropriate tools

### Interpreting Results
- **GoalSuccessRate > 80%**: Agent effectively resolves customer queries
- **Correctness > 90%**: Information provided is reliable
- **ToolSelectionAccuracy > 85%**: Agent consistently picks the right tool

![Online Evaluation Dashboard](images/online_evaluations_dashboard.png)


In [None]:
# List available evaluators
print("Available Evaluators:")
print("-" * 40)
try:
    evaluators = eval_client.list_evaluators()
    for ev in evaluators:
        print(f"  {ev}")
except Exception as e:
    print(f"  Error listing evaluators: {e}")

# Check online evaluation configs
print("\nOnline Evaluation Configs:")
print("-" * 40)
try:
    configs = eval_client.list_online_configs(agent_id=agent_id)
    for config in configs:
        print(f"  {config}")
except Exception as e:
    print(f"  Error: {e}")

print("\nNote: Evaluation results appear in the AgentCore Observability Dashboard.")
print("Check CloudWatch > Metrics > bedrock-agentcore namespace for detailed metrics.")

## Summary

In this lab, you set up online evaluations:

1. **Created evaluation config** with 3 built-in evaluators
2. **Generated test interactions** across 5 scenarios
3. **Learned to interpret** evaluation metrics in CloudWatch

In **Lab 6**, we'll build a Streamlit frontend for the agent.