# Lab 04: LLM Routing

## Overview

In this notebook, we implement **model routing** to use cheaper models for simple queries while preserving quality for complex ones. This can dramatically reduce costs without sacrificing user experience.

**What you'll learn:**
- How to classify query complexity
- How to route queries to appropriate models
- How to verify routing decisions in Langfuse
- Cost savings from intelligent routing

**Routing Strategy:**
- Simple queries → Claude Haiku ($0.25/1M input, 12x cheaper)
- Complex queries → Claude Sonnet ($3.00/1M input)

## Prerequisites

- Completed Labs 01-03

## Workshop Journey

```
01 Baseline → 02 Quick Wins → 03 Caching → [04 Routing] → 05 Guardrails → 06 Gateway → 07 Evaluations
                                               ↑
                                          You are here
```

## Step 1: Setup

In [None]:
import os
import json
import uuid
from pathlib import Path
from dotenv import load_dotenv

load_dotenv(override=True)

import boto3
from bedrock_agentcore_starter_toolkit import Runtime

region = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")
control_client = boto3.client("bedrock-agentcore-control", region_name=region)
data_client = boto3.client("bedrock-agentcore", region_name=region)
agentcore_runtime = Runtime()

print(f"Region: {region}")
print(f"Langfuse Host: {os.environ.get('LANGFUSE_HOST', 'Not set')}")

## Step 2: Understanding Model Routing

### Model Pricing Comparison

| Model | Input (per 1M) | Output (per 1M) | Best For |
|-------|----------------|-----------------|----------|
| Claude Sonnet | $3.00 | $15.00 | Complex reasoning, analysis |
| Claude Haiku | $0.25 | $1.25 | Simple Q&A, lookups |

**Haiku is 12x cheaper for input tokens!**

### Routing Logic

Simple queries (use Haiku):
- "What is your return policy?"
- "Hello, what can you help me with?"
- "Do you have X product?"

Complex queries (use Sonnet):
- "Compare these products and recommend..."
- "Troubleshoot my device that..."
- Multi-step reasoning required

In [None]:
# Review the routing logic in v4 agent
agent_file = Path("agents/v4_routing.py")
print(agent_file.read_text())

## Step 3: Deploy the Routing Agent

In [None]:
agent_name = "customer_support_v4_routing"
agent_file = str(Path("agents/v4_routing.py").absolute())
requirements_file = str(Path("requirements-for-agentcore.txt").absolute())

print(f"Agent name: {agent_name}")
print(f"Agent file: {agent_file}")
print(f"Requirements: {requirements_file}")

print(f"Configuring agent: {agent_name}")
agentcore_runtime.configure(
    entrypoint=agent_file,
    auto_create_execution_role=True,
    auto_create_ecr=True,
    requirements_file=requirements_file,
    region=region,
    agent_name=agent_name,
)

In [None]:
# Modify Dockerfile for Langfuse
dockerfile_path = Path("Dockerfile")
if dockerfile_path.exists():
    content = dockerfile_path.read_text()
    if "opentelemetry-instrument" in content:
        import re
        content = re.sub(
            r'CMD \["opentelemetry-instrument", "python", "-m", "([^"]+)"\]',
            r'CMD ["python", "-m", "\1"]',
            content
        )
        dockerfile_path.write_text(content)
        print("Dockerfile modified for Langfuse")
    else:
        print("Dockerfile already configured or using different format")
else:
    print("Dockerfile not found - will be created during deployment")

In [None]:
env_vars = {
    "LANGFUSE_HOST": os.environ.get("LANGFUSE_HOST"),
    "LANGFUSE_PUBLIC_KEY": os.environ.get("LANGFUSE_PUBLIC_KEY"),
    "LANGFUSE_SECRET_KEY": os.environ.get("LANGFUSE_SECRET_KEY"),
    "PYTHONUNBUFFERED": "1",
}

print("Deploying to AgentCore Runtime...")
launch_result = agentcore_runtime.launch(env_vars=env_vars, auto_update_on_conflict=True)
agent_arn = launch_result.agent_arn
print(f"Agent deployed: {agent_arn}")

## Step 4: Test Model Routing

In [None]:
def invoke_agent(prompt):
    """Invoke the agent via AgentCore API."""
    response = data_client.invoke_agent_runtime(
        agentRuntimeArn=agent_arn,
        runtimeSessionId=str(uuid.uuid4()),
        payload=json.dumps({"prompt": prompt}).encode(),
    )
    return json.loads(response["response"].read().decode("utf-8"))

In [None]:
# Import Langfuse metrics helper
from utils.langfuse_metrics import (
    get_latest_trace_metrics,
    print_metrics,
    clear_metrics,
    collect_metric,
    print_metrics_table,
    get_collected_metrics
)

# Clear any previously collected metrics
clear_metrics()

# Standard test prompts - same across all notebooks for consistent comparison
TEST_PROMPTS = [
    # Single tool: get_return_policy
    ("Return Policy", "What is your return policy for laptops?"),

    # Single tool: get_product_info
    ("Product Info", "Tell me about your smartphone options"),

    # Single tool: get_technical_support (Bedrock KB)
    ("Technical Support", "My laptop won't turn on, can you help me troubleshoot?"),

    # Multi-tool: get_product_info + get_return_policy
    ("Multi-part Question", "I want to buy a laptop. What are the specs and what's the return policy?"),

    # No tool: General greeting
    ("General Question", "Hello! What can you help me with today?"),
]

# Run all tests and collect metrics
for test_name, prompt in TEST_PROMPTS:
    print("=" * 60)
    print(f"Test: {test_name}")
    print("=" * 60)

    result = invoke_agent(prompt)
    
    # Show routing decision if available
    if isinstance(result, dict):
        print(f"Model used: {result.get('model_used', 'N/A')}")
        print(f"Complexity: {result.get('complexity', 'N/A')}")
        print(f"Response: {str(result.get('response', result))[:200]}...")
    else:
        print(result)

    # Fetch and collect metrics
    metrics = get_latest_trace_metrics(
        agent_name="customer-support-v4-routing",
        wait_seconds=5,
        max_retries=5,
        timeout_seconds=120,
    )
    print_metrics(metrics, test_name)
    collect_metric(metrics, test_name)

In [None]:
# Print summary table
print_metrics_table()

# Compare with baseline metrics (from notebook 01)
BASELINE_AVG_INPUT_TOKENS = 4251  # From v1-baseline
BASELINE_AVG_LATENCY = 8.0  # From v1-baseline (seconds)

# Calculate improvements
collected = get_collected_metrics()
if collected:
    valid_metrics = [m for m in collected if "error" not in m]
    if valid_metrics:
        avg_input = sum(m.get('input_tokens', 0) for m in valid_metrics) / len(valid_metrics)
        avg_latency = sum(m.get('latency_seconds', 0) or 0 for m in valid_metrics) / len(valid_metrics)
        total_cost = sum(m.get('cost_usd', 0) for m in valid_metrics)
        total_cache_read = sum(m.get('cache_read_tokens', 0) for m in valid_metrics)

        token_reduction = ((BASELINE_AVG_INPUT_TOKENS - avg_input) / BASELINE_AVG_INPUT_TOKENS) * 100
        latency_change = ((BASELINE_AVG_LATENCY - avg_latency) / BASELINE_AVG_LATENCY) * 100

        print("\n" + "=" * 60)
        print("           COMPARISON VS BASELINE (v1)")
        print("=" * 60)
        print(f"  Avg Input Tokens:  {avg_input:,.0f} (Baseline: {BASELINE_AVG_INPUT_TOKENS:,})")
        print(f"  Token Reduction:   {token_reduction:+.1f}%")
        print(f"  Avg Latency:       {avg_latency:.2f}s (Baseline: {BASELINE_AVG_LATENCY:.2f}s)")
        print(f"  Latency Change:    {latency_change:+.1f}%")
        print(f"  Total Cost:        ${total_cost:.4f}")
        print(f"  Cache Read Tokens: {total_cache_read:,}")
        print("=" * 60)
        print("\nNote: With routing, simple queries use Haiku (12x cheaper)!")

## Step 5: Verify Routing in Langfuse

In [None]:
langfuse_host = os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")
print(f"View your traces at: {langfuse_host}")
print("\nFilter by tags: 'routing', 'simple' or 'complex'")
print("\nCheck the 'query_complexity' and 'model_used' attributes in each trace.")
print("\n" + "=" * 60)
print("Expected Routing:")
print("=" * 60)
print("")
print("| Query Type | Model | Cost per 1M input |")
print("|------------|-------|-------------------|")
print("| Simple     | Haiku | $0.25             |")
print("| Complex    | Sonnet| $3.00             |")

## Step 6: Calculate Routing Savings

In [None]:
def calculate_routing_savings(total_requests, simple_pct, tokens_per_request):
    """Calculate savings from model routing."""
    sonnet_input = 3.00  # per 1M
    haiku_input = 0.25   # per 1M
    
    simple_requests = total_requests * (simple_pct / 100)
    complex_requests = total_requests * (1 - simple_pct / 100)
    
    # Without routing (all Sonnet)
    cost_no_routing = (tokens_per_request * total_requests / 1_000_000) * sonnet_input
    
    # With routing
    cost_simple = (tokens_per_request * simple_requests / 1_000_000) * haiku_input
    cost_complex = (tokens_per_request * complex_requests / 1_000_000) * sonnet_input
    cost_with_routing = cost_simple + cost_complex
    
    savings = cost_no_routing - cost_with_routing
    savings_pct = (savings / cost_no_routing) * 100
    
    return {
        "without_routing": cost_no_routing,
        "with_routing": cost_with_routing,
        "savings": savings,
        "savings_pct": savings_pct,
    }

# Example: 70% simple queries
result = calculate_routing_savings(
    total_requests=1000,
    simple_pct=70,
    tokens_per_request=1000
)

print("Routing Savings (1000 requests, 70% simple, 1000 tokens each):")
print(f"  Without routing (all Sonnet): ${result['without_routing']:.4f}")
print(f"  With routing:                 ${result['with_routing']:.4f}")
print(f"  Savings:                      ${result['savings']:.4f} ({result['savings_pct']:.1f}%)")

## Summary

In this notebook, we implemented model routing:

1. **Query classification**: Pattern-based complexity detection
2. **Model selection**: Haiku for simple, Sonnet for complex
3. **Significant cost savings**: Up to 60%+ with typical query distributions

**Key Insight:** Most customer support queries are simple lookups that don't need the full power of Sonnet.

**Next Steps:** In the next notebook, we'll add Bedrock Guardrails to filter out off-topic queries before they reach the LLM.

**Next notebook:** [05-guardrails.ipynb](./05-guardrails.ipynb)