# Lab 04: LLM Routing

## Overview

In this notebook, we implement **model routing** to use cheaper models for simple queries while preserving quality for complex ones. The key insight is that **an LLM can classify queries more accurately than keyword matching**.

**What you'll learn:**
- How to classify query complexity using an LLM
- How to route queries to appropriate models (Haiku vs Sonnet)
- How to verify routing decisions in Langfuse
- Cost savings from intelligent routing

**Routing Strategy:**
- Simple queries ‚Üí Claude Haiku (smaller, faster, cheaper)
- Complex queries ‚Üí Claude Sonnet (more capable, higher quality)

## Prerequisites

- Completed Labs 01-03

## Workshop Journey

```
01 Baseline ‚Üí 02 Quick Wins ‚Üí 03 Caching ‚Üí [04 Routing] ‚Üí 05 Guardrails ‚Üí 06 Gateway ‚Üí 07 Evaluations
                                               ‚Üë
                                          You are here
```

## Step 1: Setup

In [1]:
import os
import json
import uuid
from pathlib import Path
from dotenv import load_dotenv

load_dotenv(override=True)

import boto3
from bedrock_agentcore_starter_toolkit import Runtime

region = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")
control_client = boto3.client("bedrock-agentcore-control", region_name=region)
data_client = boto3.client("bedrock-agentcore", region_name=region)
agentcore_runtime = Runtime()

print(f"Region: {region}")
print(f"Langfuse Host: {os.environ.get('LANGFUSE_HOST', 'Not set')}")

Region: us-east-1
Langfuse Host: https://d2rhlwziq3nnbf.cloudfront.net


## Step 2: Understanding Model Routing

### Why Route Between Models?

Not all queries require the same level of reasoning. A simple "What's your return policy?" doesn't need the full power of Sonnet‚ÄîHaiku can handle it at a fraction of the cost.

**Claude Haiku** ‚Äî ~4x cheaper than Sonnet, best for simple Q&A, lookups, and greetings. Requires 4,096 tokens minimum for prompt caching.

**Claude Sonnet** ‚Äî More capable model for complex reasoning and troubleshooting. Requires 1,024 tokens minimum for prompt caching.

Our system prompt (~1,030 tokens) only meets Sonnet's caching threshold. Haiku requests won't benefit from prompt caching, but the ~4x cost savings still make it worthwhile for simple queries.

### Routing Approaches

**Keyword matching** ‚Äî Fast with no LLM cost, but brittle and misses semantic variations.

**LLM-based classification** ‚Äî Accurate and handles edge cases, with negligible overhead.

**Embedding similarity** ‚Äî No LLM call needed, but requires training data and more complexity.

We'll use **LLM-based classification** with Haiku‚Äîthe classification cost is negligible compared to the savings from routing simple queries away from Sonnet.

### Query Classification Examples

These examples match our test prompts:

**Simple ‚Üí Haiku:**
- "What is your return policy for laptops?" ‚Äî Single factual lookup
- "Tell me about your smartphone options" ‚Äî Direct product question
- "Hello! What can you help me with today?" ‚Äî Greeting

**Complex ‚Üí Sonnet:**
- "My laptop won't turn on, can you help me troubleshoot?" ‚Äî Multi-step troubleshooting
- "I want to buy a laptop. What are the specs and what's the return policy?" ‚Äî Multiple questions

## Step 3: Review the Routing Logic

The v4 agent uses Haiku to classify queries before routing to the appropriate model.

In [None]:
from agents.v4_routing import CLASSIFIER_PROMPT

print("=== CLASSIFIER PROMPT ===")
print(CLASSIFIER_PROMPT)

### How the Classifier Works

1. **Haiku receives the query** with the classifier prompt
2. **Haiku responds** with a single word: "simple" or "complex"
3. **Router parses the response** and selects the model:
   - "simple" ‚Üí Haiku handles the full request
   - "complex" ‚Üí Sonnet handles the full request

**Classification overhead:** ~400 input tokens + ~5 output tokens per query

This overhead is negligible compared to the savings from routing 60-70% of queries to Haiku.

In [3]:
# Review the v4 agent code
agent_file = Path("agents/v4_routing.py")
print(agent_file.read_text())

"""
V4 Routing Agent - Model routing based on query complexity.
- Same system prompt as v3 with prompt caching
- LLM-based classification using Haiku
- Simple queries -> Haiku (cheaper)
- Complex queries -> Sonnet (better quality)
"""

import base64
import os
from enum import Enum
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from strands import Agent
from strands.models import BedrockModel
from strands.telemetry import StrandsTelemetry
from strands.types.content import SystemContentBlock

import sys
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from utils.tools import get_return_policy, get_product_info, web_search, get_technical_support

load_dotenv()

# Langfuse configuration
langfuse_public_key = os.environ.get("LANGFUSE_PUBLIC_KEY")
langfuse_secret_key = os.environ.get("LANGFUSE_SECRET_KEY")
langfuse_host = os.environ.get("LANGFUSE_HOST", "https://cloud.langfuse.com")

## Step 4: Deploy the Routing Agent

In [4]:
agent_name = "customer_support_v4_routing"
agent_file = str(Path("agents/v4_routing.py").absolute())
requirements_file = str(Path("requirements-for-agentcore.txt").absolute())

print(f"Agent name: {agent_name}")
print(f"Agent file: {agent_file}")
print(f"Requirements: {requirements_file}")

print(f"Configuring agent: {agent_name}")
agentcore_runtime.configure(
    entrypoint=agent_file,
    auto_create_execution_role=True,
    auto_create_ecr=True,
    requirements_file=requirements_file,
    region=region,
    agent_name=agent_name,
)

Entrypoint parsed: file=/Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/agents/v4_routing.py, bedrock_agentcore_name=v4_routing
Memory disabled - agent will be stateless
Configuring BedrockAgentCore agent: customer_support_v4_routing


Agent name: customer_support_v4_routing
Agent file: /Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/agents/v4_routing.py
Requirements: /Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/requirements-for-agentcore.txt
Configuring agent: customer_support_v4_routing


Memory disabled
Network mode: PUBLIC
Generated Dockerfile: Dockerfile
Generated .dockerignore: /Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/.dockerignore
Keeping 'customer_support_v4_routing' as default agent
Bedrock AgentCore configured: /Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/.bedrock_agentcore.yaml


ConfigureResult(config_path=PosixPath('/Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/.bedrock_agentcore.yaml'), dockerfile_path=PosixPath('/Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/Dockerfile'), dockerignore_path=PosixPath('/Users/tracilim/Projects/aws-bedrock-prompt-optimization-workshop/03-developer-journey/.dockerignore'), runtime='None', runtime_type=None, region='us-east-1', account_id='739907928487', execution_role=None, ecr_repository=None, auto_create_ecr=True, s3_path=None, auto_create_s3=False, memory_id=None, network_mode='PUBLIC', network_subnets=None, network_security_groups=None, network_vpc_id=None)

In [5]:
# Modify Dockerfile for Langfuse
dockerfile_path = Path("Dockerfile")
if dockerfile_path.exists():
    content = dockerfile_path.read_text()
    if "opentelemetry-instrument" in content:
        import re
        content = re.sub(
            r'CMD \["opentelemetry-instrument", "python", "-m", "([^"]+)"\]',
            r'CMD ["python", "-m", "\1"]',
            content
        )
        dockerfile_path.write_text(content)
        print("Dockerfile modified for Langfuse")
    else:
        print("Dockerfile already configured or using different format")
else:
    print("Dockerfile not found - will be created during deployment")

Dockerfile modified for Langfuse


In [6]:
env_vars = {
    "LANGFUSE_HOST": os.environ.get("LANGFUSE_HOST"),
    "LANGFUSE_PUBLIC_KEY": os.environ.get("LANGFUSE_PUBLIC_KEY"),
    "LANGFUSE_SECRET_KEY": os.environ.get("LANGFUSE_SECRET_KEY"),
    "PYTHONUNBUFFERED": "1",
}

print("Deploying to AgentCore Runtime...")
launch_result = agentcore_runtime.launch(env_vars=env_vars, auto_update_on_conflict=True)
agent_arn = launch_result.agent_arn
print(f"Agent deployed: {agent_arn}")

üöÄ Launching Bedrock AgentCore (cloud mode - RECOMMENDED)...
   ‚Ä¢ Deploy Python code directly to runtime
   ‚Ä¢ No Docker required (DEFAULT behavior)
   ‚Ä¢ Production-ready deployment

üí° Deployment options:
   ‚Ä¢ runtime.launch()                ‚Üí Cloud (current)
   ‚Ä¢ runtime.launch(local=True)      ‚Üí Local development
Memory disabled - skipping memory creation
Starting CodeBuild ARM64 deployment for agent 'customer_support_v4_routing' to account 739907928487 (us-east-1)
Setting up AWS resources (ECR repository, execution roles)...
Getting or creating ECR repository for agent: customer_support_v4_routing


Deploying to AgentCore Runtime...


ECR repository available: 739907928487.dkr.ecr.us-east-1.amazonaws.com/bedrock-agentcore-customer_support_v4_routing
Getting or creating execution role for agent: customer_support_v4_routing
Using AWS region: us-east-1, account ID: 739907928487
Role name: AmazonBedrockAgentCoreSDKRuntime-us-east-1-0fb396fd48


‚úÖ Reusing existing ECR repository: 739907928487.dkr.ecr.us-east-1.amazonaws.com/bedrock-agentcore-customer_support_v4_routing


‚úÖ Reusing existing execution role: arn:aws:iam::739907928487:role/AmazonBedrockAgentCoreSDKRuntime-us-east-1-0fb396fd48
Execution role available: arn:aws:iam::739907928487:role/AmazonBedrockAgentCoreSDKRuntime-us-east-1-0fb396fd48
Preparing CodeBuild project and uploading source...
Getting or creating CodeBuild execution role for agent: customer_support_v4_routing
Role name: AmazonBedrockAgentCoreSDKCodeBuild-us-east-1-0fb396fd48
Reusing existing CodeBuild execution role: arn:aws:iam::739907928487:role/AmazonBedrockAgentCoreSDKCodeBuild-us-east-1-0fb396fd48
Using dockerignore.template with 46 patterns for zip filtering
Uploaded source to S3: customer_support_v4_routing/source.zip
Updated CodeBuild project: bedrock-agentcore-customer_support_v4_routing-builder
Starting CodeBuild build (this may take several minutes)...
Starting CodeBuild monitoring...
üîÑ QUEUED started (total: 0s)
‚úÖ QUEUED completed in 1.3s
üîÑ PROVISIONING started (total: 2s)
‚úÖ PROVISIONING completed in 7.7s


Agent deployed: arn:aws:bedrock-agentcore:us-east-1:739907928487:runtime/customer_support_v4_routing-3LNaMtHxM3


In [13]:
agent_arn = "arn:aws:bedrock-agentcore:us-east-1:739907928487:runtime/customer_support_v4_routing-3LNaMtHxM3"

## Step 5: Test Model Routing

Let's run the same test prompts and observe which model handles each query.

In [14]:
def invoke_agent(prompt):
    """Invoke the agent via AgentCore API."""
    response = data_client.invoke_agent_runtime(
        agentRuntimeArn=agent_arn,
        runtimeSessionId=str(uuid.uuid4()),
        payload=json.dumps({"prompt": prompt}).encode(),
    )
    return json.loads(response["response"].read().decode("utf-8"))

In [None]:
from utils.langfuse_metrics import (
    get_latest_trace_metrics,
    print_metrics,
    clear_metrics,
    collect_metric,
    print_metrics_table,
    get_collected_metrics
)

clear_metrics()

TEST_PROMPTS = [
    ("Return Policy", "What is your return policy for laptops?"),
    ("Product Info", "Tell me about your smartphone options"),
    ("Technical Support", "My laptop won't turn on, can you help me troubleshoot?"),
    ("Multi-part Question", "I want to buy a laptop. What are the specs and what's the return policy?"),
    ("General Question", "Hello! What can you help me with today?"),
]

for test_name, prompt in TEST_PROMPTS:
    print("=" * 60)
    print(f"Test: {test_name}")
    print("=" * 60)

    result = invoke_agent(prompt)

    if isinstance(result, dict):
        print(f"Model used: {result.get('model_used', 'N/A')}")
        print(f"Complexity: {result.get('complexity', 'N/A')}")
        print(f"Response: {str(result.get('response', result))[:200]}...")
    else:
        print(result)

    # Get metrics from parent trace (includes classifier + main agent)
    metrics = get_latest_trace_metrics(
        agent_name="customer-support-v4-routing",
        wait_seconds=5,
        max_retries=5,
        timeout_seconds=120,
    )
    print_metrics(metrics, test_name)
    collect_metric(metrics, test_name)

In [9]:
print_metrics_table()


                                  METRICS SUMMARY
               Test Latency    Cost Input Output Cache Read Tokens Cache Write Tokens
      Return Policy   3.89s $0.0060 4,349    331                 0                  0
       Product Info   3.86s $0.0062 4,396    360                 0                  0
  Technical Support   9.35s $0.0095   917    418             1,743              1,743
Multi-part Question   9.09s $0.0123 1,207    510             3,486                  0
   General Question   1.78s $0.0028 2,075    149                 0                  0
---------------------------------------------------------------------------------------------------------
  TOTALS: Latency(avg): 5.59s | Cost: $0.0369 | Input: 12,944 | Output: 1,768
          Cache Read Tokens: 5,229 | Cache Write Tokens: 1,743



Unnamed: 0,Test,Latency,Cost,Input,Output,Cache Read Tokens,Cache Write Tokens
0,Return Policy,3.89s,$0.0060,4349,331,0,0
1,Product Info,3.86s,$0.0062,4396,360,0,0
2,Technical Support,9.35s,$0.0095,917,418,1743,1743
3,Multi-part Question,9.09s,$0.0123,1207,510,3486,0
4,General Question,1.78s,$0.0028,2075,149,0,0


### Expected Routing Results

**Routed to Haiku (3 queries):**
- Return Policy ‚Äî Single factual lookup
- Product Info ‚Äî Direct product question  
- General Question ‚Äî Simple greeting

**Routed to Sonnet (2 queries):**
- Technical Support ‚Äî Multi-step troubleshooting
- Multi-part Question ‚Äî Multiple questions requiring reasoning

**Result:** 60% of queries routed to the cheaper model.

## Step 6: Compare with v3 (All Sonnet)

Enter your metrics from Lab 03 (v3 caching) to compare with v4 routing results.

## Step 6: Compare with v3 (Caching)

Enter your metrics from Lab 03 (v3 caching) to compare with v4 routing results.

In [10]:
# ============================================================
# INPUT YOUR V3 METRICS FROM LAB 03 HERE
# (Copy the totals from your v3 metrics table)
# ============================================================
v3_total_cost = 0.0438       # e.g., 0.0430
v3_avg_latency = 8.10        # e.g., 7.55 (seconds)
v3_total_input_tokens = 4228    # e.g., 4229
v3_total_output_tokens = 1795   # e.g., 1739

# ============================================================
# V4 metrics (calculated from above)
# ============================================================
v4_metrics = get_collected_metrics()
v4_total_cost = sum(m.get('cost_usd', 0) for m in v4_metrics if 'error' not in m)
v4_latencies = [m.get('latency_seconds', 0) or 0 for m in v4_metrics if 'error' not in m]
v4_avg_latency = sum(v4_latencies) / len(v4_latencies) if v4_latencies else 0
v4_total_input_tokens = sum(m.get('input_tokens', 0) for m in v4_metrics if 'error' not in m)
v4_total_output_tokens = sum(m.get('output_tokens', 0) for m in v4_metrics if 'error' not in m)

# ============================================================
# Comparison
# ============================================================
print("=" * 70)
print("  V3 (CACHING) vs V4 (ROUTING) COMPARISON")
print("=" * 70)
print(f"{'Metric':<20} {'v3 (Caching)':>18} {'v4 (Routing)':>18} {'Change':>12}")
print("-" * 70)

has_v3_metrics = v3_total_cost > 0 and v3_avg_latency > 0

# Cost comparison
if v3_total_cost > 0:
    cost_change = ((v4_total_cost - v3_total_cost) / v3_total_cost) * 100
    cost_str = f"{cost_change:+.1f}%"
else:
    cost_str = "N/A"
print(f"{'Total Cost':<20} ${v3_total_cost:>17.4f} ${v4_total_cost:>17.4f} {cost_str:>12}")

# Latency comparison
if v3_avg_latency > 0:
    latency_change = ((v4_avg_latency - v3_avg_latency) / v3_avg_latency) * 100
    latency_str = f"{latency_change:+.1f}%"
else:
    latency_str = "N/A"
print(f"{'Avg Latency (s)':<20} {v3_avg_latency:>18.2f} {v4_avg_latency:>18.2f} {latency_str:>12}")

# Input tokens comparison
if v3_total_input_tokens > 0:
    input_change = ((v4_total_input_tokens - v3_total_input_tokens) / v3_total_input_tokens) * 100
    input_str = f"{input_change:+.1f}%"
else:
    input_str = "N/A"
print(f"{'Input Tokens':<20} {v3_total_input_tokens:>18,} {v4_total_input_tokens:>18,} {input_str:>12}")

# Output tokens comparison
if v3_total_output_tokens > 0:
    output_change = ((v4_total_output_tokens - v3_total_output_tokens) / v3_total_output_tokens) * 100
    output_str = f"{output_change:+.1f}%"
else:
    output_str = "N/A"
print(f"{'Output Tokens':<20} {v3_total_output_tokens:>18,} {v4_total_output_tokens:>18,} {output_str:>12}")

print("=" * 70)
if has_v3_metrics:
    print(f"\nRouting: {-cost_change:.1f}% cost reduction, {-latency_change:.1f}% latency change")
    print("Note: Latency may increase due to classification overhead, but cost savings are significant.")
else:
    print("\n‚ö†Ô∏è  Enter your v3 metrics above to see the comparison")

  V3 (CACHING) vs V4 (ROUTING) COMPARISON
Metric                     v3 (Caching)       v4 (Routing)       Change
----------------------------------------------------------------------
Total Cost           $           0.0438 $           0.0369       -15.8%
Avg Latency (s)                    8.10               5.59       -30.9%
Input Tokens                      4,228             12,944      +206.1%
Output Tokens                     1,795              1,768        -1.5%

Routing: 15.8% cost reduction, 30.9% latency change
Note: Latency may increase due to classification overhead, but cost savings are significant.


## Summary

In this notebook, we implemented intelligent model routing:

1. **LLM-based classification** ‚Äî Haiku classifies queries as "simple" or "complex" using a single-word response
2. **Cost-effective routing** ‚Äî Simple queries go to Haiku (~4x cheaper), complex to Sonnet
3. **Prompt caching for Sonnet only** ‚Äî Haiku requires 4,096 tokens minimum (our prompt is ~1,030), so only Sonnet requests benefit from caching

**Key insights:**

- **LLM classification beats keyword matching** ‚Äî Handles semantic variations and edge cases
- **Simple text parsing is reliable** ‚Äî Few-shot examples in the prompt ensure consistent "simple" or "complex" responses
- **Real cost savings** ‚Äî Compare the v3 vs v4 metrics above to see actual savings from routing

**Next:** In Lab 05, we'll add Bedrock Guardrails to filter off-topic queries before they reach the LLM.

---

## Cleanup

To delete the agent deployed in this notebook, uncomment and run the following code.

In [11]:
# # Delete the agent
# control_client.delete_agent_runtime(agentRuntimeId=agent_arn.split("/")[-1])
# print(f"Agent deleted: {agent_arn}")