# Lab 04: LLM Routing

## Overview

In this notebook, we implement **model routing** to use cheaper models for simple queries while preserving quality for complex ones. The key insight is that **an LLM can classify queries more accurately than keyword matching**.

**What you'll learn:**
- How to classify query complexity using an LLM
- How to route queries to appropriate models (Haiku vs Sonnet)
- How to verify routing decisions in Langfuse
- Cost savings from intelligent routing

**Routing Strategy:**
- Simple queries → Claude Haiku (smaller, faster, cheaper)
- Complex queries → Claude Sonnet (more capable, higher quality)

## Prerequisites

- Completed Labs 01-03

## Workshop Journey

```
01 Baseline → 02 Quick Wins → 03 Caching → [04 Routing] → 05 Guardrails → 06 Gateway → 07 Evaluations
                                               ↑
                                          You are here
```

## Step 1: Setup

In [None]:
from __future__ import annotations

import json
import os
import uuid
from pathlib import Path

from dotenv import load_dotenv

load_dotenv(override=True)

import boto3
from bedrock_agentcore_starter_toolkit import Runtime

region = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")
control_client = boto3.client("bedrock-agentcore-control", region_name=region)
data_client = boto3.client("bedrock-agentcore", region_name=region)
agentcore_runtime = Runtime()

print(f"Region: {region}")
print(f"Langfuse Host: {os.environ.get('LANGFUSE_BASE_URL', 'Not set')}")

## Step 2: Understanding Model Routing

### Why Route Between Models?

Not all queries require the same level of reasoning. A simple "What's your return policy?" doesn't need the full power of Sonnet—Haiku can handle it at a fraction of the cost.

**Claude Haiku** — ~4x cheaper than Sonnet, best for simple Q&A, lookups, and greetings. Requires 4,096 tokens minimum for prompt caching.

**Claude Sonnet** — More capable model for complex reasoning and troubleshooting. Requires 1,024 tokens minimum for prompt caching.

Our system prompt (~1,030 tokens) only meets Sonnet's caching threshold. Haiku requests won't benefit from prompt caching, but the ~4x cost savings still make it worthwhile for simple queries.

### Routing Approaches

**Keyword matching** — Fast with no LLM cost, but brittle and misses semantic variations.

**LLM-based classification** — Accurate and handles edge cases, with negligible overhead.

**Embedding similarity** — No LLM call needed, but requires training data and more complexity.

We'll use **LLM-based classification** with Haiku—the classification cost is negligible compared to the savings from routing simple queries away from Sonnet.

### Query Classification Examples

These examples match our test prompts:

**Simple → Haiku:**
- "What is your return policy for laptops?" — Single factual lookup
- "Tell me about your smartphone options" — Direct product question
- "Hello! What can you help me with today?" — Greeting

**Complex → Sonnet:**
- "My laptop won't turn on, can you help me troubleshoot?" — Multi-step troubleshooting
- "I want to buy a laptop. What are the specs and what's the return policy?" — Multiple questions

## Step 3: Review the Routing Logic

The v4 agent uses Haiku to classify queries before routing to the appropriate model.

In [None]:
from agents.v4_routing import CLASSIFIER_PROMPT

print("=== CLASSIFIER PROMPT ===")
print(CLASSIFIER_PROMPT)

### How the Classifier Works

1. **Haiku receives the query** with the classifier prompt
2. **Haiku responds** with a single word: "simple" or "complex"
3. **Router parses the response** and selects the model:
   - "simple" → Haiku handles the full request
   - "complex" → Sonnet handles the full request

**Classification overhead:** ~400 input tokens + ~5 output tokens per query

This overhead is negligible compared to the savings from routing 60-70% of queries to Haiku.

In [None]:
# Review the v4 agent code
agent_file = Path("agents/v4_routing.py")
print(agent_file.read_text())

## Step 4: Deploy the Routing Agent

In [None]:
agent_name = "customer_support_v4_routing"
agent_file = str(Path("agents/v4_routing.py").absolute())
requirements_file = str(Path("requirements-for-agentcore.txt").absolute())

print(f"Agent name: {agent_name}")
print(f"Agent file: {agent_file}")
print(f"Requirements: {requirements_file}")

# Clean up any existing build files from previous labs
for f in ["Dockerfile", ".dockerignore", ".bedrock_agentcore.yaml"]:
    p = Path(f)
    if p.exists():
        p.unlink()
        print(f"Removed existing: {f}")

print(f"Configuring agent: {agent_name}")
agentcore_runtime.configure(
    entrypoint=agent_file,
    auto_create_execution_role=True,
    auto_create_ecr=True,
    requirements_file=requirements_file,
    region=region,
    agent_name=agent_name,
)

In [None]:
# Modify Dockerfile for Langfuse
dockerfile_path = Path("Dockerfile")
if dockerfile_path.exists():
    content = dockerfile_path.read_text()
    if "opentelemetry-instrument" in content:
        import re

        content = re.sub(
            r'CMD \["opentelemetry-instrument", "python", "-m", "([^"]+)"\]', r'CMD ["python", "-m", "\1"]', content
        )
        dockerfile_path.write_text(content)
        print("Dockerfile modified for Langfuse")
    else:
        print("Dockerfile already configured or using different format")
else:
    print("Dockerfile not found - will be created during deployment")

In [None]:
env_vars = {
    "LANGFUSE_BASE_URL": os.environ.get("LANGFUSE_BASE_URL"),
    "LANGFUSE_PUBLIC_KEY": os.environ.get("LANGFUSE_PUBLIC_KEY"),
    "LANGFUSE_SECRET_KEY": os.environ.get("LANGFUSE_SECRET_KEY"),
    "PYTHONUNBUFFERED": "1",
}

print("Deploying to AgentCore Runtime...")
launch_result = agentcore_runtime.launch(env_vars=env_vars, auto_update_on_conflict=True)
agent_arn = launch_result.agent_arn
print(f"Agent deployed: {agent_arn}")

In [None]:
# Save the agent ARN for later use
agent_arn = launch_result.agent_arn
print(f"Agent ARN: {agent_arn}")

## Step 5: Test Model Routing

Let's run the same test prompts and observe which model handles each query.

In [None]:
def invoke_agent(prompt):
    """Invoke the agent via AgentCore API."""
    response = data_client.invoke_agent_runtime(
        agentRuntimeArn=agent_arn,
        runtimeSessionId=str(uuid.uuid4()),
        payload=json.dumps({"prompt": prompt}).encode(),
    )
    return json.loads(response["response"].read().decode("utf-8"))

In [None]:
from utils.langfuse_metrics import (
    clear_metrics,
    collect_metric,
    get_latest_trace_metrics,
    print_metrics,
    print_metrics_table,
)

clear_metrics()

TEST_PROMPTS = [
    ("Return Policy", "What is your return policy for laptops?"),
    ("Product Info", "Tell me about your smartphone options"),
    ("Technical Support", "My laptop won't turn on, can you help me troubleshoot?"),
    ("Multi-part Question", "I want to buy a laptop. What are the specs and what's the return policy?"),
    ("General Question", "Hello! What can you help me with today?"),
]

for test_name, prompt in TEST_PROMPTS:
    print("=" * 60)
    print(f"Test: {test_name}")
    print("=" * 60)

    result = invoke_agent(prompt)

    if isinstance(result, dict):
        print(f"Model used: {result.get('model_used', 'N/A')}")
        print(f"Complexity: {result.get('complexity', 'N/A')}")
        print(f"Response: {str(result.get('response', result))[:200]}...")
    else:
        print(result)

    # Get metrics from parent trace (includes classifier + main agent)
    metrics = get_latest_trace_metrics(
        agent_name="customer-support-v4-routing",
        wait_seconds=5,
        max_retries=5,
        timeout_seconds=120,
    )
    print_metrics(metrics, test_name)
    collect_metric(metrics, test_name)

In [None]:
print_metrics_table()

# Save metrics for comparison in later notebooks
from utils.langfuse_metrics import save_metrics
save_metrics("v4")

### Expected Routing Results

**Routed to Haiku (3 queries):**
- Return Policy — Single factual lookup
- Product Info — Direct product question  
- General Question — Simple greeting

**Routed to Sonnet (2 queries):**
- Technical Support — Multi-step troubleshooting
- Multi-part Question — Multiple questions requiring reasoning

**Result:** 60% of queries routed to the cheaper model.

## Step 6: Compare with v3 (Caching)

Enter your metrics from Lab 03 (v3 caching) to compare with v4 routing results.

In [None]:
from utils.langfuse_metrics import load_metrics, print_comparison

# Load metrics from Lab 03 (saved automatically when you ran print_metrics_table())
v3 = load_metrics("v3")

# Or enter manually if Lab 03 metrics weren't saved:
# v3 = {"total_cost": 0.0438, "avg_latency": 8.10, "total_input_tokens": 4228, "total_output_tokens": 1795}

# Print comparison (current metrics auto-calculated from collected)
print_comparison(
    prev_name="v3 (Caching)",
    curr_name="v4 (Routing)",
    prev_cost=v3["total_cost"],
    prev_latency=v3["avg_latency"],
    prev_input_tokens=v3["total_input_tokens"],
    prev_output_tokens=v3["total_output_tokens"],
)

## Summary

In this notebook, we implemented intelligent model routing:

1. **LLM-based classification** — Haiku classifies queries as "simple" or "complex" using a single-word response
2. **Cost-effective routing** — Simple queries go to Haiku (~4x cheaper), complex to Sonnet
3. **Prompt caching for Sonnet only** — Haiku requires 4,096 tokens minimum (our prompt is ~1,030), so only Sonnet requests benefit from caching

**Key insights:**

- **LLM classification beats keyword matching** — Handles semantic variations and edge cases
- **Simple text parsing is reliable** — Few-shot examples in the prompt ensure consistent "simple" or "complex" responses
- **Real cost savings** — Compare the v3 vs v4 metrics above to see actual savings from routing

**Next:** In Lab 05, we'll add Bedrock Guardrails to filter off-topic queries before they reach the LLM.

---

## Cleanup

To delete the agent deployed in this notebook, uncomment and run the following code.

In [None]:
# # Uncomment to delete resources created in this lab
# agentcore_runtime.destroy(delete_ecr_repo=True)
# print(f"Deleted agent and ECR repository: {agent_name}")