# Lab 03: Prompt Caching

## Overview

In this notebook, we enable **prompt caching** to reduce costs on repeated requests. Prompt caching stores the processed system prompt and tool definitions, so subsequent requests can reuse them at a 90% discount.

**What you'll learn:**
- How to configure system prompt caching
- How to enable tool definition caching
- How to verify cache hits in Langfuse
- How to calculate cost savings from caching

**Optimizations in this notebook:**
- `SystemContentBlock` with `cachePoint` (system prompt caching - provider-agnostic)
- `cache_tools="default"` on BedrockModel (tool definition caching)

## Prerequisites

- Completed Labs 01-02

## Workshop Journey

```
01 Baseline → 02 Quick Wins → [03 Caching] → 04 Routing → 05 Guardrails → 06 Gateway → 07 Evaluations
                                   ↑
                              You are here
```

## Step 1: Setup

In [None]:
from __future__ import annotations

import json
import os
import uuid
from pathlib import Path

from dotenv import load_dotenv

load_dotenv(override=True)

import boto3
from bedrock_agentcore_starter_toolkit import Runtime

region = os.environ.get("AWS_DEFAULT_REGION", "us-east-1")
control_client = boto3.client("bedrock-agentcore-control", region_name=region)
data_client = boto3.client("bedrock-agentcore", region_name=region)
agentcore_runtime = Runtime()

print(f"Region: {region}")
print(f"Langfuse Host: {os.environ.get('LANGFUSE_BASE_URL', 'Not set')}")

## Step 2: Understanding Prompt Caching

### What is Prompt Caching?

Prompt caching stores frequently reused content so subsequent requests can skip reprocessing it. This notebook enables **two types of caching**:

| Cache Type | What's Cached | Configuration |
|------------|---------------|---------------|
| **System Prompt** | The entire system prompt (~1,100 tokens) | `SystemContentBlock` with `cachePoint` |
| **Tool Definitions** | All tool schemas and descriptions | `cache_tools="default"` on BedrockModel |

### Minimum Token Requirement

> **Important:** Claude Sonnet 4.5 requires **at least 1,024 tokens** before a cache checkpoint for caching to activate.
>
> If your system prompt is below this threshold:
> - Inference still succeeds normally (no error)
> - No cache checkpoint is created (silently skipped)
> - `cacheWriteInputTokens` and `cacheReadInputTokens` remain 0
>
> **Solution:** Expand short system prompts with few-shot examples to meet the minimum.

### System Prompt Caching

Use `SystemContentBlock` with a cache point at the end:

```python
from strands.types.content import SystemContentBlock

# Create system prompt with cache point
system_prompt = [
    SystemContentBlock(text=SYSTEM_PROMPT_TEXT),  # Must be 1,024+ tokens
    SystemContentBlock(cachePoint={"type": "default"})  # Cache checkpoint
]

agent = Agent(system_prompt=system_prompt)
```

### Tool Definition Caching

Enable tool caching via `BedrockModel`:

```python
model = BedrockModel(
    model_id="us.anthropic.claude-sonnet-4-5-20250929-v1:0",
    cache_tools="default"  # Cache all tool definitions
)
```

### Pricing

Cache pricing varies by model. See [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/) for current rates.

### Cache Metrics in Langfuse

- `cacheWriteInputTokens` - Tokens written to cache (first request after cache expires)
- `cacheReadInputTokens` - Tokens read from cache (subsequent requests within 5-min TTL)

In [None]:
# Review the caching configuration in v3 agent
agent_file = Path("agents/v3_caching.py")
print(agent_file.read_text())

## Step 3: Deploy the Caching Agent

In [None]:
agent_name = "customer_support_v3_caching"
agent_file = str(Path("agents/v3_caching.py").absolute())
requirements_file = str(Path("requirements-for-agentcore.txt").absolute())

# Clean up any existing build files from previous labs
for f in ["Dockerfile", ".dockerignore", ".bedrock_agentcore.yaml"]:
    p = Path(f)
    if p.exists():
        p.unlink()
        print(f"Removed existing: {f}")

print(f"Configuring agent: {agent_name}")
agentcore_runtime.configure(
    entrypoint=agent_file,
    auto_create_execution_role=True,
    auto_create_ecr=True,
    requirements_file=requirements_file,
    region=region,
    agent_name=agent_name,
)

In [None]:
# Modify Dockerfile for Langfuse
dockerfile_path = Path("Dockerfile")
if dockerfile_path.exists():
    content = dockerfile_path.read_text()
    # Replace opentelemetry-instrument wrapper with direct python call
    # Keep the correct module path using regex
    if "opentelemetry-instrument" in content:
        import re

        content = re.sub(
            r'CMD \["opentelemetry-instrument", "python", "-m", "([^"]+)"\]', r'CMD ["python", "-m", "\1"]', content
        )
        dockerfile_path.write_text(content)
        print("Dockerfile modified for Langfuse")
    else:
        print("Dockerfile already configured or using different format")
else:
    print("Dockerfile not found - will be created during deployment")

In [None]:
env_vars = {
    "LANGFUSE_BASE_URL": os.environ.get("LANGFUSE_BASE_URL"),
    "LANGFUSE_PUBLIC_KEY": os.environ.get("LANGFUSE_PUBLIC_KEY"),
    "LANGFUSE_SECRET_KEY": os.environ.get("LANGFUSE_SECRET_KEY"),
    "PYTHONUNBUFFERED": "1",
}

print("Deploying to AgentCore Runtime...")
launch_result = agentcore_runtime.launch(env_vars=env_vars, auto_update_on_conflict=True)
agent_arn = launch_result.agent_arn
print(f"Agent deployed: {agent_arn}")

In [None]:
# Save the agent ARN for later use
agent_arn = launch_result.agent_arn
print(f"Agent ARN: {agent_arn}")

## Step 4: Test Caching Behavior

Run the standard test prompts and observe cache metrics in Langfuse.

**What to look for:**
- `Cache Read Tokens` > 0 indicates the system prompt and tool definitions are being read from cache
- `Cache Write Tokens` > 0 indicates tokens were written to cache (only happens when cache is cold/expired)

In [None]:
def invoke_agent(prompt):
    """Invoke the agent via AgentCore API."""
    response = data_client.invoke_agent_runtime(
        agentRuntimeArn=agent_arn,
        runtimeSessionId=str(uuid.uuid4()),
        payload=json.dumps({"prompt": prompt}).encode(),
    )
    return json.loads(response["response"].read().decode("utf-8"))

In [None]:
# Import Langfuse metrics helper
from utils.langfuse_metrics import (
    clear_metrics,
    collect_metric,
    get_latest_trace_metrics,
    print_metrics,
    print_metrics_table,
)

# Clear any previously collected metrics
clear_metrics()

# Standard test prompts - same across all notebooks for consistent comparison
TEST_PROMPTS = [
    # Single tool: get_return_policy
    ("Return Policy", "What is your return policy for laptops?"),
    # Single tool: get_product_info
    ("Product Info", "Tell me about your smartphone options"),
    # Single tool: get_technical_support (Bedrock KB)
    ("Technical Support", "My laptop won't turn on, can you help me troubleshoot?"),
    # Multi-tool: get_product_info + get_return_policy
    ("Multi-part Question", "I want to buy a laptop. What are the specs and what's the return policy?"),
    # No tool: General greeting
    ("General Question", "Hello! What can you help me with today?"),
]

# Run all tests and collect metrics
# Cache behavior: First request after deployment writes to cache, subsequent requests read from cache
# If cache is already warm (within 5-min TTL), all requests will show cache reads
for i, (test_name, prompt) in enumerate(TEST_PROMPTS):
    print("=" * 60)
    print(f"Test {i + 1}: {test_name}")
    print("=" * 60)

    response = invoke_agent(prompt)
    print(response)

    # Fetch and collect metrics
    metrics = get_latest_trace_metrics(
        agent_name="customer-support-v3-caching",
        wait_seconds=5,
        max_retries=5,
        timeout_seconds=120,
    )
    print_metrics(metrics, test_name)
    collect_metric(metrics, test_name)

In [None]:
# Print summary table
print_metrics_table()

# Save metrics for comparison in later notebooks
from utils.langfuse_metrics import save_metrics
save_metrics("v3")

## Step 5: Compare with v2 (Quick Wins)

Enter your metrics from Lab 02 (v2 quick wins) to compare cost, latency, and token usage.

In [None]:
from utils.langfuse_metrics import calculate_totals_from_collected, load_metrics, print_comparison

# Load metrics from Lab 02 (saved automatically when you ran print_metrics_table())
v2 = load_metrics("v2")

# Or enter manually if Lab 02 metrics weren't saved:
# v2 = {"total_cost": 0.0858, "avg_latency": 7.30, "total_input_tokens": 19920, "total_output_tokens": 1737}

# Print comparison (current metrics auto-calculated from collected)
print_comparison(
    prev_name="v2 (Quick Wins)",
    curr_name="v3 (Caching)",
    prev_cost=v2["total_cost"],
    prev_latency=v2["avg_latency"],
    prev_input_tokens=v2["total_input_tokens"],
    prev_output_tokens=v2["total_output_tokens"],
)

# Show cache-specific metrics (unique to v3)
totals = calculate_totals_from_collected()
print("\nCache Metrics (v3 only):")
print(f"  Cache Read Tokens:  {totals['total_cache_read_tokens']:,}")
print(f"  Cache Write Tokens: {totals['total_cache_write_tokens']:,}")

## Summary

In this notebook, we enabled **two types of caching** on top of the v2 optimizations:

| Cache Type | Configuration | Tokens Cached |
|------------|---------------|---------------|
| **System Prompt** | `SystemContentBlock` + `cachePoint` | ~1,100 tokens |
| **Tool Definitions** | `cache_tools="default"` | ~1,100 tokens |

```python
from strands.types.content import SystemContentBlock

# 1. System prompt caching
system_prompt = [
    SystemContentBlock(text=SYSTEM_PROMPT_TEXT),  # Must be 1,024+ tokens
    SystemContentBlock(cachePoint={"type": "default"})
]

# 2. Tool definition caching
model = BedrockModel(
    cache_tools="default",
    ...
)

agent = Agent(model=model, system_prompt=system_prompt)
```

**Key Observations:**
- First request after cache expires writes to cache (premium pricing)
- Subsequent requests read from cache (discounted pricing)
- The 5-minute TTL refreshes on each cache hit
- System prompt must be **at least 1,024 tokens** for caching

**When to use caching:**
- Agents with consistent traffic (requests within 5-minute windows)
- Static system prompts and tool definitions
- High-volume production workloads

**Pricing:** See [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/) for current cache read/write rates.

---

**Next:** In Lab 04, we'll explore **model routing** to use cheaper models for simple queries.

**Next notebook:** [04-llm-routing.ipynb](./04-llm-routing.ipynb)

---

## Cleanup

To delete the agent deployed in this notebook, uncomment and run the following code.

In [None]:
# # Uncomment to delete resources created in this lab
# agentcore_runtime.destroy(delete_ecr_repo=True)
# print(f"Deleted agent and ECR repository: {agent_name}")