# Advanced Prompt Caching

This notebook covers advanced prompt caching strategies for Amazon Bedrock. You'll learn what can be cached, how caching works under the hood, and best practices to maximize cache efficiency for cost savings.

## Learning Objectives

By the end of this notebook, you will be able to:
- Understand the three types of cacheable content (tools, system, messages)
- Implement multi-checkpoint caching patterns
- Apply static/dynamic separation principles
- Choose appropriate TTL strategies (5-minute vs 1-hour)
- Monitor and optimize cache hit rates

## Why This Matters

At production scale, caching decisions significantly impact costs:
- **Cache reads cost up to 90% less than regular input** - each cache hit saves substantial money
- **Wrong checkpoint placement causes cache thrashing** - repeated cache misses waste the 1.25x write investment
- **Proper TTL selection prevents unnecessary rewrites** - 1-hour TTL saves money on stable content

**Duration**: ~60 minutes

## Prerequisites

Before running this notebook, ensure you have:
1. An AWS account with Amazon Bedrock access enabled
2. AWS credentials configured (via `.env` file, AWS CLI, or IAM role)

## Setup

In [1]:
# Standard library imports
from __future__ import annotations

import json
import os
from datetime import datetime
from pathlib import Path
from uuid import uuid4

import boto3

# Third-party imports
from dotenv import load_dotenv

# Local imports - utility functions for cache metrics
from utils import (
    analyze_caching_roi,
    calculate_savings,
    extract_cache_metrics,
)

# Load environment variables
load_dotenv()

# Initialize AWS client
REGION = os.getenv("AWS_DEFAULT_REGION", "us-east-1")
bedrock_runtime = boto3.client("bedrock-runtime", region_name=REGION)

# Model configuration
MODEL_ID = "global.anthropic.claude-sonnet-4-5-20250929-v1:0"

# Data directory path
DATA_DIR = Path("data")

print(f"Region: {REGION}")
print(f"Model: {MODEL_ID}")

Region: us-east-1
Model: global.anthropic.claude-sonnet-4-5-20250929-v1:0


<div class="alert alert-block alert-info">
<b>Note:</b> This notebook uses Claude Sonnet 4.5 with the <b>global</b> CRIS (Cross-Region Inference Service) profile. The global profile offers ~10% cost savings and higher availability by automatically routing requests to regions with capacity.
</div>

In [2]:
# Load sample data files for demos
# These files contain realistic content to demonstrate caching patterns

# Technical documentation for document Q&A demos
TECHNICAL_DOCS = (DATA_DIR / "technical_documentation.txt").read_text()

# Company policies for system prompt demos
COMPANY_POLICIES = (DATA_DIR / "company_policies.txt").read_text()

# Data analyst system prompts (two versions to demo invalidation)
ANALYST_SYSTEM_V1 = (DATA_DIR / "analyst_system_prompt_v1.txt").read_text()
ANALYST_SYSTEM_V2 = (DATA_DIR / "analyst_system_prompt_v2.txt").read_text()

# Tool definitions
WEATHER_TOOLS = json.loads((DATA_DIR / "weather_tools.json").read_text())
ANALYST_TOOLS = json.loads((DATA_DIR / "analyst_tools.json").read_text())

print("Loaded sample data:")
print(f"  - Technical docs: {len(TECHNICAL_DOCS):,} chars")
print(f"  - Company policies: {len(COMPANY_POLICIES):,} chars")
print(f"  - Analyst system v1: {len(ANALYST_SYSTEM_V1):,} chars")
print(f"  - Analyst system v2: {len(ANALYST_SYSTEM_V2):,} chars")
print(f"  - Weather tools: {len(WEATHER_TOOLS)} tools")
print(f"  - Analyst tools: {len(ANALYST_TOOLS)} tools")

Loaded sample data:
  - Technical docs: 9,696 chars
  - Company policies: 8,750 chars
  - Analyst system v1: 4,701 chars
  - Analyst system v2: 4,387 chars
  - Weather tools: 2 tools
  - Analyst tools: 5 tools


---

# Part 1: Key Concepts

## 1.1 What Can Be Cached

Amazon Bedrock supports caching **three types of content** in your requests:

| Content Type | Description | Change Frequency |
|-------------|-------------|------------------|
| **Tools** | Function definitions for tool use | Rarely (schema changes) |
| **System** | System instructions and context | Occasionally (policy updates) |
| **Messages** | Conversation history | Frequently (every turn) |

## 1.2 Cache Checkpoints

Amazon Bedrock supports **up to 4 cache checkpoints per request**. A checkpoint marks: "Cache everything from the beginning up to this point."

### Checkpoint Syntax

```python
# Converse API (5-minute TTL only)
{"cachePoint": {"type": "default"}}

# InvokeModel API (Anthropic models - supports both TTLs)
{"cache_control": {"type": "ephemeral"}}      # 5-minute TTL (1.25x write)
{"cache_control": {"type": "ephemeral_1h"}}   # 1-hour TTL (2x write)
```

### Supported Models and Limits

| Model | Min Tokens | Max Checkpoints | Cacheable Fields |
|-------|-----------|-----------------|------------------|
| Claude Sonnet 4.5 | 1,024 | 4 | system, messages, tools |
| Claude Opus 4.1 | 1,024 | 4 | system, messages, tools |
| Claude Haiku 4.5 | 4,096 | 4 | system, messages, tools |
| Amazon Nova Pro | 1,000 | 4 | system, messages |

> **Reference**: See [AWS Bedrock Prompt Caching Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html) for the latest supported models and limits.

## 1.3 Bedrock Prompt Assembly Order

**Critical**: When calling Bedrock for inference, prompts are assembled in this specific order:

```
┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   Tools     │  →   │   System    │  →   │  Messages   │
│(if provided)│      │(if provided)│      │  (history + │
│             │      │             │      │   current)  │
└─────────────┘      └─────────────┘      └─────────────┘
```

This sequential assembly determines cache behavior:
- Caches are matched from the **beginning** of the assembled prompt
- Each checkpoint caches **everything from the start up to that checkpoint** (cumulative)
- If content at any position changes, **that cache + all subsequent caches** are invalidated

<div class="alert alert-block alert-warning">
<b>Key Insight:</b> Tools are first in assembly order - changing tools invalidates the <b>entire</b> cache chain (tools → system → messages).
</div>

### Common Caching Patterns

| Pattern | Structure | Best For |
|---------|-----------|----------|
| **Single Checkpoint** | `[Document] → [CP] → [Query]` | FAQ/support, document Q&A |
| **Two Checkpoints** | `[Tools] → [CP1] → [System] → [CP2] → [Messages]` | Stable agent with fixed tools |
| **Three Checkpoints or more** | `[Tools] → [CP1] → [System] → [CP2] → [History] → [CP3] → [Current]` | Conversational agent |

<div class="alert alert-block alert-info">
<b>Note:</b> You can place multiple checkpoints within a single section. The 4-checkpoint limit is per request, not per section.
</div>

## 1.4 Cumulative Token Counting

**Important**: The minimum token requirement is **cumulative**, not per-section.

This means:
- You DON'T need 1,024 tokens in tools AND 1,024 in system AND 1,024 in messages
- You need 1,024 tokens **total** from the start to your first checkpoint
- Each subsequent checkpoint also uses cumulative counting from the start

<div class="alert alert-block alert-warning">
<b>Key Insight:</b> If your tools/system individually don't meet the minimum, that's okay! The cumulative count from the start is what matters.
</div>

In [None]:
# Demo: Cumulative Token Counting
# Uses weather tools (small) + company policies as system (large)
# Together they exceed the 1,024 token minimum

# Weather tools - below minimum token count alone
tools_with_checkpoint = [*WEATHER_TOOLS, {"cachePoint": {"type": "default"}}]

# Company policies as system prompt - combined with tools exceeds minimum
system_with_checkpoint = [{"text": COMPANY_POLICIES}, {"cachePoint": {"type": "default"}}]

# Request 1: Cache WRITE
r1 = bedrock_runtime.converse(
    modelId=MODEL_ID,
    toolConfig={"tools": tools_with_checkpoint},
    system=system_with_checkpoint,
    messages=[{"role": "user", "content": [{"text": "What's the weather in Seattle?"}]}],
    inferenceConfig={"maxTokens": 100},
)
m1 = extract_cache_metrics(r1)

# Request 2: Cache READ
r2 = bedrock_runtime.converse(
    modelId=MODEL_ID,
    toolConfig={"tools": tools_with_checkpoint},
    system=system_with_checkpoint,
    messages=[{"role": "user", "content": [{"text": "What's the weather in New York?"}]}],
    inferenceConfig={"maxTokens": 100},
)
m2 = extract_cache_metrics(r2)

print("DEMO: Cumulative Token Counting")
print("Request 1: ")
print(m1)
print("---")
print("Request 2: ")
print(m2)
print("\nKey: Even though tools alone < 1,024, cumulative total met the requirement.")

## 1.5 Cache Invalidation Scenarios

Understanding what invalidates cache is critical for maintaining high hit rates.

### Invalidation Impact by Change Type

| What Changes | Tools Cache | System Cache | Messages Cache |
|--------------|-------------|--------------|----------------|
| Only messages | ✅ HIT | ✅ HIT | Fresh (not cached) |
| System prompt | ✅ HIT | ❌ MISS (rewrite) | ❌ MISS |
| Tools | ❌ MISS | ❌ MISS | ❌ MISS |

<div class="alert alert-block alert-info">
<b>Note:</b> This assumes cache checkpoints are explicitly defined at each section and meet the minimum token requirement.
</div>

### Common Invalidation Triggers

| Trigger | Impact | Notes |
|---------|--------|-------|
| Tool definition changed | Full invalidation | Tools are assembled first |
| System prompt changed | System + Messages miss | Tools cache preserved |
| Message history changed | Messages only | Tools + System preserved |
| Image added/removed | Full invalidation | Even in uncached sections |
| `tool_choice` changed | Full invalidation | Affects request structure |

In [None]:
# Demo: Cache Invalidation Scenarios
# Uses analyst tools and two versions of system prompt to show invalidation patterns


def run_analyst(user_msg, system_text, tools_list):
    """Helper function to run analyst queries with caching."""
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        toolConfig={"tools": [*tools_list, {"cachePoint": {"type": "default"}}]},
        system=[{"text": system_text}, {"cachePoint": {"type": "default"}}],
        messages=[{"role": "user", "content": [{"text": user_msg}]}],
        inferenceConfig={"maxTokens": 100},
    )
    return extract_cache_metrics(response)


print("DEMO: Cache Invalidation Scenarios\n")

# Scenario 1: Only message changes (IDEAL)
print("--- Scenario 1: Only Message Changes ---")
m1 = run_analyst("Show sales Q1", ANALYST_SYSTEM_V1, ANALYST_TOOLS)
print(f"R1: write={m1['cache_write']:,}, read={m1['cache_read']:,}")
m2 = run_analyst("Show sales Q2", ANALYST_SYSTEM_V1, ANALYST_TOOLS)
print(f"R2: write={m2['cache_write']:,}, read={m2['cache_read']:,} ← Cache HIT!")

# Scenario 2: System changes (PARTIAL MISS)
print("\n--- Scenario 2: System Prompt Changes ---")
m3 = run_analyst("Show sales Q3", ANALYST_SYSTEM_V2, ANALYST_TOOLS)
print(f"R3: write={m3['cache_write']:,}, read={m3['cache_read']:,} ← System changed, partial miss")

# Scenario 3: Tools change (FULL INVALIDATION)
print("\n--- Scenario 3: Tools Change ---")
# Add a new tool to the existing tools
NEW_TOOL = {
    "toolSpec": {
        "name": "send_notification",
        "description": "Send notifications to users via email or Slack.",
        "inputSchema": {
            "json": {
                "type": "object",
                "properties": {"channel": {"type": "string"}, "message": {"type": "string"}},
                "required": ["channel", "message"],
            }
        },
    }
}
m4 = run_analyst("Show sales Q4", ANALYST_SYSTEM_V2, [NEW_TOOL, *ANALYST_TOOLS])
print(f"R4: write={m4['cache_write']:,}, read={m4['cache_read']:,} ← Tools changed, FULL invalidation!")

---

# Part 2: Best Practices

## 2.1 Static First, Dynamic Last

**Principle**: Place static (cacheable) content before your cache checkpoint, dynamic content after.

### Good Pattern: Static Document with Checkpoint

```python
messages = [{
    "role": "user",
    "content": [
        {"text": f"Document:\n{STATIC_DOCUMENT}"},  # Static (cached)
        {"cachePoint": {"type": "default"}},
        {"text": f"Question: {user_query}"}  # Dynamic (not cached)
    ]
}]
```

### Good Pattern: System Prompt with Static/Dynamic Split

```python
system = [
    {"text": f"You are a helpful assistant.\n\n{COMPANY_POLICIES}"},  # Static
    {"cachePoint": {"type": "default"}},
    {"text": f"Session: {uuid4()}\nDate: {datetime.now()}"}  # Dynamic
]
```

<div class="alert alert-block alert-warning">
<b>Common Mistakes:</b>
<ul>
<li>Putting user queries inside the cached section (before checkpoint)</li>
<li>Including session IDs or timestamps before the checkpoint</li>
<li>Adding dynamic data anywhere in the cached content</li>
</ul>
</div>

In [5]:
# Demo: Static/Dynamic Separation - GOOD Pattern
# Uses technical documentation as static content

print("GOOD Pattern: Static before checkpoint, dynamic after\n")
queries = ["What is this about?", "Summarize the content", "Any key points?"]
metrics_good = []

for i, query in enumerate(queries, 1):
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[
            {
                "role": "user",
                "content": [
                    {"text": f"Document:\n{TECHNICAL_DOCS}"},
                    {"cachePoint": {"type": "default"}},  # Checkpoint AFTER static
                    {"text": f"Question: {query}"},  # Dynamic AFTER checkpoint
                ],
            }
        ],
        inferenceConfig={"maxTokens": 50},
    )
    metrics_good.append(extract_cache_metrics(response))
    print(f"Query {i}: cache_read={metrics_good[-1]['cache_read']:,}, cache_write={metrics_good[-1]['cache_write']:,}")

savings = calculate_savings(metrics_good)
print(f"\nCache hit rate: {savings['hit_rate']:.1f}%")

GOOD Pattern: Static before checkpoint, dynamic after

Query 1: cache_read=0, cache_write=2,627
Query 2: cache_read=2,627, cache_write=0
Query 3: cache_read=2,627, cache_write=0

Cache hit rate: 66.7%


In [6]:
# Demo: BAD Pattern - Dynamic content inside cached section

print("BAD Pattern: Dynamic content inside cached section (cache thrashing)\n")
queries = ["What is this about?", "Summarize the content", "Any key points?"]
metrics_bad = []

for i, query in enumerate(queries, 1):
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[
            {
                "role": "user",
                "content": [
                    # BAD: Query is INSIDE cached content!
                    {"text": f"Document:\n{TECHNICAL_DOCS}\n\nQuestion: {query}"},
                    {"cachePoint": {"type": "default"}},
                ],
            }
        ],
        inferenceConfig={"maxTokens": 50},
    )
    metrics_bad.append(extract_cache_metrics(response))
    print(f"Query {i}: cache_read={metrics_bad[-1]['cache_read']:,}, cache_write={metrics_bad[-1]['cache_write']:,}")

savings = calculate_savings(metrics_bad)
print(f"\nCache hit rate: {savings['hit_rate']:.1f}%")

BAD Pattern: Dynamic content inside cached section (cache thrashing)

Query 1: cache_read=0, cache_write=2,634
Query 2: cache_read=0, cache_write=2,634
Query 3: cache_read=0, cache_write=2,633

Cache hit rate: 0.0%


In [7]:
# Demo: BAD Pattern - Session IDs in system prompt before cache checkpoint

print("BAD Pattern: Dynamic session data in system prompt\n")
metrics_session_bad = []
for i in range(3):
    # BAD: New session ID on every request invalidates cache!
    system_bad = [
        {
            "text": f"You are a helpful assistant.\nSession ID: {uuid4()}\nDate: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n{COMPANY_POLICIES}"
        },
        {"cachePoint": {"type": "default"}},
    ]

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        system=system_bad,
        messages=[{"role": "user", "content": [{"text": "Hello"}]}],
        inferenceConfig={"maxTokens": 20},
    )
    metrics_session_bad.append(extract_cache_metrics(response))
    print(
        f"Request {i + 1}: cache_read={metrics_session_bad[-1]['cache_read']:,}, cache_write={metrics_session_bad[-1]['cache_write']:,}"
    )

print(f"\nCache hit rate: {calculate_savings(metrics_session_bad)['hit_rate']:.1f}%")

BAD Pattern: Dynamic session data in system prompt

Request 1: cache_read=0, cache_write=1,930
Request 2: cache_read=0, cache_write=1,930
Request 3: cache_read=0, cache_write=1,932

Cache hit rate: 0.0%


In [8]:
# Demo: GOOD Pattern - Static system with checkpoint, then dynamic context

print("GOOD Pattern: Static system with checkpoint, dynamic context after\n")
metrics_session_good = []
for i in range(3):
    # GOOD: Checkpoint BETWEEN static and dynamic parts
    system_good = [
        {"text": f"You are a helpful assistant.\n\n{COMPANY_POLICIES}"},  # Static (cached)
        {"cachePoint": {"type": "default"}},
        {"text": f"Session ID: {uuid4()}\nDate: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"},  # Dynamic
    ]

    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        system=system_good,
        messages=[{"role": "user", "content": [{"text": "Hello"}]}],
        inferenceConfig={"maxTokens": 20},
    )
    metrics_session_good.append(extract_cache_metrics(response))
    print(
        f"Request {i + 1}: cache_read={metrics_session_good[-1]['cache_read']:,}, cache_write={metrics_session_good[-1]['cache_write']:,}"
    )

print(f"\nCache hit rate: {calculate_savings(metrics_session_good)['hit_rate']:.1f}%")

GOOD Pattern: Static system with checkpoint, dynamic context after

Request 1: cache_read=0, cache_write=1,887
Request 2: cache_read=1,887, cache_write=0
Request 3: cache_read=1,887, cache_write=0

Cache hit rate: 66.7%


## 2.2 Cache TTL Strategy: 5-Minute vs 1-Hour

### TTL Syntax (InvokeModel API)

```python
# 5-minute TTL (default)
"cache_control": {"type": "ephemeral", "ttl": "5m"}

# 1-hour TTL (extended)
"cache_control": {"type": "ephemeral", "ttl": "1h"}
```

| TTL | Syntax | Duration | Write Cost | Read Cost |
|-----|--------|----------|------------|-----------|
| **5-minute** | `{"type": "ephemeral", "ttl": "5m"}` | 5 min | 1.25x | 0.1x |
| **1-hour** | `{"type": "ephemeral", "ttl": "1h"}` | 1 hour | 2x | 0.1x |

### When to Use Each TTL

**Continue using 5-minute TTL when:**
- Prompts are used at a regular cadence (more frequently than every 5 minutes)
- System prompts that are continuously refreshed at no additional charge

**Use 1-hour TTL when:**
- Prompts used less frequently than every 5 minutes, but more than every hour
- Agentic side-agents that take longer than 5 minutes to complete
- Long chat conversations where user may not respond within 5 minutes
- Latency is important and follow-up prompts may be sent beyond 5 minutes
- You want to improve rate limit utilization (cache hits don't count against rate limits)

### Mixing TTLs

You can use both TTLs in the same request with an **important constraint**:

<div class="alert alert-block alert-warning">
<b>Important:</b> Cache entries with longer TTL must appear <b>before</b> shorter TTLs (1-hour cache must appear before any 5-minute cache entries).
</div>

> **API Support**: 1-hour TTL is supported via **InvokeModel API** only. Converse API supports 5-minute TTL (`cachePoint`) only.
>
> **Reference**: [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/) for supported models.

In [9]:
# Demo: Mixed TTL Strategy with InvokeModel API
# Uses company policies (stable) with 1h TTL and technical docs as context with 5m TTL


def test_mixed_ttl(query: str) -> dict:
    """Test mixed TTL strategy using InvokeModel API."""
    request_body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 100,
        "system": [
            {
                "type": "text",
                "text": f"You are an assistant for the company. You have to follow exactly with the company policies when the employee is asking any questions.\n\n{COMPANY_POLICIES}",
                "cache_control": {"type": "ephemeral", "ttl": "1h"},  # 1-hour TTL (must come first)
            }
        ],
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Reference documentation:\n{TECHNICAL_DOCS}",
                        "cache_control": {"type": "ephemeral", "ttl": "5m"},  # 5-minute TTL (after 1h)
                    },
                    {"type": "text", "text": f"Current question: {query}"},
                ],
            }
        ],
    }

    response = bedrock_runtime.invoke_model(
        modelId=MODEL_ID, contentType="application/json", accept="application/json", body=json.dumps(request_body)
    )
    return json.loads(response["body"].read())


print("DEMO: Mixed TTL Strategy (InvokeModel API)\n")

# Request 1: WRITE to both caches
r1 = test_mixed_ttl("When will my order arrive?")
print("Request 1: ")
print(r1["usage"])

print("---")

# Request 2: READ from both caches
r2 = test_mixed_ttl("Can I change the shipping address?")
print("Request 2: ")
print(r2["usage"])

DEMO: Mixed TTL Strategy (InvokeModel API)

Request 1: 
{'input_tokens': 11, 'cache_creation_input_tokens': 4533, 'cache_read_input_tokens': 0, 'cache_creation': {'ephemeral_5m_input_tokens': 2627, 'ephemeral_1h_input_tokens': 1906}, 'output_tokens': 100}
---
Request 2: 
{'input_tokens': 12, 'cache_creation_input_tokens': 0, 'cache_read_input_tokens': 4533, 'cache_creation': {'ephemeral_5m_input_tokens': 0, 'ephemeral_1h_input_tokens': 0}, 'output_tokens': 100}


## 2.3 Monitor Cache Performance

### Key Metrics from API Response

Cache metrics are available in every API response under the `usage` object:

| Metric | Description | Cost Impact |
|--------|-------------|-------------|
| `inputTokens` | Fresh input tokens (not cached) | 1x base rate |
| `cacheWriteInputTokens` | Tokens written to cache | 1.25x (5m) or 2x (1h) |
| `cacheReadInputTokens` | Tokens read from cache | 0.1x base rate |

### Cache Hit Rate Formula

```python
cache_hit_rate = cacheReadInputTokens / (cacheReadInputTokens + cacheWriteInputTokens)
```

**Target**: > 70% cache hit rate for optimal ROI

### When Caching Makes Sense

| Scenario | Cache? | Reason |
|----------|--------|--------|
| Same document, multiple queries | ✅ Yes | High reuse potential |
| Stable system prompt | ✅ Yes | Rarely changes |
| Multi-turn conversations | ✅ Yes | History reused each turn |
| One-off document analysis | ❌ No | No reuse opportunity |
| Highly dynamic prompts | ❌ No | Cache thrashing |
| Content < 1,024 tokens | ❌ No | Won't cache anyway |


In [10]:
# Demo: Cache Performance Monitoring
# Uses technical documentation for multiple queries to show cache hit rate

session_metrics = []

print("DEMO: Cache Performance Monitoring\n")
queries = ["What is the main topic?", "List the key points.", "Summarize briefly.", "Any action items?", "Conclusion?"]

for i, query in enumerate(queries, 1):
    response = bedrock_runtime.converse(
        modelId=MODEL_ID,
        messages=[
            {
                "role": "user",
                "content": [
                    {"text": f"Documents:\n{TECHNICAL_DOCS}"},
                    {"cachePoint": {"type": "default"}},
                    {"text": f"Question: {query}"},
                ],
            }
        ],
        inferenceConfig={"maxTokens": 50},
    )
    metrics = extract_cache_metrics(response)
    session_metrics.append(metrics)
    status = "WRITE" if metrics["cache_write"] > 0 else "READ"
    print(f"Query {i}: {status} - write={metrics['cache_write']:,}, read={metrics['cache_read']:,}")

# Analyze ROI
roi = analyze_caching_roi(session_metrics)

print(f"\n{'=' * 50}")
print("CACHE PERFORMANCE REPORT")
print(f"{'=' * 50}")
print(f"Requests:        {roi['total_requests']}")
print(f"Cache hit rate:  {roi['hit_rate']:.1f}% {'✅' if roi['hit_rate'] >= 70 else '⚠️'}")
print(f"Cost savings:    {roi['roi_pct']:.1f}%")
print(f"{'=' * 50}")

DEMO: Cache Performance Monitoring

Query 1: WRITE - write=2,627, read=0
Query 2: READ - write=0, read=2,627
Query 3: READ - write=0, read=2,627
Query 4: READ - write=0, read=2,627
Query 5: READ - write=0, read=2,627

CACHE PERFORMANCE REPORT
Requests:        5
Cache hit rate:  80.0% ✅
Cost savings:    66.8%


---

## Summary

In this notebook, you learned advanced prompt caching strategies for Amazon Bedrock:

### Part 1: Key Concepts

| Concept | Key Takeaway |
|---------|-------------|
| **Cacheable Content** | Tools, System, Messages (in assembly order) |
| **Checkpoints** | Up to 4 per request, can be distributed within sections |
| **Assembly Order** | Tools → System → Messages |
| **Token Counting** | **Cumulative** from start, not per-section |
| **Invalidation** | Changes to early content invalidate downstream caches |

### Part 2: Best Practices

| Practice | Key Takeaway |
|----------|-------------|
| **Static/Dynamic** | Static content first, checkpoint, then dynamic content |
| **TTL Strategy** | 5m for regular cadence (>1x per 5 min), 1h for infrequent access |
| **Mixing TTLs** | 1-hour cache MUST appear before 5-minute cache |
| **Selective Caching** | Only cache content with high reuse potential |
| **Monitoring** | Track hit rate (>70% target), calculate break-even |

### TTL Quick Reference (InvokeModel API)

| When to Use | TTL | Syntax | Write Cost |
|-------------|-----|--------|------------|
| Regular cadence (>1x per 5 min) | 5 min | `{"type": "ephemeral", "ttl": "5m"}` | 1.25x |
| Infrequent (5 min - 1 hour) | 1 hour | `{"type": "ephemeral", "ttl": "1h"}` | 2x |

*Read cost is 0.1x for both TTL types*

---

## Additional Resources

- [AWS Bedrock Prompt Caching Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html)
- [AWS Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/)
- [AWS Bedrock Monitoring Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/monitoring.html)
- [Anthropic Prompt Caching Documentation](https://platform.claude.com/docs/en/build-with-claude/prompt-caching)