# Optimization Strategy

This notebook covers practical techniques to reduce costs and improve quality when working with Amazon Bedrock foundation models. We cover model selection, prompt design best practices, parameter tuning, and introduce prompt caching fundamentals.

## Learning Objectives

By the end of this notebook, you will be able to:
- Choose the right model for your task (don't always go for the largest)
- Write effective prompts with clear instructions and few-shot examples
- Use tool use for reliable structured output
- Optimize `max_tokens` and temperature parameters
- Implement basic prompt caching for cost savings

## Why This Matters

At production scale, optimization choices compound significantly:
- **Wrong model selection** can cost 3-10x more than necessary
- **Verbose prompts** waste input tokens on every request
- **High max_tokens** reserves unnecessary quota, reducing concurrency
- **Missing cache opportunities** means paying full price for repeated content

**Duration**: ~60 minutes

## Prerequisites

Before running this notebook, ensure you have:
1. An AWS account with Amazon Bedrock access enabled
2. AWS credentials configured (via `.env` file, AWS CLI, or IAM role)
3. Completed the previous notebook (01-prompts-101.ipynb) or have equivalent knowledge

## Setup

In [1]:
from __future__ import annotations

from dotenv import load_dotenv

load_dotenv()

import json
import os
import time

import boto3

# Initialize Bedrock client
bedrock_runtime = boto3.client(
    service_name="bedrock-runtime", region_name=os.getenv("AWS_DEFAULT_REGION", "us-east-1")
)

# Model IDs - using model categories rather than specific variants
# Small/Fast model: Good for simple extraction, classification
# Medium/Balanced model: Good for most production workloads
SMALL_MODEL = "global.anthropic.claude-haiku-4-5-20251001-v1:0"  # Small/Fast
MEDIUM_MODEL = "global.anthropic.claude-sonnet-4-5-20250929-v1:0"  # Medium/Balanced

print(f"Region: {os.getenv('AWS_DEFAULT_REGION', 'us-east-1')}")
print(f"Small/Fast model:     {SMALL_MODEL}")
print(f"Medium/Balanced model: {MEDIUM_MODEL}")
print("\nSetup complete!")

Region: us-east-1
Small/Fast model:     global.anthropic.claude-haiku-4-5-20251001-v1:0
Medium/Balanced model: global.anthropic.claude-sonnet-4-5-20250929-v1:0

Setup complete!


<div class="alert alert-block alert-info">
<b>Note:</b> This notebook uses Claude models with the <b>global</b> CRIS (Cross-Region Inference Service) profile for higher availability.
</div>

---

## 1. Model Selection

One of the most impactful optimization decisions is choosing the right model for your task. A common mistake is always using the largest, most capable model - this wastes money and increases latency.

### Key Principle: Start Small, Scale Up

| Category | Typical Models | Best For | Cost |
|----------|----------------|----------|------|
| **Small/Fast** | Claude Haiku 4.5, Nova Lite, Nova Micro | Extraction, classification, simple Q&A | $ (lowest) |
| **Medium/Balanced** | Claude Sonnet 4.5, Nova Pro | General reasoning, summarization, code | $$ |
| **Large/Flagship** | Claude Opus 4.5, Nova Premier | Complex reasoning, research, analysis | $$$ (highest) |

<div class="alert alert-warning">
<strong>Key Insight:</strong> The best model is the <strong>smallest model that meets your quality requirements</strong> - not the most capable one available.
</div>

Let's compare a smaller and larger model on a simple extraction task.

In [2]:
# Compare cost: Small vs Medium model on extraction task

task = (
    "Extract the person's name and email from this text: 'Contact John Smith at john.smith@example.com for details.'"
)

# Current pricing per 1,000 tokens (check https://aws.amazon.com/bedrock/pricing/ for updates)
# Using Claude model pricing as of Jan 2026
PRICING = {
    "small": {"input": 0.001, "output": 0.005},  # Haiku 4.5
    "medium": {"input": 0.003, "output": 0.015},  # Sonnet 4.5
}

results = {}

for model_id, label, pricing_key in [
    (SMALL_MODEL, "Small (Haiku)", "small"),
    (MEDIUM_MODEL, "Medium (Sonnet)", "medium"),
]:
    start = time.time()
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=[{"role": "user", "content": [{"text": task}]}],
        inferenceConfig={"maxTokens": 100, "temperature": 0},
    )
    latency = time.time() - start

    usage = response["usage"]
    output = response["output"]["message"]["content"][0]["text"]

    # Calculate cost per 1,000 tokens
    cost = (usage["inputTokens"] / 1_000) * PRICING[pricing_key]["input"] + (usage["outputTokens"] / 1_000) * PRICING[
        pricing_key
    ]["output"]

    results[label] = {
        "output": output,
        "latency": latency,
        "input_tokens": usage["inputTokens"],
        "output_tokens": usage["outputTokens"],
        "cost": cost,
    }

# Display comparison
print(f"Task: {task}\n")
print(f"{'Model':<18} {'Latency':<10} {'In Tok':<10} {'Out Tok':<10} {'Cost':<12}")
print("-" * 65)
for name, data in results.items():
    print(
        f"{name:<18} {data['latency']:.2f}s{'':<4} {data['input_tokens']:<10} {data['output_tokens']:<10} ${data['cost']:.6f}"
    )

print(
    f"\nCost ratio: Medium is {results['Medium (Sonnet)']['cost'] / results['Small (Haiku)']['cost']:.1f}x more expensive"
)

print("\n--- Small Model Output ---")
print(results["Small (Haiku)"]["output"][:200])
print("\n--- Medium Model Output ---")
print(results["Medium (Sonnet)"]["output"][:200])

Task: Extract the person's name and email from this text: 'Contact John Smith at john.smith@example.com for details.'

Model              Latency    In Tok     Out Tok    Cost        
-----------------------------------------------------------------
Small (Haiku)      1.52s     34         24         $0.000154
Medium (Sonnet)    1.85s     34         19         $0.000387

Cost ratio: Medium is 2.5x more expensive

--- Small Model Output ---
# Extracted Information

**Name:** John Smith

**Email:** john.smith@example.com

--- Medium Model Output ---
**Name:** John Smith

**Email:** john.smith@example.com


### Key Insight: Model Selection

For simple extraction tasks, both models produce correct results. The key difference is **cost and latency**.

<div class="alert alert-success">
<strong>Recommendation:</strong> For extraction, classification, and simple pattern matching, use <strong>smaller models</strong>. Reserve larger models for tasks requiring deeper reasoning.
</div>

| Task Type | Recommended | Rationale |
|-----------|-------------|----------|
| Entity extraction | Small model | Simple pattern matching |
| Classification | Small model | Follows clear rules |
| Summarization | Medium model | Needs context understanding |
| Complex Q&A | Medium/Large model | Requires inference |

---

## 2. Prompt Design Best Practices

Good prompt design reduces token usage while improving output quality. This section covers key techniques:

1. **Clear, specific instructions**
2. **Few-shot examples**
3. **Structured output**

---

### 2.1 Clear, Specific Instructions

Vague prompts lead to verbose, unfocused responses that waste output tokens. Specific prompts produce concise, targeted responses.

<div class="alert alert-info">
<strong>Tip:</strong> When you need a specific format, length, or focus, specify it in your prompt. For open-ended creative tasks, you may want to leave more freedom for the model.
</div>

In [3]:
# Compare vague vs specific prompts

vague_prompt = "Tell me about Paris."

specific_prompt = """List 3 must-see attractions in Paris for first-time visitors.
Format: numbered list with attraction name only."""

for label, prompt in [("VAGUE", vague_prompt), ("SPECIFIC", specific_prompt)]:
    response = bedrock_runtime.converse(
        modelId=SMALL_MODEL,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        inferenceConfig={"maxTokens": 200, "temperature": 0},
    )

    output = response["output"]["message"]["content"][0]["text"]
    usage = response["usage"]

    print(f"{'=' * 60}")
    print(f"{label} PROMPT")
    print(f"{'=' * 60}")
    print(f"Prompt: {prompt}\n")
    print(f"Output tokens: {usage['outputTokens']}")
    print(f"\nResponse:\n{output}\n")

VAGUE PROMPT
Prompt: Tell me about Paris.

Output tokens: 200

Response:
# Paris

Paris is the capital and largest city of France, located in north-central France along the Seine River. Here are some key highlights:

## History & Culture
- One of the world's most historically significant cities, with roots dating back over 2,000 years
- Center of art, philosophy, and intellectual movements throughout European history
- Known for its classical architecture, museums, and cultural institutions

## Famous Landmarks
- **Eiffel Tower** – iconic iron lattice monument built in 1889
- **Notre-Dame Cathedral** – medieval Gothic masterpiece (currently under restoration)
- **Louvre Museum** – world's largest art museum, home to the Mona Lisa
- **Arc de Triomphe** – monumental arch honoring military victories
- **Sacré-Cœur** – white basilica overlooking the city

## Characteristics
- Population of about 2.

SPECIFIC PROMPT
Prompt: List 3 must-see attractions in Paris for first-time visitors.
Forma

### 2.2 Few-Shot Examples

Few-shot learning uses examples to teach the model the exact format and style you want. While this adds input tokens, it dramatically improves output consistency.

**When to use few-shot:**
- When you need consistent output format across many requests
- For domain-specific terminology or style
- When zero-shot produces inconsistent results

**When NOT to use few-shot:**
- Simple tasks where the model already performs well (unnecessary token cost)
- Tasks with highly variable output (examples may constrain creativity)
- When input context is already long (token limits)

Let's compare zero-shot vs few-shot for **extracting product info into a specific format**:

In [4]:
# Zero-shot vs few-shot for data extraction with specific format

# Test with multiple product descriptions to see consistency
products = [
    "Apple MacBook Pro 16-inch with M3 chip, 32GB RAM, Space Black - $2,499",
    "Sony WH-1000XM5 wireless noise cancelling headphones in silver for $348",
    "Samsung 65 inch OLED 4K Smart TV (2024 model) priced at $1,799.99",
]

zero_shot = """Extract product info and output in this exact format:
PRODUCT|BRAND|PRICE|CATEGORY

Product: {product}"""

few_shot = """Extract product info in this exact format:
PRODUCT|BRAND|PRICE|CATEGORY

Examples:
Product: "Dell XPS 15 laptop with Intel i7, 16GB memory - $1,299"
Output: XPS 15|Dell|1299|laptop

Product: "Bose QuietComfort earbuds, noise cancelling, $279 retail"
Output: QuietComfort Earbuds|Bose|279|audio

Product: "LG 55" C3 OLED TV (2023) on sale for $1,196.99"
Output: C3 OLED 55"|LG|1196.99|tv

Now extract:
Product: "{product}"
Output:"""

print("=" * 70)
print("ZERO-SHOT vs FEW-SHOT: Product Data Extraction")
print("=" * 70)

for product in products:
    print(f"\nProduct: {product}")
    print("-" * 60)

    for label, template in [("ZERO-SHOT", zero_shot), ("FEW-SHOT", few_shot)]:
        prompt = template.format(product=product)
        response = bedrock_runtime.converse(
            modelId=SMALL_MODEL,
            messages=[{"role": "user", "content": [{"text": prompt}]}],
            inferenceConfig={"maxTokens": 100, "temperature": 0},
        )

        output = response["output"]["message"]["content"][0]["text"].strip()
        # Show first line only for cleaner comparison
        first_line = output.split("\n")[0]
        print(f"  {label}: {first_line}")

ZERO-SHOT vs FEW-SHOT: Product Data Extraction

Product: Apple MacBook Pro 16-inch with M3 chip, 32GB RAM, Space Black - $2,499
------------------------------------------------------------
  ZERO-SHOT: Apple MacBook Pro 16-inch with M3 chip, 32GB RAM, Space Black|Apple|$2,499|Laptops
  FEW-SHOT: MacBook Pro 16-inch|Apple|2499|laptop

Product: Sony WH-1000XM5 wireless noise cancelling headphones in silver for $348
------------------------------------------------------------
  ZERO-SHOT: Sony WH-1000XM5|Sony|$348|Wireless Noise Cancelling Headphones
  FEW-SHOT: WH-1000XM5|Sony|348|headphones

Product: Samsung 65 inch OLED 4K Smart TV (2024 model) priced at $1,799.99
------------------------------------------------------------
  ZERO-SHOT: Samsung|Samsung|$1,799.99|Television
  FEW-SHOT: 65" OLED 4K Smart TV|Samsung|1799.99|tv


### 2.3 Structured Output

When you need structured output, there are three approaches: **prompt-based**, **tool use**, and **native structured output**.

#### Prompt-Based Structured Output

You can ask the model to return structured data directly in the prompt. This works for JSON, XML, or any format you specify:

In [5]:
# Prompt-based structured output: JSON and XML

text_to_extract = "John Smith, 35 years old, works as a Software Engineer at TechCorp in San Francisco."

# JSON format
json_prompt = f"""Extract information from this text and return ONLY valid JSON (no markdown):

Text: {text_to_extract}

Return JSON with: name, age, job_title, company, location"""

# XML format
xml_prompt = f"""Extract information from this text and return ONLY valid XML (no markdown):

Text: {text_to_extract}

Return XML with tags: <person><name>, <age>, <job_title>, <company>, <location></person>"""

print("=" * 60)
print("PROMPT-BASED STRUCTURED OUTPUT")
print("=" * 60)

for label, prompt in [("JSON", json_prompt), ("XML", xml_prompt)]:
    response = bedrock_runtime.converse(
        modelId=SMALL_MODEL,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        inferenceConfig={"maxTokens": 200, "temperature": 0},
    )

    output = response["output"]["message"]["content"][0]["text"]
    usage = response["usage"]

    print(f"\n{label} Format:")
    print("-" * 40)
    print(output)
    print(f"\nTokens - Input: {usage['inputTokens']}, Output: {usage['outputTokens']}")

PROMPT-BASED STRUCTURED OUTPUT

JSON Format:
----------------------------------------
```json
{
  "name": "John Smith",
  "age": 35,
  "job_title": "Software Engineer",
  "company": "TechCorp",
  "location": "San Francisco"
}
```

Tokens - Input: 64, Output: 59

XML Format:
----------------------------------------
```xml
<?xml version="1.0" encoding="UTF-8"?>
<person>
  <name>John Smith</name>
  <age>35</age>
  <job_title>Software Engineer</job_title>
  <company>TechCorp</company>
  <location>San Francisco</location>
</person>
```

Tokens - Input: 75, Output: 86


#### Tool Use (Schema-Enforced)

**Tool use** (function calling) provides native schema enforcement. The model is constrained to output valid JSON matching your defined schema.

<div class="alert alert-warning">
<strong>Trade-off:</strong> Tool use adds significant input tokens (tool definitions) but guarantees valid, parseable output. Choose based on your reliability requirements vs cost sensitivity.
</div>

In [None]:
# Tool use: Schema-enforced JSON extraction

# Define tool with JSON schema
tools = [
    {
        "toolSpec": {
            "name": "extract_person_info",
            "strict": True,  # Enforce strict schema validation
            "description": "Extract structured person information from text",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "name": {"type": "string", "description": "Person's full name"},
                        "age": {"type": "integer", "description": "Person's age"},
                        "job_title": {"type": "string", "description": "Job title or occupation"},
                        "company": {"type": "string", "description": "Company name"},
                        "location": {"type": "string", "description": "City or location"},
                    },
                    "required": ["name", "age", "job_title"],
                }
            },
        }
    }
]

response = bedrock_runtime.converse(
    modelId=SMALL_MODEL,
    messages=[{"role": "user", "content": [{"text": f"Extract person information from: {text_to_extract}"}]}],
    toolConfig={"tools": tools},
    inferenceConfig={"maxTokens": 200, "temperature": 0},
)

# Extract tool use result
output_content = response["output"]["message"]["content"]
for block in output_content:
    if "toolUse" in block:
        extracted = block["toolUse"]["input"]
        print("=" * 60)
        print("TOOL USE (Schema-Enforced, strict)")
        print("=" * 60)
        print("\nExtracted:")
        print(json.dumps(extracted, indent=2))
        break

usage = response["usage"]
print(f"\nTokens - Input: {usage['inputTokens']}, Output: {usage['outputTokens']}")
print("\n⚠️  Notice: Tool use input tokens are significantly higher due to schema definition.")

#### Structured Output (Native JSON Schema)

Amazon Bedrock supports **native structured output** via the `outputConfig` parameter. Instead of using tool use as a workaround for JSON extraction, you can pass a JSON schema directly and the model will return valid JSON conforming to that schema.

<div class="alert alert-success">
<strong>Best of both worlds:</strong> Native structured output provides guaranteed schema compliance like tool use, but without the extra token overhead of tool definitions. This is the recommended approach for production JSON extraction.
</div>

In [None]:
# Structured Output: Native JSON Schema via outputConfig

response = bedrock_runtime.converse(
    modelId=SMALL_MODEL,
    messages=[{"role": "user", "content": [{"text": f"Extract person information from: {text_to_extract}"}]}],
    inferenceConfig={"maxTokens": 200, "temperature": 0},
    outputConfig={
        "textFormat": {
            "type": "json_schema",
            "structure": {
                "jsonSchema": {
                    "schema": json.dumps({
                        "type": "object",
                        "properties": {
                            "name": {"type": "string", "description": "Person's full name"},
                            "age": {"type": "integer", "description": "Person's age"},
                            "job_title": {"type": "string", "description": "Job title or occupation"},
                            "company": {"type": "string", "description": "Company name"},
                            "location": {"type": "string", "description": "City or location"},
                        },
                        "required": ["name", "age", "job_title", "company", "location"],
                        "additionalProperties": False,
                    }),
                    "name": "person_info_extraction",
                    "description": "Extract structured person information from text",
                }
            },
        }
    },
)

# The response text is valid JSON conforming to the schema
extracted = json.loads(response["output"]["message"]["content"][0]["text"])

print("=" * 60)
print("STRUCTURED OUTPUT (Native JSON Schema)")
print("=" * 60)
print("\nExtracted:")
print(json.dumps(extracted, indent=2))

usage = response["usage"]
print(f"\nTokens - Input: {usage['inputTokens']}, Output: {usage['outputTokens']}")
print("\n✅ Lower input tokens than tool use, with the same schema guarantee!")

### Comparison: Three Approaches to Structured Output

| Approach | Format | Reliability | Input Tokens | Best For |
|----------|--------|-------------|--------------|----------|
| **Prompt-based** | JSON, XML, custom | Good (may need cleanup) | Lower | Prototyping, cost-sensitive apps |
| **Tool Use (strict)** | JSON only | Guaranteed valid | Higher (schema overhead) | Production APIs needing function calling |
| **Structured Output** | JSON only | Guaranteed valid | Lower than tool use | Production APIs needing direct JSON response |

<div class="alert alert-info">
<strong>When to use each:</strong>
<ul>
<li><strong>Prompt-based</strong>: Lower token cost, flexible formats (JSON/XML/custom), good for exploration</li>
<li><strong>Tool use (strict)</strong>: Guaranteed schema compliance with function calling semantics, use when you need tool orchestration</li>
<li><strong>Structured output</strong>: Guaranteed schema compliance with lowest token overhead, use when you need direct JSON responses without function calling</li>
</ul>
</div>

---

## 3. Parameter Tuning

### 3.1 max_tokens

The `max_tokens` parameter serves two purposes:

1. **Limit output length** - Caps how many tokens the model generates
2. **Affects concurrency** - Bedrock reserves `input + max_tokens` from your TPM quota at request start

<div class="alert alert-warning">
<strong>Key Trade-off:</strong>
<ul>
<li><strong>Too small</strong> → Output gets truncated (incomplete response)</li>
<li><strong>Too large</strong> → Wastes quota reservation, reduces concurrent requests</li>
</ul>
Set <code>max_tokens</code> to your expected output + 10-15% buffer.
</div>

In [7]:
# Demonstrate max_tokens impact on output quality and quota

prompt = "List 5 benefits of exercise with a brief explanation for each."

print(f'Prompt: "{prompt}"\n')
print("=" * 80)

results = []
for max_tok in [50, 100, 200, 500]:
    response = bedrock_runtime.converse(
        modelId=SMALL_MODEL,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        inferenceConfig={"maxTokens": max_tok, "temperature": 0},
    )

    output = response["output"]["message"]["content"][0]["text"]
    usage = response["usage"]
    stop_reason = response["stopReason"]
    truncated = stop_reason == "max_tokens"

    results.append(
        {
            "max_tokens": max_tok,
            "output_tokens": usage["outputTokens"],
            "input_tokens": usage["inputTokens"],
            "truncated": truncated,
            "output": output,
        }
    )

# Show summary table
print(f"{'max_tokens':<12} {'Output Tok':<12} {'Truncated?':<12} {'Quota Reserved':<15}")
print("-" * 55)
for r in results:
    quota = r["input_tokens"] + r["max_tokens"]
    trunc = "⚠️ YES" if r["truncated"] else "No"
    print(f"{r['max_tokens']:<12} {r['output_tokens']:<12} {trunc:<12} ~{quota:<14}")

# Show actual outputs to demonstrate truncation
print("\n" + "=" * 80)
print("ACTUAL OUTPUTS (notice truncation with small max_tokens):")
print("=" * 80)

for r in results:
    status = "⚠️ TRUNCATED" if r["truncated"] else "✓ Complete"
    print(f"\nmax_tokens={r['max_tokens']} [{status}]:")
    print("-" * 40)
    # Show first 200 chars to keep output manageable
    preview = r["output"][:300] + "..." if len(r["output"]) > 300 else r["output"]
    print(preview)

Prompt: "List 5 benefits of exercise with a brief explanation for each."

max_tokens   Output Tok   Truncated?   Quota Reserved 
-------------------------------------------------------
50           50           ⚠️ YES       ~71            
100          100          ⚠️ YES       ~121           
200          190          No           ~221           
500          190          No           ~521           

ACTUAL OUTPUTS (notice truncation with small max_tokens):

max_tokens=50 [⚠️ TRUNCATED]:
----------------------------------------
# 5 Benefits of Exercise

1. **Improved Cardiovascular Health**
Regular exercise strengthens your heart and improves circulation, reducing blood pressure and lowering the risk of heart disease and stroke.

2. **Weight Management

max_tokens=100 [⚠️ TRUNCATED]:
----------------------------------------
# 5 Benefits of Exercise

1. **Improved Cardiovascular Health**
Regular exercise strengthens your heart and improves circulation, reducing blood pressure and lowe

### 3.2 Temperature

Temperature (range: **0.0 to 1.0**) controls randomness in token selection:
- **Lower** (0.0-0.3) → More deterministic, consistent outputs
- **Higher** (0.7-1.0) → More creative, varied outputs

| Temperature | Behavior | Use Cases |
|-------------|----------|----------|
| **0.0** | Deterministic (same input = same output) | Extraction, classification, factual Q&A |
| **0.3-0.5** | Low randomness | Summarization, general Q&A |
| **0.7-1.0** | High randomness | Creative writing, brainstorming |

In [8]:
# Demonstrate temperature effect on output consistency

prompt = "Write a one-sentence tagline for a coffee shop."

print(f'Prompt: "{prompt}"\n')

for temp in [0.0, 0.7, 1.0]:
    print(f"Temperature: {temp}")
    print("-" * 40)

    # Run 3 times to show consistency/variation
    for i in range(3):
        response = bedrock_runtime.converse(
            modelId=SMALL_MODEL,
            messages=[{"role": "user", "content": [{"text": prompt}]}],
            inferenceConfig={"maxTokens": 50, "temperature": temp},
        )
        output = response["output"]["message"]["content"][0]["text"].strip()
        print(f"  Run {i + 1}: {output[:60]}..." if len(output) > 60 else f"  Run {i + 1}: {output}")
    print()

Prompt: "Write a one-sentence tagline for a coffee shop."

Temperature: 0.0
----------------------------------------
  Run 1: "Wake up to something extraordinary."
  Run 2: "Wake up to something extraordinary."
  Run 3: "Wake up to something extraordinary."

Temperature: 0.7
----------------------------------------
  Run 1: "Wake up to something extraordinary."
  Run 2: "Wake up to something extraordinary."
  Run 3: "Wake up to something extraordinary."

Temperature: 1.0
----------------------------------------
  Run 1: "Wake up to something wonderful."
  Run 2: "Wake up to what matters—great coffee, good people, and the ...
  Run 3: "Start your day the right way—where every cup tells a story....



---

## 4. Prompt Caching Basics

Prompt caching allows you to cache static portions of your prompt and reuse them across multiple requests.

### How Caching Works

| Step | What Happens | Cost |
|------|--------------|------|
| **First Request** | Static content is cached | 1.25x normal cost (5min cache) or 2x (1h cache) |
| **Subsequent Requests** | Cached content is reused | 0.1x normal cost (cache read) |
| **Cache Expiry** | After TTL with no hits | Cache is cleared |

<div class="alert alert-success">
<strong>Key Benefit:</strong> Cache reads cost only 10% of normal input cost - <strong>up to 90% savings</strong> on cached content with high hit rates.
</div>

### Caching Requirements by Model

| Model | Minimum Tokens per Checkpoint |
|-------|-------------------------------|
| Claude Sonnet 4.5 | 1,024 |
| Claude Haiku 4.5 | 2,048 |
| Claude Opus 4.5 | 1,024 |

<div class="alert alert-info">
<strong>Note:</strong> Minimum token requirements vary by model. Content below the minimum will not be cached. See <a href="https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html">Prompt Caching Documentation</a> for the latest requirements. Cache pricing varies by TTL option - verify at <a href="https://aws.amazon.com/bedrock/pricing/">Amazon Bedrock Pricing</a>.
</div>

### Load Helper Functions

In [9]:
# Import cache metrics utilities
from utils.cache_metrics import calculate_cache_savings, extract_cache_metrics

print("Cache metrics utilities loaded successfully!")

Cache metrics utilities loaded successfully!


### Load Sample Document

In [10]:
# Load product documentation from file
with open("data/product_manual.txt") as f:
    PRODUCT_MANUAL = f.read()

# Estimate token count
word_count = len(PRODUCT_MANUAL.split())
estimated_tokens = int(word_count * 1.3)  # Rough estimate

print("Product manual loaded")
print(f"  Word count: {word_count:,}")
print(f"  Estimated tokens: ~{estimated_tokens:,}")
print(f"  Meets caching minimum (1,024 for Sonnet): {'Yes' if estimated_tokens >= 1024 else 'No'}")

Product manual loaded
  Word count: 1,023
  Estimated tokens: ~1,329
  Meets caching minimum (1,024 for Sonnet): Yes


### Implementing Basic Caching

To enable caching with the Converse API, add a `cachePoint` after the static content:

```python
messages = [{
    "role": "user",
    "content": [
        {"text": "Your static content here..."},
        {"cachePoint": {"type": "default"}},  # <-- Cache checkpoint
        {"text": "Your dynamic query here..."}
    ]
}]
```

<div class="alert alert-warning">
<strong>Important:</strong> Place static content <em>before</em> the cache checkpoint and dynamic content <em>after</em> it.
</div>

In [11]:
def query_document_with_cache(user_query, document=PRODUCT_MANUAL, model_id=MEDIUM_MODEL):
    """
    Query a document using single-checkpoint caching.

    The document is placed before the cache checkpoint (cached),
    and the user query is placed after (not cached).
    """
    messages = [
        {
            "role": "user",
            "content": [
                {
                    # Static content to cache
                    "text": f"""You are a helpful assistant. Use the following product manual to answer questions.

PRODUCT MANUAL:
{document}

Answer the user's question based on the information in the manual. If the answer is not in the manual, say so."""
                },
                {
                    # Cache checkpoint - everything before this gets cached
                    "cachePoint": {"type": "default"}
                },
                {
                    # Dynamic content (not cached)
                    "text": f"\n\nQUESTION: {user_query}"
                },
            ],
        }
    ]

    response = bedrock_runtime.converse(
        modelId=model_id, messages=messages, inferenceConfig={"maxTokens": 500, "temperature": 0.0}
    )

    response_text = response["output"]["message"]["content"][0]["text"]
    metrics = extract_cache_metrics(response)

    return response_text, metrics


print("Document query function with caching defined!")

Document query function with caching defined!


### Demo: Cache in Action

In [12]:
# Test multiple queries to see caching in action
queries = [
    "What is the return policy for electronics?",
    "How much does the Professional tier cost?",
    "What shipping options are available?",
    "What are the API rate limits for standard tier?",
]

all_metrics = []

print("=" * 80)
print("DEMO: Single-Checkpoint Caching")
print("=" * 80)

for i, query in enumerate(queries, 1):
    print(f"\nQuery {i}: {query}")

    response, metrics = query_document_with_cache(query)
    all_metrics.append(metrics)

    # Show abbreviated response
    response_preview = response[:150] + "..." if len(response) > 150 else response
    print(f"\nResponse: {response_preview}")

    # Show cache metrics
    print("\nCache metrics:")
    print(f"  Input tokens (fresh):     {metrics['input_tokens']:,}")
    print(f"  Cache write tokens:       {metrics['cache_write']:,}")
    print(f"  Cache read tokens:        {metrics['cache_read']:,}")
    print("-" * 60)

DEMO: Single-Checkpoint Caching

Query 1: What is the return policy for electronics?

Response: According to the product manual, **electronics have a 14-day return window** from the purchase date.

For the return to be accepted, items must be:
- ...

Cache metrics:
  Input tokens (fresh):     15
  Cache write tokens:       1,936
  Cache read tokens:        0
------------------------------------------------------------

Query 2: How much does the Professional tier cost?

Response: According to the product manual, the **Professional tier costs $299/month**.

This tier includes:
- Up to 10,000 orders/month
- 10 user accounts with ...

Cache metrics:
  Input tokens (fresh):     15
  Cache write tokens:       0
  Cache read tokens:        1,936
------------------------------------------------------------

Query 3: What shipping options are available?

Response: Based on the product manual, the following shipping options are available:

1. **Standard Shipping**: 5-7 business days, FREE on or

### Analyze Cache Performance

In [13]:
# Calculate overall cache savings
# Using Sonnet 4.5 pricing: input $0.003/1K, cache read $0.0003/1K (0.1x)
savings = calculate_cache_savings(all_metrics, input_price_per_million=3.0)

print("=" * 60)
print("CACHE PERFORMANCE SUMMARY")
print("=" * 60)
print(f"Total requests:      {savings['total_requests']}")
print(f"Cache hit rate:      {savings['cache_hit_rate']:.1f}%")
print("")
print(f"Cost WITH caching:    ${savings['cost_with_cache']:.6f}")
print(f"Cost WITHOUT caching: ${savings['cost_no_cache']:.6f}")
print("")
print(f"Savings:             ${savings['savings']:.6f} ({savings['savings_pct']:.1f}% reduction)")
print("=" * 60)

CACHE PERFORMANCE SUMMARY
Total requests:      4
Cache hit rate:      75.0%

Cost WITH caching:    $0.009182
Cost WITHOUT caching: $0.023412

Savings:             $0.014230 (60.8% reduction)


### Understanding the Results

**Request 1 (Cache Write):**
- First occurrence of the document
- Document tokens written to cache (1.25x cost)
- Higher initial investment

**Requests 2-4 (Cache Hit):**
- Document retrieved from cache (0.1x cost - 90% savings!)
- Only the new user query processed fresh
- Significant cost reduction

<div class="alert alert-success">
<strong>Key Insight:</strong> The break-even point for caching is typically 2-3 requests. After that, every additional request saves ~90% on cached content!
</div>

---

## Summary

In this notebook, you learned practical optimization strategies for Amazon Bedrock:

| Technique | Key Takeaway |
|-----------|-------------|
| **Model Selection** | Start small, scale up; smaller models are 3-10x cheaper and often equally effective for simple tasks |
| **Clear Instructions** | Specific prompts reduce output tokens; use imperative language ("Extract..." not "Could you...") |
| **Few-Shot Examples** | Improves format consistency; use when zero-shot produces inconsistent results |
| **Structured Output** | Prompt-based for flexibility (JSON/XML); Tool Use for guaranteed schema compliance |
| **max_tokens** | Right-size to expected output + 10-15% buffer; over-setting wastes quota capacity |
| **Temperature** | Use 0.0 for deterministic tasks (extraction, classification); higher for creative tasks |
| **Prompt Caching** | Cache static content for up to 90% input cost savings; break-even at 2-3 requests |

### What's Next

In the next notebook, **03-langfuse-observability.ipynb**, you will learn:
- Setting up LangFuse for prompt tracing
- Tracking cost, latency, and token usage
- Building dashboards for production monitoring

---

## Additional Resources

- [Supported models in Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)
- [Prompt engineering concepts](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-engineering-guidelines.html)
- [Inference request parameters](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters.html)
- [Prompt caching](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html)
- [Tool use with Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/tool-use.html)
- [Monitoring Amazon Bedrock](https://docs.aws.amazon.com/bedrock/latest/userguide/monitoring.html)