# Prompts 101: Understanding the Basics

This notebook introduces the fundamental concepts you need to understand when working with Large Language Models (LLMs) on Amazon Bedrock. Before diving into advanced optimization techniques like prompt caching, it's essential to understand the building blocks: tokens, pricing models, API usage, and service quotas.

## Learning Objectives

By the end of this notebook, you will be able to:
- Count tokens using the Bedrock CountTokens API
- Calculate inference costs based on token usage
- Use the Converse API for model inference
- Understand and interpret usage metrics from API responses
- Navigate TPM (Tokens Per Minute) and RPM (Requests Per Minute) quotas

## Why This Matters

At production scale, small inefficiencies compound dramatically:
- **1,000 extra tokens per request x 1 million requests = 1 billion unnecessary tokens processed**
- Understanding token economics is the foundation for all cost optimization strategies
- Proper quota management prevents throttling and ensures reliable service

**Duration**: ~30 minutes

## Prerequisites

Before running this notebook, ensure you have:
1. An AWS account with Amazon Bedrock access enabled
2. AWS credentials configured (via `.env` file, AWS CLI, or IAM role)

In [1]:
# Install/upgrade required packages
from __future__ import annotations

!pip3 install --upgrade boto3 python-dotenv --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m


In [2]:
# Standard library imports
import os
import time

import boto3

# Third-party imports
from dotenv import load_dotenv

# Local imports - utility functions for pricing calculations
from utils import (
    OUTPUT_BURNDOWN_RATE,
    calculate_actual_cost,
    calculate_tpm_actual,
    calculate_tpm_reservation,
    compare_optimization,
    print_pricing_table,
)

# Load environment variables from .env file
load_dotenv()

# Initialize AWS clients
REGION = os.getenv('AWS_DEFAULT_REGION', 'us-east-1')
bedrock_runtime = boto3.client('bedrock-runtime', region_name=REGION)
bedrock = boto3.client('bedrock', region_name=REGION)
service_quotas = boto3.client('service-quotas', region_name=REGION)

# Model configuration - using Claude Sonnet 4.5 with global CRIS profile
MODEL_ID = "global.anthropic.claude-sonnet-4-5-20250929-v1:0"

print(f"Region: {REGION}")
print(f"Model: {MODEL_ID}")
print(f"boto3 version: {boto3.__version__}")

Region: us-east-1
Model: global.anthropic.claude-sonnet-4-5-20250929-v1:0
boto3 version: 1.42.25


<div class="alert alert-block alert-info">
<b>Note:</b> This notebook uses Claude Sonnet 4.5 with the <b>global</b> CRIS (Cross-Region Inference Service) profile. The global profile offers ~10% cost savings and higher availability by automatically routing requests to regions with capacity.
</div>

---

## 1. Understanding Tokens

Tokens are the fundamental units that LLMs use to process text. Understanding tokenization is crucial because:
- **Pricing** is based on token count, not character count
- **Context limits** are measured in tokens
- **Quota limits** (TPM) are based on tokens processed

### What is a Token?

A token can be:
- A complete word ("hello" = 1 token)
- Part of a word ("tokenization" might be 2-3 tokens)
- Punctuation or whitespace
- Special characters

**Rule of thumb**: For English text, 1 token is approximately 4 characters or 0.75 words.

### CountTokens API

Amazon Bedrock provides a `CountTokens` API that allows you to count tokens before making inference calls. This is useful for:
- Estimating costs before processing
- Ensuring prompts fit within context limits
- Validating caching strategies (minimum token requirements)

<div class="alert alert-block alert-info">
<b>Note:</b> The CountTokens API is free to use and does not count against your quotas.
</div>

In [3]:
def count_tokens(text, model_id="anthropic.claude-sonnet-4-20250514-v1:0"):
    """
    Count tokens for a given text using Bedrock's CountTokens API.
    
    Args:
        text: The text to count tokens for
        model_id: The model ID to use for tokenization
    
    Returns:
        dict with token count and character count
    """
    response = bedrock_runtime.count_tokens(
        modelId=model_id,
        input={
            "converse": {
                "messages": [
                    {
                        "role": "user",
                        "content": [{"text": text}]
                    }
                ]
            }
        }
    )
    
    return {
        'tokens': response['inputTokens'],
        'characters': len(text),
        'chars_per_token': len(text) / response['inputTokens'] if response['inputTokens'] > 0 else 0
    }

# Test with different prompt lengths
sample_prompts = [
    "Hello, world!",
    "Explain quantum computing in simple terms.",
    "Write a detailed analysis of the economic impact of artificial intelligence on the global workforce over the next decade, including specific sectors that will be most affected and potential mitigation strategies."
]

print("="*70)
print("TOKEN COUNT ANALYSIS")
print("="*70)
print(f"{'Prompt Preview':<45} {'Chars':>8} {'Tokens':>8} {'Ratio':>8}")
print("-"*70)

for prompt in sample_prompts:
    result = count_tokens(prompt)
    preview = prompt[:42] + "..." if len(prompt) > 45 else prompt
    print(f"{preview:<45} {result['characters']:>8} {result['tokens']:>8} {result['chars_per_token']:>7.1f}")

print("="*70)

TOKEN COUNT ANALYSIS
Prompt Preview                                   Chars   Tokens    Ratio
----------------------------------------------------------------------
Hello, world!                                       13       11     1.2
Explain quantum computing in simple terms.          42       15     2.8
Write a detailed analysis of the economic ...      212       41     5.2


<div class="alert alert-block alert-warning">
<b>Key Insight:</b> English text averages approximately 4 characters per token. However, this ratio varies based on vocabulary complexity, special characters, and formatting. Always use the CountTokens API for accurate counts when estimating costs for production workloads.
</div>

<div class="alert alert-block alert-info">
<b>Token Overhead:</b> The actual token count billed may be slightly higher than just your text content. The underlying API adds a small overhead for:
<ul>
<li>Message formatting (role markers, turn separators)</li>
<li>System prompt wrapper tokens</li>
<li>Tool definitions (if using tools)</li>
</ul>
</div>

---

## 2. Understanding Pricing

Amazon Bedrock uses a pay-per-token pricing model for on-demand mode. Understanding this model is essential for cost optimization.

### Token Pricing Components

| Token Type | Description | Typical Cost Ratio |
|------------|-------------|--------------------|
| Input Tokens | Tokens in your prompt | 1x (base rate) |
| Output Tokens | Tokens generated by model | 3-5x input rate |
| Cache Write | Tokens written to cache | 1.25x input rate |
| Cache Read | Tokens read from cache | 0.1x input rate |

**Key observation**: Output tokens cost significantly more than input tokens. This means:
1. Optimizing prompt length saves money
2. Controlling output length (via `max_tokens`) manages costs
3. Caching can provide substantial savings (up to 90% on cached reads)

In [4]:
# Display pricing table from utils.py
# Pricing is defined in utils.py for reuse across notebooks
# Always verify current pricing at: https://aws.amazon.com/bedrock/pricing/

print_pricing_table()

Model Pricing (per 1M tokens) - as of January 2026:
Model                               Input     Output    Cache Write     Cache Read
                                                        (5m, 1.25x)     (5m, 0.1x)
------------------------------------------------------------------------------------------
Claude Sonnet 4.5 (Global)     $    3.00 $   15.00 $        3.75 $        0.30
Claude Haiku 4.5 (Global)      $    1.00 $    5.00 $        1.25 $        0.10

Note: Cache pricing shown is for 5-minute TTL cache.


### Interactive Cost Calculator

Let's build a cost calculator to understand the economics of token optimization at scale.

In [5]:
# Scenario: Customer support chatbot
# - Unoptimized: 5,000 token prompt (verbose system instructions + policies)
# - Optimized: 3,000 token prompt (streamlined instructions)
# - Average output: 150 tokens per response

comparison = compare_optimization(
    original_tokens=5000,
    optimized_tokens=3000,
    output_tokens=150,
    num_requests=1_000_000,  # 1 million requests
    model_id=MODEL_ID
)

print("="*70)
print("COST COMPARISON: Token Optimization at Scale")
print("="*70)
print("Scenario: 1 million customer support requests")
print(f"Model: {comparison['original']['model']}")
print()

print("Unoptimized (5,000 input tokens per request):")
print(f"  Input cost:  ${comparison['original']['input_cost']:>12,.2f}")
print(f"  Output cost: ${comparison['original']['output_cost']:>12,.2f}")
print(f"  Total cost:  ${comparison['original']['total_cost']:>12,.2f}")

print()
print("Optimized (3,000 input tokens per request):")
print(f"  Input cost:  ${comparison['optimized']['input_cost']:>12,.2f}")
print(f"  Output cost: ${comparison['optimized']['output_cost']:>12,.2f}")
print(f"  Total cost:  ${comparison['optimized']['total_cost']:>12,.2f}")

print()
print(f"SAVINGS: ${comparison['savings']:,.2f} ({comparison['savings_pct']:.1f}% reduction)")
print("="*70)

COST COMPARISON: Token Optimization at Scale
Scenario: 1 million customer support requests
Model: Claude Sonnet 4.5 (Global)

Unoptimized (5,000 input tokens per request):
  Input cost:  $   15,000.00
  Output cost: $    2,250.00
  Total cost:  $   17,250.00

Optimized (3,000 input tokens per request):
  Input cost:  $    9,000.00
  Output cost: $    2,250.00
  Total cost:  $   11,250.00

SAVINGS: $6,000.00 (34.8% reduction)


### ROI at Different Scales

Let's see how optimization savings compound at different request volumes.

In [6]:
def roi_at_scale(original_tokens, optimized_tokens, output_tokens=150, model_id=MODEL_ID):
    """
    Calculate ROI of prompt optimization at different scales.
    """
    scales = [
        (1_000, "1K (pilot)"),
        (10_000, "10K (small)"),
        (100_000, "100K (medium)"),
        (1_000_000, "1M (large)"),
        (10_000_000, "10M (enterprise)"),
    ]

    print("="*80)
    print(f"ROI ANALYSIS: {original_tokens} -> {optimized_tokens} tokens")
    print("="*80)
    print(f"{'Scale':<20} {'Original':>15} {'Optimized':>15} {'Savings':>15} {'%':>10}")
    print("-"*80)

    for num_requests, label in scales:
        result = compare_optimization(original_tokens, optimized_tokens, output_tokens, num_requests, model_id)
        print(f"{label:<20} ${result['original']['total_cost']:>13,.2f} ${result['optimized']['total_cost']:>13,.2f} ${result['savings']:>13,.2f} {result['savings_pct']:>9.1f}%")

    print("="*80)

# Run ROI analysis
roi_at_scale(original_tokens=5000, optimized_tokens=3000)

ROI ANALYSIS: 5000 -> 3000 tokens
Scale                       Original       Optimized         Savings          %
--------------------------------------------------------------------------------
1K (pilot)           $        17.25 $        11.25 $         6.00      34.8%
10K (small)          $       172.50 $       112.50 $        60.00      34.8%
100K (medium)        $     1,725.00 $     1,125.00 $       600.00      34.8%
1M (large)           $    17,250.00 $    11,250.00 $     6,000.00      34.8%
10M (enterprise)     $   172,500.00 $   112,500.00 $    60,000.00      34.8%


<div class="alert alert-block alert-warning">
<b>Key Insight:</b> A 40% reduction in input tokens (5,000 to 3,000) yields approximately 34% total cost savings. At enterprise scale (10M requests), this translates to $60,000+ in savings - and this is before applying prompt caching, which can save up to 90% on cached read tokens!
</div>

---

## 3. Latency Impact

Token count affects not just cost but also latency. Understanding this relationship helps you build responsive applications.

### Key Latency Metrics

| Metric | What It Measures | Why It Matters |
|--------|------------------|----------------|
| **TTFT** (Time to First Token) | How quickly the model starts responding | Critical for perceived speed - users see something happening |
| **TPS** (Tokens per Second) | How fast text appears after it starts | Affects how quickly the full response is generated |
| **TTLT** (Time to Last Token) | Total time until the last token is generated | The actual generation time (excludes app overhead) |
| **E2E Latency** | Total time from request to complete response | The full picture including your application overhead |

### Measuring Real API Latency

Let's measure actual latency from the Bedrock API using the streaming response to capture TTFT and TPS.

In [7]:
def measure_latency(prompt, model_id=MODEL_ID, max_tokens=200):
    """
    Measure latency metrics using the Converse API.
    
    Returns E2E latency, TTFT (estimated), TTLT, and TPS.
    """
    start_time = time.time()
    
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=[{"role": "user", "content": [{"text": prompt}]}],
        inferenceConfig={"maxTokens": max_tokens, "temperature": 0}
    )
    
    end_time = time.time()
    
    # Extract metrics
    usage = response.get('usage', {})
    metrics = response.get('metrics', {})
    
    e2e_ms = (end_time - start_time) * 1000
    output_tokens = usage.get('outputTokens', 0)
    input_tokens = usage.get('inputTokens', 0)
    
    # API-reported latency (TTLT - time to last token)
    ttlt_ms = metrics.get('latencyMs', e2e_ms)
    
    # Estimate TTFT: input processing time (typically ~0.5ms per token + overhead)
    # This is approximate - for precise TTFT, use streaming API
    estimated_ttft_ms = (input_tokens * 0.5) + 200  # ~200ms network/startup overhead
    
    # TPS: tokens per second for output generation
    generation_time_ms = ttlt_ms - estimated_ttft_ms
    tps = (output_tokens / (generation_time_ms / 1000)) if generation_time_ms > 0 else 0
    
    return {
        'ttft_ms': estimated_ttft_ms,
        'ttlt_ms': ttlt_ms,
        'e2e_ms': e2e_ms,
        'tps': tps,
        'input_tokens': input_tokens,
        'output_tokens': output_tokens,
    }


# Measure real latency with different prompt sizes
print("="*90)
print("REAL API LATENCY MEASUREMENT")
print("="*90)

test_prompts = [
    ("Short", "What is 2+2? Answer briefly."),
    ("Medium", "Explain photosynthesis in 2-3 sentences."),
    ("Long", "Explain machine learning training in one paragraph.")
]

print(f"\n{'Prompt':<10} {'TTFT':>12} {'TTLT':>12} {'E2E':>12} {'TPS':>10} {'In':>8} {'Out':>8}")
print("-"*90)

for label, prompt in test_prompts:
    metrics = measure_latency(prompt, max_tokens=150)
    print(f"{label:<10} {metrics['ttft_ms']:>10.0f}ms {metrics['ttlt_ms']:>10.0f}ms {metrics['e2e_ms']:>10.0f}ms {metrics['tps']:>8.1f}/s {metrics['input_tokens']:>8} {metrics['output_tokens']:>8}")

print("="*90)
print("\nKey metrics:")
print("- TTFT (Time to First Token): Estimated time until first token starts generating")
print("- TTLT (Time to Last Token): Total model processing time (from API metrics)")
print("- E2E (End-to-End): Total round-trip time including network overhead")
print("- TPS: Output tokens generated per second")

REAL API LATENCY MEASUREMENT

Prompt             TTFT         TTLT          E2E        TPS       In      Out
------------------------------------------------------------------------------------------
Short             208ms       2029ms       2296ms      6.0/s       17       11
Medium            210ms       2491ms       2759ms     39.0/s       21       89
Long              208ms       3882ms       4210ms     38.1/s       16      140

Key metrics:
- TTFT (Time to First Token): Estimated time until first token starts generating
- TTLT (Time to Last Token): Total model processing time (from API metrics)
- E2E (End-to-End): Total round-trip time including network overhead
- TPS: Output tokens generated per second


<div class="alert alert-block alert-info">
<b>Note:</b> TTFT (Time to First Token) is critical for user-perceived responsiveness. In interactive applications, use <b>streaming</b> to show tokens as they're generated - users perceive faster responses even if total generation time (TTLT) is the same. TPS (Tokens per Second) indicates how fast the model generates output after starting.
</div>

---

## 4. Converse API

Amazon Bedrock provides two main APIs for model inference:

### Converse API vs InvokeModel API

| Aspect | Converse API | InvokeModel API |
|--------|--------------|-----------------|
| **Interface** | Unified across all models | Model-specific request format |
| **Code portability** | Same code works for Claude, Nova, Mistral, etc. | Different code per model provider |
| **Multi-turn support** | Built-in conversation history | Manual history management |
| **Tool use** | Native support | Provider-specific implementation |
| **Prompt caching** | Supported with cache checkpoints | Provider-specific |
| **Best for** | Most applications | Model-specific features |

### Which API Should You Use?

| Scenario | Recommended |
|----------|-------------|
| Building a new application | **Converse API** - easier to switch models |
| Multi-turn conversations | **Converse API** - built-in support |
| Tool use / function calling | **Converse API** - unified interface |
| Need Anthropic-specific features | InvokeModel API |
| Existing InvokeModel code | Continue using it (both work fine) |

> **For this workshop**: We use the **Converse API** as it's the recommended approach for most applications. All subsequent notebooks will use the Converse API.

### Key Benefits of Converse API

- **Unified interface**: Same API structure for Claude, Nova, Mistral, and other models
- **Multi-turn support**: Built-in conversation history management
- **Tool use**: Native support for function calling
- **Prompt caching**: Cache checkpoints for cost optimization (covered in next notebook)

In [8]:
def converse(prompt, model_id=MODEL_ID, max_tokens=100, temperature=0):
    """
    Make a simple inference call using the Converse API.
    
    Args:
        prompt: The user message
        model_id: Model to use for inference
        max_tokens: Maximum tokens in the response
        temperature: Sampling temperature (0 = deterministic)
    
    Returns:
        dict with response text and usage metrics
    """
    response = bedrock_runtime.converse(
        modelId=model_id,
        messages=[
            {
                "role": "user",
                "content": [{"text": prompt}]
            }
        ],
        inferenceConfig={
            "maxTokens": max_tokens,
            "temperature": temperature
        }
    )
    
    return {
        "text": response['output']['message']['content'][0]['text'],
        "stop_reason": response['stopReason'],
        "usage": response['usage'],
        "latency_ms": response.get('metrics', {}).get('latencyMs', None)
    }

# Make a simple inference call
result = converse("What is the capital of France? Answer in one sentence.")

print("="*70)
print("CONVERSE API DEMO")
print("="*70)
print(f"Response: {result['text']}")
print(f"Stop Reason: {result['stop_reason']}")
print("="*70)

CONVERSE API DEMO
Response: The capital of France is Paris.
Stop Reason: end_turn


### Understanding Usage Metrics

Every Converse API response includes usage metrics that are essential for cost tracking and optimization.

In [9]:
# Display detailed usage metrics from the previous call
usage = result['usage']

print("="*70)
print("USAGE METRICS BREAKDOWN")
print("="*70)
print()
print("Token Counts:")
print(f"  inputTokens:           {usage.get('inputTokens', 0):>8}  (uncached input)")
print(f"  outputTokens:          {usage.get('outputTokens', 0):>8}  (generated response)")
print(f"  cacheReadInputTokens:  {usage.get('cacheReadInputTokens', 0):>8}  (read from 5m cache - 0.1x cost)")
print(f"  cacheWriteInputTokens: {usage.get('cacheWriteInputTokens', 0):>8}  (written to 5m cache - 1.25x cost)")
print(f"  totalTokens:           {usage.get('totalTokens', usage.get('inputTokens', 0) + usage.get('outputTokens', 0)):>8}  (total processed)")
print()

# Calculate actual cost using the utility function
actual_cost = calculate_actual_cost(
    input_tokens=usage.get('inputTokens', 0),
    output_tokens=usage.get('outputTokens', 0),
    model_id=MODEL_ID
)

print(f"Estimated Cost: ${actual_cost:.8f}")
if result.get('latency_ms'):
    print(f"Latency: {result['latency_ms']} ms")
print("="*70)

USAGE METRICS BREAKDOWN

Token Counts:
  inputTokens:                 19  (uncached input)
  outputTokens:                10  (generated response)
  cacheReadInputTokens:         0  (read from 5m cache - 0.1x cost)
  cacheWriteInputTokens:        0  (written to 5m cache - 1.25x cost)
  totalTokens:                 29  (total processed)

Estimated Cost: $0.00020700
Latency: 1552 ms


### Usage Metrics Reference

| Metric | Description | Cost Impact |
|--------|-------------|-------------|
| `inputTokens` | Tokens in prompt that were NOT cached | 1x base rate |
| `outputTokens` | Tokens generated by model | 3-5x base rate |
| `cacheWriteInputTokens` | Tokens written to 5m cache (first request) | 1.25x base rate |
| `cacheReadInputTokens` | Tokens read from 5m cache (subsequent requests) | 0.1x base rate (90% savings!) |
| `totalTokens` | Sum of all tokens processed | Varies by type |

---

## 5. TPM and RPM Quotas

Amazon Bedrock enforces throughput limits to ensure fair resource allocation. Understanding these quotas is essential for building reliable production systems.

### Quota Types

| Quota | Description |
|-------|-------------|
| **TPM** (Tokens Per Minute) | Maximum tokens processed per minute |
| **RPM** (Requests Per Minute) | Maximum API calls per minute |

Both limits apply simultaneously - you must stay within both to avoid throttling.

<div class="alert alert-block alert-warning">
<b>Important:</b> TPM quota consumption is calculated as: <code>InputTokenCount + CacheWriteInputTokens + (OutputTokenCount x output_burndown_rate)</code>. For newer Claude models, the output burndown rate is typically 5x, meaning each output token consumes 5 tokens of your TPM quota. This is because output generation is more compute-intensive than input processing.
</div>

### TPM Quota: Before vs After Request

Let's see how TPM quota consumption works in practice:
1. **Before request**: Bedrock reserves quota based on `max_tokens` setting
2. **After request**: Actual consumption is calculated from real token usage

This is important because over-setting `max_tokens` reserves more quota than needed, reducing concurrency.

In [10]:
# TPM Calculation: Before vs After Request
# Using Claude Sonnet 4.5 with 5x output burndown rate
# Functions imported from utils.py: calculate_tpm_reservation, calculate_tpm_actual, calculate_actual_cost

# Make a real API call to demonstrate
print("="*80)
print(f"TPM QUOTA: BEFORE vs AFTER REQUEST (Claude Sonnet 4.5, {OUTPUT_BURNDOWN_RATE}x burndown)")
print("="*80)

# Set up request parameters
test_prompt = "What are three benefits of cloud computing? Be brief."
max_tokens_setting = 500  # What we set in the API call

# Count input tokens first
token_count = count_tokens(test_prompt)
input_tokens = token_count['tokens']

# BEFORE: Calculate reservation
tpm_reserved = calculate_tpm_reservation(input_tokens, max_tokens_setting)

print("\n>>> BEFORE REQUEST (Quota Reservation)")
print(f"   Input tokens:           {input_tokens:>8}")
print(f"   max_tokens setting:     {max_tokens_setting:>8}")
print(f"   Output quota reserved:  {max_tokens_setting * OUTPUT_BURNDOWN_RATE:>8}  ({max_tokens_setting} x {OUTPUT_BURNDOWN_RATE})")
print("   ------------------------------------")
print(f"   TOTAL TPM RESERVED:     {tpm_reserved:>8}")

# Make the actual API call
result = converse(test_prompt, max_tokens=max_tokens_setting)
usage = result['usage']
actual_output = usage['outputTokens']
actual_input = usage['inputTokens']

# AFTER: Calculate actual consumption
tpm_actual = calculate_tpm_actual(actual_input, actual_output)
actual_cost = calculate_actual_cost(actual_input, actual_output, MODEL_ID)

print("\n<<< AFTER REQUEST (Actual Usage)")
print(f"   Input tokens:           {actual_input:>8}")
print(f"   Output tokens:          {actual_output:>8}")
print(f"   Output TPM consumed:    {actual_output * OUTPUT_BURNDOWN_RATE:>8}  ({actual_output} x {OUTPUT_BURNDOWN_RATE})")
print("   ------------------------------------")
print(f"   ACTUAL TPM CONSUMED:    {tpm_actual:>8}")

# Show the difference
tpm_wasted = tpm_reserved - tpm_actual
efficiency = (tpm_actual / tpm_reserved) * 100 if tpm_reserved > 0 else 100

print("\n=== ANALYSIS")
print(f"   TPM reserved:           {tpm_reserved:>8}")
print(f"   TPM actually used:      {tpm_actual:>8}")
print(f"   TPM over-reserved:      {tpm_wasted:>8}  (wasted quota capacity)")
print(f"   Quota efficiency:       {efficiency:>7.1f}%")
print(f"\n$$$ ACTUAL COST: ${actual_cost:.6f}")
print("="*80)

TPM QUOTA: BEFORE vs AFTER REQUEST (Claude Sonnet 4.5, 5x burndown)

>>> BEFORE REQUEST (Quota Reservation)
   Input tokens:                 18
   max_tokens setting:          500
   Output quota reserved:      2500  (500 x 5)
   ------------------------------------
   TOTAL TPM RESERVED:         2518

<<< AFTER REQUEST (Actual Usage)
   Input tokens:                 18
   Output tokens:                80
   Output TPM consumed:         400  (80 x 5)
   ------------------------------------
   ACTUAL TPM CONSUMED:         418

=== ANALYSIS
   TPM reserved:               2518
   TPM actually used:           418
   TPM over-reserved:          2100  (wasted quota capacity)
   Quota efficiency:          16.6%

$$$ ACTUAL COST: $0.001254


<div class="alert alert-block alert-warning">
<b>Key Insight:</b> The output burndown rate (5x for Claude Sonnet 4.5) means each output token consumes 5 tokens of your TPM quota. <b>But remember:</b>
<ul>
<li><b>TPM affects concurrency, not cost</b> - you're billed for actual tokens, not reserved quota</li>
<li><b>Over-setting max_tokens wastes quota capacity</b> - the example above shows how setting max_tokens=500 but only using ~100 tokens reserves 5x more quota than needed</li>
<li><b>Right-size max_tokens</b> - set it to expected output + 10-15% buffer, not the model maximum</li>
</ul>
</div>

### Additional Quota Considerations

- **In-region vs CRIS**: Cross-Region Inference Service (CRIS) has separate quotas from in-region calls
- **max_tokens reservation**: Setting `max_tokens` reserves that capacity during request processing
- **Quota increases**: Request increases via the AWS Service Quotas console for production workloads
- **Concurrent requests**: Both TPM and RPM limits apply simultaneously

---

## Summary

In this notebook, you learned the fundamental concepts for working with LLMs on Amazon Bedrock:

| Concept | Key Takeaway |
|---------|-------------|
| **Tokens** | ~4 characters per token for English; use `CountTokens` API for accurate counts; watch for API overhead |
| **Pricing** | Output tokens cost 3-5x more than input; 5m cache reads save 90%; global CRIS saves ~10% |
| **Cost at Scale** | 40% token reduction = 34% cost savings; compounds to $60K+ at 10M requests |
| **Latency** | TTFT critical for UX; TTLT = total generation time; use streaming for responsiveness |
| **APIs** | Use **Converse API** for most applications; unified interface across all models |
| **Usage Metrics** | Track `inputTokens`, `outputTokens`, and cache metrics for cost visibility |
| **Quotas** | TPM formula includes output burndown (5x); TPM affects concurrency, not cost |

### What's Next

In the next notebook, **02-optimization-strategy.ipynb**, you will learn:
- Model selection strategies (right-sizing for your use case)
- Prompt design best practices (clear instructions, few-shot examples)
- Parameter tuning (temperature, max_tokens)
- Structured output with tool use
- Prompt caching fundamentals
- CloudWatch monitoring for Bedrock

---

## Additional Resources

- [Amazon Bedrock Pricing](https://aws.amazon.com/bedrock/pricing/)
- [Amazon Bedrock Quotas](https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html)
- [Converse API Documentation](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_Converse.html)
- [Prompt Caching Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html)