# 1.1 The Economics of Prompt Optimization

**Duration**: 20 minutes

**Learning Objectives**:
- Understand why prompt optimization matters (cost, latency, scale)
- Calculate the ROI of optimization at scale
- Learn the paradigm shift from ad-hoc to systematic optimization

---

## Setup

First, let's load environment variables and import necessary libraries.

In [1]:
# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

import os
import json

# Verify AWS credentials are loaded (without exposing sensitive values)
# Credentials can come from .env, AWS CLI (~/.aws/credentials), or environment variables
if os.getenv('AWS_ACCESS_KEY_ID') and os.getenv('AWS_SECRET_ACCESS_KEY'):
    print("‚úÖ AWS credentials loaded successfully")
    print(f"‚úÖ Region: {os.getenv('AWS_DEFAULT_REGION', 'us-east-1')}")
else:
    print("‚úÖ Environment loaded (AWS CLI credentials will be used)")
    print("‚ÑπÔ∏è  boto3 will automatically use AWS CLI credentials from ~/.aws/credentials")

‚úÖ Environment loaded (AWS CLI credentials will be used)
‚ÑπÔ∏è  boto3 will automatically use AWS CLI credentials from ~/.aws/credentials


## Why Prompt Optimization Matters

### 1. Cost Efficiency

Input tokens typically represent **40-60% of total inference costs**. At scale (millions of requests), small optimizations compound dramatically.

**Example**: Reducing a 5,000-token prompt to 3,000 tokens = **40% cost savings** on input tokens.

### 2. Latency Impact

Each token adds processing time (time-to-first-token and overall latency). Optimized prompts = faster response times = better user experience.

Critical for real-time applications: chatbots, customer support, interactive agents.

### 3. Scale Considerations

Small inefficiencies become expensive at high volumes:
- **1,000 tokens √ó 1 million requests = 1 billion tokens processed**
- Optimization ROI increases exponentially with usage

### 4. User Experience

- Faster responses improve user satisfaction
- Better quality outputs from well-crafted prompts
- Reduced timeout errors and failures

---

## Interactive Cost Calculator

Let's calculate the cost impact of token optimization at scale.

**Scenario**: You're building a customer support chatbot.

**Assumptions** (example pricing - always check current Bedrock pricing):
- Input tokens: $3.00 per million tokens
- Output tokens: $15.00 per million tokens
- Average output: 150 tokens per response

In [2]:
def calculate_cost(input_tokens, output_tokens, num_requests, input_price_per_mtok=3.0, output_price_per_mtok=15.0):
    """
    Calculate total cost for LLM inference.
    
    Args:
        input_tokens: Number of input tokens per request
        output_tokens: Number of output tokens per request
        num_requests: Total number of requests
        input_price_per_mtok: Price per million input tokens (default: $3.00)
        output_price_per_mtok: Price per million output tokens (default: $15.00)
    
    Returns:
        dict with cost breakdown
    """
    total_input_tokens = input_tokens * num_requests
    total_output_tokens = output_tokens * num_requests
    
    input_cost = (total_input_tokens / 1_000_000) * input_price_per_mtok
    output_cost = (total_output_tokens / 1_000_000) * output_price_per_mtok
    total_cost = input_cost + output_cost
    
    return {
        'total_input_tokens': total_input_tokens,
        'total_output_tokens': total_output_tokens,
        'input_cost': input_cost,
        'output_cost': output_cost,
        'total_cost': total_cost
    }

# Scenario: Unoptimized prompt (5,000 tokens)
unoptimized = calculate_cost(
    input_tokens=5000,
    output_tokens=150,
    num_requests=1_000_000  # 1 million requests
)

# Scenario: Optimized prompt (3,000 tokens) - 40% reduction
optimized = calculate_cost(
    input_tokens=3000,
    output_tokens=150,
    num_requests=1_000_000
)

# Calculate savings
savings = unoptimized['total_cost'] - optimized['total_cost']
savings_percentage = (savings / unoptimized['total_cost']) * 100

print("=" * 60)
print("COST COMPARISON: Unoptimized vs. Optimized Prompt")
print("=" * 60)
print(f"\nüìä Scenario: 1 million customer support requests\n")

print("Unoptimized (5,000 input tokens per request):")
print(f"  - Input cost:  ${unoptimized['input_cost']:,.2f}")
print(f"  - Output cost: ${unoptimized['output_cost']:,.2f}")
print(f"  - Total cost:  ${unoptimized['total_cost']:,.2f}")

print("\nOptimized (3,000 input tokens per request):")
print(f"  - Input cost:  ${optimized['input_cost']:,.2f}")
print(f"  - Output cost: ${optimized['output_cost']:,.2f}")
print(f"  - Total cost:  ${optimized['total_cost']:,.2f}")

print(f"\nüí∞ Savings: ${savings:,.2f} ({savings_percentage:.1f}% reduction)")
print("=" * 60)

COST COMPARISON: Unoptimized vs. Optimized Prompt

üìä Scenario: 1 million customer support requests

Unoptimized (5,000 input tokens per request):
  - Input cost:  $15,000.00
  - Output cost: $2,250.00
  - Total cost:  $17,250.00

Optimized (3,000 input tokens per request):
  - Input cost:  $9,000.00
  - Output cost: $2,250.00
  - Total cost:  $11,250.00

üí∞ Savings: $6,000.00 (34.8% reduction)


### Key Insight

By reducing prompt size from 5,000 to 3,000 tokens (40% reduction):
- **Input tokens reduced from $15,000 to $9,000** - saving $6,000 at 1M requests
- **Input cost represents 87% of total cost** in this scenario ($15,000 input vs. $2,250 output)
- **34.8% total cost reduction** through token optimization alone
- This is before applying prompt caching - which can save another 75-90%!

---

## Latency Impact Simulator

Let's simulate how token count affects response latency.

**Approximate processing rates**:
- Input tokens: ~0.5ms per token (model-dependent)
- Output tokens: ~50ms per token (generation is slower)

In [3]:
def calculate_latency(input_tokens, output_tokens, input_ms_per_token=0.5, output_ms_per_token=50):
    """
    Estimate latency for LLM inference.
    
    Args:
        input_tokens: Number of input tokens
        output_tokens: Number of output tokens
        input_ms_per_token: Processing time per input token in ms (default: 0.5ms)
        output_ms_per_token: Generation time per output token in ms (default: 50ms)
    
    Returns:
        dict with latency breakdown
    """
    input_latency_ms = input_tokens * input_ms_per_token
    output_latency_ms = output_tokens * output_ms_per_token
    total_latency_ms = input_latency_ms + output_latency_ms
    
    return {
        'input_latency_ms': input_latency_ms,
        'output_latency_ms': output_latency_ms,
        'total_latency_ms': total_latency_ms,
        'total_latency_sec': total_latency_ms / 1000
    }

# Compare latency: Unoptimized vs. Optimized
latency_unoptimized = calculate_latency(input_tokens=5000, output_tokens=150)
latency_optimized = calculate_latency(input_tokens=3000, output_tokens=150)

latency_savings_ms = latency_unoptimized['total_latency_ms'] - latency_optimized['total_latency_ms']
latency_savings_pct = (latency_savings_ms / latency_unoptimized['total_latency_ms']) * 100

print("=" * 60)
print("LATENCY COMPARISON: Unoptimized vs. Optimized Prompt")
print("=" * 60)
print(f"\nüìä Scenario: Customer support response generation\n")

print("Unoptimized (5,000 input tokens):")
print(f"  - Input processing:  {latency_unoptimized['input_latency_ms']:.0f}ms")
print(f"  - Output generation: {latency_unoptimized['output_latency_ms']:.0f}ms")
print(f"  - Total latency:     {latency_unoptimized['total_latency_sec']:.2f}s")

print("\nOptimized (3,000 input tokens):")
print(f"  - Input processing:  {latency_optimized['input_latency_ms']:.0f}ms")
print(f"  - Output generation: {latency_optimized['output_latency_ms']:.0f}ms")
print(f"  - Total latency:     {latency_optimized['total_latency_sec']:.2f}s")

print(f"\n‚ö° Latency improvement: {latency_savings_ms:.0f}ms ({latency_savings_pct:.1f}% faster)")
print("=" * 60)

LATENCY COMPARISON: Unoptimized vs. Optimized Prompt

üìä Scenario: Customer support response generation

Unoptimized (5,000 input tokens):
  - Input processing:  2500ms
  - Output generation: 7500ms
  - Total latency:     10.00s

Optimized (3,000 input tokens):
  - Input processing:  1500ms
  - Output generation: 7500ms
  - Total latency:     9.00s

‚ö° Latency improvement: 1000ms (10.0% faster)


### Key Insight

By reducing prompt size from 5,000 to 3,000 tokens:
- **Input processing time reduced by 1 second** (2,000 fewer tokens √ó 0.5ms/token)
- At scale, faster responses improve user experience and system throughput

---

## Scale Impact: ROI Calculator

Let's calculate the ROI of optimization across different usage scales.

In [4]:
def roi_at_scale(original_tokens, optimized_tokens, output_tokens=150):
    """
    Calculate ROI of prompt optimization at different scales.
    """
    scales = [
        (1_000, "1K requests (pilot)"),
        (10_000, "10K requests (small scale)"),
        (100_000, "100K requests (medium scale)"),
        (1_000_000, "1M requests (large scale)"),
        (10_000_000, "10M requests (enterprise scale)")
    ]
    
    print("=" * 80)
    print(f"ROI OF OPTIMIZATION: {original_tokens} ‚Üí {optimized_tokens} tokens")
    print("=" * 80)
    print(f"\n{'Scale':<30} {'Original Cost':<15} {'Optimized Cost':<15} {'Savings':<15}")
    print("-" * 80)
    
    for num_requests, label in scales:
        original_cost = calculate_cost(original_tokens, output_tokens, num_requests)['total_cost']
        optimized_cost = calculate_cost(optimized_tokens, output_tokens, num_requests)['total_cost']
        savings = original_cost - optimized_cost
        
        print(f"{label:<30} ${original_cost:>12,.2f}  ${optimized_cost:>12,.2f}  ${savings:>12,.2f}")
    
    print("=" * 80)

# Run ROI analysis
roi_at_scale(original_tokens=5000, optimized_tokens=3000)

ROI OF OPTIMIZATION: 5000 ‚Üí 3000 tokens

Scale                          Original Cost   Optimized Cost  Savings        
--------------------------------------------------------------------------------
1K requests (pilot)            $       17.25  $       11.25  $        6.00
10K requests (small scale)     $      172.50  $      112.50  $       60.00
100K requests (medium scale)   $    1,725.00  $    1,125.00  $      600.00
1M requests (large scale)      $   17,250.00  $   11,250.00  $    6,000.00
10M requests (enterprise scale) $  172,500.00  $  112,500.00  $   60,000.00


### Key Insight

By reducing prompt size from 5,000 to 3,000 tokens:
- At 1K requests: **$6 savings** (pilot testing)
- At 100K requests: **$600 savings** (hundreds of dollars)
- At 1M requests: **$6,000 savings** (thousands of dollars)
- At 10M requests: **$60,000 savings** (tens of thousands of dollars)

**Optimization ROI scales linearly with request volume. At production scale (1M+ requests), token optimization delivers substantial savings.**

---

## The Optimization Paradigm Shift

Now that you understand the economics (cost, latency, ROI), let's examine how to approach optimization systematically.

### Traditional Approach ‚ùå

```
Write Prompt ‚Üí Test Manually ‚Üí Deploy to Production ‚Üí Hope for the Best
```

**Problems**:
- Ad-hoc testing ("it works for me")
- No systematic evaluation
- No visibility into costs or performance
- Difficult to iterate and improve
- Regressions go unnoticed
- **No strategy for caching or token optimization**

### Modern Optimized Approach ‚úÖ

```
Design Prompt ‚Üí Evaluate Systematically ‚Üí Apply Caching Strategy ‚Üí 
Monitor Metrics ‚Üí Iterate Based on Data
```

**Benefits**:
- **Data-driven decisions**: Use evaluation datasets, not anecdotes
- **Continuous improvement loop**: Measure, optimize, measure again
- **Cost and performance visibility**: Track metrics in production
- **Systematic evaluation and testing**: Catch regressions before deployment
- **Version control**: Track prompt changes like code
- **Strategic caching**: Cache static content for 75-90% additional savings

---

### What This Workshop Teaches You

This paradigm shift requires three foundational pillars that are covered in subsequent sections:

**1. Understanding Caching Mechanics**
- How prompt caching works (cache hits, misses, TTL)
- When caching provides ROI (break-even analysis)
- Cost structure: cache write vs. cache read
- **Goal**: Amplify savings by another 75-90% through strategic caching

**2. Optimization Techniques**
- Manual & automated optimization
- Decision framework: Choosing the right approach
- **Goal**: Improve prompt quality while reducing tokens

**3. Production Integration**
- Caching patterns for different use cases
- Observability and monitoring
- Evaluation frameworks for systematic testing
- CI/CD for prompt lifecycle management
- **Goal**: Build production-grade GenAI systems

### The Compounding Effect

When you combine:
- **Token optimization** (30-40% cost savings) 
- **Strategic caching** (75-90% additional savings on cached tokens)
- **Systematic evaluation** (prevent regressions, improve quality)
- **Production monitoring** (catch issues early, optimize continuously)

**Result**: Cost reductions of 80-95% with improved quality and reliability.

---

## Summary

In this notebook, you learned the economic fundamentals of prompt optimization:

### Key Takeaways

1. ‚úÖ **Cost Efficiency**: At 1M requests, reducing input tokens from 5,000‚Üí3,000 saves **$6,000 (34.8% cost reduction)**. Input tokens can represent up to 87% of total inference costs.

2. ‚úÖ **Latency Impact**: Reducing 2,000 input tokens saves **1 second of processing time**, improving user experience and system throughput.

3. ‚úÖ **Scale Amplification**: Optimization ROI scales linearly with volume:
   - 1K requests: $6 savings
   - 100K requests: $600 savings
   - 1M requests: $6,000 savings
   - 10M requests: $60,000 savings

4. ‚úÖ **Paradigm Shift**: Move from ad-hoc testing to systematic, data-driven optimization with:
   - Strategic caching (75-90% additional savings)
   - Systematic evaluation (prevent regressions)
   - Production monitoring (continuous improvement)

---