# M2.2 ‚Äî Prompt Optimization & Model Selection

## Purpose

Learn to reduce RAG LLM costs by 30-50% through intelligent prompt engineering, token optimization, and model routing **without sacrificing quality**. This module teaches you when and how to optimize prompts, and critically, **when NOT to optimize**.

## Concepts Covered

- **RAG-specific prompt templates** (5 production-tested variants)
- **Token estimation and cost projection** across models
- **Intelligent model routing** based on query complexity
- **Context formatting** and smart document truncation
- **A/B testing framework** for prompt comparison
- **Cost/quality trade-offs** and decision frameworks
- **Common failure modes** and debugging strategies
- **ROI analysis** and break-even calculations

## After Completing

You will be able to:
- Design and test prompt variants that reduce token usage by 30-50%
- Route queries to appropriate models based on complexity and cost constraints
- Measure and project costs at different scales (100 to 100K queries/day)
- Identify when prompt optimization is counterproductive
- Debug the 5 most common prompt optimization failures
- Make data-driven decisions using ROI and decision frameworks

## Context in Track

This is **Module 2.2** in the RAG Production Engineering track:
- M1.x: Built foundational RAG system with vector search and generation
- M2.1: Implemented caching strategies for cost reduction
- **M2.2: Optimize prompts and route models intelligently** ‚Üê YOU ARE HERE
- M2.3: Build production monitoring dashboards
- M2.4: Implement error handling and reliability patterns

**Prerequisites:** M2.1 (Caching), working RAG system, OpenAI API access (optional for testing)  
**Estimated time:** 60-90 minutes for implementation + practice

---

**Reality Check:** Prompt optimization trades verbosity for cost. Not suitable for all use cases.

## 1. Prerequisite Check & Reality Check

In [None]:
# Verify installations and imports
import sys
import os

print("Checking prerequisites...\n")

# Check Python version
print(f"Python version: {sys.version.split()[0]}")
assert sys.version_info >= (3, 9), "Python 3.9+ required"

# Check required packages
try:
    import openai
    print(f"‚úì OpenAI: {openai.__version__}")
except ImportError:
    print("‚úó OpenAI not installed. Run: pip install -r requirements.txt")

try:
    import tiktoken
    print("‚úì Tiktoken: OK")
except ImportError:
    print("‚úó Tiktoken not installed")

try:
    import pandas as pd
    import numpy as np
    print("‚úì Pandas & NumPy: OK")
except ImportError:
    print("‚úó Data analysis packages missing")

# Check API key
from dotenv import load_dotenv
load_dotenv()

api_key = os.getenv("OPENAI_API_KEY", "")
if api_key and api_key.startswith("sk-"):
    print(f"‚úì API Key: {api_key[:10]}...")
    HAS_API_KEY = True
else:
    print("‚ö†Ô∏è  No API key found. Will run in DRY RUN mode (estimates only)")
    HAS_API_KEY = False

print("\n" + "="*60)
print("Prerequisites check complete!")
if not HAS_API_KEY:
    print("‚ö†Ô∏è  Running without API key - all tests will use estimates")
print("="*60)

### Reality Check: What Prompt Optimization Actually Does

**‚úÖ What it DOES well:**
- Reduces token usage 30-50% (measured across production systems)
- Cuts API costs proportionally (if spending $500/mo ‚Üí $250-350/mo)
- Improves response latency 10-20% (fewer tokens = faster generation)

**‚ùå What it DOESN'T do:**
- Cannot fix poor retrieval quality (garbage in = garbage out)
- Won't improve response quality beyond baseline (trades verbosity for conciseness)
- Doesn't solve scaling bottlenecks (DB queries, network latency unaffected)

**‚ö†Ô∏è The Trade-offs:**
- You gain **cost savings** but risk **response quality degradation**
- Works for **high-volume simple queries** but not **complex reasoning tasks**
- Saves money in **production** but adds **development/monitoring overhead**

**Cost structure honesty:**
- Initial: 4-8 hours implementation
- Ongoing: 2-4 hours/month monitoring
- Hidden: Need A/B testing infrastructure

**When NOT to use:**
- Query volume <100/day (overhead exceeds savings)
- Quality is non-negotiable (medical, legal, financial)
- Query diversity >90% (caching ineffective)

In [None]:
# Cost comparison example
print("Token Cost Reality Check")
print("="*60)

# Scenario parameters
queries_per_day = 10_000
price_per_1m_tokens = 3.00  # Average blended rate

# Bad prompt
bad_tokens = 550  # 350 in + 200 out
bad_cost_per_query = (bad_tokens / 1_000_000) * price_per_1m_tokens
bad_daily = bad_cost_per_query * queries_per_day
bad_monthly = bad_daily * 30

# Optimized prompt
opt_tokens = 330  # 180 in + 150 out
opt_cost_per_query = (opt_tokens / 1_000_000) * price_per_1m_tokens
opt_daily = opt_cost_per_query * queries_per_day
opt_monthly = opt_daily * 30

# Calculate savings
savings_monthly = bad_monthly - opt_monthly
savings_pct = (savings_monthly / bad_monthly) * 100

print(f"\nBad Prompt ({bad_tokens} tokens):")
print(f"  ${bad_cost_per_query:.6f} per query")
print(f"  ${bad_daily:.2f}/day")
print(f"  ${bad_monthly:.2f}/month")

print(f"\nOptimized Prompt ({opt_tokens} tokens):")
print(f"  ${opt_cost_per_query:.6f} per query")
print(f"  ${opt_daily:.2f}/day")
print(f"  ${opt_monthly:.2f}/month")

print(f"\nüí∞ Savings: ${savings_monthly:.2f}/month ({savings_pct:.1f}% reduction)")
print("\nJust from optimizing prompts!")

# Expected: ~$198/month savings, 40% reduction

## 2. RAG Prompt Library

Explore different prompt templates optimized for various use cases. Each template has different token counts and trade-offs.

In [None]:
# Import our prompt optimization module
import sys
import os
# Add project root to path
sys.path.insert(0, os.path.abspath('..'))

from src.m2_2_prompt_optimization import RAGPromptLibrary, TokenEstimator
import json

# Initialize token estimator
estimator = TokenEstimator()

print("RAG Prompt Template Library")
print("="*60)
print("\nAvailable Templates:\n")

# Get all templates
templates = [
    ("BASIC_RAG", RAGPromptLibrary.BASIC_RAG),
    ("CONCISE_RAG", RAGPromptLibrary.CONCISE_RAG),
    ("STRUCTURED_RAG", RAGPromptLibrary.STRUCTURED_RAG),
    ("JSON_RAG", RAGPromptLibrary.JSON_RAG),
    ("SUPPORT_RAG", RAGPromptLibrary.SUPPORT_RAG),
]

for name, template in templates:
    # Calculate actual token count for system prompt
    sys_tokens = estimator.count_tokens(template.system_prompt)
    
    print(f"{name}:")
    print(f"  Use case: {template.use_case}")
    print(f"  Estimated tokens: {template.tokens_estimate}")
    print(f"  System prompt tokens: {sys_tokens}")
    print(f"  System prompt: {template.system_prompt[:80]}...")
    print()

# Expected: Shows 5 templates with token estimates

In [None]:
# Compare token savings across templates
print("Token Savings Comparison")
print("="*60)

baseline = RAGPromptLibrary.BASIC_RAG.tokens_estimate

for name, template in templates:
    tokens = template.tokens_estimate
    savings = baseline - tokens
    savings_pct = (savings / baseline) * 100 if baseline > 0 else 0
    
    print(f"{name:20s} {tokens:4d} tokens  ", end="")
    if savings > 0:
        print(f"‚Üì {savings:3d} ({savings_pct:5.1f}% savings)")
    else:
        print("(baseline)")

print("\nüí° Key insight: Optimization can reduce tokens by up to 60%")
print("‚ö†Ô∏è  But may reduce response quality - always A/B test!")

# Expected: Table showing token reduction from baseline

## 3. Context Formatting for Fewer Tokens

Learn how to format retrieved documents efficiently to minimize token usage while preserving critical information.

In [None]:
# Load example documents
data_path = "../data/example/example_data.json"
with open(data_path, "r") as f:
    data = json.load(f)

documents = data["documents"]

print("Document Context Formatting")
print("="*60)
print(f"\nOriginal documents: {len(documents)}")

# Calculate original token count
original_context = "\n\n".join([f"[{i+1}] {doc['content']}" for i, doc in enumerate(documents)])
original_tokens = estimator.count_tokens(original_context)

print(f"Original context tokens: {original_tokens}")
print(f"\nOriginal context preview:")
print(original_context[:200] + "...\n")

# Expected: Shows raw documents and token count

In [None]:
# Test format_context_optimally with different token limits
from src.m2_2_prompt_optimization import format_context_optimally

print("Optimized Context Formatting")
print("="*60)

# Test different max_tokens settings
test_limits = [500, 300, 150]

for max_tokens in test_limits:
    formatted = format_context_optimally(
        documents,
        max_tokens=max_tokens,
        include_metadata=False,
        estimator=estimator
    )
    
    actual_tokens = estimator.count_tokens(formatted)
    savings = original_tokens - actual_tokens
    savings_pct = (savings / original_tokens * 100) if original_tokens > 0 else 0
    
    print(f"\nMax tokens: {max_tokens}")
    print(f"  Actual tokens: {actual_tokens}")
    print(f"  Savings: {savings} tokens ({savings_pct:.1f}%)")
    print(f"  Preview: {formatted[:100]}...")

print("\nüí° Smart truncation preserves most relevant docs first")

# Expected: Shows different truncation levels and token savings

## 4. Model Routing

Intelligently route queries to appropriate models based on complexity. Simple queries use fast/cheap models, complex queries use premium models.

In [None]:
# Test model router with different query types
from src.m2_2_prompt_optimization import ModelRouter

router = ModelRouter()

print("Intelligent Model Routing")
print("="*60)

# Test queries with varying complexity
test_queries = data["test_queries"]

for query_data in test_queries:
    query = query_data["question"]
    expected_complexity = query_data.get("complexity", "unknown")
    
    # Analyze and route
    decision = router.select_model(query, context=original_context[:500])
    
    print(f"\nQuery: {query[:60]}...")
    print(f"  Expected complexity: {expected_complexity}")
    print(f"  Complexity score: {decision['complexity_score']}")
    print(f"  Selected model: {decision['model']}")
    print(f"  Tier: {decision['tier']}")
    print(f"  Reason: {decision['reason']}")
    if decision.get('complexity_factors'):
        print(f"  Factors: {list(decision['complexity_factors'].keys())}")

# Expected: Shows routing decisions for simple vs complex queries

In [None]:
# Cost implications of model routing
print("\nModel Routing Cost Analysis")
print("="*60)

# Simulate routing distribution
simple_queries_pct = 70  # 70% simple queries
complex_queries_pct = 30  # 30% complex queries

total_queries = 10_000

# Cost per query by model (rough estimates)
cost_fast = 0.0003  # gpt-3.5-turbo
cost_premium = 0.0020  # gpt-4o

# Scenario 1: All queries to premium model
all_premium_cost = total_queries * cost_premium * 30  # monthly

# Scenario 2: Smart routing
simple_cost = (total_queries * simple_queries_pct / 100) * cost_fast * 30
complex_cost = (total_queries * complex_queries_pct / 100) * cost_premium * 30
smart_routing_cost = simple_cost + complex_cost

savings = all_premium_cost - smart_routing_cost
savings_pct = (savings / all_premium_cost * 100)

print(f"\nScenario 1: All queries ‚Üí Premium model")
print(f"  Monthly cost: ${all_premium_cost:.2f}")

print(f"\nScenario 2: Smart routing ({simple_queries_pct}% simple, {complex_queries_pct}% complex)")
print(f"  Simple queries: ${simple_cost:.2f}/month")
print(f"  Complex queries: ${complex_cost:.2f}/month")
print(f"  Total: ${smart_routing_cost:.2f}/month")

print(f"\nüí∞ Savings: ${savings:.2f}/month ({savings_pct:.1f}% reduction)")
print("\nüí° Routing matches complexity to model tier")

# Expected: Shows significant cost savings from routing

## 5. Prompt Testing Framework (A/B)

Run A/B tests to compare prompt variants scientifically. Measure tokens, cost, and latency for each template.

In [None]:
# Set up prompt tester (will auto-detect if API key available)
from src.m2_2_prompt_optimization import PromptTester

# Initialize client if API key available
openai_client = None
if HAS_API_KEY:
    try:
        from openai import OpenAI
        openai_client = OpenAI()
        print("‚úì OpenAI client initialized - will run LIVE tests")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not initialize client: {e}")
        print("‚ö†Ô∏è Skipping (no keys) - Running in DRY RUN mode")
else:
    print("‚ö†Ô∏è Skipping (no keys) - running in DRY RUN mode (estimates only)")

# Create tester
tester = PromptTester(
    openai_client=openai_client,
    model="gpt-3.5-turbo",
    dry_run=(not HAS_API_KEY)
)

print("\n" + "="*60)
print("PromptTester initialized")
print(f"Mode: {'LIVE API calls' if HAS_API_KEY else 'DRY RUN (estimates)'}")
print("="*60)

In [None]:
# Compare 3 prompt templates
templates_to_test = [
    RAGPromptLibrary.BASIC_RAG,
    RAGPromptLibrary.CONCISE_RAG,
    RAGPromptLibrary.STRUCTURED_RAG,
]

# Use first 3 test queries
test_cases = data["test_queries"][:3]

print("\nRunning A/B comparison...")
print(f"Testing {len(templates_to_test)} templates on {len(test_cases)} queries")
print()

# Run comparison
results = tester.compare_templates(
    templates_to_test,
    test_cases,
    data["documents"]
)

# Expected: Table comparing tokens, cost, and latency
# In dry run: estimates only
# With API key: actual measurements

## 6. Cost & Latency Projections

Project monthly costs at different scales and see the impact of optimization decisions.

In [None]:
# Use results from A/B testing to project costs
import pandas as pd

print("Monthly Cost Projections")
print("="*60)

# Different scale scenarios
scales = [
    ("Startup", 100),
    ("Growth", 1_000),
    ("Production", 10_000),
    ("Enterprise", 100_000),
]

# Use the results from our comparison
if results:
    best_template = results[0]  # Cheapest
    baseline_template = results[-1]  # Most expensive
    
    print(f"\nComparing:")
    print(f"  Baseline: {baseline_template.template_name}")
    print(f"    ${baseline_template.avg_cost_per_query:.6f}/query")
    print(f"  Optimized: {best_template.template_name}")
    print(f"    ${best_template.avg_cost_per_query:.6f}/query")
    print()
    
    projection_data = []
    
    for scale_name, queries_per_day in scales:
        baseline_monthly = baseline_template.avg_cost_per_query * queries_per_day * 30
        optimized_monthly = best_template.avg_cost_per_query * queries_per_day * 30
        savings = baseline_monthly - optimized_monthly
        savings_pct = (savings / baseline_monthly * 100) if baseline_monthly > 0 else 0
        
        projection_data.append({
            "Scale": scale_name,
            "Queries/Day": f"{queries_per_day:,}",
            "Baseline": f"${baseline_monthly:.2f}",
            "Optimized": f"${optimized_monthly:.2f}",
            "Savings": f"${savings:.2f}",
            "Savings %": f"{savings_pct:.1f}%"
        })
    
    df = pd.DataFrame(projection_data)
    print(df.to_string(index=False))
    
    print("\nüí° Savings scale linearly with query volume")
    print("‚ö†Ô∏è  But implementation overhead is fixed ~8 hours")

# Expected: Table showing costs at different scales

In [None]:
# ROI calculation
print("\nROI Analysis")
print("="*60)

implementation_hours = 8
hourly_rate = 100  # Developer hourly rate
implementation_cost = implementation_hours * hourly_rate

print(f"\nImplementation cost: ${implementation_cost} ({implementation_hours} hours @ ${hourly_rate}/hr)")
print("\nBreak-even analysis:")

if results:
    best = results[0]
    baseline = results[-1]
    savings_per_query = baseline.avg_cost_per_query - best.avg_cost_per_query
    
    for scale_name, queries_per_day in scales:
        daily_savings = savings_per_query * queries_per_day
        monthly_savings = daily_savings * 30
        
        if monthly_savings > 0:
            months_to_breakeven = implementation_cost / monthly_savings
            print(f"\n{scale_name} ({queries_per_day:,} q/day):")
            print(f"  Monthly savings: ${monthly_savings:.2f}")
            print(f"  Break-even: {months_to_breakeven:.1f} months")
            
            if months_to_breakeven < 1:
                print(f"  ‚úì ROI: EXCELLENT - pays for itself in <1 month")
            elif months_to_breakeven < 3:
                print(f"  ‚úì ROI: GOOD - pays for itself in {months_to_breakeven:.0f} months")
            elif months_to_breakeven < 12:
                print(f"  ‚ö†Ô∏è  ROI: MARGINAL - takes {months_to_breakeven:.0f} months")
            else:
                print(f"  ‚ùå ROI: POOR - takes {months_to_breakeven:.0f} months")

print("\nüí° Optimization is worth it at 1K+ queries/day")

# Expected: ROI analysis showing when optimization makes sense

## 7. Common Failures & When NOT to Optimize

Learn what breaks with prompt optimization and when to avoid it entirely.

### Common Failure Modes

**Failure #1: Token Limit Exceeded Despite Optimization**
- **Cause:** Forgot to account for prompt overhead + safety margin
- **Fix:** Reserve tokens: `actual_limit = model_context - prompt_overhead - safety_margin`

**Failure #2: Model Router Selects Wrong Tier**
- **Cause:** Complexity scoring over-weights query length
- **Fix:** Combine length with reasoning keyword detection; manual override for known patterns

**Failure #3: Aggressive Truncation Loses Critical Context**
- **Cause:** Cutting mid-sentence, removing exceptions/caveats
- **Fix:** Truncate at sentence boundaries, add `[truncated]` indicators

**Failure #4: Cache Invalidation Causing Cost Spikes**
- **Cause:** Cache keys include prompt hash; template updates invalidate all caches
- **Fix:** Use semantic versioning (v1, v2) not exact hashes; implement fallback to previous versions

**Failure #5: JSON Output Format Breaking**
- **Cause:** Model ignores "return JSON only" instruction
- **Fix:** Use `response_format={"type": "json_object"}`, lower temperature to 0.0, validate + retry

### When NOT to Use Prompt Optimization

**‚ùå Don't optimize when:**

1. **Response Quality is Non-Negotiable**
   - Medical advice, legal analysis, financial recommendations
   - Use instead: Best model (GPT-4) with full prompts + human review
   - Example: Medical diagnosis - patient safety >>> cost savings

2. **Query Volume Too Low (<100 queries/day)**
   - Implementation overhead (8-12 hours) exceeds savings
   - At 100 q/day with $0.002/query = $6/month cost
   - Use instead: Keep prompts simple and clear

3. **Query Diversity Extremely High (>90% unique)**
   - Caching ineffective, uniform optimization difficult
   - Use instead: Focus on infrastructure or consider fine-tuning
   - Example: Research assistant with novel academic queries

**üö© Warning Signs:**
- Users report "answers feel rushed or incomplete" ‚Üí too aggressive
- Cache hit rate <10% ‚Üí query diversity too high
- Costs still >$500/month after optimization ‚Üí consider fine-tuning
- Quality metrics declining ‚Üí token cuts removing necessary context
- More time tuning than saving ‚Üí volume too low

In [None]:
# Decision framework
print("DECISION CARD: Should You Optimize Prompts?")
print("="*60)

# User should fill these in for their use case
your_queries_per_day = 1000  # Change this
your_monthly_cost = 300      # Change this
quality_critical = False     # Change this

print(f"\nYour situation:")
print(f"  Queries per day: {your_queries_per_day:,}")
print(f"  Monthly LLM cost: ${your_monthly_cost:.2f}")
print(f"  Quality critical: {'Yes' if quality_critical else 'No'}")
print()

# Decision logic
recommendation = None

if quality_critical:
    recommendation = "‚ùå DON'T OPTIMIZE - Quality is non-negotiable"
elif your_queries_per_day < 100:
    recommendation = "‚ùå DON'T OPTIMIZE - Volume too low, overhead exceeds savings"
elif your_monthly_cost < 50:
    recommendation = "‚ùå DON'T OPTIMIZE - Cost too low to justify effort"
elif your_queries_per_day >= 10000:
    recommendation = "‚úì‚úì STRONGLY RECOMMEND - High volume, significant savings potential"
elif your_queries_per_day >= 1000:
    recommendation = "‚úì RECOMMEND - Good volume, ROI positive"
else:
    recommendation = "‚ö†Ô∏è  MARGINAL - Consider if growth expected"

print(f"Recommendation: {recommendation}")
print()

# Projected savings
if your_monthly_cost > 0 and not quality_critical:
    estimated_savings = your_monthly_cost * 0.35  # Conservative 35%
    print(f"Estimated monthly savings: ${estimated_savings:.2f} (35% reduction)")
    
    implementation_cost = 800  # 8 hours @ $100/hr
    months_to_breakeven = implementation_cost / estimated_savings if estimated_savings > 0 else 999
    
    print(f"Break-even: {months_to_breakeven:.1f} months")

print("\n" + "="*60)
print("üí° Use this decision card for your specific use case")

# Expected: Personalized recommendation based on inputs

## Summary & Next Steps

**What we learned:**
1. ‚úì RAG-specific prompt templates can reduce tokens by 30-60%
2. ‚úì Model routing matches complexity to appropriate tier
3. ‚úì Token optimization requires sentence-boundary truncation
4. ‚úì A/B testing framework measures real impact
5. ‚úì ROI is positive at 1K+ queries/day
6. ‚úì Common failures are predictable and fixable
7. ‚úì Optimization is NOT for everyone (quality-critical, low volume, high diversity)

**Key Takeaway:**
Prompt optimization is a **tool**, not a mandate. Let economics guide your engineering decisions.

**Action Items:**
1. [ ] Calculate your current token usage and costs
2. [ ] Test 2-3 prompt templates with your real data
3. [ ] Measure quality impact (not just cost)
4. [ ] Use decision card to determine if optimization is worth it
5. [ ] If proceeding: Implement monitoring before optimizing

**When in doubt:** Start conservative, measure everything, optimize incrementally.