# Bridge M2.1 → M2.2: Cache-Miss Cost Focus

**Duration:** 5-10 minutes

---

## Run Locally (Windows)

```powershell
powershell -c "$env:PYTHONPATH='$PWD'; jupyter notebook"
```

Linux/Mac:
```bash
jupyter notebook
```

---

## Purpose

You've completed M2.1 and built a multi-layer Redis cache reducing RAG costs by 30-70%. But 60% of queries still miss the cache, hitting the full pipeline at $0.0021 per query.

**The shift:** From optimizing cache hits (M2.1) to optimizing cache misses (M2.2). This bridge validates your M2.1 completion and prepares you to add prompt optimization, reducing cache-miss costs by another 30-50%.

**Why it matters:** Combined savings of 50-70% total cost reduction make RAG systems production-viable. Without baseline metrics from M2.1, you can't prove ROI in M2.2.

## Concepts Covered

**Delta from M2.1 to M2.2:**
- Cache hit rate thresholds (when caching overhead exceeds savings)
- Cost baseline establishment for compound optimization measurement
- Query diversity as a strategy selector (compression vs. semantic tuning)
- Production failure documentation as interview evidence

## After Completing This Notebook

You will have verified:
- ✓ Cache hit rate ≥30% (or documented mitigation strategy)
- ✓ Cost savings baseline recorded for ROI tracking
- ✓ All 5 production failures documented with fixes
- ✓ Query diversity metrics guide M2.2 optimization focus
- ✓ Readiness to add prompt optimization without breaking existing cache

## Context in Track

**Bridge:** M2.1 Caching Strategies → M2.2 Prompt Optimization & Model Selection

**Module:** M2 (Cost Optimization)

**Learning path:** Caching (passive cost reduction) → Prompt engineering (active cost reduction per query)

---

## Recap: M2.1 Accomplishments

### What You Built in M2.1

✓ **Multi-Layer Redis Cache System**  
Four layers (response, semantic, embedding, context) that reduce RAG costs by 30-70%.

✓ **Cache Invalidation Strategy**  
TTL, event-based, and LRU policies balancing freshness vs. cost savings.

✓ **Debugged 5 Production Failures**
1. Cache stampede on cold start
2. Stale data after updates
3. Redis memory overflow (OOM)
4. Hash collisions
5. Connection timeouts

✓ **Query Diversity Analysis**  
Learned when caching works (similarity >30%) and when to skip it (>90% diversity).

---

### The Cache Miss Problem

**Your cache hit rate:** 40% (typical)  
**Cache miss rate:** 60% of queries still hit full pipeline

**Cost calculation:**
- 10,000 queries/day
- 6,000 cache misses/day
- Cost per miss: $0.0021
- **Daily cost:** $12.60 = **$378/month**

**M2.2 Goal:** Reduce cache miss costs by 30-50% through prompt optimization.

**Combined savings:** 40% (caching) + 30-50% (prompt optimization) = **50-70% total cost reduction**

---

## Check #1: Cache Hit Rate

**Requirement:** Redis cache operational with >30% hit rate

**Why:** Low hit rate (<30%) means caching overhead exceeds savings; prompt optimization becomes critical.

---

**What this cell does:** Attempts to read Redis stats via `redis-cli`, falls back to manual input if unavailable, or creates a stub file. The skip guard ensures the notebook runs offline without live Redis.

In [None]:
# Expected: cache_hit_rate >= 0.30
# Offline-friendly: Falls back to manual input or stub file

import json
import os

SKIP_REDIS = os.environ.get('SKIP_REDIS', 'false').lower() == 'true'

# Option 1: Try parsing Redis stats (skip guard for offline mode)
def get_redis_hit_rate():
    if SKIP_REDIS:
        print("⊘ Skipping Redis check (SKIP_REDIS=true)")
        return None
    try:
        import subprocess
        result = subprocess.run(['redis-cli', 'INFO', 'stats'], 
                              capture_output=True, text=True, timeout=2)
        if result.returncode == 0:
            lines = result.stdout.split('\n')
            hits = misses = 0
            for line in lines:
                if 'keyspace_hits:' in line:
                    hits = int(line.split(':')[1])
                if 'keyspace_misses:' in line:
                    misses = int(line.split(':')[1])
            if hits + misses > 0:
                return hits / (hits + misses)
    except Exception as e:
        print(f"⊘ Redis not available: {e}")
    return None

# Option 2: Manual input fallback
def manual_input_hit_rate():
    print("Enter your cache hit rate (0.0-1.0) or press Enter to use stub:")
    try:
        user_input = input().strip()
        if user_input:
            return float(user_input)
    except:
        pass
    return None

# Option 3: Stub from file (offline-safe)
def stub_hit_rate():
    stub_file = 'cache_metrics_stub.json'
    if os.path.exists(stub_file):
        with open(stub_file) as f:
            data = json.load(f)
            return data.get('cache_hit_rate', 0.40)
    # Create stub if missing
    stub_data = {'cache_hit_rate': 0.40, 'note': 'Replace with actual metrics'}
    with open(stub_file, 'w') as f:
        json.dump(stub_data, f, indent=2)
    print(f"✓ Created {stub_file} with default hit rate 0.40")
    return 0.40

# Try methods in order
hit_rate = get_redis_hit_rate()
if hit_rate is None:
    hit_rate = stub_hit_rate()  # Skip manual input in automated environments

# Validate
print(f"\n✓ Cache Hit Rate: {hit_rate:.1%}")
if hit_rate >= 0.30:
    print("✓ PASS: Hit rate meets minimum threshold (>=30%)")
else:
    print(f"⚠ WARNING: Hit rate {hit_rate:.1%} < 30%")
    print("  → Return to M2.1 Augmented [14:30] for semantic cache tuning")
    print("  → Prompt optimization will carry more load in M2.2")

## Check #2: Cost Savings Baseline

**Requirement:** Analytics showing `cost_savings_vs_baseline` with 30-40% reduction

**Why:** Baseline metrics let you measure compound savings when adding prompt optimization; no baseline = can't prove ROI.

---

**What this cell does:** Loads or creates a cost analytics JSON file with baseline and current costs. Validates the `cost_savings_vs_baseline` field exists for ROI tracking in M2.2.

In [None]:
# Expected: cost_savings_vs_baseline field exists showing 30-70% reduction

import json
import os

stub_file = 'cost_analytics_stub.json'

# Check if analytics file exists
if os.path.exists(stub_file):
    with open(stub_file) as f:
        analytics = json.load(f)
    print(f"✓ Found {stub_file}")
else:
    # Create stub template (offline-safe)
    analytics = {
        "baseline_cost_per_day": 21.00,
        "current_cost_per_day": 12.60,
        "cost_savings_vs_baseline": 0.40,
        "note": "Replace with actual analytics data"
    }
    with open(stub_file, 'w') as f:
        json.dump(analytics, f, indent=2)
    print(f"✓ Created {stub_file} with template data")

# Validate required field
if 'cost_savings_vs_baseline' in analytics:
    savings = analytics['cost_savings_vs_baseline']
    baseline = analytics.get('baseline_cost_per_day', 0)
    current = analytics.get('current_cost_per_day', 0)
    
    print(f"\n✓ Baseline cost/day: ${baseline:.2f}")
    print(f"✓ Current cost/day: ${current:.2f}")
    print(f"✓ Cost savings: {savings:.1%}")
    
    if 0.30 <= savings <= 0.70:
        print("✓ PASS: Savings within expected range (30-70%)")
    else:
        print(f"⚠ NOTE: Savings {savings:.1%} outside typical 30-70% range")
else:
    print("✗ FAIL: 'cost_savings_vs_baseline' field missing")
    print(f"  → Add this field to {stub_file}")

## Check #3: Failures Documented

**Requirement:** All 5 common failures documented with fixes

**Why:** Prevents 4+ hours debugging same issues in M2.2 when caching interacts with optimized prompts; portfolio evidence.

**The 5 Failures:**
1. Cache stampede on cold start
2. Stale data after updates
3. Redis memory overflow (OOM)
4. Hash collisions
5. Connection timeouts

---

**What this cell does:** Searches for failure documentation in README or dedicated files. Creates a template if missing. Validates all 5 failure types are documented.

In [None]:
# Expected: File or README documenting all 5 failures + fixes

import os

# Check for documentation files
doc_files = ['failures_documentation.md', 'README.md', 'FAILURES.md']
stub_file = 'failures_documentation.md'

required_failures = [
    'cache stampede',
    'stale data',
    'memory overflow',
    'hash collision',
    'connection timeout'
]

found_file = None
for f in doc_files:
    if os.path.exists(f):
        with open(f, 'r') as file:
            content = file.read().lower()
            found_file = f
            break

if found_file:
    print(f"✓ Found documentation: {found_file}")
    
    # Check for all 5 failures
    found_count = sum(1 for failure in required_failures if failure in content)
    
    print(f"\n✓ Failures documented: {found_count}/5")
    for failure in required_failures:
        status = "✓" if failure in content else "✗"
        print(f"  {status} {failure.title()}")
    
    if found_count >= 5:
        print("\n✓ PASS: All 5 failures documented")
    else:
        print(f"\n⚠ NOTE: Only {found_count}/5 failures found in documentation")
else:
    # Create stub template
    print(f"⊘ No documentation found. Creating {stub_file} template...")
    
    template = """# Cache Failures & Fixes

## 1. Cache Stampede on Cold Start
**Problem:** [Describe the issue]
**Fix:** [Your solution]

## 2. Stale Data After Updates
**Problem:** [Describe the issue]
**Fix:** [Your solution]

## 3. Redis Memory Overflow (OOM)
**Problem:** [Describe the issue]
**Fix:** [Your solution]

## 4. Hash Collisions
**Problem:** [Describe the issue]
**Fix:** [Your solution]

## 5. Connection Timeouts
**Problem:** [Describe the issue]
**Fix:** [Your solution]
"""
    
    with open(stub_file, 'w') as f:
        f.write(template)
    
    print(f"✓ Created {stub_file} template")
    print(f"⚠ ACTION REQUIRED: Fill in problem descriptions and fixes")

## Check #4: Query Diversity Metric

**Requirement:** Query diversity calculated from logs (CSV/JSON)

**Why:** Determines M2.2 optimization strategy:
- High diversity (>70%): Focus on prompt compression
- Low diversity (<30%): Focus on semantic cache tuning

---

**What this cell does:** Searches for query diversity metrics in JSON or CSV format. Creates a stub if missing. Uses diversity score to recommend M2.2 optimization focus.

In [None]:
# Expected: CSV or JSON with query diversity distribution

import json
import os
import csv

# Check for diversity metrics files
metric_files = ['query_diversity.json', 'query_diversity.csv', 'diversity_metrics.json']
stub_file = 'query_diversity.json'

found = False
diversity = 0.45  # default

for f in metric_files:
    if os.path.exists(f):
        print(f"✓ Found: {f}")
        
        if f.endswith('.json'):
            with open(f) as file:
                data = json.load(file)
                diversity = data.get('diversity_score', data.get('average_diversity', 0.45))
                print(f"\n✓ Diversity score: {diversity:.1%}")
        elif f.endswith('.csv'):
            with open(f) as file:
                reader = csv.DictReader(file)
                rows = list(reader)
                print(f"\n✓ Found {len(rows)} diversity records")
        
        found = True
        break

if not found:
    # Create stub (offline-safe)
    print(f"⊘ No diversity metrics found. Creating {stub_file} template...")
    
    stub_data = {
        "diversity_score": 0.45,
        "total_queries": 1000,
        "unique_patterns": 450,
        "similarity_distribution": {
            "0-30%": 0.20,
            "30-70%": 0.50,
            "70-100%": 0.30
        },
        "note": "Replace with actual query diversity analysis"
    }
    
    with open(stub_file, 'w') as f:
        json.dump(stub_data, f, indent=2)
    
    print(f"✓ Created {stub_file} with template data")
    diversity = stub_data['diversity_score']

# Strategy recommendation
print("\n--- M2.2 Strategy Recommendation ---")
if diversity > 0.70:
    print("⚠ High diversity (>70%)")
    print("  → Focus: Prompt compression in M2.2")
    print("  → Reason: Low cache hit potential, optimize each query")
elif diversity < 0.30:
    print("✓ Low diversity (<30%)")
    print("  → Focus: Semantic cache tuning")
    print("  → Reason: High cache hit potential, maximize cache efficiency")
else:
    print("✓ Medium diversity (30-70%)")
    print("  → Focus: Balanced approach (caching + prompt optimization)")
    print("  → Reason: Both strategies contribute significantly")

## Call-Forward: M2.2 Preview

### What's Next in M2.2: Prompt Optimization & Model Selection

You'll add three capabilities to reduce cache miss costs by 30-50%:

---

#### 1. RAG-Specific Prompt Engineering
**Goal:** Reduce token usage 30-50% with 7 optimization templates  
**Trade-off:** Trading context/verbosity for cost savings  
**Technique:** A/B test templates to find quality-cost sweet spot

#### 2. Intelligent Model Routing
**Goal:** Route simple → cheap models (GPT-3.5), complex → premium (GPT-4)  
**Trade-off:** Balancing cost vs capability  
**Risk:** Wrong routing = overpaying or poor answers

#### 3. Token Optimization Techniques
**Goal:** Smart truncation, compression, summarization to stay under token limits  
**Trade-off:** Risking context loss if too aggressive  
**Challenge:** When is 140 tokens enough vs. needing full 350?

---

### Critical Heads-Up

**Aggressive prompt optimization can degrade answer quality.**

When you cut a prompt from 350 → 140 tokens, you're removing context and nuance.

**The sweet spot:** 30-40% token reduction with <5% quality degradation.

M2.2 teaches you to optimize without breaking user trust—measuring quality alongside cost.

---

### The Question for M2.2

**"How do you optimize prompts to cut costs 30-50% without degrading answer quality?"**

**Technical preview:**
- Use `tiktoken` for token counting
- Implement template-based optimization with A/B testing
- Build model router using complexity scoring

**Combined result:** 50-70% total cost reduction (caching + prompt optimization)

---

**Estimated time:** 40 min video + 60-90 min hands-on

**Ready for M2.2!** ✓