# Bridge M1.4 → M2.1: Cost Reality + Caching Trade-off

**Purpose:** Validate baseline costs before M2.1 caching implementation  
**Source:** bridge_M1_4_to_M2_1.md  
**Approach:** Local calculations only; no live API calls

## Section 1: Recap — M1 Shipped Baseline RAG + Metrics

### Module 1 Accomplishments

**M1.1 (Vector Databases):**
- ✓ Pinecone index configured (dimension 1536, cosine metric)
- ✓ Embeddings generated using OpenAI text-embedding-3-small
- ✓ Semantic search with similarity scoring
- ✓ 5 common vector DB failures debugged

**M1.2 (Advanced Indexing):**
- ✓ Hybrid search (dense + sparse vectors, 20-40% better recall)
- ✓ Metadata filtering for domain-specific queries
- ✓ Advanced index configurations tested
- ✓ Cost-performance trade-offs documented

**M1.3 (Document Processing):**
- ✓ Multi-format parser (PDF, TXT, DOCX, MD)
- ✓ Smart chunking strategies (size-based, semantic, sliding window)
- ✓ Batch processing with error handling
- ✓ Metadata extraction & enrichment

**M1.4 (Query Pipeline):**
- ✓ Complete 7-stage query pipeline
- ✓ Query classification system
- ✓ Cross-encoder reranking (20-40% quality improvement)
- ✓ Production error handling (5 common failures)
- ✓ Performance monitoring (latency, cost, quality)
- ✓ Source attribution & streaming responses

### Current State

**Working RAG system with:**
- Answers questions using indexed documents
- Debugging capabilities for production failures
- Performance measurement baseline

**The Problem:** Every query processed fresh; no memory of previous answers

## Section 2: Baseline — Approximate Per-Query Cost Calculator

**Baseline scale:** 10,000 queries/day  
**Approach:** Local math using constants from bridge_M1_4_to_M2_1.md

In [None]:
# Cost constants from bridge_M1_4_to_M2_1.md
COST_EMBEDDING_PER_QUERY = 0.0001    # OpenAI text-embedding-3-small
COST_LLM_PER_QUERY = 0.002            # LLM generation
COST_VECTOR_SEARCH_PER_QUERY = 0.007  # Pinecone vector search

# Baseline scale
QUERIES_PER_DAY = 10_000

# Calculate daily costs
daily_embedding_cost = QUERIES_PER_DAY * COST_EMBEDDING_PER_QUERY
daily_llm_cost = QUERIES_PER_DAY * COST_LLM_PER_QUERY
daily_vector_search_cost = QUERIES_PER_DAY * COST_VECTOR_SEARCH_PER_QUERY

daily_total = daily_embedding_cost + daily_llm_cost + daily_vector_search_cost
monthly_total = daily_total * 30

# Expected: Daily=$91.00, Monthly=$2,730
print("=" * 50)
print("BASELINE COST CALCULATOR")
print("=" * 50)
print(f"Scale: {QUERIES_PER_DAY:,} queries/day")
print()
print("Daily Costs:")
print(f"  Embeddings:      ${daily_embedding_cost:>7.2f}")
print(f"  LLM Calls:       ${daily_llm_cost:>7.2f}")
print(f"  Vector Searches: ${daily_vector_search_cost:>7.2f}")
print(f"  {'─' * 30}")
print(f"  DAILY TOTAL:     ${daily_total:>7.2f}")
print()
print(f"MONTHLY TOTAL:     ${monthly_total:>7,.2f}")
print()
print(f"Per-query cost: ${daily_total / QUERIES_PER_DAY:.4f}")
print("=" * 50)

## Section 3: Repeat-Rate Worksheet

**Source:** bridge_M1_4_to_M2_1.md query repetition analysis  
**Purpose:** Record assumed repeat percentages for caching estimation

In [None]:
import json

# Query repetition rates from bridge_M1_4_to_M2_1.md
repeat_rate_assumptions = {
    "semantically_similar_pct": {
        "min": 30,
        "max": 50,
        "assumed": 40,
        "description": "Queries similar to previous queries"
    },
    "exact_near_exact_repeats_pct": {
        "min": 15,
        "max": 25,
        "assumed": 20,
        "description": "Exact or near-exact repeat queries"
    },
    "truly_unique_pct": {
        "min": 25,
        "max": 40,
        "assumed": 32.5,
        "description": "Completely unique questions"
    },
    "notes": [
        "Source: Real RAG systems analysis from bridge_M1_4_to_M2_1.md",
        "Assumed values used for caching projections",
        "Edit 'assumed' values to test different scenarios"
    ]
}

# Save to JSON
with open('repeat_rate_assumptions.json', 'w') as f:
    json.dump(repeat_rate_assumptions, f, indent=2)

# Expected: JSON file created with repeat rate ranges
print("=" * 50)
print("QUERY REPEAT-RATE ASSUMPTIONS")
print("=" * 50)
print()
for key, value in repeat_rate_assumptions.items():
    if key != "notes":
        print(f"{value['description']}:")
        print(f"  Range: {value['min']}%-{value['max']}%")
        print(f"  Assumed: {value['assumed']}%")
        print()

print("Notes:")
for note in repeat_rate_assumptions["notes"]:
    print(f"  • {note}")
print()
print("✓ Saved to: repeat_rate_assumptions.json")
print("=" * 50)

## Section 4: Projected Savings — Caching What-If Table

**Scenarios:** 0%, 30%, 50%, 70% cache hit rates  
**Approach:** Simple cost reduction calculation at different hit rates

In [None]:
# Cache hit rate scenarios
cache_scenarios = [0, 30, 50, 70]

# Baseline costs (from Section 2)
baseline_daily_cost = 91.00
baseline_monthly_cost = 2730.00

# Cached query costs (minimal Redis lookup cost)
CACHE_LOOKUP_COST = 0.0001  # Negligible Redis read cost

# Expected: Table showing savings at 30%, 50%, 70% cache hit rates
print("=" * 70)
print("CACHING WHAT-IF TABLE")
print("=" * 70)
print()
print(f"{'Cache Hit Rate':<20} {'Daily Cost':<15} {'Monthly Cost':<15} {'Savings %':<15}")
print("─" * 70)

results = []
for hit_rate in cache_scenarios:
    hit_rate_decimal = hit_rate / 100
    
    # Cached queries pay only lookup cost
    cached_queries = QUERIES_PER_DAY * hit_rate_decimal
    cache_cost = cached_queries * CACHE_LOOKUP_COST
    
    # Full-cost queries
    full_cost_queries = QUERIES_PER_DAY * (1 - hit_rate_decimal)
    full_cost = full_cost_queries * (baseline_daily_cost / QUERIES_PER_DAY)
    
    daily_cost = cache_cost + full_cost
    monthly_cost = daily_cost * 30
    savings_pct = ((baseline_daily_cost - daily_cost) / baseline_daily_cost) * 100
    
    results.append({
        "hit_rate": hit_rate,
        "daily_cost": daily_cost,
        "monthly_cost": monthly_cost,
        "savings_pct": savings_pct
    })
    
    if hit_rate == 0:
        print(f"{hit_rate}% (baseline){'':<7} ${daily_cost:<14.2f} ${monthly_cost:<14,.2f} {'-':<15}")
    else:
        print(f"{hit_rate}%{'':<17} ${daily_cost:<14.2f} ${monthly_cost:<14,.2f} {savings_pct:<14.1f}%")

print("=" * 70)
print()
print("KEY INSIGHTS:")
print(f"  • 30% cache hit rate → Save ~${baseline_monthly_cost - results[1]['monthly_cost']:.0f}/month")
print(f"  • 50% cache hit rate → Save ~${baseline_monthly_cost - results[2]['monthly_cost']:.0f}/month")
print(f"  • 70% cache hit rate → Save ~${baseline_monthly_cost - results[3]['monthly_cost']:.0f}/month")
print()
print("Source: bridge_M1_4_to_M2_1.md — Caching savings: 30-70% reduction")
print("=" * 70)

## Section 5: Risks — Freshness Trade-off

**Trade-off Assessment:** HIGH  
**Source:** bridge_M1_4_to_M2_1.md

### Critical Limitation: Data Freshness Loss

**The Trade-off:**  
"You gain speed and cost savings, but you lose data freshness."

**When caching is WRONG:**
- ❌ Knowledge base updates every 5 minutes
- ❌ Real-time data requirements (stock prices, live metrics)
- ❌ Rapidly changing content (news, social feeds)
- ❌ Compliance requires latest data (regulatory, legal)

**When caching makes sense:**
- ✓ Stable documentation (API docs, how-to guides)
- ✓ Historical data (archives, past reports)
- ✓ FAQ-style content (policies, procedures)
- ✓ Acceptable staleness window (hourly/daily updates OK)

### Implementation Risks

**1. Cache Invalidation (The Hard Problem)**
- Challenge: "Knowing when to refresh"
- Strategies: TTL (time-based), Event-based, LRU eviction
- Risk: Stale answers for updated content

**2. Cache Stampede**
- Challenge: Many requests for expired cache key simultaneously
- Impact: Sudden load spike when cache expires
- Mitigation: Request coalescing (covered in M2.1)

**3. Cache Warming**
- Challenge: Cold cache = no savings initially
- Reality: Takes time to build up hit rate
- Expectation: 40-60% hit rate after warm-up period

### Decision Framework

**Ask before implementing caching:**

1. What's our acceptable staleness window?
   - Minutes? Caching risky
   - Hours? Caching viable with short TTL
   - Days? Caching strong candidate

2. What's our update frequency?
   - Continuous? Caching problematic
   - Hourly/Daily? Caching works with proper TTL
   - Weekly/Monthly? Caching ideal

3. What's our cost/freshness priority?
   - Freshness > Cost: Skip caching, optimize prompts instead
   - Cost > Freshness: Caching appropriate with monitoring

### Bottom Line

**Caching is NOT free.** You trade data freshness for cost savings.  
M2.1 will show you HOW to implement caching.  
This section shows you WHEN to implement it.

## Section 6: Call-Forward to M2.1 — Caching Goals & Invalidation Checklist

**Next:** M2.1 Caching Strategies for Cost Reduction  
**Duration:** 38 min video + 60-90 min hands-on

### M2.1 Implementation Goals

**Multi-layer Cache Architecture:**
1. **Response Cache** — Complete answers (highest savings)
2. **Semantic Cache** — Similar query matching (40-60% hit rate target)
3. **Embedding Cache** — Reuse vector representations
4. **Context Cache** — Retrieved document chunks

**Technologies:**
- Redis for response cache with SHA-256 hashing
- Faiss for semantic similarity cache
- Request coalescing for cache stampede protection

**Expected Performance:**
- 40-60% of queries served from cache
- 50-200ms cached response time (vs 300-500ms full pipeline)
- 30-70% cost reduction at scale

### Pre-M2.1 Checklist

☐ **Baseline metrics captured**  
   → Run Section 2 calculator to confirm $91/day baseline  
   → Document current per-query cost ($0.0091)

☐ **Repeat rate estimated**  
   → Review Section 3 assumptions (30-50% similar queries)  
   → Adjust JSON if your query patterns differ

☐ **Savings scenarios reviewed**  
   → Study Section 4 what-if table  
   → Identify target cache hit rate (30%, 50%, or 70%)

☐ **Trade-off decision made**  
   → Section 5 decision framework completed  
   → Staleness window defined (hours? days?)  
   → Cost vs freshness priority clarified

☐ **Ready for implementation**  
   → M1.4 query pipeline working reliably  
   → Performance monitoring in place  
   → Test environment prepared for caching layer

### Cache Invalidation Strategy Checklist

**Before implementing caching, define:**

☐ **TTL (Time-To-Live)**  
   - How long should cached answers remain valid?  
   - Recommendation: Start with 1 hour, adjust based on update frequency

☐ **Event-based Invalidation**  
   - What events require cache clearing?  
   - Examples: Document updates, index rebuilds, config changes

☐ **LRU (Least Recently Used) Eviction**  
   - What's max cache size?  
   - What happens when cache fills up?

☐ **Monitoring & Alerting**  
   - Track cache hit rate (target: 40-60%)  
   - Alert on cache misses spike (indicates invalidation issues)  
   - Monitor staleness metrics (time since cache entry created)

### Ready?

Run all cells in this notebook to:
1. ✓ Validate baseline costs
2. ✓ Record repeat-rate assumptions  
3. ✓ Project caching savings  
4. ✓ Review freshness trade-offs  
5. ✓ Complete pre-M2.1 checklist

**Then proceed to M2.1: Caching Strategies for Cost Reduction**