# M2.1 ‚Äî Caching Strategies for Cost Reduction

**Learning Objectives:**
- Deploy a multi-layer Redis caching system reducing RAG costs by 30-70%
- Configure cache invalidation based on content freshness requirements
- Diagnose and resolve five common production failures
- Recognize scenarios where caching creates more problems than solutions

## 1. Objectives & Reality Check

### What Caching Accomplishes
- **Cost Reduction:** 30-70% for systems with repeating query patterns
- **Latency:** ~800ms ‚Üí ~50ms for cache hits
- **Scalability:** Handle more traffic without proportional cost increases

### What Caching Cannot Do
- Help when every question differs (>90% query diversity = <10% hit rates)
- Guarantee data freshness; invalidation remains difficult
- Eliminate LLM processing time on initial requests

### When NOT to Use Caching
- Query diversity exceeds 90%
- Content updates required within 5-minute windows
- Traffic below 500 daily queries
- Single-server deployments with low volumes

In [None]:
# Token/Cost Math: Reality Check

# Assumptions
avg_tokens_per_query = 1500  # input + output
cost_per_1k_tokens = 0.002  # GPT-3.5-turbo approximate
queries_per_day = 10000
cache_hit_rate = 0.50  # 50% of queries hit cache

# Without caching
daily_cost_no_cache = (queries_per_day * avg_tokens_per_query / 1000) * cost_per_1k_tokens

# With caching
cache_misses = queries_per_day * (1 - cache_hit_rate)
daily_cost_with_cache = (cache_misses * avg_tokens_per_query / 1000) * cost_per_1k_tokens

savings = daily_cost_no_cache - daily_cost_with_cache
savings_pct = (savings / daily_cost_no_cache) * 100

# Expected:
# Without cache: $30.00/day
# With cache (50% hit): $15.00/day
# Savings: 50%

print(f"Without cache: ${daily_cost_no_cache:.2f}/day")
print(f"With cache ({cache_hit_rate*100:.0f}% hit): ${daily_cost_with_cache:.2f}/day")
print(f"Savings: {savings_pct:.0f}% (${savings:.2f}/day)")

In [None]:
# When caching FAILS: High diversity scenario

# Simulate high-diversity workload
import random

unique_queries = set()
total_queries = 1000

# Generate mostly unique queries (90% diversity)
for i in range(total_queries):
    if random.random() < 0.90:  # 90% unique
        unique_queries.add(f"query_{i}")
    else:  # 10% repeats
        unique_queries.add(f"query_{random.randint(0, 100)}")

diversity = len(unique_queries) / total_queries
theoretical_hit_rate = 1 - diversity

# Expected:
# Diversity: 90%
# Max hit rate: 10%
# Verdict: DON'T CACHE

print(f"Diversity: {diversity*100:.0f}%")
print(f"Theoretical max hit rate: {theoretical_hit_rate*100:.0f}%")
print(f"Verdict: {'‚ùå DON\'T CACHE' if theoretical_hit_rate < 0.2 else '‚úì Cache viable'}")

## 2. Architecture: Multi-Layer Cache

### Three-Layer Design

**Layer 1: Query Cache (Exact + Semantic)**
- Exact match via SHA-256 hash
- Semantic match via fuzzy string similarity (BM25/MinHash)
- Stores final LLM responses

**Layer 2: Embedding Cache**
- Caches vector embeddings (1536 dims = ~6KB each)
- Reduces OpenAI API calls for repeated text
- TTL: 2 hours (embeddings rarely change)

**Layer 3: Retrieved-Context Cache**
- Caches document snippets fetched from vector DB
- Keyed by sorted document IDs
- Multiple queries often retrieve same documents

### Request Flow
```
Query ‚Üí Exact Cache? ‚Üí Semantic Cache? ‚Üí Embedding Cache? ‚Üí Vector DB ‚Üí Context Cache? ‚Üí LLM ‚Üí Cache Result
```

In [None]:
# Quick architecture verification
import config

print("=== Cache Layer Configuration ===")
print(f"Exact Cache: {'‚úì' if config.ENABLE_EXACT_CACHE else '‚úó'} (TTL: {config.TTL_EXACT_CACHE}s)")
print(f"Semantic Cache: {'‚úì' if config.ENABLE_SEMANTIC_CACHE else '‚úó'} (TTL: {config.TTL_SEMANTIC_CACHE}s)")
print(f"Embedding Cache: {'‚úì' if config.ENABLE_EMBEDDING_CACHE else '‚úó'} (TTL: {config.TTL_EMBEDDING_CACHE}s)")
print(f"Context Cache: {'‚úì' if config.ENABLE_CONTEXT_CACHE else '‚úó'} (TTL: {config.TTL_CONTEXT_CACHE}s)")
print(f"\nSemantic threshold: {config.SEMANTIC_THRESHOLD}")

# Expected:
# All layers enabled with default TTLs
# Semantic threshold: 0.85

## 3. Redis Setup & Connection

### Prerequisites
- **Docker:** `docker run -d -p 6379:6379 redis:7-alpine`
- **Redis Cloud:** Free tier at redis.com/try-free
- **Environment:** Copy `.env.example` to `.env` and set `REDIS_URL`

### Connection Testing
Test Redis connectivity and handle graceful fallback if unavailable.

In [None]:
# Test Redis and OpenAI connections
import config
from m2_1_caching import MultiLayerCache

# Initialize clients
redis_client = config.get_redis()
openai_client = config.get_openai()

if redis_client:
    print("‚úì Redis connected")
    info = redis_client.info("server")
    print(f"  Version: {info.get('redis_version', 'unknown')}")
else:
    print("‚ö†Ô∏è Redis not available - will skip network-dependent cells")

if openai_client:
    print("‚úì OpenAI client initialized")
else:
    print("‚ö†Ô∏è OpenAI not configured - will skip API calls")

# Initialize cache system
cache = MultiLayerCache(redis_client, openai_client)

# Expected:
# ‚úì Redis connected OR ‚ö†Ô∏è Redis not available
# ‚úì OpenAI client initialized OR ‚ö†Ô∏è OpenAI not configured

## 4. Exact & Semantic Query Cache

### How It Works
1. **Exact Match:** SHA-256 hash of query string
2. **Semantic Match:** Fuzzy similarity (rapidfuzz) with threshold
3. **Response Storage:** Complete LLM response cached for reuse

### Demonstration
Simulate two similar queries to show cache hits.

In [None]:
# Exact cache demonstration
if redis_client:
    query1 = "How do I reset my password?"
    response1 = {"answer": "Visit settings > security > reset password", "source": "docs"}

    # First request - MISS
    cached = cache.get_exact(query1)
    if not cached:
        cache.set_exact(query1, response1)

    # Second request - HIT (exact match)
    cached = cache.get_exact(query1)
    print(f"Exact cache result: {cached}")

    # Expected:
    # ‚úó Cache MISS [exact]
    # ‚úì Cache HIT [exact]
    # Result: {"answer": "Visit settings...", "source": "docs"}
else:
    print("‚ö†Ô∏è Skipping (no Redis)")

In [None]:
# Semantic cache demonstration
if redis_client:
    query2 = "What are your business hours?"
    query2_similar = "What time are you open?"
    response2 = {"answer": "Mon-Fri 9am-5pm EST", "source": "contact"}

    # Store original
    cache.set_semantic(query2, response2)

    # Try similar query
    cached_semantic = cache.get_semantic(query2_similar, threshold=0.70)
    print(f"Semantic match found: {cached_semantic is not None}")
    if cached_semantic:
        print(f"Result: {cached_semantic}")

    # Expected:
    # ‚úó Cache MISS [semantic] (scan finds match)
    # ‚úì Cache HIT [semantic]
    # Semantic match found: True
else:
    print("‚ö†Ô∏è Skipping (no Redis)")

## 5. Embedding & Context Caches

### Embedding Cache
Reduces OpenAI API calls by caching vector embeddings (1536 dims = ~6KB).
Includes stampede protection via per-key locks.

### Context Cache
Stores retrieved document snippets keyed by document IDs.
Multiple queries often fetch the same documents from vector DB.

In [None]:
# Embedding cache demonstration
if redis_client and openai_client:
    text = "machine learning embeddings"
    
    # First call - computes embedding
    embedding1 = cache.compute_or_get_embedding(text)
    
    # Second call - retrieves from cache
    embedding2 = cache.compute_or_get_embedding(text)
    
    if embedding1:
        print(f"Embedding cached: {len(embedding1)} dimensions")
        print(f"Match: {embedding1 == embedding2}")
    
    # Expected:
    # ‚úó Cache MISS [embedding]
    # ‚úì Cache HIT [embedding]
    # Embedding cached: 1536 dimensions
    # Match: True
else:
    print("‚ö†Ô∏è Skipping (no OpenAI/Redis)")

In [None]:
# Context cache demonstration
if redis_client:
    doc_ids = ["doc_123", "doc_456", "doc_789"]
    contexts = [
        {"id": "doc_123", "text": "Password reset instructions..."},
        {"id": "doc_456", "text": "Security best practices..."},
        {"id": "doc_789", "text": "Account recovery steps..."}
    ]
    
    # Cache contexts
    cache.set_context(doc_ids, contexts)
    
    # Retrieve (different order, same docs)
    doc_ids_reordered = ["doc_789", "doc_123", "doc_456"]
    cached_contexts = cache.get_context(doc_ids_reordered)
    
    print(f"Context cache hit: {cached_contexts is not None}")
    if cached_contexts:
        print(f"Retrieved {len(cached_contexts)} documents")
    
    # Expected:
    # ‚úì Cache HIT [context]
    # Context cache hit: True
    # Retrieved 3 documents
else:
    print("‚ö†Ô∏è Skipping (no Redis)")

## 6. Invalidation Strategies

### Three Approaches
1. **TTL (Time-To-Live):** Automatic expiration via Redis
2. **Manual Bust:** Explicit invalidation when content updates
3. **Stale Detection:** Timestamp-based freshness policies

### Reality Check
Invalidation remains genuinely difficult. Choose conservative TTLs and monitor stale data incidents.

In [None]:
# Invalidation demonstrations
if redis_client:
    # 1. Manual query invalidation
    query_to_bust = "How do I reset my password?"
    cache.invalidate_query(query_to_bust)
    print("‚úì Invalidated specific query")
    
    # 2. Prefix-based invalidation (clear all semantic cache)
    count = cache.invalidate_by_prefix(config.PREFIX_SEMANTIC)
    print(f"‚úì Cleared semantic cache: {count} keys")
    
    # 3. Stale data cleanup (entries older than 1 hour)
    cache.invalidate_stale(max_age_seconds=3600)
    
    # Expected:
    # ‚úì Invalidated specific query
    # ‚úì Cleared semantic cache: N keys
    # üóëÔ∏è Invalidated N stale entries
else:
    print("‚ö†Ô∏è Skipping (no Redis)")

## 7. Common Failures & Fixes

### Five Production Issues
1. **Cache Stampede:** Concurrent requests overwhelm backend
2. **Stale Data:** Updates not reflected until TTL expires
3. **Memory Exhaustion:** Large embeddings fill Redis
4. **Hash Collisions:** Wrong results from weak hashes
5. **Low ROI:** <20% hit rate wastes infrastructure costs

### Solutions Demonstrated Below

In [None]:
# Demonstrate cache metrics and ROI check
print("=== Cache Performance Metrics ===")
print(cache.metrics.summary())

# Check if caching is worthwhile
hit_rate = cache.metrics.get_hit_rate()
if hit_rate < 20:
    print(f"\n‚ö†Ô∏è WARNING: Hit rate {hit_rate:.1f}% too low - consider disabling cache")
else:
    print(f"\n‚úì Hit rate {hit_rate:.1f}% - caching provides value")

# Expected output varies based on previous cells
# Example: Hits: 5, Misses: 3, Hit Rate: 62.5%

## 8. Decision Card & Cost Projection

### Decision Matrix

| Scenario | Daily Queries | Diversity | Freshness Need | Recommendation |
|----------|---------------|-----------|----------------|----------------|
| FAQ Bot | 5,000+ | <30% | >1 hour | ‚úì Cache |
| News Search | 10,000+ | >90% | <5 min | ‚úó Don't Cache |
| Support Docs | 2,000+ | 40% | >30 min | ‚úì Cache |
| Research Q&A | 500 | 85% | Any | ‚úó Don't Cache |

### Cost Projection Calculator

In [None]:
# Interactive cost projection
def project_costs(queries_per_day, hit_rate_pct, avg_tokens=1500, cost_per_1k=0.002):
    """Project costs with and without caching."""
    hit_rate = hit_rate_pct / 100
    
    # Without cache
    cost_no_cache = (queries_per_day * avg_tokens / 1000) * cost_per_1k
    
    # With cache
    cache_misses = queries_per_day * (1 - hit_rate)
    cost_with_cache = (cache_misses * avg_tokens / 1000) * cost_per_1k
    
    # Redis costs (estimate $10/month for basic tier)
    redis_monthly = 10
    redis_daily = redis_monthly / 30
    
    total_with_cache = cost_with_cache + redis_daily
    savings = cost_no_cache - total_with_cache
    monthly_savings = savings * 30
    
    return {
        "no_cache": cost_no_cache,
        "with_cache_llm": cost_with_cache,
        "redis_daily": redis_daily,
        "total_with_cache": total_with_cache,
        "daily_savings": savings,
        "monthly_savings": monthly_savings,
        "roi": (savings / redis_daily * 100) if redis_daily > 0 else 0
    }

# Example scenarios
scenarios = [
    ("FAQ Bot (good fit)", 5000, 60),
    ("Support Docs (moderate)", 2000, 40),
    ("High Diversity (poor fit)", 10000, 10)
]

for name, queries, hit_rate in scenarios:
    result = project_costs(queries, hit_rate)
    print(f"\n{name}")
    print(f"  Queries/day: {queries:,}, Hit rate: {hit_rate}%")
    print(f"  Daily: ${result['no_cache']:.2f} ‚Üí ${result['total_with_cache']:.2f}")
    print(f"  Monthly savings: ${result['monthly_savings']:.2f}")
    print(f"  Verdict: {'‚úì Deploy' if result['daily_savings'] > 0 else '‚úó Skip'}")

# Expected:
# FAQ Bot: Significant savings
# Support Docs: Moderate savings  
# High Diversity: Minimal/negative ROI