# M2.4 ‚Äî Error Handling & Reliability

**Production-ready resilience patterns for RAG systems**

This notebook demonstrates:
- Retry strategies with exponential backoff
- Circuit breakers for cascading failure prevention
- Graceful degradation with fallbacks
- Request queueing and backpressure handling
- Real-world trade-offs and failure modes

**Build Protocol**: This notebook is built incrementally. Each section is saved separately.

## Purpose

**Make RAG systems production-ready with error handling that actually works.**

This module teaches you to handle the 2-5% base failure rate of external APIs (OpenAI, vector databases) automatically, reducing user-facing errors by 80-95%. You'll learn patterns used by companies like Netflix, AWS, and Google to keep services running even when dependencies fail.

**Real-world impact**: Transform "Service Unavailable 503" errors into slightly slower but functional responses.


## Concepts Covered

1. **Retry Strategy with Exponential Backoff**
   - Smart error classification: What to retry (5xx, 429) vs what not to (4xx)
   - Exponential delays prevent thundering herd
   - Jitter spreads retry timing

2. **Circuit Breaker Pattern**
   - Three-state machine: CLOSED ‚Üí OPEN ‚Üí HALF_OPEN
   - Prevents retry storms during outages
   - Automatic recovery testing

3. **Graceful Degradation**
   - Last-known-good caching with age annotations
   - Generic helpful messages > stack traces
   - Stale data vs no data trade-offs

4. **Request Queueing & Backpressure**
   - Bounded queues prevent memory exhaustion
   - Traffic spike protection
   - Reject some vs crash all

5. **Honest Trade-offs**
   - When NOT to use each pattern
   - Complexity vs reliability costs
   - Production tuning guidance


## After Completing This Module

You will be able to:
- ‚úÖ Distinguish retryable (5xx, 429) from non-retryable (4xx) errors
- ‚úÖ Implement circuit breakers to prevent cascading failures
- ‚úÖ Design graceful fallback strategies for degraded operation
- ‚úÖ Handle traffic spikes with bounded queues
- ‚úÖ Tune resilience thresholds based on monitoring data
- ‚úÖ Make informed decisions about when NOT to add complexity
- ‚úÖ Deploy RAG systems with 80-95% error reduction

**Production ready**: Copy patterns into your code, tune thresholds, deploy.


## Context in Track

**Prerequisites** (M1.x - M2.3):
- M1.1-M1.4: Basic RAG (embeddings, retrieval, generation)
- M2.1-M2.3: Chunking, vector search, evaluation

**This Module** (M2.4):
Making RAG systems **production-ready** with error handling.

**Next Steps**:
- M3.x: Advanced RAG (hybrid search, re-ranking)
- Production deployment with monitoring

**Why This Matters**:
- Without resilience: 2-5% base failure rate, cascading outages, crashes
- With resilience: <0.1% error rate, graceful degradation, smooth traffic handling

---

## Section 1: Reality Check - What Resilience Solves (and Doesn't)

Before diving into implementation, let's be honest about what error handling **actually solves** and what it **doesn't**.

In [None]:
import sys
import time
import random
from datetime import datetime

# Add src to path for imports
sys.path.insert(0, '..')

# Import our resilience module from new location
from src.m2_4_error_handling import (
    RetryStrategy, with_retry, 
    CircuitBreaker, CircuitState,
    GracefulFallbacks,
    RequestQueue, QueueWorker,
    ResilientOpenAIClient
)

print("‚úì Resilience patterns loaded from src.m2_4_error_handling")

### What Resilience Patterns SOLVE

‚úÖ **Transient failures** - Network blips, temporary service outages  
‚úÖ **Cascading failures** - One service down doesn't crash everything  
‚úÖ **Rate limiting** - Automatic backoff when hitting API limits  
‚úÖ **Load spikes** - Queue requests during traffic bursts  

**Impact**: Can reduce user-facing errors by 80-95% in production.

### What Resilience Patterns DON'T SOLVE

‚ùå **Bugs in your code** - Retrying a logic error won't fix it  
‚ùå **Data corruption** - Fallbacks can't restore bad data  
‚ùå **Permanent outages** - If the service is truly down, retries won't help  
‚ùå **Latency** - Retries ADD latency (50-200ms overhead per retry)  

**Reality**: These patterns add complexity. Use them when the trade-off makes sense.

In [None]:
# Trade-off visualization
print("RESILIENCE TRADE-OFFS")
print("=" * 60)
print("\nüìä Benefits:")
print("  ‚Ä¢ Error reduction: 80-95%")
print("  ‚Ä¢ User experience: Much smoother")
print("  ‚Ä¢ System stability: Prevents cascading failures")

print("\n‚ö†Ô∏è  Costs:")
print("  ‚Ä¢ Code complexity: +20-30%")
print("  ‚Ä¢ Latency overhead: +50-200ms per retry")
print("  ‚Ä¢ Infrastructure cost: +10-20% (queue memory, retry traffic)")
print("  ‚Ä¢ Development time: 8-12 hours for full stack")

print("\nüéØ When to use:")
print("  ‚úì User-facing apps with 10+ users")
print("  ‚úì Production systems with external dependencies")
print("  ‚úì Services where uptime > cost")

print("\nüö´ When NOT to use:")
print("  ‚úó Simple internal tools (<10 users)")
print("  ‚úó Real-time systems (<50ms SLA)")
print("  ‚úó Batch processing (failures should be investigated)")

# Expected: Trade-off table printed

### Decision Framework

**Full Resilience Stack** (This Module):  
- Best for: User-facing apps, 10+ users
- Time: 8-12 hours implementation
- Cost: +10-20% infrastructure

**Fail Fast + Monitoring**:  
- Best for: Internal tools
- Time: 2-4 hours
- Cost: Requires on-call rotation

**External Orchestration** (Service Mesh):  
- Best for: Microservices
- Time: 0 application code
- Cost: Infrastructure complexity

**For this module**: We're implementing the full stack because you're building production RAG systems.

In [None]:
print("\n" + "=" * 60)
print("SAVED_SECTION:1")
print("Section 1 complete: Reality Check")
print("Next: Section 2 - Smart Retries")
print("=" * 60)

In [None]:
print("\n" + "=" * 60)
print("SAVED_SECTION:7")
print("Section 7 complete: Common Failures & Decision Card")
print("=" * 60)
print("\nüéâ ALL SECTIONS COMPLETE!")
print("\nNotebook built incrementally with 7 sections:")
print("  1. Reality Check")
print("  2. Smart Retries")
print("  3. Circuit Breaker")
print("  4. Graceful Degradation")
print("  5. Queueing & Backpressure")
print("  6. Putting It Together")
print("  7. Common Failures & Decision Card")
print("\nYou now have a production-ready resilience stack!")
print("=" * 60)

## Summary: Key Takeaways

1. **Resilience patterns solve transient failures**, not bugs or permanent outages
2. **Trade-offs are real**: Complexity vs reliability, latency vs availability
3. **Combine patterns**: Retry + Circuit Breaker + Fallback = robust system
4. **Tune for your context**: Conservative for critical systems, aggressive for cost optimization
5. **Monitor everything**: Track retry rates, circuit state, queue depth

### Next Steps

- Copy `m2_4_resilience.py` into your RAG project
- Start with retry strategy (easiest, highest impact)
- Add circuit breaker for production deployment
- Tune thresholds based on your monitoring data

**Remember**: The best error handling is the kind users never notice.

In [None]:
# Production Configuration Reference
print("\n" + "=" * 60)
print("PRODUCTION CONFIGURATION REFERENCE")
print("=" * 60)

production_configs = {
    "Conservative (High Reliability)": {
        "retry_max_retries": 5,
        "retry_initial_delay": 2.0,
        "cb_failure_threshold": 3,
        "cb_recovery_timeout": 120.0,
        "queue_max_size": 500,
        "use_case": "Financial, healthcare, critical systems"
    },
    
    "Balanced (Recommended Default)": {
        "retry_max_retries": 3,
        "retry_initial_delay": 1.0,
        "cb_failure_threshold": 5,
        "cb_recovery_timeout": 60.0,
        "queue_max_size": 1000,
        "use_case": "Most production RAG systems"
    },
    
    "Aggressive (Fast Recovery)": {
        "retry_max_retries": 2,
        "retry_initial_delay": 0.5,
        "cb_failure_threshold": 10,
        "cb_recovery_timeout": 30.0,
        "queue_max_size": 2000,
        "use_case": "High-traffic, cost-sensitive, fast iteration"
    },
    
    "Cost-Optimized (Minimize API Costs)": {
        "retry_max_retries": 1,
        "retry_initial_delay": 2.0,
        "cb_failure_threshold": 5,
        "cb_recovery_timeout": 60.0,
        "queue_max_size": 500,
        "use_case": "Development, low-budget, internal tools"
    }
}

for profile, config in production_configs.items():
    print(f"\n{profile}:")
    print(f"  Use case: {config['use_case']}")
    print(f"  Configuration:")
    print(f"    ‚Ä¢ Retry max: {config['retry_max_retries']}")
    print(f"    ‚Ä¢ Retry delay: {config['retry_initial_delay']}s")
    print(f"    ‚Ä¢ CB threshold: {config['cb_failure_threshold']}")
    print(f"    ‚Ä¢ CB timeout: {config['cb_recovery_timeout']}s")
    print(f"    ‚Ä¢ Queue size: {config['queue_max_size']}")

print("\nüí° Start with 'Balanced', adjust based on monitoring data")
# Expected: Production config profiles

### Quick Reference: Production Thresholds

Copy these battle-tested values for your production deployment:

In [None]:
# Decision Card
print("\n" + "=" * 60)
print("DECISION CARD: Choosing Resilience Patterns")
print("=" * 60)

decision_card = {
    "Retry Strategy": {
        "When to use": [
            "‚úì External API calls (embeddings, completions)",
            "‚úì Network operations",
            "‚úì Transient failures expected (<5%)"
        ],
        "When NOT to use": [
            "‚úó Database writes (idempotency issues)",
            "‚úó Real-time systems (<50ms SLA)",
            "‚úó Already using service mesh (duplication)"
        ],
        "Cost": "Low - 2-3 hours implementation",
        "Impact": "High - 80-95% error reduction"
    },
    
    "Circuit Breaker": {
        "When to use": [
            "‚úì Protecting downstream services",
            "‚úì Preventing cascading failures",
            "‚úì Production systems with dependencies"
        ],
        "When NOT to use": [
            "‚úó Single-service applications",
            "‚úó Batch processing",
            "‚úó When false positives unacceptable"
        ],
        "Cost": "Medium - 4-6 hours implementation + tuning",
        "Impact": "High - Prevents cascading failures"
    },
    
    "Graceful Degradation": {
        "When to use": [
            "‚úì User-facing applications",
            "‚úì When partial functionality acceptable",
            "‚úì Cached data is useful"
        ],
        "When NOT to use": [
            "‚úó Financial transactions (accuracy critical)",
            "‚úó Real-time data requirements",
            "‚úó When stale data is worse than no data"
        ],
        "Cost": "Low-Medium - 3-4 hours",
        "Impact": "Medium - Better UX during outages"
    },
    
    "Request Queue": {
        "When to use": [
            "‚úì Traffic spikes expected",
            "‚úì Rate-limited APIs",
            "‚úì Background processing acceptable"
        ],
        "When NOT to use": [
            "‚úó Latency-sensitive operations",
            "‚úó Low traffic (<100 req/min)",
            "‚úó When immediate response required"
        ],
        "Cost": "Medium - 4-5 hours",
        "Impact": "High - Prevents thundering herd"
    }
}

for pattern, details in decision_card.items():
    print(f"\n{pattern}")
    print(f"  When to use:")
    for item in details["When to use"]:
        print(f"    {item}")
    print(f"  When NOT to use:")
    for item in details["When NOT to use"]:
        print(f"    {item}")
    print(f"  Cost: {details['Cost']}")
    print(f"  Impact: {details['Impact']}")

# Expected: Complete decision matrix printed

### Decision Card: When to Use Each Pattern

Use this to decide which patterns to implement based on your constraints.

In [None]:
# Common Failures Documentation
print("COMMON FAILURE MODES & MITIGATIONS")
print("=" * 60)

failures = [
    {
        "name": "1. Retry Storm",
        "problem": "Aggressive retries amplify load during outages",
        "symptom": "Service goes from 90% down to 100% down due to retries",
        "mitigation": [
            "Use exponential backoff with jitter",
            "Limit max retries (3 is usually enough)",
            "Combine with circuit breaker to stop retry storm"
        ],
        "code_fix": "RetryStrategy(max_retries=3, jitter=True)"
    },
    {
        "name": "2. Circuit Breaker False Positives",
        "problem": "Over-sensitive thresholds reject valid requests",
        "symptom": "Circuit opens after 2-3 transient errors, blocks traffic",
        "mitigation": [
            "Tune failure_threshold higher (5-10 for production)",
            "Reduce recovery_timeout for faster recovery tests",
            "Monitor circuit state transitions"
        ],
        "code_fix": "CircuitBreaker(failure_threshold=10, recovery_timeout=30.0)"
    },
    {
        "name": "3. Queue Memory Exhaustion",
        "problem": "Unbounded queues consume all memory",
        "symptom": "System runs out of memory, crashes harder than without queue",
        "mitigation": [
            "ALWAYS use bounded queues",
            "Set max_size based on memory constraints",
            "Monitor queue depth and reject when full"
        ],
        "code_fix": "RequestQueue(max_size=1000)  # Bounded!"
    },
    {
        "name": "4. Graceful Degradation Stuck",
        "problem": "Fallbacks remain active after service recovers",
        "symptom": "Users get stale cached data even when service is healthy",
        "mitigation": [
            "Circuit breaker naturally handles this (HALF_OPEN tests recovery)",
            "Add cache TTL to expire old entries",
            "Monitor fallback usage rate"
        ],
        "code_fix": "Circuit breaker + retry automatically recover"
    },
    {
        "name": "5. Retrying Non-Retryable Errors",
        "problem": "Wasting time/money retrying 4xx errors that won't change",
        "symptom": "404 errors retried 3 times, tripling costs",
        "mitigation": [
            "Classify errors correctly (5xx = retry, 4xx = don't retry)",
            "Exception: 429 (rate limit) should retry",
            "Log non-retryable errors for debugging"
        ],
        "code_fix": "RetryStrategy with is_retryable() checks status codes"
    }
]

for failure in failures:
    print(f"\n{failure['name']}")
    print(f"  Problem: {failure['problem']}")
    print(f"  Symptom: {failure['symptom']}")
    print(f"  Mitigation:")
    for item in failure['mitigation']:
        print(f"    ‚Ä¢ {item}")
    print(f"  Code: {failure['code_fix']}")

# Expected: Full failure mode documentation

## Section 7: Common Failures & Decision Card

**Reality Check**: Even with all these patterns, things can still go wrong.

### 5 Common Failure Modes

1. **Retry Storm**: Aggressive retries amplify load during outages
2. **Circuit Breaker False Positives**: Over-sensitive thresholds reject valid requests
3. **Queue Memory Exhaustion**: Unbounded queues consume all memory
4. **Graceful Degradation Stuck**: Fallbacks remain active after recovery
5. **Retrying Non-Retryable Errors**: Wasting time and money on 4xx errors

Let's explore each one...

In [None]:
print("\n" + "=" * 60)
print("SAVED_SECTION:6")
print("Section 6 complete: Putting It Together")
print("Next: Section 7 - Common Failures & Decision Card")
print("=" * 60)

In [None]:
# Scenario 4: Traffic spike (queue protects system)
print("\n" + "=" * 60)
print("SCENARIO 4: Traffic Spike (Queue Provides Backpressure)")
print("=" * 60)

# Simulate burst
rag.api_failure_rate = 0.0  # API is fine, just traffic spike
burst_questions = [f"burst_query_{i}" for i in range(150)]

accepted = 0
rejected = 0

for q in burst_questions:
    result = rag.query(q)
    if result["source"] == "rejected":
        rejected += 1
    else:
        accepted += 1

print(f"\nTraffic burst: {len(burst_questions)} requests")
print(f"  ‚úì Accepted: {accepted}")
print(f"  ‚úó Rejected: {rejected}")
print(f"  Success rate: {(accepted/len(burst_questions)*100):.1f}%")

print("\nüí° Queue prevents system crash during traffic spikes")
print("   Trade-off: Some requests rejected vs complete system failure")

# Expected: Most requests handled, some rejected gracefully

In [None]:
# Scenario 3: Complete outage (fallback activates)
print("\n" + "=" * 60)
print("SCENARIO 3: Complete Outage (Fallbacks Activate)")
print("=" * 60)

# First, build cache
rag.api_failure_rate = 0.0
cached_question = "What is deep learning?"
rag.query(cached_question)  # Build cache
print(f"‚úì Built cache for: {cached_question}")

# Now simulate total outage
rag.api_failure_rate = 1.0  # 100% failure
print("\n‚ö†Ô∏è  API completely down, triggering failures...\n")

# Trigger circuit breaker to open
for i in range(6):
    try:
        rag.query("test query")
    except:
        pass

print(f"Circuit breaker state: {rag.circuit_breaker.get_state().value.upper()}")

# Query with cached data available
result1 = rag.query(cached_question)
print(f"\n‚úì Q: {cached_question}")
print(f"   A: {result1['answer'][:60]}...")
print(f"   Source: {result1['source']}")

# Query without cached data
result2 = rag.query("Brand new question")
print(f"\n‚ö†Ô∏è  Q: Brand new question")
print(f"   A: {result2['answer'][:60]}...")
print(f"   Source: {result2['source']}")

print("\nüí° System degraded but functional - users see responses, not errors!")
# Expected: Cached response for known query, generic fallback for unknown

In [None]:
# Scenario 2: Transient failures (retries work)
print("=" * 60)
print("SCENARIO 2: Transient Failures (Retries Succeed)")
print("=" * 60)

rag.api_failure_rate = 0.6  # 60% failure rate

question = "What is machine learning?"
result = rag.query(question)
status = "‚úì" if not result["degraded_mode"] else "‚ö†Ô∏è "
print(f"{status} Q: {question}")
print(f"   A: {result['answer'][:60]}...")
print(f"   Source: {result['source']}, Degraded: {result['degraded_mode']}")

print("\nüí° Retries handled transient failure automatically")
# Expected: Query succeeds after retries

In [None]:
# Scenario 1: Normal operation
print("\n" + "=" * 60)
print("SCENARIO 1: Normal Operation (No Failures)")
print("=" * 60)

rag.api_failure_rate = 0.0

for i in range(3):
    question = f"What is concept_{i}?"
    result = rag.query(question)
    status = "‚úì" if not result["degraded_mode"] else "‚ö†Ô∏è "
    print(f"{status} Q: {question}")
    print(f"   A: {result['answer'][:60]}...")
    print(f"   Source: {result['source']}, Degraded: {result['degraded_mode']}")
    print()

print("üí° All queries succeed through normal path")
# Expected: 3 successful queries from live API

In [None]:
# Demo: Full resilience stack
print("DEMO: Complete Resilient RAG System")
print("=" * 60)

class ProductionRAG:
    """
    Production-ready RAG with full resilience stack:
    - Request queue for traffic spikes
    - Retry strategy for transient failures
    - Circuit breaker for cascading failure prevention
    - Graceful fallbacks for degraded mode
    """
    
    def __init__(self):
        self.queue = RequestQueue(max_size=100)
        self.retry_strategy = RetryStrategy(max_retries=3, initial_delay=0.5)
        self.circuit_breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=10.0)
        self.fallbacks = GracefulFallbacks()
        
        # Simulate external service state
        self.api_failure_rate = 0.0
    
    def query(self, question: str) -> dict:
        """
        Process query with full resilience.
        Returns: {answer, source, degraded_mode}
        """
        # Step 1: Queue (backpressure)
        if not self.queue.enqueue(question):
            return {
                "answer": "System is experiencing high load. Please try again shortly.",
                "source": "rejected",
                "degraded_mode": True
            }
        
        # Process from queue
        self.queue.dequeue()
        
        # Step 2: Try normal flow with retry + circuit breaker
        try:
            def _call_with_retry():
                def _call():
                    # Simulate API call
                    if random.random() < self.api_failure_rate:
                        raise ConnectionError("API temporarily unavailable")
                    
                    answer = f"AI-generated answer for: {question[:40]}..."
                    self.fallbacks.update_cache(question, answer)
                    return answer
                
                # Wrap in circuit breaker
                return self.circuit_breaker.call(_call)
            
            # Execute with retry
            answer = self.retry_strategy.execute(_call_with_retry)
            
            return {
                "answer": answer,
                "source": "live_api",
                "degraded_mode": False
            }
        
        except Exception as e:
            # Step 3: Fallback path
            cached = self.fallbacks.get_last_known_good(question)
            if cached:
                answer, age = cached
                return {
                    "answer": f"{answer} [Cached: {age:.0f}s old]",
                    "source": "cache",
                    "degraded_mode": True
                }
            
            return {
                "answer": self.fallbacks.get_generic_answer(question),
                "source": "generic_fallback",
                "degraded_mode": True
            }

# Initialize system
rag = ProductionRAG()
print("‚úì Production RAG initialized with full resilience stack")
print()

## Section 6: Putting It All Together - Full Resilience Stack

**The Goal**: Combine all patterns into a production-ready RAG system.

### Architecture
```
User Query ‚Üí Queue ‚Üí Retry ‚Üí Circuit Breaker ‚Üí API Call
                ‚Üì (if fails)
            Fallback ‚Üí Cached Response
```

This demonstrates a complete resilient system with all patterns working together.

In [None]:
print("\n" + "=" * 60)
print("SAVED_SECTION:5")
print("Section 5 complete: Queueing & Backpressure")
print("Next: Section 6 - Putting It Together")
print("=" * 60)

In [None]:
# Demo 3: Simulating traffic spike
print("\nDEMO: Handling Traffic Spike")
print("=" * 60)

# Simulate before/after queue implementation
print("\n1. WITHOUT queue (system crashes):")
print("  100 simultaneous requests ‚Üí System overwhelmed")
print("  ‚úó Memory exhausted")
print("  ‚úó Service crashes")
print("  ‚úó 0% requests succeed")

print("\n2. WITH bounded queue (graceful handling):")
spike_queue = RequestQueue(max_size=50)
burst_size = 100
accepted_during_burst = 0

for i in range(burst_size):
    if spike_queue.enqueue(f"burst_req_{i}"):
        accepted_during_burst += 1

success_rate = (accepted_during_burst / burst_size) * 100
print(f"  100 simultaneous requests ‚Üí Queue absorbs spike")
print(f"  ‚úì {accepted_during_burst} queued for processing")
print(f"  ‚ö†Ô∏è  {burst_size - accepted_during_burst} rejected (backpressure)")
print(f"  ‚úì Success rate: {success_rate:.0f}%")
print(f"  ‚úì System stable")

print("\n  üí° Trade-off: 50% success with queue vs 0% without!")
print("     Users see 'Please wait' instead of crashes.")

# Expected: Queue handles spike gracefully, some requests rejected

In [None]:
# Demo 2: Queue worker pattern
print("\nDEMO: Queue Worker Processing")
print("=" * 60)

# Create queue and worker
work_queue = RequestQueue(max_size=20)
processed_items = []

def process_request(item):
    """Simulate processing a request."""
    time.sleep(0.1)  # Simulate work
    result = f"Processed: {item}"
    processed_items.append(result)
    print(f"  ‚úì {result}")

worker = QueueWorker(work_queue, process_request)

# Enqueue some work
print("\n1. Adding work to queue:")
for i in range(5):
    work_queue.enqueue(f"task_{i}")
print(f"  Added 5 tasks (queue size: {work_queue.size()})")

# Start worker
print("\n2. Starting worker...")
worker.start()
time.sleep(1.0)  # Let it process

# Check results
print(f"\n3. Processing complete:")
print(f"  Items processed: {len(processed_items)}")
print(f"  Queue size: {work_queue.size()}")

# Clean up
worker.stop()
print("\n  üí° Worker processes queue in background!")

# Expected: 5 items processed, queue empty

In [None]:
# Demo 1: Request queue with bounded size
print("DEMO: Request Queue with Backpressure")
print("=" * 60)

# Create queue with small capacity for demo
queue = RequestQueue(max_size=10)

# Simulate burst of requests
print("\n1. Simulating traffic burst (15 requests, capacity 10):")
requests = [f"query_{i}" for i in range(15)]
accepted = 0
rejected = 0

for req in requests:
    if queue.enqueue(req):
        accepted += 1
    else:
        rejected += 1

print(f"\n  Results:")
print(f"  ‚úì Accepted: {accepted}")
print(f"  ‚úó Rejected: {rejected} (backpressure activated)")
print(f"  Queue size: {queue.size()}/{queue.max_size}")

# Show queue stats
print(f"\n2. Queue statistics:")
stats = queue.stats()
for key, value in stats.items():
    print(f"  {key}: {value}")

print("\n  üí° Bounded queue prevents memory exhaustion!")

# Expected: 10 accepted, 5 rejected

## Section 5: Queueing & Backpressure

**The Problem**: Traffic spikes overwhelm your system (thundering herd).  
**The Solution**: Queue requests and process at sustainable rate.

### Key Concepts
- **FIFO Queue**: Process requests in order
- **Bounded size**: Prevent memory exhaustion
- **Backpressure**: Reject requests when queue is full
- **Worker pattern**: Background processing

**Trade-off**: Added latency vs. system stability.

In [None]:
print("\n" + "=" * 60)
print("SAVED_SECTION:4")
print("Section 4 complete: Graceful Degradation")
print("Next: Section 5 - Queueing & Backpressure")
print("=" * 60)

In [None]:
# Demo 3: Circuit breaker + fallback integration
print("\nDEMO: Circuit Breaker + Fallback Integration")
print("=" * 60)

class ResilientRAG:
    def __init__(self):
        self.circuit_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=5.0)
        self.fallbacks = GracefulFallbacks()
        self.service_is_down = False
    
    def query(self, question: str) -> str:
        try:
            # Try normal flow through circuit breaker
            def _call():
                if self.service_is_down:
                    raise ConnectionError("RAG service unavailable")
                answer = f"Fresh answer: {question[:30]}..."
                self.fallbacks.update_cache(question, answer)
                return answer
            
            return self.circuit_breaker.call(_call)
        
        except Exception as e:
            # Fallback path
            print(f"  ‚ö†Ô∏è  Error: {type(e).__name__}")
            return self.fallbacks.get_cached_or_fallback(
                question,
                self.fallbacks.get_generic_answer(question)
            )

rag = ResilientRAG()

# Build cache
print("\n1. Normal operation (building cache):")
question = "What is AI?"
answer = rag.query(question)
print(f"  ‚úì {answer}")

# Simulate outage
print("\n2. Service goes down (using fallback):")
rag.service_is_down = True
for i in range(4):
    answer = rag.query(question)
    state = rag.circuit_breaker.get_state().value
    print(f"  Attempt {i+1} (CB: {state}): {answer[:50]}...")

print("\n  üí° Users get cached answers instead of errors!")

# Expected: Circuit opens, fallbacks activate, users see cached data

In [None]:
# Demo 2: Last-known-good with age indicator
print("\nDEMO: Last-Known-Good Pattern")
print("=" * 60)

# Simulate successful response
question = "What is deep learning?"
answer = "Deep learning uses neural networks with multiple layers..."
fallbacks.update_cache(question, answer)

print(f"\n1. Fresh cache entry created")
time.sleep(2)  # Wait 2 seconds

# Retrieve with age
result = fallbacks.get_last_known_good(question)
if result:
    cached_answer, age = result
    print(f"\n2. Retrieved cached response:")
    print(f"  Answer: {cached_answer[:50]}...")
    print(f"  Age: {age:.1f} seconds old")
    print(f"\n  üí° User sees: '{cached_answer[:40]}...'")
    print(f"     [Note: Using cached response from {age:.0f}s ago]")

# Unknown question
result = fallbacks.get_last_known_good("Unknown question?")
if result is None:
    print(f"\n3. No cache available:")
    print(f"  ‚Üí Use generic fallback message")

# Expected: Cache with age indicator, graceful handling of misses

In [None]:
# Demo 1: Fallback with cached responses
print("DEMO: Graceful Degradation with Fallbacks")
print("=" * 60)

fallbacks = GracefulFallbacks()

# Simulate successful RAG queries
print("\n1. Building cache with successful responses:")
questions = [
    "What is machine learning?",
    "How does a neural network work?",
    "What is gradient descent?"
]

for q in questions:
    answer = f"[AI-generated answer about: {q}]"
    fallbacks.update_cache(q, answer)
    print(f"  ‚úì Cached: {q[:40]}...")

# Simulate service failure - use cache
print("\n2. Service fails - using cached responses:")
cached_answer = fallbacks.get_cached_or_fallback(
    questions[0],
    "Service temporarily unavailable"
)
print(f"  üì¶ From cache: {cached_answer}")

# Try unknown question - use generic fallback
print("\n3. Unknown question - generic fallback:")
new_question = "What is quantum computing?"
fallback_answer = fallbacks.get_cached_or_fallback(
    new_question,
    fallbacks.get_generic_answer(new_question)
)
print(f"  ‚ö†Ô∏è  Fallback: {fallback_answer[:80]}...")

# Expected: Cached answers served, generic fallback for unknown queries

## Section 4: Graceful Degradation - Fallback Strategies

**The Problem**: When services fail, users get cryptic error messages.  
**The Solution**: Provide degraded but functional responses.

### Fallback Patterns
1. **Cached responses** - Return last-known-good answer
2. **Generic helpful messages** - Better than a stack trace
3. **Partial functionality** - Some features work, others degraded

**Key trade-off**: Stale data vs. no data at all.

In [None]:
print("\n" + "=" * 60)
print("SAVED_SECTION:3")
print("Section 3 complete: Circuit Breaker")
print("Next: Section 4 - Graceful Degradation")
print("=" * 60)

In [None]:
# Demo 2: Circuit breaker protecting RAG system
print("\nDEMO: Circuit-Protected RAG System")
print("=" * 60)

class MockOpenAIClient:
    def __init__(self):
        self.circuit_breaker = CircuitBreaker(failure_threshold=3, recovery_timeout=5.0)
        self.is_down = False
    
    def get_embedding(self, text):
        def _call():
            if self.is_down:
                raise ConnectionError("OpenAI API unavailable")
            return [random.random() for _ in range(8)]  # Mock embedding
        
        return self.circuit_breaker.call(_call)

client = MockOpenAIClient()

# Normal operation
print("\n1. Normal operation (service UP):")
for i in range(3):
    try:
        emb = client.get_embedding(f"query_{i}")
        print(f"  ‚úì Embedding {i+1}: {emb[:3]}... (state: {client.circuit_breaker.get_state().value})")
    except Exception as e:
        print(f"  ‚úó {e}")

# Simulate service outage
print("\n2. Service goes DOWN (circuit should open):")
client.is_down = True
for i in range(5):
    try:
        emb = client.get_embedding(f"query_fail_{i}")
        print(f"  ‚úì Embedding {i+1}")
    except Exception as e:
        state = client.circuit_breaker.get_state().value
        print(f"  ‚úó Call {i+1} failed (state: {state})")

# Circuit is open - requests rejected immediately
print("\n3. Circuit OPEN - rejecting requests:")
print(f"  State: {client.circuit_breaker.get_state().value.upper()}")
print("  üí° No more API calls attempted - preventing cascade!")

# Expected: Circuit opens after 3 failures, then rejects immediately

In [None]:
# Demo 1: Circuit breaker lifecycle
print("DEMO: Circuit Breaker State Transitions")
print("=" * 60)

# Create circuit breaker with low threshold for demo
cb = CircuitBreaker(
    failure_threshold=3,
    recovery_timeout=5.0,
    expected_exception=ConnectionError
)

def unstable_service(should_fail=True):
    """Simulates an unstable service."""
    if should_fail:
        raise ConnectionError("Service is down")
    return "Service OK"

print(f"\nInitial state: {cb.get_state().value}")

# Trigger failures to open circuit
print("\n1. Causing failures to open circuit...")
for i in range(4):
    try:
        cb.call(unstable_service, should_fail=True)
    except ConnectionError as e:
        print(f"  Attempt {i+1}: Failed - State: {cb.get_state().value}")

print(f"\n‚úó Circuit is now: {cb.get_state().value.upper()}")

# Try calling while circuit is open
print("\n2. Attempting calls while circuit is OPEN...")
try:
    cb.call(unstable_service, should_fail=False)
except Exception as e:
    print(f"  ‚úó Rejected: {e}")

# Wait for recovery timeout
print(f"\n3. Waiting {cb.recovery_timeout}s for recovery timeout...")
time.sleep(cb.recovery_timeout + 0.5)

# Circuit should transition to HALF_OPEN on next call
print("\n4. Testing recovery (HALF_OPEN)...")
try:
    result = cb.call(unstable_service, should_fail=False)
    print(f"  ‚úì {result}")
    print(f"  Circuit state: {cb.get_state().value}")
except Exception as e:
    print(f"  ‚úó Recovery failed: {e}")

# Expected: CLOSED ‚Üí OPEN ‚Üí HALF_OPEN ‚Üí CLOSED

## Section 3: Circuit Breaker - Preventing Cascading Failures

**The Problem**: When a service is down, retries make it worse (thundering herd).  
**The Solution**: Circuit breaker stops trying after N failures, then tests recovery.

### State Machine
```
CLOSED (normal) ‚Üí OPEN (failing) ‚Üí HALF_OPEN (testing) ‚Üí CLOSED (recovered)
```

- **CLOSED**: Normal operation, tracking failures
- **OPEN**: Rejecting all requests (service is down)
- **HALF_OPEN**: Testing if service recovered

**Key insight**: Prevents cascading failures, but can cause false positives.

In [None]:
print("\n" + "=" * 60)
print("SAVED_SECTION:2")
print("Section 2 complete: Smart Retries")
print("Next: Section 3 - Circuit Breaker")
print("=" * 60)

In [None]:
# Demo 3: Retryable vs Non-retryable errors
print("\nDEMO: Smart Error Classification")
print("=" * 60)

class APIError(Exception):
    def __init__(self, status_code, message):
        self.status_code = status_code
        super().__init__(message)

def api_with_different_errors(error_type="500"):
    """Simulate API with different error types."""
    if error_type == "500":
        raise APIError(500, "Internal Server Error - Retryable")
    elif error_type == "429":
        raise APIError(429, "Rate Limit - Retryable")
    elif error_type == "404":
        raise APIError(404, "Not Found - Non-retryable")
    elif error_type == "401":
        raise APIError(401, "Unauthorized - Non-retryable")
    return "Success"

strategy = RetryStrategy(max_retries=2, initial_delay=0.3)

# Test retryable error (500)
print("\nTest 1: 500 Server Error (should retry)")
try:
    # Simulate recovery on 2nd attempt
    attempt = [0]
    def call_500():
        attempt[0] += 1
        if attempt[0] < 2:
            raise APIError(500, "Server error")
        return "Recovered!"
    result = strategy.execute(call_500)
    print(f"‚úì {result}")
except Exception as e:
    print(f"‚úó {e}")

# Test non-retryable error (404)
print("\nTest 2: 404 Not Found (should NOT retry)")
try:
    strategy.execute(api_with_different_errors, error_type="404")
except APIError as e:
    print(f"‚úó Immediately failed (no retries): {e}")

print("\nüí° Key insight: Retrying 404s wastes time and money!")
# Expected: 500 retries, 404 fails immediately

In [None]:
# Demo 2: Retry decorator (cleaner syntax)
print("\nDEMO: Using @with_retry Decorator")
print("=" * 60)

@with_retry(max_retries=3, initial_delay=0.5, jitter=True)
def fetch_embeddings(text: str):
    """Simulate embedding API call."""
    if random.random() < 0.6:  # 60% failure rate
        raise ConnectionError("Embedding API temporarily unavailable")
    return f"Embedding[1536] for: {text[:30]}..."

# Use the decorated function
print("\nFetching embeddings...")
try:
    embedding = fetch_embeddings("What is machine learning?")
    print(f"‚úì {embedding}")
except Exception as e:
    print(f"‚úó Failed: {e}")

# Expected: Automatic retries, cleaner code

In [None]:
# Demo 1: Basic retry with exponential backoff
print("DEMO: Retry with Exponential Backoff")
print("=" * 60)

def flaky_api_call(failure_rate=0.7, call_id="test"):
    """Simulates a flaky API that fails randomly."""
    if random.random() < failure_rate:
        raise ConnectionError(f"API call {call_id} failed (simulated)")
    return f"Success: {call_id}"

# Create retry strategy
strategy = RetryStrategy(
    max_retries=3,
    initial_delay=0.5,  # Short for demo
    exponential_base=2.0,
    jitter=True
)

# Try calling the flaky API
print("\nAttempting flaky API call (70% failure rate)...")
try:
    result = strategy.execute(flaky_api_call, failure_rate=0.7, call_id="demo-001")
    print(f"\n‚úì Final result: {result}")
except Exception as e:
    print(f"\n‚úó All retries exhausted: {e}")

# Expected: Shows 1-4 attempts with exponential delays

## Section 2: Smart Retries with Exponential Backoff

**The Problem**: Network calls fail ~2-5% of the time due to transient issues.  
**The Solution**: Retry with increasing delays + jitter to prevent thundering herd.

### Key Concepts
- **Exponential backoff**: Each retry waits longer (1s ‚Üí 2s ‚Üí 4s)
- **Jitter**: Add randomness to prevent synchronized retries
- **Retryable vs non-retryable**: Don't retry 4xx errors (except 429)