# Bridge M2.3 → M2.4: Observability to Self-Healing

**Duration:** 8-10 min | **Module:** 2 - Cost Optimization  
**Transition:** M2.3 (Production Monitoring) → M2.4 (Error Handling & Reliability)

---

## Run Locally

**Windows:**
```powershell
powershell -c "$env:PYTHONPATH='$PWD'; jupyter notebook"
```

**Linux/Mac:**
```bash
PYTHONPATH=$PWD jupyter notebook
```

---

## Purpose

This bridge validates your M2.3 observability infrastructure before advancing to M2.4 self-healing patterns. You've built monitoring that shows WHEN failures occur; M2.4 will teach your system to recover automatically from those failures. Without working metrics, alerts, and correlation logging, you cannot measure whether retries and circuit breakers are actually reducing errors.

## Concepts Covered

**Validation Delta (M2.3 → M2.4):**
- Baseline error rate measurement (current state before implementing resilience)
- Alert infrastructure for circuit breaker state transitions
- Log correlation requirements for tracing retry attempts
- Metric endpoints as prerequisites for measuring error handling effectiveness

## After Completing

You will be able to:
- Verify Prometheus is scraping your RAG pipeline metrics at <30s intervals
- Confirm Grafana dashboards display p95 latency, error rate, cache hit rate, and cost metrics
- Validate 3+ alert rules are configured and can trigger notifications
- Trace full request lifecycles using correlation IDs in structured logs
- Document baseline error rate (current ~0.5-2%) for comparison after M2.4 (target <0.1%)

## Context in Track

**Bridge: Module 2.3 → Module 2.4**

**Previous (M2.3):** Production Monitoring & Observability  
**Current Bridge:** Readiness validation for self-healing capabilities  
**Next (M2.4):** Error Handling & Reliability (retries, circuit breakers, graceful degradation)

**Module 2 Progress:** 3 of 4 complete (Caching, Prompt Optimization, Monitoring) → Resilience next

---

## Section 1: M2.3 Accomplishments Recap

### What You Built in M2.3

Congratulations! You completed M2.3 and built production-grade observability that most systems take months to implement properly.

#### ✓ 1. Complete Observability Stack
- **Set up:** Prometheus + Grafana monitoring infrastructure from scratch
- **Tracking:** 8-10 production metrics including:
  - Latency percentiles (p50, p95, p99)
  - Cache hit rates
  - Cost per query
  - Error rates
- **Update frequency:** Real-time dashboards refreshing every 15 seconds

#### ✓ 2. Production-Grade Alerting
- **Alert rules configured:**
  - Rate limit headroom <20%
  - p95 latency >500ms sustained
  - Error rate >1% for 5 minutes
- **Delivery:** Slack/email notifications tested and tuned
- **Result:** Alerts fire on actual problems, avoiding alert fatigue

#### ✓ 3. Metric Cardinality Management
- **Problem solved:** Unbounded labels crash Prometheus
- **Debugged:** Cardinality explosion (2GB RAM usage)
- **Fixed:** Removed high-cardinality labels (user IDs), implemented proper label design
- **Learning:** Use categories, not IDs, in metric labels

#### ✓ 4. Structured Logging with Metrics
- **Implementation:** JSON logging with correlation via request_id and query_id
- **Benefit:** Jump from Grafana alert (symptom) to logs (root cause) in <60 seconds
- **Impact:** MTTR reduced from hours to minutes

### The Impact

**Before M2.3:** Production incidents took hours to diagnose  
**After M2.3:**
- Detect problems: 1-2 minutes
- Get debugging context: 5 minutes
- Find root cause: Minutes (was hours)

---

### Why M2.4 Matters: From Visibility to Resilience

**The reality:** Monitoring tells you THAT something is wrong. It doesn't prevent failures or auto-recover.

#### Real Incident Example

Perfect visibility, zero resilience:
- **2:47 AM:** Alert fires: "OpenAI rate limit at 95%"
- **2:48 AM:** Alert fires: "p95 latency >800ms"
- **2:49 AM:** Alert fires: "Error rate >5%"
- **Result:** 150 queries failed before manual intervention

#### Cost of Brittleness

**Revenue Impact:**
- 150 failed queries × $20 avg = $3,000 revenue risk
- 5% user churn = 7-8 lost customers
- 1 negative review = 50-200 potential customers turned away

**Time Cost:**
- 3 AM wake-up + degraded performance
- Manual recovery: 15-30 min per incident
- Post-incident debugging: 2-4 hours
- **Total:** ~6 hours @ $100/hr = $600 per incident

**Reputation:**
- Users lose trust after 2-3 failures
- Support ticket spike ($25-50 each)
- Team morale drops

#### M2.4 Goal: Self-Healing Systems

Turn a system that *alerts when it breaks* into one that *fixes itself and only alerts when human intervention is truly needed*.

**M2.4 capabilities:**
- Retry transient failures (API timeouts, rate limits)
- Circuit breakers prevent cascading failures
- Graceful degradation when dependencies fail
- Request queuing during temporary overload

---

## Section 2: Validation Check 1 - Prometheus Scrape Target

### ✓ Check: Prometheus Scraping Your RAG Pipeline

**What to verify:**
- Prometheus targets show your app as "State: UP"
- Last scrape timestamp <30 seconds ago
- `/metrics` endpoint returns valid Prometheus metrics

**Why this matters for M2.4:**
- Need baseline error rate (current ~0.5-2%) to measure improvement
- Target error rate after M2.4: <0.1%
- Cannot measure retry effectiveness without working metrics

**Success criteria:**
- `http://localhost:9090/targets` shows app as UP
- Metrics endpoint accessible and returning data

---

**Run Check 1:** Attempts to contact Prometheus (localhost:9090) and your app metrics endpoint (localhost:8000). If services are offline, displays stub messages with validation steps instead of failing.

In [None]:
import requests
from datetime import datetime

def check_metrics_endpoint():
    """Check if /metrics endpoint is accessible and returning data"""
    
    metrics_url = "http://localhost:8000/metrics"
    prometheus_url = "http://localhost:9090/targets"
    
    print("=" * 60)
    print("CHECK 1: Prometheus /metrics Endpoint")
    print("=" * 60)
    
    # Check app metrics endpoint (offline-friendly)
    try:
        response = requests.get(metrics_url, timeout=5)
        if response.status_code == 200:
            metrics_count = len([line for line in response.text.split('\n') if line and not line.startswith('#')])
            print(f"✓ App /metrics endpoint: ACCESSIBLE")
            print(f"  - URL: {metrics_url}")
            print(f"  - Status: {response.status_code}")
            print(f"  - Metrics found: {metrics_count} lines")
        else:
            print(f"⚠ App /metrics endpoint returned status: {response.status_code}")
    except requests.exceptions.ConnectionError:
        print("⚠ STUB: App /metrics endpoint not accessible")
        print("  - Service may not be running on localhost:8000")
        print("  - To validate: Start your RAG app, then check http://localhost:8000/metrics")
        print("  - Expected: Prometheus-format metrics (text/plain)")
    except Exception as e:
        print(f"⚠ Error checking metrics: {e}")
    
    print()
    
    # Check Prometheus targets (offline-friendly)
    try:
        response = requests.get(prometheus_url, timeout=5)
        if response.status_code == 200:
            print(f"✓ Prometheus targets page: ACCESSIBLE")
            print(f"  - URL: {prometheus_url}")
            print(f"  - Check manually for target state: UP")
        else:
            print(f"⚠ Prometheus returned status: {response.status_code}")
    except requests.exceptions.ConnectionError:
        print("⚠ STUB: Prometheus not accessible")
        print("  - Service may not be running on localhost:9090")
        print("  - To validate: Start Prometheus, then check http://localhost:9090/targets")
        print("  - Expected: Your app listed as 'UP' with last scrape <30s ago")
    except Exception as e:
        print(f"⚠ Error checking Prometheus: {e}")
    
    print()
    print(f"Timestamp: {datetime.now().isoformat()}")
    print("=" * 60)
    print()
    
    return True

# Run the check
check_metrics_endpoint()

## Section 3: Validation Check 2 - Grafana Dashboard Panels

### ✓ Check: Grafana Showing Key Metrics

**What to verify:**
- Dashboard displays: p95 latency, error rate, cache hit rate, cost accumulation
- All panels show data for last 30 minutes
- Graphs are not empty

**Why this matters for M2.4:**
- Will measure retry effectiveness (error rate 2% → <0.1%)
- Track latency impact of retries (+50-200ms per retry)
- Monitor circuit breaker state transitions
- Observe degradation patterns in real-time

**Success criteria:**
- All 4 key panels populated with data
- Time range: last 30 minutes minimum
- PromQL queries returning results

---

**Generate Dashboard Spec:** Creates a local JSON file with 4 Grafana panel definitions (p95 latency, error rate, cache hit rate, cost). This runs offline and produces `grafana_panels_spec.json` that you can import into Grafana.

In [None]:
import json

def create_grafana_panel_spec():
    """Create JSON specification for required Grafana panels"""
    
    grafana_panels = {
        "dashboard": {
            "title": "RAG Pipeline Observability - M2.3",
            "tags": ["rag", "m2.3", "observability"],
            "timezone": "browser",
            "panels": [
                {
                    "id": 1,
                    "title": "p95 Latency",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "histogram_quantile(0.95, rate(rag_query_duration_seconds_bucket[5m]))",
                            "legendFormat": "p95 latency",
                            "refId": "A"
                        }
                    ],
                    "yaxes": [
                        {"format": "s", "label": "Duration"}
                    ],
                    "alert": {
                        "name": "High p95 Latency",
                        "conditions": [
                            {
                                "evaluator": {"params": [0.5], "type": "gt"},
                                "query": {"params": ["A", "5m", "now"]},
                                "type": "query"
                            }
                        ],
                        "message": "p95 latency exceeded 500ms"
                    }
                },
                {
                    "id": 2,
                    "title": "Error Rate (%)",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "rate(rag_errors_total[5m]) / rate(rag_requests_total[5m]) * 100",
                            "legendFormat": "Error rate %",
                            "refId": "A"
                        }
                    ],
                    "yaxes": [
                        {"format": "percent", "label": "Error Rate"}
                    ],
                    "alert": {
                        "name": "High Error Rate",
                        "conditions": [
                            {
                                "evaluator": {"params": [1.0], "type": "gt"},
                                "query": {"params": ["A", "5m", "now"]},
                                "type": "query"
                            }
                        ],
                        "message": "Error rate exceeded 1% for 5 minutes"
                    }
                },
                {
                    "id": 3,
                    "title": "Cache Hit Rate (%)",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "rate(rag_cache_hits_total[5m]) / rate(rag_cache_requests_total[5m]) * 100",
                            "legendFormat": "Cache hit rate %",
                            "refId": "A"
                        }
                    ],
                    "yaxes": [
                        {"format": "percent", "label": "Hit Rate"}
                    ]
                },
                {
                    "id": 4,
                    "title": "Cost Accumulation",
                    "type": "graph",
                    "targets": [
                        {
                            "expr": "sum(increase(rag_cost_usd_total[1h]))",
                            "legendFormat": "Hourly cost (USD)",
                            "refId": "A"
                        }
                    ],
                    "yaxes": [
                        {"format": "currencyUSD", "label": "Cost"}
                    ]
                }
            ],
            "refresh": "15s",
            "time": {
                "from": "now-30m",
                "to": "now"
            }
        }
    }
    
    # Write to local file (no external calls)
    spec_file = "grafana_panels_spec.json"
    with open(spec_file, 'w') as f:
        json.dump(grafana_panels, f, indent=2)
    
    print("=" * 60)
    print("CHECK 2: Grafana Dashboard Panels")
    print("=" * 60)
    print()
    print(f"✓ Panel specification created: {spec_file}")
    print()
    print("Required Panels (4):")
    for panel in grafana_panels["dashboard"]["panels"]:
        print(f"  {panel['id']}. {panel['title']}")
    print()
    print(f"Specification saved to: {spec_file}")
    print("Import this JSON into Grafana to create the dashboard")
    print("=" * 60)
    print()
    
    return grafana_panels

# Create the spec
panels = create_grafana_panel_spec()

## Section 4: Validation Check 3 - Alert Rules Configuration

### ✓ Check: Alert Rules Configured and Tested

**What to verify:**
- Alert rules exist for critical conditions
- Test alert can be triggered
- Slack/email notifications arrive within 2 minutes
- Alert includes metric value and threshold

**Why this matters for M2.4:**
- Will configure circuit breaker state change alerts (OPEN → HALF-OPEN → CLOSED)
- Need to know when system is degraded and when it recovers
- Without working alerts, won't know if error handling is functioning
- M2.4 adds new alert types (retry exhaustion, circuit breaker trips, degradation mode)

**Success criteria:**
- 3+ alert rules configured (latency, error rate, rate limit)
- Test alert successfully triggered and received
- Alert Manager properly configured

---

**Generate Alert Rules:** Creates a local YAML file with 4 Prometheus alert rules (3 for M2.3, 1 preview for M2.4 circuit breakers). Runs offline and produces `prometheus_alert_rules.yml` for Prometheus configuration.

In [None]:
import yaml

def create_alert_rules_stub():
    """Create YAML stub for Prometheus alert rules"""
    
    alert_rules = {
        "groups": [
            {
                "name": "rag_pipeline_alerts",
                "interval": "30s",
                "rules": [
                    {
                        "alert": "HighP95Latency",
                        "expr": "histogram_quantile(0.95, rate(rag_query_duration_seconds_bucket[5m])) > 0.5",
                        "for": "5m",
                        "labels": {
                            "severity": "warning",
                            "component": "rag_pipeline"
                        },
                        "annotations": {
                            "summary": "High p95 latency detected",
                            "description": "p95 latency is {{ $value }}s (threshold: 500ms) for 5 minutes"
                        }
                    },
                    {
                        "alert": "HighErrorRate",
                        "expr": "(rate(rag_errors_total[5m]) / rate(rag_requests_total[5m])) * 100 > 1.0",
                        "for": "5m",
                        "labels": {
                            "severity": "critical",
                            "component": "rag_pipeline"
                        },
                        "annotations": {
                            "summary": "High error rate detected",
                            "description": "Error rate is {{ $value }}% (threshold: 1%) for 5 minutes"
                        }
                    },
                    {
                        "alert": "LowRateLimitHeadroom",
                        "expr": "rag_rate_limit_headroom_percent < 20",
                        "for": "2m",
                        "labels": {
                            "severity": "warning",
                            "component": "api_gateway"
                        },
                        "annotations": {
                            "summary": "Rate limit headroom low",
                            "description": "Rate limit headroom at {{ $value }}% (threshold: 20%)"
                        }
                    },
                    {
                        "alert": "CircuitBreakerOpen",
                        "expr": "rag_circuit_breaker_state{state=\"open\"} == 1",
                        "for": "1m",
                        "labels": {
                            "severity": "warning",
                            "component": "resilience",
                            "m2_4": "true"
                        },
                        "annotations": {
                            "summary": "Circuit breaker opened",
                            "description": "Circuit breaker for {{ $labels.service }} is OPEN - service degraded"
                        }
                    }
                ]
            }
        ]
    }
    
    # Write YAML to file (no external calls)
    rules_file = "prometheus_alert_rules.yml"
    with open(rules_file, 'w') as f:
        yaml.dump(alert_rules, f, default_flow_style=False, sort_keys=False)
    
    print("=" * 60)
    print("CHECK 3: Alert Rules Configuration")
    print("=" * 60)
    print()
    print(f"✓ Alert rules YAML created: {rules_file}")
    print()
    print("Configured Alert Rules:")
    for rule in alert_rules["groups"][0]["rules"]:
        print(f"  • {rule['alert']} (severity: {rule['labels']['severity']})")
    print()
    print(f"YAML saved to: {rules_file}")
    print("=" * 60)
    print()
    
    return alert_rules

# Create alert rules
rules = create_alert_rules_stub()

## Section 5: Validation Check 4 - Structured Logging with Correlation IDs

### ✓ Check: Correlation IDs in Structured Logs

**What to verify:**
- Logs are in JSON format
- Every log entry includes `correlation_id` or `request_id`
- Can trace full request lifecycle using correlation ID
- Log entries: query received → embedding → vector search → LLM call → response

**Why this matters for M2.4:**
- M2.4 error handling produces MORE logs (retry attempts, circuit breaker state changes, fallback triggers)
- Without correlation IDs: thousands of log lines with no way to trace a single failing request
- Retry debugging requires tracing: initial attempt → retry 1 → retry 2 → success/failure
- Circuit breaker state changes need correlation to specific requests that triggered the change

**Success criteria:**
- Logs in JSON format
- `correlation_id` field present in all log entries
- Can grep logs by correlation_id to see full request trace

---

**Validate Log Structure:** Demonstrates proper structured logging with correlation IDs by generating sample log entries. Shows a complete request lifecycle trace and validates that each entry contains the required `correlation_id` field.

In [None]:
import json
import uuid
from datetime import datetime

def validate_correlation_id_in_logs():
    """Validate that logs include correlation_id for request tracing"""
    
    # Sample log entries demonstrating proper structured logging (no external calls)
    sample_correlation_id = str(uuid.uuid4())[:8]
    
    sample_logs = [
        {
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": "Query received",
            "correlation_id": sample_correlation_id,
            "query": "What is RAG?",
            "user_id": "user_123"
        },
        {
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": "Embedding generated",
            "correlation_id": sample_correlation_id,
            "embedding_model": "text-embedding-ada-002",
            "latency_ms": 45
        },
        {
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": "Vector search completed",
            "correlation_id": sample_correlation_id,
            "results_count": 5,
            "search_latency_ms": 12
        },
        {
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": "LLM call completed",
            "correlation_id": sample_correlation_id,
            "model": "gpt-4",
            "tokens": 450,
            "latency_ms": 1200
        },
        {
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": "Response returned",
            "correlation_id": sample_correlation_id,
            "total_latency_ms": 1257,
            "status": "success"
        }
    ]
    
    # M2.4 preview: retry logs with correlation
    sample_logs_m24 = [
        {
            "timestamp": datetime.now().isoformat(),
            "level": "WARNING",
            "message": "LLM call failed, retrying",
            "correlation_id": sample_correlation_id,
            "error": "RateLimitError",
            "retry_attempt": 1,
            "backoff_ms": 1000
        },
        {
            "timestamp": datetime.now().isoformat(),
            "level": "INFO",
            "message": "Retry successful",
            "correlation_id": sample_correlation_id,
            "retry_attempt": 1,
            "final_status": "success"
        }
    ]
    
    print("=" * 60)
    print("CHECK 4: Structured Logging with Correlation IDs")
    print("=" * 60)
    print()
    print(f"✓ Sample correlation_id: {sample_correlation_id}")
    print()
    print("Example: Complete Request Trace (5 log entries)")
    for i, log in enumerate(sample_logs, 1):
        print(f"  {i}. {log['message']}")
    print()
    
    # Validate structure
    all_have_correlation = all('correlation_id' in log for log in sample_logs)
    print("Validation Results:")
    if all_have_correlation:
        print("  ✓ All log entries contain 'correlation_id'")
    else:
        print("  ✗ MISSING: Some logs lack 'correlation_id'")
    print(f"  ✓ All logs are valid JSON")
    print(f"  ✓ Request lifecycle traceable by correlation_id")
    print()
    
    # M2.4 Preview
    print("M2.4 PREVIEW: Retry Logs with Correlation")
    for log in sample_logs_m24:
        print(f"  • {log['message']} (attempt: {log.get('retry_attempt', 'N/A')})")
    print()
    print("=" * 60)
    print()
    
    return sample_logs

# Run validation
logs = validate_correlation_id_in_logs()

## Section 6: M2.4 Call-Forward - Self-Healing Capabilities

### What You'll Learn in M2.4: Error Handling & Reliability

M2.4 teaches you the four resilience patterns that make production RAG systems bulletproof.

---

### 1. Retry Logic with Exponential Backoff

**What it does:**
- Automatically recover from transient API failures
- Handle rate limits, timeouts, network glitches
- Exponential backoff prevents overwhelming failing services

**Trade-offs:**
- Adds retry infrastructure complexity
- Increases latency: +50-200ms per retry attempt
- Must tune max retries (too many = wasted time, too few = missed recoveries)

**Example transformation:**
- **Before:** OpenAI timeout → user sees error → lost query
- **After:** OpenAI timeout → retry with backoff → success on retry 2 → user gets answer (slower, but successful)

---

### 2. Circuit Breaker Pattern

**What it does:**
- Prevent cascading failures when dependencies are down
- Stop hammering failing services, fail fast instead
- Automatic recovery detection (half-open → closed states)

**Trade-offs:**
- Adds state machine complexity
- 5-15ms per-request overhead for state checking
- Risk of false positives (blocking healthy services after transient issues)

**Example:**
- Vector DB goes down for 30 seconds
- Circuit breaker opens after 5 failures
- New requests fail fast (no wasted retries)
- Circuit breaker periodically tests recovery
- Auto-closes when service returns

---

### 3. Graceful Degradation Strategies

**What it does:**
- Continue serving *some* functionality when full features unavailable
- Return cached results when vector DB is down
- Provide simpler responses when LLM is rate-limited

**Trade-offs:**
- Requires fallback logic (20-30% more code)
- Must maintain alternate code paths
- Response quality may drop during degradation

**Example fallback chain:**
1. Full RAG (vector search + LLM): quality score 0.85
2. Cached results only: quality score 0.75
3. Static FAQ responses: quality score 0.60
4. Error message with retry suggestion: quality score 0.0

---

### 4. Request Queue with Rate Limiting

**What it does:**
- Handle traffic spikes without crashing
- Queue excess requests during temporary overload
- Prevents API rate limit violations

**Trade-offs:**
- Requires memory for queue (5KB per request)
- Adds queueing delay under load
- Must configure queue size limits (too small = rejected requests, too large = OOM)

---

### Expected Impact After M2.4

**Error rate transformation:**
- **Current (M2.3):** 2% error rate (200 failures per 10,000 queries)
- **Target (M2.4):** <0.1% error rate (<10 failures per 10,000 queries)

**Why the improvement:**
- Transient failures (80% of errors) automatically retried and succeed
- Circuit breakers prevent repeated failures on broken dependencies
- Graceful degradation keeps users productive during partial outages

---

### Honest Trade-offs Discussion

M2.4 is NOT "add resilience and everything is better." You'll learn:

**When TO use error handling:**
- Production systems with real users
- APIs with known transient failures (rate limits, timeouts)
- Dependencies with occasional downtime
- Systems where availability > latency

**When NOT to use:**
- MVP phase (add complexity only when needed)
- Internal tools with manual retry acceptable
- Systems with zero-tolerance for latency increases
- Single-user development environments

---

### Validation Time Investment

**Before starting M2.4:**
- Complete M2.3 validation checklist: 30-45 minutes
- Review monitoring dashboards: 10 minutes
- Optional: Screenshot current metrics for before/after comparison

**M2.4 implementation time:**
- Video duration: 32 minutes
- Implementation: 2-3 hours
- Testing and validation: 1 hour

---

### Module 2 Progress

**You've completed 3 of 4 videos:**
- ✓ M2.1: Caching for 60-80% cost reduction
- ✓ M2.2: Prompt optimization for quality + cost
- ✓ M2.3: Production monitoring and observability
- **→ M2.4: Error handling and resilience** ← Next

**After M2.4:**
- Complete production RAG system: cost-efficient, observable, resilient
- Module 3: Deploy to cloud (AWS/GCP/Azure)
- Module 4: Advanced retrieval techniques

---

### Ready for M2.4?

**Pre-flight checklist:**
- [ ] All 4 M2.3 validation checks passing
- [ ] Prometheus + Grafana accessible
- [ ] Alert rules tested and working
- [ ] Logs include correlation IDs
- [ ] 15-minute break taken (M2.4 is dense!)

**When ready, proceed to M2.4 video: Error Handling & Reliability**

---

**Notebook complete!** Run all cells to validate your M2.3 readiness for M2.4.