# L3 M14.2: Incident Management & Blast Radius

## Learning Arc

**Purpose:** Detect failing tenants within 60 seconds and automatically isolate them to prevent platform-wide outages

**Concepts Covered:**

1. **Blast Radius Definition & Cost Impact** - Understanding scope of damage from single failure (‚Çπ5.5 crore vs ‚Çπ15 lakh)
2. **Circuit Breaker Pattern** - Three-state isolation system (Closed/Open/Half-Open)
3. **Tenant Tier System** - Platinum/Gold/Silver/Bronze with different SLAs
4. **Incident Priority Framework** - P0/P1/P2 severity based on tenant tier and count
5. **Automated Detection** - Prometheus-based monitoring with 60-second detection SLA
6. **Notification System** - PagerDuty, Slack, email alerts with escalation paths
7. **Blameless Postmortems** - Five Whys analysis focused on system improvements
8. **Common Failure Scenarios** - Database pool exhaustion, query timeouts, shared resources
9. **Cost-Benefit Analysis** - ROI calculation and infrastructure costs
10. **Production Implementation** - Real-world deployment patterns and trade-offs

**After Completing This Notebook:**

- ‚úì You will understand how to detect failing tenants within 60 seconds
- ‚úì You can implement circuit breaker patterns for automatic isolation
- ‚úì You will recognize when blast radius containment fails
- ‚úì You can design blameless postmortems with actionable outcomes
- ‚úì You will calculate cost impact and ROI for incident management systems
- ‚úì You can configure Prometheus metrics for multi-tenant monitoring
- ‚úì You will build production-grade notification and escalation workflows

**Context in Track L3.M14:**

This module builds on multi-tenant monitoring (M14.1) and prepares you for tenant lifecycle management (M14.3). It's part of the Operations & Governance track focusing on production stability.

## 1. Environment Setup

In [None]:
import os
import sys
from datetime import datetime

# Add src to path for imports
if './src' not in sys.path:
    sys.path.insert(0, './src')
if '.' not in sys.path:
    sys.path.insert(0, '.')

# OFFLINE mode for L3 consistency
OFFLINE = os.getenv("OFFLINE", "false").lower() == "true"

# Prometheus detection from script
PROMETHEUS_ENABLED = os.getenv("PROMETHEUS_ENABLED", "false").lower() == "true"

if OFFLINE or not PROMETHEUS_ENABLED:
    print("‚ö†Ô∏è  Running in OFFLINE/PROMETHEUS_DISABLED mode")
    print("   ‚Üí Prometheus queries will be skipped")
    print("   ‚Üí Set PROMETHEUS_ENABLED=true in .env to enable")
    print("   ‚Üí Ensure Prometheus is running at configured URL")
else:
    print("‚úì Online mode - Prometheus enabled")
    print(f"  ‚Üí Prometheus URL: {os.getenv('PROMETHEUS_URL', 'http://prometheus:9090')}")

print(f"\nüìÖ Notebook started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Import Core Modules

In [None]:
# Import all classes and functions from our module
from src.l3_m14_operations_governance import (
    # Enums
    TenantTier,
    IncidentPriority,
    CircuitBreakerState,
    
    # Data classes
    TenantMetrics,
    CircuitBreaker,
    Incident,
    
    # Main classes
    BlastRadiusDetector,
    
    # Functions
    calculate_incident_priority,
    create_incident,
    send_notifications,
    generate_postmortem_template,
    run_monitoring_loop,
)

print("‚úì All modules imported successfully")
print(f"  ‚Üí {len(TenantTier)} tenant tiers available")
print(f"  ‚Üí {len(IncidentPriority)} priority levels defined")
print(f"  ‚Üí {len(CircuitBreakerState)} circuit breaker states")

## 3. Blast Radius Concept

### What is Blast Radius?

**Definition:** The scope of damage from a single failure in a multi-tenant system.

**Example Scenario:**
- **Single-tenant system**: Bad query affects 1 company (blast radius = 1)
- **Multi-tenant system (no containment)**: Bad query affects 50 companies (blast radius = 50)
- **Multi-tenant system (with containment)**: Bad query affects 1 company (blast radius = 1)

### Cost Impact Analysis

**Without Containment:**
- Affected tenants: 50
- Downtime: 3 hours
- Cost: 50 √ó 3 √ó ‚Çπ36,600/hour = **‚Çπ5.5 crore**

**With Containment:**
- Affected tenants: 1
- Downtime: 3 hours (for that tenant only)
- Cost: 1 √ó 3 √ó ‚Çπ5,000/hour = **‚Çπ15 lakh**

**Savings: ‚Çπ5.35 crore (36x cost reduction)**

In [None]:
# Calculate blast radius cost impact
def calculate_blast_radius_cost(affected_tenants: int, downtime_hours: int, avg_cost_per_hour: float) -> float:
    """Calculate total cost impact of an incident."""
    return affected_tenants * downtime_hours * avg_cost_per_hour

# Without containment (platform-wide outage)
cost_without = calculate_blast_radius_cost(
    affected_tenants=50,
    downtime_hours=3,
    avg_cost_per_hour=36_600  # Average across all tiers
)

# With containment (single tenant isolated)
cost_with = calculate_blast_radius_cost(
    affected_tenants=1,
    downtime_hours=3,
    avg_cost_per_hour=5_000  # Bronze tier
)

savings = cost_without - cost_with
savings_ratio = cost_without / cost_with

print("üí∞ Blast Radius Cost Impact:")
print(f"  Without containment: ‚Çπ{cost_without:,.0f} (‚Çπ{cost_without/10_000_000:.1f} crore)")
print(f"  With containment:    ‚Çπ{cost_with:,.0f} (‚Çπ{cost_with/100_000:.0f} lakh)")
print(f"  Savings:             ‚Çπ{savings:,.0f} (‚Çπ{savings/10_000_000:.2f} crore)")
print(f"  Savings ratio:       {savings_ratio:.0f}x cost reduction")

# Expected: 36x cost reduction, ‚Çπ5.35 crore savings

## 4. Tenant Tier System

### Tier Definitions

| Tier | Contract Value | SLA | Incident Priority | Response SLA |
|------|----------------|-----|-------------------|-------------|
| **Platinum** | ‚Çπ2 crore+ | 99.99% | P0 (Critical) | 15 minutes |
| **Gold** | ‚Çπ50 lakh+ | 99.9% | P1 (High) | 60 minutes |
| **Silver** | ‚Çπ10 lakh+ | 99% | P2 (Medium) | 4 hours |
| **Bronze** | <‚Çπ10 lakh | Best-effort | P2 (Medium) | 8 hours |

In [None]:
# Explore tenant tiers
print("üèÜ Tenant Tier System:\n")

tier_info = {
    TenantTier.PLATINUM: {"contract": "‚Çπ2 crore+", "sla": "99.99%", "response": "15 min"},
    TenantTier.GOLD: {"contract": "‚Çπ50 lakh+", "sla": "99.9%", "response": "60 min"},
    TenantTier.SILVER: {"contract": "‚Çπ10 lakh+", "sla": "99%", "response": "4 hours"},
    TenantTier.BRONZE: {"contract": "<‚Çπ10 lakh", "sla": "Best-effort", "response": "8 hours"},
}

for tier, info in tier_info.items():
    print(f"{tier.value.upper():8} | Contract: {info['contract']:12} | SLA: {info['sla']:11} | Response: {info['response']}")

# Expected: 4 tiers displayed with contract values and SLAs

## 5. Circuit Breaker Pattern

### Three States

1. **CLOSED (Normal Operation)**
   - All requests pass through
   - Monitoring for failures
   - Transitions to OPEN after 5 consecutive failures

2. **OPEN (Isolated)**
   - All requests blocked immediately
   - Tenant isolated from platform
   - Fast-fail response (no resource consumption)
   - After 60-second timeout ‚Üí HALF_OPEN

3. **HALF_OPEN (Testing Recovery)**
   - Limited test queries allowed
   - Success ‚Üí CLOSED (recovery complete)
   - Failure ‚Üí OPEN (back to isolation)

In [None]:
# Demonstrate circuit breaker state transitions
breaker = CircuitBreaker(tenant_id="demo-tenant", failure_threshold=5, timeout_seconds=0)

print("üîå Circuit Breaker State Transitions:\n")
print(f"Initial state: {breaker.state.value.upper()} (failure_count={breaker.failure_count})\n")

# Record 4 failures (below threshold)
print("Recording 4 failures (below threshold):")
for i in range(1, 5):
    tripped = breaker.record_failure()
    print(f"  Failure {i}: state={breaker.state.value}, count={breaker.failure_count}, tripped={tripped}")

# 5th failure trips the breaker
print("\nRecording 5th failure (trips breaker):")
tripped = breaker.record_failure()
print(f"  Failure 5: state={breaker.state.value.upper()}, count={breaker.failure_count}, tripped={tripped}")
print(f"  ‚ö†Ô∏è  Circuit breaker OPENED - tenant isolated!")

# Attempt reset (immediate with 0-second timeout)
print("\nAttempting recovery (timeout expired):")
reset = breaker.attempt_reset()
print(f"  Reset successful: {reset}")
print(f"  New state: {breaker.state.value.upper()} (testing recovery)")

# Successful request closes breaker
print("\nSuccessful request:")
breaker.record_success()
print(f"  State: {breaker.state.value.upper()} - recovery complete!")

# Expected: CLOSED ‚Üí OPEN ‚Üí HALF_OPEN ‚Üí CLOSED

## 6. Incident Priority Framework

### Priority Rules

**P0 (Critical):**
- Any Platinum tenant affected OR
- 10+ tenants affected
- Response SLA: 15 minutes
- Escalation: War room with CTO

**P1 (High):**
- Gold tenant affected OR
- 5-9 tenants affected
- Response SLA: 60 minutes
- Escalation: Platform lead + on-call

**P2 (Medium):**
- Silver/Bronze tenant
- <5 tenants affected
- Response SLA: 4 hours
- Escalation: On-call engineer only

In [None]:
# Test incident priority calculation
test_scenarios = [
    {
        "name": "Single Platinum tenant",
        "tenants": [TenantMetrics("plat-1", 100, 60, 0.60, datetime.now(), TenantTier.PLATINUM)],
        "expected": IncidentPriority.P0
    },
    {
        "name": "10 Bronze tenants",
        "tenants": [TenantMetrics(f"bronze-{i}", 100, 60, 0.60, datetime.now(), TenantTier.BRONZE) for i in range(10)],
        "expected": IncidentPriority.P0
    },
    {
        "name": "Single Gold tenant",
        "tenants": [TenantMetrics("gold-1", 100, 60, 0.60, datetime.now(), TenantTier.GOLD)],
        "expected": IncidentPriority.P1
    },
    {
        "name": "7 Silver tenants",
        "tenants": [TenantMetrics(f"silver-{i}", 100, 60, 0.60, datetime.now(), TenantTier.SILVER) for i in range(7)],
        "expected": IncidentPriority.P1
    },
    {
        "name": "2 Bronze tenants",
        "tenants": [TenantMetrics(f"bronze-{i}", 100, 60, 0.60, datetime.now(), TenantTier.BRONZE) for i in range(2)],
        "expected": IncidentPriority.P2
    },
]

print("üìä Incident Priority Calculation:\n")

for scenario in test_scenarios:
    priority = calculate_incident_priority(scenario["tenants"])
    match = "‚úì" if priority == scenario["expected"] else "‚úó"
    print(f"{match} {scenario['name']:25} ‚Üí {priority.value.upper():8} (expected: {scenario['expected'].value.upper()})")

# Expected: All 5 scenarios match expected priority

## 7. BlastRadiusDetector Implementation

### Architecture

```
Prometheus ‚Üí BlastRadiusDetector ‚Üí Circuit Breaker ‚Üí Isolation
   (metrics)     (10s polling)      (5 failures)     (60s timeout)
```

### Detection Algorithm

1. Query Prometheus for all active tenant IDs
2. For each tenant:
   - Query `rag_queries_total` (5-minute window)
   - Query `rag_queries_errors` (5-minute window)
   - Calculate error_rate = errors / total
3. If error_rate ‚â• 50%, record failure in circuit breaker
4. After 5 consecutive failures, circuit breaker opens (tenant isolated)
5. After 60-second timeout, attempt recovery (HALF_OPEN state)

In [None]:
# Initialize BlastRadiusDetector
if PROMETHEUS_ENABLED and not OFFLINE:
    prometheus_url = os.getenv("PROMETHEUS_URL", "http://prometheus:9090")
    detector = BlastRadiusDetector(
        prometheus_url=prometheus_url,
        error_threshold=0.50,  # 50% error rate
        check_interval_seconds=10,
        check_window="5m"
    )
    
    print("‚úì BlastRadiusDetector initialized")
    print(f"  ‚Üí Prometheus URL: {prometheus_url}")
    print(f"  ‚Üí Error threshold: 50%")
    print(f"  ‚Üí Check interval: 10 seconds")
    print(f"  ‚Üí Time window: 5 minutes")
    
    # Configure tenant tiers
    detector.set_tenant_tier("tenant-platinum-1", TenantTier.PLATINUM)
    detector.set_tenant_tier("tenant-gold-1", TenantTier.GOLD)
    detector.set_tenant_tier("tenant-silver-1", TenantTier.SILVER)
    detector.set_tenant_tier("tenant-bronze-1", TenantTier.BRONZE)
    
    print(f"\n‚úì Configured {len(detector.tenant_tiers)} tenant tiers")
    
else:
    print("‚ö†Ô∏è  Skipping detector initialization (Prometheus disabled)")
    print("   ‚Üí Set PROMETHEUS_ENABLED=true to enable detection")
    detector = None

# Expected: Detector initialized with configuration (if Prometheus enabled)

## 8. Testing the Detector (Simulated)

Since Prometheus may not be available, we'll simulate tenant metrics to demonstrate the detection workflow.

In [None]:
# Simulate tenant metrics (without Prometheus)
simulated_tenants = [
    TenantMetrics("tenant-A", 1000, 850, 0.85, datetime.now(), TenantTier.PLATINUM),  # Failing (85%)
    TenantMetrics("tenant-B", 800, 500, 0.625, datetime.now(), TenantTier.GOLD),      # Failing (62%)
    TenantMetrics("tenant-C", 500, 100, 0.20, datetime.now(), TenantTier.SILVER),     # Healthy (20%)
    TenantMetrics("tenant-D", 300, 50, 0.167, datetime.now(), TenantTier.BRONZE),     # Healthy (17%)
]

print("üîç Simulated Blast Radius Detection:\n")
print(f"{'Tenant ID':15} {'Queries':>8} {'Errors':>8} {'Error Rate':>11} {'Tier':>8} {'Status':>10}")
print("-" * 70)

failing_tenants = []
for tenant in simulated_tenants:
    is_failing = tenant.is_failing(threshold=0.50)
    status = "‚ùå FAILING" if is_failing else "‚úì Healthy"
    
    print(f"{tenant.tenant_id:15} {tenant.total_queries:8} {tenant.error_queries:8} "
          f"{tenant.error_rate:10.1%} {tenant.tier.value:>8} {status:>10}")
    
    if is_failing:
        failing_tenants.append(tenant)

print(f"\nüìä Detection Summary:")
print(f"  Total tenants: {len(simulated_tenants)}")
print(f"  Failing tenants: {len(failing_tenants)}")
print(f"  Blast radius: {len(failing_tenants)}/{len(simulated_tenants)} ({len(failing_tenants)/len(simulated_tenants):.0%})")

# Expected: 2 failing tenants (tenant-A, tenant-B)

## 9. Creating Incidents

When failing tenants are detected, an incident is automatically created with:
- Priority calculation (P0/P1/P2)
- Cost impact estimation
- Affected tenant tracking
- Timestamp and incident ID

In [None]:
# Create incident from failing tenants
if failing_tenants:
    # Create a mock detector for incident creation
    mock_detector = BlastRadiusDetector()
    
    incident = create_incident(failing_tenants, mock_detector)
    
    print("üö® Incident Created:\n")
    print(f"  Incident ID: {incident.incident_id}")
    print(f"  Priority: {incident.priority.value.upper()}")
    print(f"  Created at: {incident.created_at.strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"  Affected tenants: {len(incident.tenant_ids)} ({', '.join(incident.tenant_ids)})")
    print(f"  Highest tier affected: {incident.affected_tier.value.upper() if incident.affected_tier else 'Unknown'}")
    print(f"  Estimated cost: ‚Çπ{incident.cost_impact_inr:,.0f}/hour")
    
    # Priority explanation
    if incident.priority == IncidentPriority.P0:
        print(f"\n  ‚ö†Ô∏è  P0 CRITICAL: Platinum tenant affected or 10+ tenants")
        print(f"      Response SLA: 15 minutes | Escalation: CTO + VP Eng")
    elif incident.priority == IncidentPriority.P1:
        print(f"\n  ‚ö†Ô∏è  P1 HIGH: Gold tenant affected or 5-9 tenants")
        print(f"      Response SLA: 60 minutes | Escalation: Platform Lead")
    else:
        print(f"\n  ‚ÑπÔ∏è  P2 MEDIUM: Silver/Bronze tenant, <5 tenants")
        print(f"      Response SLA: 4 hours | Escalation: On-call Engineer")
else:
    print("‚úì No failing tenants detected - no incident created")

# Expected: P0 incident (Platinum tenant affected)

## 10. Notification System

### Notification Channels

1. **Logging** (always enabled)
   - Structured logs with incident details
   - Searchable via ELK/Splunk

2. **PagerDuty** (optional)
   - P0/P1 incidents trigger pages
   - Automatic escalation after timeout
   - On-call rotation integration

3. **Slack** (optional)
   - #incidents channel notifications
   - Real-time team alerts
   - War room channel creation

4. **Email** (optional)
   - Tenant admin notifications
   - Incident summary reports

In [None]:
# Send notifications for the incident
if failing_tenants and 'incident' in locals():
    print("üì¢ Sending Notifications:\n")
    
    # Check notification configuration
    pagerduty_enabled = os.getenv("PAGERDUTY_ENABLED", "false").lower() == "true"
    slack_enabled = os.getenv("SLACK_ENABLED", "false").lower() == "true"
    
    results = send_notifications(
        incident,
        pagerduty_enabled=pagerduty_enabled,
        slack_enabled=slack_enabled
    )
    
    print("  Notification Results:")
    for channel, success in results.items():
        status = "‚úì" if success else "‚úó"
        print(f"    {status} {channel.upper()}: {'Sent' if success else 'Failed'}")
    
    if not pagerduty_enabled and not slack_enabled:
        print("\n  ‚ÑπÔ∏è  Only logging enabled. Configure PagerDuty/Slack in .env for external alerts.")
else:
    print("‚ö†Ô∏è  No incident to notify")

# Expected: Logging notification sent (PagerDuty/Slack if configured)

## 11. Blameless Postmortem Generation

### Five Whys Analysis

Postmortems focus on **system improvements**, not individual blame:

1. **Why did the incident occur?** ‚Üí Identify immediate cause
2. **Why did that happen?** ‚Üí Dig deeper
3. **Why did that underlying issue exist?** ‚Üí Find systemic issue
4. **Why wasn't it prevented?** ‚Üí Identify missing safeguards
5. **Why wasn't that in place?** ‚Üí Root cause identified

### Action Items (Blameless)

- Assign to teams, not individuals
- Set realistic deadlines
- Focus on preventing future incidents
- Track completion and effectiveness

In [None]:
# Generate postmortem template
if failing_tenants and 'incident' in locals():
    # Mark incident as resolved for postmortem
    incident.resolved_at = datetime.now()
    incident.root_cause = "Bad query deployment with infinite loop"
    incident.action_items = [
        "[Platform Team | 2 weeks] Implement query validation tool",
        "[DevOps | 1 week] Add pre-deployment testing for queries",
        "[SRE | 3 days] Update runbook with detection steps",
        "[Engineering | 1 week] Review circuit breaker thresholds"
    ]
    
    postmortem = generate_postmortem_template(incident)
    
    print("üìù Postmortem Generated:\n")
    print(postmortem[:1000])  # Show first 1000 chars
    print("\n... (truncated for display)")
    print(f"\nFull postmortem: {len(postmortem)} characters")
else:
    print("‚ö†Ô∏è  No incident to generate postmortem")

# Expected: Markdown-formatted postmortem with Five Whys and action items

## 12. Cost-Benefit Analysis

### Infrastructure Costs (50-tenant deployment)

| Component | Self-Hosted | Managed SaaS | Notes |
|-----------|-------------|--------------|-------|
| Prometheus | ‚Çπ15K/month | ‚Çπ80K/month | 3-node cluster, 90-day retention |
| Grafana | ‚Çπ5K/month | Included | Visualization dashboards |
| PagerDuty | - | ‚Çπ45K/month | 10-user plan |
| ELK Stack | ‚Çπ25K/month | ‚Çπ30K/month | Log correlation (optional) |
| Compute | ‚Çπ5K/month | - | Detector + API |
| **Total** | **‚Çπ50K/month** | **‚Çπ155K/month** | **Hybrid: ‚Çπ75K/month** |

### ROI Calculation

- **First prevented outage**: ‚Çπ5.5 crore saved
- **Infrastructure cost**: ‚Çπ75K/month (hybrid)
- **Payback period**: First incident covers 73 months of infrastructure

In [None]:
# Calculate ROI for blast radius containment
infrastructure_cost_monthly = 75_000  # Hybrid setup
prevented_outage_cost = 55_000_000    # ‚Çπ5.5 crore

payback_months = prevented_outage_cost / infrastructure_cost_monthly
payback_years = payback_months / 12

print("üí∞ ROI Analysis for Blast Radius Containment:\n")
print(f"  Infrastructure Cost: ‚Çπ{infrastructure_cost_monthly:,}/month")
print(f"  First Prevented Outage: ‚Çπ{prevented_outage_cost:,} (‚Çπ{prevented_outage_cost/10_000_000:.1f} crore)")
print(f"\n  Payback Period: {payback_months:.0f} months ({payback_years:.1f} years)")
print(f"  ROI: {(prevented_outage_cost / infrastructure_cost_monthly):.0f}x return on first incident")
print(f"\n  ‚úì First prevented outage pays for {payback_years:.1f} years of infrastructure!")

# Calculate break-even
print(f"\n  Break-even: 1 prevented outage (vs platform-wide failure)")
print(f"  Expected incidents per year: 2-4 (industry average)")
print(f"  Annual savings: ‚Çπ{(prevented_outage_cost * 2):,} - ‚Çπ{(prevented_outage_cost * 4):,}")
print(f"                  (‚Çπ{(prevented_outage_cost * 2)/10_000_000:.0f}-{(prevented_outage_cost * 4)/10_000_000:.0f} crore)")

# Expected: 73-month payback, 733x ROI on first incident

## 13. Common Failure Scenarios

Based on production experience with 50-tenant multi-tenant RAG systems:

In [None]:
# Load and display common failure scenarios
import json

try:
    with open('./example_data.json', 'r') as f:
        data = json.load(f)
    
    failures = data.get('failure_scenarios', [])
    
    print("‚ö†Ô∏è  Common Failure Scenarios:\n")
    
    for i, failure in enumerate(failures, 1):
        print(f"{i}. {failure['scenario']}")
        print(f"   Cause: {failure['cause']}")
        print(f"   Impact: {failure['impact']}")
        print(f"   Detection: {failure['detection_time_seconds']}s")
        print(f"   Mitigation: {failure['mitigation']}")
        print()
    
    print(f"Total scenarios documented: {len(failures)}")
    
except FileNotFoundError:
    print("‚ö†Ô∏è  example_data.json not found - run from notebook directory")

# Expected: 6+ failure scenarios with detection times and mitigations

## 14. Production Deployment Checklist

### Pre-Deployment

- [ ] Prometheus deployed with 90-day retention
- [ ] Grafana dashboards configured for blast radius visualization
- [ ] PagerDuty integration tested with P0/P1/P2 routing
- [ ] Slack webhook configured for #incidents channel
- [ ] Circuit breaker thresholds tuned for your traffic patterns
- [ ] Tenant tiers configured in detector
- [ ] Runbooks created for P0/P1/P2 incidents
- [ ] On-call rotation established

### Post-Deployment

- [ ] Monitor false positive rate (<5% expected)
- [ ] Verify detection time meets 60-second SLA
- [ ] Test circuit breaker recovery (HALF_OPEN ‚Üí CLOSED)
- [ ] Conduct fire drill with simulated incident
- [ ] Review and update postmortem templates
- [ ] Measure MTTR (Mean Time To Recovery)
- [ ] Calculate actual cost savings
- [ ] Schedule quarterly runbook reviews

In [None]:
# Deployment readiness check
checklist = {
    "Prometheus configured": PROMETHEUS_ENABLED,
    "Detector initialized": detector is not None,
    "Circuit breakers available": True,  # Always available
    "Incident priority rules defined": True,
    "Notification system configured": os.getenv("PAGERDUTY_ENABLED") or os.getenv("SLACK_ENABLED"),
}

print("‚úÖ Deployment Readiness Check:\n")

ready_count = sum(checklist.values())
total_count = len(checklist)

for item, status in checklist.items():
    icon = "‚úì" if status else "‚úó"
    print(f"  {icon} {item}")

print(f"\nReadiness: {ready_count}/{total_count} ({ready_count/total_count:.0%})")

if ready_count == total_count:
    print("\nüöÄ System ready for production deployment!")
else:
    print("\n‚ö†Ô∏è  Complete remaining items before production deployment")

# Expected: Readiness score based on configuration

## 15. Summary & Key Takeaways

### What You Learned

1. **Blast Radius Containment** prevents ‚Çπ5.5 crore platform outages (36x cost reduction)
2. **Circuit Breaker Pattern** automatically isolates failing tenants within 60 seconds
3. **Incident Priority Framework** ensures appropriate response based on tier and impact
4. **Blameless Postmortems** drive system improvements without individual blame
5. **Production Monitoring** requires Prometheus, alerting, and runbooks

### Production Checklist

‚úì Deploy Prometheus with tenant metrics  
‚úì Configure circuit breakers per tenant  
‚úì Set up PagerDuty/Slack notifications  
‚úì Create P0/P1/P2 runbooks  
‚úì Establish on-call rotation  
‚úì Test with fire drills  
‚úì Monitor MTTR and false positives  

### Next Steps

- **M14.3**: Tenant lifecycle management (onboarding, offboarding)
- **M15**: Advanced monitoring and observability
- **Practathon**: Implement full incident response system

### Resources

- [Prometheus Documentation](https://prometheus.io/docs/)
- [Circuit Breaker Pattern](https://martinfowler.com/bliki/CircuitBreaker.html)
- [Google SRE Book - Postmortems](https://sre.google/sre-book/postmortem-culture/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)

In [None]:
# Notebook completion summary
print("üéâ L3 M14.2 Notebook Complete!\n")
print("Key Achievements:")
print("  ‚úì Understanding blast radius and cost impact")
print("  ‚úì Implementing circuit breaker pattern")
print("  ‚úì Calculating incident priorities")
print("  ‚úì Creating blameless postmortems")
print("  ‚úì Analyzing ROI for incident management")
print("\nNext Actions:")
print("  1. Deploy Prometheus in your environment")
print("  2. Configure tenant tiers and thresholds")
print("  3. Set up PagerDuty/Slack integrations")
print("  4. Create incident runbooks for your team")
print("  5. Conduct fire drill to test response")
print("\nüìö Continue to M14.3: Tenant Lifecycle Management")