# Reliability & Disaster Recovery

Building systems that remain operational despite failures. This guide covers availability calculations, failure modes, redundancy strategies, and disaster recovery planning.

## Key Concepts

| Term | Definition |
|------|------------|
| **Availability** | Percentage of time a system is operational |
| **SLA** | Service Level Agreement - contractual uptime commitment |
| **RTO** | Recovery Time Objective - max acceptable downtime |
| **RPO** | Recovery Point Objective - max acceptable data loss (time) |
| **MTBF** | Mean Time Between Failures |
| **MTTR** | Mean Time To Recovery |

## Availability & SLA Calculations

**Availability Formula:**
$$\text{Availability} = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}} = \frac{\text{Uptime}}{\text{Total Time}}$$

**The "Nines" of Availability:**

| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|--------------|---------------|----------------|---------------|
| 99% (two 9s) | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% (three 9s) | 8.77 hours | 43.83 min | 10.08 min |
| 99.99% (four 9s) | 52.60 min | 4.38 min | 1.01 min |
| 99.999% (five 9s) | 5.26 min | 26.30 sec | 6.05 sec |

**Composite Availability:**
- **Series (all must work):** $A_{total} = A_1 \times A_2 \times ... \times A_n$
- **Parallel (any can work):** $A_{total} = 1 - (1-A_1) \times (1-A_2) \times ... \times (1-A_n)$

In [None]:
# Availability Calculations

def availability_from_mtbf(mtbf_hours: float, mttr_hours: float) -> float:
    """Calculate availability from MTBF and MTTR."""
    return mtbf_hours / (mtbf_hours + mttr_hours)

def downtime_per_year(availability: float) -> dict:
    """Calculate allowed downtime for a given availability."""
    minutes_per_year = 365.25 * 24 * 60
    downtime_minutes = minutes_per_year * (1 - availability)
    return {
        "availability": f"{availability * 100:.4f}%",
        "downtime_per_year": f"{downtime_minutes:.2f} minutes ({downtime_minutes/60:.2f} hours)",
        "downtime_per_month": f"{downtime_minutes/12:.2f} minutes",
        "downtime_per_day": f"{downtime_minutes/365.25:.4f} minutes"
    }

def series_availability(*availabilities: float) -> float:
    """Components in series - ALL must work (multiply)."""
    result = 1.0
    for a in availabilities:
        result *= a
    return result

def parallel_availability(*availabilities: float) -> float:
    """Components in parallel - ANY can work (redundancy)."""
    failure_prob = 1.0
    for a in availabilities:
        failure_prob *= (1 - a)
    return 1 - failure_prob

# Example: Web application with load balancer -> 2 app servers -> database
print("=== Availability Calculations ===")
print(f"\nFour 9s downtime: {downtime_per_year(0.9999)}")

# Single points of failure
lb_avail = 0.999
app_avail = 0.99
db_avail = 0.999

# Without redundancy (series)
single_path = series_availability(lb_avail, app_avail, db_avail)
print(f"\nSingle path availability: {single_path:.6f} ({single_path*100:.4f}%)")

# With 2 redundant app servers
redundant_apps = parallel_availability(app_avail, app_avail)
with_redundancy = series_availability(lb_avail, redundant_apps, db_avail)
print(f"With 2 app servers: {with_redundancy:.6f} ({with_redundancy*100:.4f}%)")

# With 3 redundant app servers
triple_apps = parallel_availability(app_avail, app_avail, app_avail)
with_triple = series_availability(lb_avail, triple_apps, db_avail)
print(f"With 3 app servers: {with_triple:.6f} ({with_triple*100:.4f}%)")

## Failure Modes

### Hardware Failures
| Component | Typical MTBF | Mitigation |
|-----------|--------------|------------|
| HDD | 300K-1M hours | RAID, replication |
| SSD | 1-2M hours | Redundancy, backups |
| Server | 50K-100K hours | Clustering, spare parts |
| Network Switch | 200K+ hours | Redundant paths |
| Power Supply | 100K+ hours | Dual PSU, UPS |

### Software Failures
- **Memory leaks** → Graceful restarts, memory limits
- **Deadlocks** → Timeouts, lock ordering
- **Resource exhaustion** → Rate limiting, circuit breakers
- **Configuration errors** → GitOps, validation, canary deployments
- **Dependency failures** → Fallbacks, graceful degradation

### Network Failures
- **Partition** → CAP theorem considerations
- **Latency spikes** → Timeouts, async processing
- **DNS failures** → Multiple DNS providers, caching
- **BGP issues** → Multi-region, anycast

## Redundancy Strategies

### Active-Active
```
┌─────────────┐    ┌─────────────┐
│  Server A   │◄──►│  Server B   │   Both handle traffic
│  (Active)   │    │  (Active)   │   Load balanced
└─────────────┘    └─────────────┘
```
✅ Full capacity utilization, instant failover  
❌ Complex state synchronization, split-brain risk

### Active-Passive (Hot Standby)
```
┌─────────────┐    ┌─────────────┐
│  Server A   │───►│  Server B   │   B receives updates
│  (Active)   │    │  (Standby)  │   Ready to take over
└─────────────┘    └─────────────┘
```
✅ Simple, no split-brain  
❌ Wasted capacity, failover delay

### N+1 / N+2 Redundancy
- **N+1**: One spare for N active (e.g., 4 servers, need 3 for load)
- **N+2**: Two spares (handles maintenance + failure simultaneously)

### Geographic Redundancy
| Pattern | Latency | Cost | RPO |
|---------|---------|------|-----|
| Same rack | <1ms | Low | 0 |
| Same datacenter | 1-5ms | Medium | 0 |
| Same region | 5-20ms | Medium | Seconds |
| Cross-region | 50-200ms | High | Minutes |

In [None]:
# RTO/RPO and Backup Strategy Calculator

def calculate_data_loss(rpo_hours: float, write_rate_mb_per_hour: float) -> float:
    """Calculate maximum data loss in MB based on RPO."""
    return rpo_hours * write_rate_mb_per_hour

def calculate_backup_frequency(rpo_hours: float, safety_margin: float = 0.8) -> float:
    """Calculate required backup frequency to meet RPO."""
    return rpo_hours * safety_margin  # Back up more frequently than RPO

def estimate_recovery_time(data_size_gb: float, restore_speed_mbps: float, 
                           setup_time_min: float = 30) -> float:
    """Estimate recovery time in minutes."""
    data_mb = data_size_gb * 1024
    transfer_time_sec = (data_mb * 8) / restore_speed_mbps
    return setup_time_min + (transfer_time_sec / 60)

def disaster_recovery_tiers():
    """Common DR tier definitions."""
    return {
        "Tier 1 - Mission Critical": {"RTO": "< 1 hour", "RPO": "< 15 min", "Strategy": "Active-Active multi-region"},
        "Tier 2 - Business Critical": {"RTO": "< 4 hours", "RPO": "< 1 hour", "Strategy": "Hot standby"},
        "Tier 3 - Important": {"RTO": "< 24 hours", "RPO": "< 4 hours", "Strategy": "Warm standby"},
        "Tier 4 - Non-Critical": {"RTO": "< 72 hours", "RPO": "< 24 hours", "Strategy": "Cold backup"}
    }

print("=== Disaster Recovery Planning ===")
print("\nDR Tiers:")
for tier, config in disaster_recovery_tiers().items():
    print(f"  {tier}: RTO {config['RTO']}, RPO {config['RPO']} -> {config['Strategy']}")

# Example: E-commerce database
print("\n=== E-commerce Database Example ===")
rpo = 1  # 1 hour RPO
write_rate = 500  # 500 MB/hour of transactions
db_size = 500  # 500 GB database
restore_speed = 1000  # 1 Gbps restore speed

max_loss = calculate_data_loss(rpo, write_rate)
backup_freq = calculate_backup_frequency(rpo)
recovery_time = estimate_recovery_time(db_size, restore_speed)

print(f"RPO: {rpo} hour -> Max data loss: {max_loss} MB")
print(f"Recommended backup frequency: Every {backup_freq:.1f} hours")
print(f"Estimated recovery time: {recovery_time:.1f} minutes ({recovery_time/60:.2f} hours)")
print(f"\n⚠️  If RTO < {recovery_time/60:.2f} hours, need hot standby or replication!")

## Backup Strategies

### 3-2-1 Rule
- **3** copies of data
- **2** different storage types (disk, tape, cloud)
- **1** offsite copy

### Backup Types

| Type | Description | Restore Time | Storage |
|------|-------------|--------------|----------|
| **Full** | Complete copy | Fast | High |
| **Incremental** | Changes since last backup | Slow (chain) | Low |
| **Differential** | Changes since last full | Medium | Medium |
| **Continuous (CDP)** | Real-time replication | Instant | High |

### Database Backup Strategies
```
┌─────────────────────────────────────────────────────┐
│  Sunday    Mon   Tue   Wed   Thu   Fri   Sat       │
│  [FULL] → [Inc] → [Inc] → [Inc] → [Inc] → [Inc] → [Inc]
│           ↑       ↑       ↑       ↑       ↑       ↑
│           └───────┴───────┴───────┴───────┴───────┘
│                   Transaction Log Backups (hourly)
└─────────────────────────────────────────────────────┘
```

## Failover Patterns

### DNS Failover
```
User → DNS (health check) → Primary OK? → Primary Server
                         → Primary DOWN? → Secondary Server
```
⚠️ **Limitation**: DNS TTL causes propagation delay

### Load Balancer Failover
```
              ┌──► Server 1 (healthy) ✓
User → LB ───┼──► Server 2 (healthy) ✓
              └──► Server 3 (failed)  ✗ (removed from pool)
```

### Database Failover

| Pattern | Failover Time | Data Loss | Complexity |
|---------|---------------|-----------|------------|
| Manual | Minutes-Hours | Possible | Low |
| Automated (leader election) | Seconds | Minimal | Medium |
| Multi-master | Zero | Zero | High |

### Circuit Breaker Pattern
```
CLOSED ──(failures > threshold)──► OPEN
   ▲                                 │
   │                          (timeout)
   │                                 ▼
   └────(success)──── HALF-OPEN ◄───┘
                      (test request)
```

In [None]:
# Circuit Breaker Implementation Example
import time
from enum import Enum
from dataclasses import dataclass, field

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing recovery

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    recovery_timeout: float = 30.0
    success_threshold: int = 3
    
    state: CircuitState = field(default=CircuitState.CLOSED)
    failures: int = field(default=0)
    successes: int = field(default=0)
    last_failure_time: float = field(default=0)
    
    def can_execute(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        elif self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time >= self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.successes = 0
                return True
            return False
        else:  # HALF_OPEN
            return True
    
    def record_success(self):
        if self.state == CircuitState.HALF_OPEN:
            self.successes += 1
            if self.successes >= self.success_threshold:
                self.state = CircuitState.CLOSED
                self.failures = 0
        elif self.state == CircuitState.CLOSED:
            self.failures = 0
    
    def record_failure(self):
        self.failures += 1
        self.last_failure_time = time.time()
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN
        elif self.failures >= self.failure_threshold:
            self.state = CircuitState.OPEN

# Simulate circuit breaker behavior
cb = CircuitBreaker(failure_threshold=3, recovery_timeout=5)

print("=== Circuit Breaker Simulation ===")
events = [("success", True), ("failure", False), ("failure", False), 
          ("failure", False), ("attempt", None), ("wait_recovery", None)]

for event, success in events:
    if event == "wait_recovery":
        print(f"  ... waiting {cb.recovery_timeout}s for recovery timeout ...")
        cb.last_failure_time = time.time() - cb.recovery_timeout - 1  # Simulate wait
    elif event == "attempt":
        can_exec = cb.can_execute()
        print(f"  Attempt: can_execute={can_exec}, state={cb.state.value}")
    else:
        if cb.can_execute():
            if success:
                cb.record_success()
            else:
                cb.record_failure()
        print(f"  {event}: state={cb.state.value}, failures={cb.failures}")

## Chaos Engineering Basics

**Principle**: Proactively inject failures to discover weaknesses before they cause outages.

### Chaos Engineering Process
1. **Hypothesize** steady-state behavior
2. **Introduce** realistic failure (smallest blast radius first)
3. **Observe** system behavior
4. **Learn** and improve resilience

### Common Experiments

| Experiment | Simulates | Tools |
|------------|-----------|-------|
| Kill process/container | Crash | `kill -9`, Chaos Monkey |
| Network latency | Slow network | `tc netem`, Toxiproxy |
| Packet loss | Unreliable network | `tc netem`, Chaos Mesh |
| CPU stress | Resource contention | `stress-ng` |
| Disk full | Storage failure | `fallocate` |
| AZ/Region failure | Cloud outage | AWS FIS, Gremlin |
| DNS failure | Resolution issues | Block DNS, Chaos Mesh |

### Chaos Maturity Model
```
Level 0: No chaos testing
Level 1: Manual experiments in staging
Level 2: Automated experiments in staging
Level 3: Automated experiments in production (off-peak)
Level 4: Continuous chaos in production (GameDay ready)
```

### Key Metrics to Monitor
- Error rates and latency percentiles
- Service availability and SLO burn rate
- Cascade failures across services
- Recovery time after fault injection