# L3 M10.4: Disaster Recovery & Business Continuity

## Learning Arc (What ‚Üí Why ‚Üí How)

**What:** Disaster recovery and business continuity for financial RAG systems

This module implements production-grade DR capabilities including cross-region replication monitoring, automated failover orchestration, and FINRA Rule 4370 compliance reporting.

**Why:** Trading-critical systems require Hot DR

- **RTO = 15 minutes**: Maximum acceptable downtime during market hours (9:30 AM - 4:00 PM ET)
- **RPO = 1 hour**: Maximum acceptable data loss
- **FINRA Rule 4370**: Requires quarterly DR testing and documentation
- **Cost of downtime**: ‚Çπ30K+ per hour during market hours
- **Break-even**: Hot DR (‚Çπ2.5L/month) pays for itself if prevents ONE major outage every 4 years

**How:** Cross-region replication with automated failover

1. **Multi-region replication**: US-EAST-1 (primary) ‚Üí US-WEST-2 (DR)
2. **Continuous monitoring**: PostgreSQL and Pinecone lag tracking
3. **DNS-based failover**: Route 53 with 60-second TTL
4. **Automated orchestration**: Lambda functions for pre-flight checks and failover
5. **Compliance reporting**: Quarterly FINRA test documentation

## Concepts Covered

1. **Recovery Time Objective (RTO)**: Maximum downtime (15 minutes for trading systems)
2. **Recovery Point Objective (RPO)**: Maximum data loss (1 hour for document-based RAG)
3. **DR Tiers**: Cold (24+ hrs), Warm (2-4 hrs), Hot (< 15 min)
4. **Cross-Region Replication**: PostgreSQL, Pinecone, Redis, S3 across AWS regions
5. **DNS-Based Failover**: Route 53 health checks and automatic traffic redirection
6. **FINRA Rule 4370**: Quarterly DR testing requirements
7. **SOX Section 404**: 7-year document retention with audit trail

## Prerequisites

**Knowledge:**
- Generic CCC M1-M6 (RAG MVP foundations)
- Finance AI M10.1 (Security architecture)
- Finance AI M10.2 (Monitoring)
- Finance AI M10.3 (Cost optimization)

**Infrastructure:**
- AWS account with multi-region permissions
- Pinecone Production tier (cross-region replication support)
- PostgreSQL primary + DR databases
- Route 53 hosted zone

In [None]:
# OFFLINE Mode Guard
# This notebook can run without external service credentials for learning purposes

import os
import sys

AWS_ENABLED = os.getenv("AWS_ENABLED", "false").lower() == "true"
AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY_ID", "")

if not AWS_ENABLED or not AWS_ACCESS_KEY:
    print("‚ö†Ô∏è AWS services not configured")
    print("This notebook will run in OFFLINE mode (skipping external API calls)")
    print("")
    print("To enable AWS services:")
    print("1. Copy .env.example to .env")
    print("2. Set AWS_ENABLED=true")
    print("3. Set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY")
    print("4. Configure PostgreSQL credentials")
    print("")
    print("For learning purposes, OFFLINE mode demonstrates the concepts without actual infrastructure.")
else:
    print("‚úÖ AWS services enabled")
    print("Running in PRODUCTION mode with actual infrastructure")

# Expected: OFFLINE mode warning (unless credentials configured)

In [None]:
# Imports
from datetime import datetime, timedelta
from src.l3_m10_financial_rag_production import (
    ReplicationMonitor,
    DRVerifier,
    FailoverOrchestrator,
    ComplianceReporter,
    ReplicationStatus,
    FailoverResult,
    verify_dr_readiness,
    execute_failover,
    generate_compliance_report
)

print("‚úÖ Imports successful")
print(f"Module loaded: {ReplicationMonitor.__module__}")

# Expected: ‚úÖ Imports successful

In [None]:
print("SAVED_SECTION:1")

## Section 1: Understanding RTO and RPO

### Recovery Time Objective (RTO)

**RTO** is the maximum acceptable downtime for your system.

**Question it answers**: "How long can we be down before business impact becomes unacceptable?"

**For financial RAG during market hours**:
- **Target**: 15 minutes
- **Rationale**: Markets move fast, portfolio managers need real-time information
- **Cost of violation**: ‚Çπ30K+ per hour in opportunity cost + regulatory risk

### Recovery Point Objective (RPO)

**RPO** is the maximum acceptable data loss measured in time.

**Question it answers**: "How much recent data can we afford to lose?"

**For financial RAG systems**:
- **Target**: 1 hour
- **Rationale**: Financial documents don't change every second (unlike tick data)
- **Implementation**: Replication lag monitoring with CloudWatch alarms

In [None]:
# RTO/RPO Constants
RTO_TARGET_MINUTES = 15  # FINRA requirement for trading systems
RPO_TARGET_MINUTES = 60  # 1 hour maximum data loss
RPO_TARGET_SECONDS = RPO_TARGET_MINUTES * 60  # 3600 seconds

print(f"RTO Target: {RTO_TARGET_MINUTES} minutes")
print(f"RPO Target: {RPO_TARGET_MINUTES} minutes ({RPO_TARGET_SECONDS} seconds)")
print("")
print("Regulatory Basis: FINRA Rule 4370 - Business Continuity Planning")

# Expected: RTO Target: 15 minutes, RPO Target: 60 minutes

In [None]:
print("SAVED_SECTION:2")

## Section 2: Replication Monitoring

### ReplicationMonitor Class

The `ReplicationMonitor` continuously tracks replication lag between primary (US-EAST-1) and DR (US-WEST-2) regions.

**Key Metrics:**
- **lag_seconds**: How far behind DR is from primary
- **is_connected**: Replication link healthy?
- **data_consistency_ratio**: What % of primary data is in DR?
- **meets_rpo**: Is lag within 1-hour RPO?

**How It Works:**
1. Queries `pg_stat_replication` on primary PostgreSQL
2. Compares document counts between primary and DR
3. Publishes metrics to CloudWatch
4. Triggers alarms if lag exceeds thresholds

In [None]:
# Create ReplicationMonitor
mock_primary_config = {
    "host": "primary-db.us-east-1.rds.amazonaws.com",
    "port": 5432,
    "database": "financial_rag",
    "user": "rag_user",
    "password": "*****"  # Hidden
}

mock_dr_config = {
    "host": "dr-db.us-west-2.rds.amazonaws.com",
    "port": 5432,
    "database": "financial_rag",
    "user": "rag_user",
    "password": "*****"  # Hidden
}

monitor = ReplicationMonitor(mock_primary_config, mock_dr_config)
print("‚úÖ ReplicationMonitor initialized")

# Expected: ‚úÖ ReplicationMonitor initialized

In [None]:
# Check replication lag
status = monitor.check_replication_lag()

print(f"Replication Status:")
print(f"  Lag: {status.lag_seconds:.1f} seconds")
print(f"  Connected: {status.is_connected}")
print(f"  Data Consistency: {status.data_consistency_ratio:.2%}")
print(f"  Meets RPO: {status.meets_rpo} (target: < {RPO_TARGET_SECONDS}s)")

# Expected: Lag ~5 seconds, Connected True, Consistency ~99.8%, Meets RPO True

In [None]:
print("SAVED_SECTION:3")

## Section 3: DR Readiness Verification

### DRVerifier Class

Before executing failover, we must verify DR region is ready.

**Pre-Flight Checks:**
1. ‚úÖ Replication connected
2. ‚úÖ Lag < 10 minutes (safe for failover)
3. ‚úÖ Data consistency > 99%
4. ‚úÖ DR infrastructure healthy

**Why This Matters:**
- Failing over to unhealthy DR makes situation worse
- Better to wait for replication catch-up than serve stale data
- FINRA requires documented verification procedures

In [None]:
# Create DRVerifier
verifier = DRVerifier(monitor)
print("‚úÖ DRVerifier initialized")

# Run comprehensive health checks
health = verifier.run_health_checks()

print(f"\nDR Health Check Results:")
print(f"  Ready for Failover: {health['ready']}")
print(f"  Issues: {health['issues'] if health['issues'] else 'None'}")
print(f"  Replication Lag: {health['replication_lag_seconds']:.1f}s")
print(f"  Data Consistency: {health['data_consistency']:.2%}")

# Expected: Ready True, Issues None, Lag ~5s, Consistency ~99.8%

In [None]:
print("SAVED_SECTION:4")

## Section 4: Failover Orchestration

### FailoverOrchestrator Class

Automates the complete failover workflow:

**Steps:**
1. **Detect** primary region failure (CloudWatch health check)
2. **Verify** DR readiness (pre-flight checks)
3. **Update** Route 53 DNS to point to DR region
4. **Wait** for DNS propagation (60 seconds)
5. **Verify** DR serving traffic
6. **Measure** and record RTO

**Typical Timeline:**
- 90 seconds: Failure detection (3 consecutive health check failures)
- 60 seconds: Pre-flight verification
- 60 seconds: DNS update
- 60 seconds: DNS propagation
- **Total: 8-12 minutes** (within 15-minute RTO)

In [None]:
# Create FailoverOrchestrator
orchestrator = FailoverOrchestrator(verifier)
print("‚úÖ FailoverOrchestrator initialized")
print("")
print("‚ö†Ô∏è NOTE: execute_failover() is CRITICAL operation")
print("In production, this redirects ALL traffic to DR region")
print("Only execute during actual disasters or quarterly DR tests")

# Expected: Warning about critical operation

In [None]:
# Simulate failover (demo only - not executing actual DNS changes)
print("Simulating failover for demonstration...")
print("")

result = orchestrator.execute_failover("Demo: Hard drive failure in US-EAST-1")

print(f"Failover Result:")
print(f"  Success: {result.success}")
print(f"  RTO: {result.rto_minutes:.1f} minutes (target: {RTO_TARGET_MINUTES})")
print(f"  Data Loss: {result.data_loss_minutes:.1f} minutes (target: {RPO_TARGET_MINUTES})")
print(f"  Timestamp: {result.timestamp.isoformat()}")
print(f"  Errors: {result.errors if result.errors else 'None'}")

# Expected: Success True, RTO ~8 minutes, Data Loss ~5 minutes

In [None]:
print("SAVED_SECTION:5")

## Section 5: FINRA Compliance Reporting

### ComplianceReporter Class

Generates quarterly DR test reports for FINRA Rule 4370.

**Report Contents:**
- Test date and quarter (e.g., 2024-Q4)
- RTO measured vs. target (e.g., 8.5 min vs. 15 min target)
- RPO measured vs. target (e.g., 5 min vs. 60 min target)
- Data consistency validation
- Overall PASS/FAIL status
- Compliance statement for auditors

**FINRA Requirements:**
- Test at least annually (industry: quarterly)
- Document results
- File with compliance team
- Make available to examiners

In [None]:
# Create ComplianceReporter
reporter = ComplianceReporter()
print("‚úÖ ComplianceReporter initialized")

# Generate quarterly report
test_date = datetime(2024, 12, 15, 14, 0, 0)  # Q4 2024 test

# Using simulated failover result from previous section
# In production, use actual test results
report = reporter.generate_quarterly_report(
    test_date=test_date,
    failover_result=result,
    replication_status=status
)

print(f"\nCompliance Report Generated:")
print(f"  Quarter: {report['quarter']}")
print(f"  RTO: {report['rto_analysis']['status']}")
print(f"  RPO: {report['rpo_analysis']['status']}")
print(f"  Overall: {report['overall_result']['status']}")

# Expected: Q4 2024, RTO PASS, RPO PASS, Overall TEST PASSED

In [None]:
# View detailed RTO analysis
print("RTO Analysis:")
for key, value in report['rto_analysis'].items():
    print(f"  {key}: {value}")

print("\nRPO Analysis:")
for key, value in report['rpo_analysis'].items():
    print(f"  {key}: {value}")

# Expected: measured_minutes < target_minutes, pass: True

In [None]:
print("SAVED_SECTION:6")

## Section 6: Quarterly DR Test Execution

### Complete Test Procedure

**Test Workflow:**

1. **Pre-Test Verification**
   - Confirm primary region healthy
   - Confirm replication lag < 5 minutes
   - Document baseline (document count, query performance)

2. **Simulate Primary Failure**
   - Block traffic to US-EAST-1 via security group
   - Start RTO timer

3. **Monitor Automated Response**
   - Health checks fail after 90 seconds
   - Lambda verifies DR readiness
   - Route 53 DNS updated to US-WEST-2

4. **Verify DR Functionality**
   - Issue test query to rag.yourcompany.com
   - Confirm response from US-WEST-2
   - Check document count consistency

5. **Measure RTO/RPO**
   - Calculate time from failure to DR serving traffic
   - Calculate data loss from replication lag

6. **Failback to Primary**
   - Remove security group block
   - Wait for replication catch-up
   - Route 53 DNS back to US-EAST-1

7. **Generate Report**
   - PASS/FAIL determination
   - File with compliance team

In [None]:
# Example: Complete quarterly DR test
print("Quarterly DR Test Simulation")
print("=" * 50)
print(f"Test Date: {test_date.strftime('%Y-%m-%d %H:%M UTC')}")
print(f"Quarter: Q{(test_date.month-1)//3 + 1} {test_date.year}")
print("")

print("Step 1: Pre-Test Verification")
baseline = {
    "primary_documents": 10432,
    "primary_health": "healthy",
    "replication_lag": "4.2 seconds"
}
print(f"  Baseline: {baseline}")
print("")

print("Step 2-5: Execute Failover (simulated above)")
print(f"  RTO Measured: {result.rto_minutes:.1f} minutes")
print(f"  RPO Measured: {result.data_loss_minutes:.1f} minutes")
print("")

print("Step 6: Generate Compliance Report")
print(f"  Report Status: {report['overall_result']['status']}")
print("")

print("‚úÖ Quarterly DR Test Complete")
print("üìÑ Report filed for FINRA examiner review")

# Expected: Test complete, report PASSED

In [None]:
print("SAVED_SECTION:7")

## Section 7: Cost Analysis

### Hot DR Monthly Costs

**Primary Region (US-EAST-1):**
- Pinecone Production tier: ‚Çπ60K/month
- RDS PostgreSQL Multi-AZ: ‚Çπ80K/month
- EC2 AutoScaling: ‚Çπ50K/month
- ElastiCache Redis: ‚Çπ15K/month
- S3 backup: ‚Çπ5K/month
- **Subtotal: ‚Çπ2.1L/month**

**DR Region (US-WEST-2):**
- Same infrastructure: ‚Çπ2.1L/month

**Cross-Region Services:**
- Route 53, CloudWatch, Lambda: ‚Çπ38K/month

**Long-Term Backup (Glacier):**
- 7-year retention: ‚Çπ20K/month

**Total: ‚Çπ2.5L/month (~$3,000 USD)**

### Break-Even Analysis

- **One 4-hour outage cost**: ‚Çπ1.2Cr+ (direct losses + fines + reputation)
- **Hot DR annual cost**: ‚Çπ30L
- **Break-even**: Prevents ONE major outage every 4 years
- **Typical failure rate**: 1-2 incidents per decade
- **Conclusion**: Hot DR is cost-justified for trading-critical systems

In [None]:
# Cost calculation
monthly_dr_cost = 250000  # ‚Çπ2.5L
annual_dr_cost = monthly_dr_cost * 12
outage_cost_per_hour = 30000  # ‚Çπ30K
major_outage_hours = 4
major_outage_cost = outage_cost_per_hour * major_outage_hours * 100  # Including fines, reputation

print(f"Hot DR Costs:")
print(f"  Monthly: ‚Çπ{monthly_dr_cost:,}")
print(f"  Annual: ‚Çπ{annual_dr_cost:,}")
print("")
print(f"Downtime Costs:")
print(f"  Per Hour: ‚Çπ{outage_cost_per_hour:,}")
print(f"  4-Hour Outage: ‚Çπ{major_outage_cost:,} (including indirect costs)")
print("")
print(f"Break-Even: Prevent 1 major outage every {major_outage_cost // annual_dr_cost:.1f} years")
print("")
print("Conclusion: Hot DR is cost-justified for trading-critical systems")

# Expected: Break-even ~4 years

In [None]:
print("SAVED_SECTION:8")

## Summary and Next Steps

### What You've Learned

1. ‚úÖ **RTO/RPO Concepts**: 15-minute RTO and 1-hour RPO for trading systems
2. ‚úÖ **Replication Monitoring**: PostgreSQL and Pinecone lag tracking
3. ‚úÖ **DR Readiness**: Pre-flight verification before failover
4. ‚úÖ **Failover Orchestration**: Automated DNS-based failover
5. ‚úÖ **FINRA Compliance**: Quarterly DR testing and reporting
6. ‚úÖ **Cost Justification**: Hot DR break-even analysis

### Regulatory Requirements Met

- ‚úÖ **FINRA Rule 4370**: Business continuity planning with quarterly testing
- ‚úÖ **SOX Section 404**: 7-year document retention with audit trail
- ‚úÖ **GLBA**: Data encryption at rest and in transit

### Production Deployment Checklist

**Before going live:**

1. ‚òê Configure AWS multi-region infrastructure
2. ‚òê Set up Pinecone Production tier with cross-region replication
3. ‚òê Configure PostgreSQL RDS Multi-AZ and read replica
4. ‚òê Set up Route 53 hosted zone and health checks
5. ‚òê Configure CloudWatch alarms (lag > 5 min, health check failures)
6. ‚òê Integrate PagerDuty for on-call alerts
7. ‚òê Execute quarterly DR test in staging environment
8. ‚òê Generate and file compliance report
9. ‚òê Document runbook for manual failover
10. ‚òê Train operations team on failover procedures

### Next Steps

1. **Review the FastAPI endpoints** at http://localhost:8000/docs
2. **Run the test suite** with `pytest -v tests/`
3. **Configure your .env file** with actual AWS credentials (if deploying)
4. **Practice quarterly DR test** in staging environment
5. **Generate compliance report** for FINRA filing

### Resources

- **FINRA Rule 4370**: https://www.finra.org/rules-guidance/rulebooks/finra-rules/4370
- **AWS RDS Multi-AZ**: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.MultiAZ.html
- **Route 53 Health Checks**: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html
- **Pinecone Production Tier**: https://docs.pinecone.io/

---

**Congratulations!** You've completed L3 M10.4: Disaster Recovery & Business Continuity.

Your financial RAG system is now production-ready with:
- 15-minute RTO capability
- 1-hour RPO compliance
- FINRA Rule 4370 quarterly testing
- Automated failover orchestration

In [None]:
print("SAVED_SECTION:9")
print("")
print("=" * 60)
print("L3 M10.4: Disaster Recovery & Business Continuity")
print("Notebook Complete!")
print("=" * 60)