# L3 M14.3: Tenant Lifecycle & Migrations

## Learning Arc

**Purpose:** Master zero-downtime tenant migrations, GDPR-compliant data deletion workflows, and backup/restore orchestration for enterprise multi-tenant RAG systems. This module teaches production-grade lifecycle management patterns used at Global Capability Centers (GCCs) serving Fortune 500 clients.

**Concepts Covered:**
- Blue-green deployment patterns with gradual traffic cutover (10% → 25% → 50% → 100%)
- GDPR Article 17 deletion workflows across 7+ systems (PostgreSQL, Redis, Pinecone, S3, CloudWatch, backups, analytics)
- Per-tenant backup and restore with point-in-time recovery
- Tenant cloning for staging/testing with data anonymization
- Sub-second rollback capability for failed migrations
- Multi-system orchestration using Celery and APScheduler
- Data consistency verification using checksums
- Legal hold checks preventing accidental deletion during litigation
- Cryptographically signed deletion certificates for compliance audit trails
- Cost-benefit analysis: zero-downtime (₹40-60 lakh) vs maintenance window (₹2-5 lakh)

**After Completing This Notebook:**
- You will understand how to design blue-green migrations with dual-write mode and incremental sync
- You can implement GDPR deletion workflows that systematically erase data across all systems
- You will recognize when to use zero-downtime migration vs maintenance windows vs rolling updates
- You can build backup/restore systems with configurable retention and cross-region replication
- You will handle common failure scenarios: data inconsistency, rollback failures, incomplete deletion
- You can debug migrations using verification scans, checksums, and deletion certificates
- You will implement legal hold checks to prevent regulatory violations
- You can orchestrate multi-system operations with parallel workers and distributed locks

**Context in Track L3.M14:**
This module builds on **M14.1 (Multi-Tenant Architecture)** and **M14.2 (Tenant Isolation & Security)** by adding lifecycle management: onboarding, migration, offboarding, and disaster recovery. It prepares you for **M14.4 (Operating Model & Governance)** where you'll learn cost allocation, SLA management, and incident response.

In [None]:
# Environment Setup
import os
import sys

# Add src to path for imports
if '../src' not in sys.path:
    sys.path.insert(0, '../src')

# OFFLINE mode for L3 consistency (no external infrastructure required)
OFFLINE = os.getenv("OFFLINE", "true").lower() == "true"

# Infrastructure services (optional - disabled by default)
POSTGRES_ENABLED = os.getenv("POSTGRES_ENABLED", "false").lower() == "true"
REDIS_ENABLED = os.getenv("REDIS_ENABLED", "false").lower() == "true"
PINECONE_ENABLED = os.getenv("PINECONE_ENABLED", "false").lower() == "true"
AWS_ENABLED = os.getenv("AWS_ENABLED", "false").lower() == "true"

print("L3 M14.3: Tenant Lifecycle & Migrations")
print("="*50)
print(f"OFFLINE mode: {OFFLINE}")
print(f"PostgreSQL: {'enabled' if POSTGRES_ENABLED else 'disabled'}")
print(f"Redis: {'enabled' if REDIS_ENABLED else 'disabled'}")
print(f"Pinecone: {'enabled' if PINECONE_ENABLED else 'disabled'}")
print(f"AWS S3: {'enabled' if AWS_ENABLED else 'disabled'}")
print()

if OFFLINE:
    print("⚠️  Running in OFFLINE mode")
    print("   → All infrastructure calls will be simulated")
    print("   → Set OFFLINE=false in .env to enable real operations")
else:
    print("✓ Online mode - infrastructure operations enabled")
    if not any([POSTGRES_ENABLED, REDIS_ENABLED, PINECONE_ENABLED, AWS_ENABLED]):
        print("⚠️  Warning: Online mode but no services enabled")
        print("   → Enable services in .env (POSTGRES_ENABLED=true, etc.)")

## Section 1: Blue-Green Migration Fundamentals

Blue-green deployment is a zero-downtime migration strategy where you:
1. Run parallel **blue** (current) and **green** (new) environments
2. Synchronize data between them
3. Gradually shift traffic from blue to green
4. Keep blue as instant rollback option

**Why zero-downtime matters:**
- Platinum tier SLAs guarantee 99.99% uptime (max 52 minutes downtime/year)
- Financial services regulations prohibit trading hour outages
- Revenue impact: ₹1-5 lakh/hour for high-volume tenants
- Customer trust: Single outage can trigger contract renegotiation

**Six-Phase Migration:**
1. **Provision:** Spin up green infrastructure (Terraform, K8s)
2. **Full Sync:** Bulk data transfer (AWS DataSync, pg_dump)
3. **Dual-Write:** Application writes to both blue and green
4. **Incremental Sync:** Close replication gap from transaction logs
5. **Cutover:** Gradual traffic shift (10% → 25% → 50% → 100%)
6. **Decommission:** Destroy blue after 24-hour stability period

In [None]:
# Import migration functions
from l3_m14_tenant_lifecycle import migrate_tenant_blue_green, MigrationStatus

# Example: Migrate a Platinum tier tenant
tenant_id = "tenant_platinum_001"
source_env = "blue"
target_env = "green"

print(f"Initiating blue-green migration: {tenant_id}")
print(f"Source: {source_env} → Target: {target_env}")
print()

result = migrate_tenant_blue_green(
    tenant_id=tenant_id,
    source_env=source_env,
    target_env=target_env,
    offline=OFFLINE
)

print(f"Migration result: {result['status']}")
if result.get('skipped'):
    print(f"Reason: {result['reason']}")
else:
    print(f"Duration: {result.get('duration_seconds', 'N/A')} seconds")
    print(f"Traffic percentage: {result.get('traffic_percentage', 'N/A')}%")

# Expected (offline mode): Skipped with simulation message
# Expected (online mode): Full 6-phase migration with checksum validation

## Section 2: GDPR Article 17 Deletion Workflow

GDPR Article 17 grants EU citizens the **"right to erasure"** (right to be forgotten). When a tenant requests deletion:
- You have **30 days** to complete erasure across ALL systems
- Missing even one system risks **€20 million** fine or 4% global revenue
- Must handle **legal holds** (court orders preventing deletion)
- Must provide **cryptographically signed certificate** as proof

**Seven Systems to Delete:**
1. **PostgreSQL:** Tenant registry, metadata, user accounts
2. **Redis:** Session cache, distributed locks, rate limit counters
3. **Pinecone:** Vector embeddings, semantic search indices
4. **S3:** Documents, images, file uploads, backups
5. **CloudWatch:** Application logs (anonymize tenant_id references)
6. **Backups:** Add to exclusion list, schedule purge
7. **Analytics:** Event streams, dashboards, aggregated metrics

**Production Reality:**
- First attempts find residual data in ~30% of verifications
- Common miss: Analytics databases not in deletion checklist
- Common miss: Backup tapes stored offsite for 7 years
- Manual audit required for first 5 deletions to build confidence

In [None]:
# Import GDPR deletion functions
from l3_m14_tenant_lifecycle import execute_gdpr_deletion, verify_gdpr_deletion, generate_deletion_certificate
import uuid

# Example: Execute GDPR deletion for offboarded tenant
tenant_id = "tenant_offboarded_001"
request_id = str(uuid.uuid4())

print(f"GDPR Deletion Request")
print(f"Tenant: {tenant_id}")
print(f"Request ID: {request_id}")
print()

deletion_result = execute_gdpr_deletion(
    tenant_id=tenant_id,
    request_id=request_id,
    offline=OFFLINE
)

print(f"Deletion status: {deletion_result['status']}")
if deletion_result.get('skipped'):
    print(f"Reason: {deletion_result['reason']}")
else:
    print(f"Systems deleted: {', '.join(deletion_result['systems_deleted'])}")
    print(f"Certificate ID: {deletion_result['certificate_id']}")
    print(f"Completed at: {deletion_result['completed_at']}")

# Expected (offline): Simulated deletion across all systems
# Expected (online): Parallel deletion + verification + certificate generation

## Section 3: Deletion Verification and Certificate

**Why verification matters:**
- GDPR audits require proof of complete erasure
- Residual data in any system violates compliance
- Certificates provide cryptographic non-repudiation

**Verification Process:**
1. Query each system for tenant_id presence
2. Aggregate results into pass/fail per system
3. Generate deletion report
4. If complete: Sign certificate with SHA-256
5. If incomplete: Block certificate, flag for manual review

**Certificate Contents:**
- Certificate ID (unique identifier)
- Tenant ID and deletion request ID
- Timestamp of deletion
- List of systems verified
- Verification status (complete/incomplete)
- SHA-256 signature for tamper-proof audit trail

In [None]:
# Verify GDPR deletion completeness
systems_to_verify = ["postgresql", "redis", "pinecone", "s3", "cloudwatch", "backups", "analytics"]

print(f"Verifying deletion for tenant: {tenant_id}")
print(f"Systems to check: {len(systems_to_verify)}")
print()

verification = verify_gdpr_deletion(
    tenant_id=tenant_id,
    systems=systems_to_verify,
    offline=OFFLINE
)

print(f"Verification complete: {verification['complete']}")
print(f"Systems checked: {len(verification['systems_checked'])}")
print()
print("Per-system status:")
for system, status in verification['system_status'].items():
    status_icon = "✓" if status else "✗"
    print(f"  {status_icon} {system}: {'deleted' if status else 'residual data found'}")

if verification['complete']:
    print("\n✓ All systems verified clean - generating certificate...")
    certificate = generate_deletion_certificate(
        tenant_id=tenant_id,
        request_id=request_id,
        verification=verification,
        offline=OFFLINE
    )
    print(f"Certificate ID: {certificate['certificate_id']}")
    print(f"Signature: {certificate['signature'][:16]}...")
else:
    print("\n✗ Verification incomplete - manual review required")
    print(f"Remaining data in: {verification['remaining_data']}")

# Expected: All systems show deleted in offline mode
# In production: May find residual data in first attempts

## Section 4: Backup and Restore with Point-in-Time Recovery

**Backup strategies by tenant tier:**
- **Platinum:** Daily backups, 365 days retention, cross-region replication
- **Gold:** Weekly backups, 90 days retention, single region
- **Bronze:** Monthly backups, 30 days retention, single region

**Point-in-Time Recovery (PITR):**
- Restore to any timestamp within retention window
- Uses transaction logs + base backup
- Critical for ransomware recovery, data corruption, accidental deletion

**Cross-Region Replication:**
- Protects against regional disasters (datacenter fire, network partition)
- Adds 50-100ms replication lag
- Doubles storage cost but ensures business continuity

**Systems Backed Up:**
1. PostgreSQL: Full pg_dump + WAL logs
2. Redis: RDB snapshots + AOF logs
3. Pinecone: Namespace export to S3
4. S3: Versioning + cross-region replication

In [None]:
# Import backup/restore functions
from l3_m14_tenant_lifecycle import create_tenant_backup, restore_tenant_backup

# Example: Create backup for Platinum tier tenant
tenant_id = "tenant_platinum_002"

print(f"Creating backup for tenant: {tenant_id}")
print(f"Tier: Platinum (365 days retention, cross-region)")
print()

backup_result = create_tenant_backup(
    tenant_id=tenant_id,
    retention_days=365,
    cross_region=True,
    offline=OFFLINE
)

print(f"Backup status: {backup_result['status']}")
if backup_result.get('skipped'):
    print(f"Reason: {backup_result['reason']}")
else:
    print(f"Backup ID: {backup_result['backup_id']}")
    print(f"Size: {backup_result['size_bytes'] / (1024*1024):.2f} MB")
    print(f"Cross-region: {backup_result['cross_region']}")
    print(f"Systems backed up: {', '.join(backup_result['systems'].keys())}")

# Save backup_id for restore demo
backup_id = backup_result.get('backup_id', 'backup_demo_001')

# Expected: Backup created across 4 systems (PostgreSQL, Redis, Pinecone, S3)

In [None]:
# Example: Restore from backup (disaster recovery scenario)
from datetime import datetime, timedelta

# Scenario: Restore to 24 hours ago (before data corruption)
point_in_time = datetime.now() - timedelta(hours=24)

print(f"Restoring tenant: {tenant_id}")
print(f"From backup: {backup_id}")
print(f"Point-in-time: {point_in_time.isoformat()}")
print()

restore_result = restore_tenant_backup(
    backup_id=backup_id,
    tenant_id=tenant_id,
    point_in_time=point_in_time,
    offline=OFFLINE
)

print(f"Restore status: {restore_result['status']}")
if restore_result.get('skipped'):
    print(f"Reason: {restore_result['reason']}")
else:
    print(f"Systems restored: {', '.join(restore_result['systems_restored'])}")
    print(f"Point-in-time: {restore_result['point_in_time']}")
    print(f"Verification: {restore_result['verification']['success']}")

# Expected: Schema compatibility check + restore + verification
# Production: 15-60 minutes depending on data size

## Section 5: Tenant Cloning for Staging and Testing

**Why clone tenants:**
- Test new features in staging before production rollout
- Reproduce production bugs in safe environment
- Create development environments with realistic data
- Validate migration workflows before live cutover

**Data Anonymization (Critical):**
- **PII masking:** Replace names with fake data (Faker library)
- **Email obfuscation:** user@example.com → user_12345@testdomain.com
- **Tokenization:** Replace credit cards with test numbers
- **Aggregation:** Sum revenue instead of itemized transactions

**Selective Synchronization:**
- **Full clone:** All data types (documents, embeddings, metadata, configs)
- **Config-only:** Just settings, no user data
- **Schema-only:** Database structure, no records
- **Metadata + configs:** Lightweight testing environment

**Legal Consideration:**
- GDPR prohibits copying PII to non-production without anonymization
- HIPAA requires same security controls in staging as production
- Always anonymize by default unless explicit legal approval

In [None]:
# Import clone function
from l3_m14_tenant_lifecycle import clone_tenant

# Example: Clone production tenant to staging with anonymization
source_tenant = "tenant_production_001"
target_tenant = "tenant_staging_001"

print(f"Cloning tenant for staging environment")
print(f"Source: {source_tenant} (production)")
print(f"Target: {target_tenant} (staging)")
print(f"Anonymization: ENABLED (GDPR compliance)")
print()

clone_result = clone_tenant(
    source_tenant_id=source_tenant,
    target_tenant_id=target_tenant,
    anonymize_data=True,  # Always True for staging
    selective_sync=["documents", "embeddings", "configs"],  # Exclude raw user data
    offline=OFFLINE
)

print(f"Clone status: {clone_result['status']}")
if clone_result.get('skipped'):
    print(f"Reason: {clone_result['reason']}")
else:
    print(f"Data types cloned: {', '.join(clone_result['data_types'])}")
    print(f"Anonymized: {clone_result['anonymized']}")
    print(f"Timestamp: {clone_result['timestamp']}")

# Expected: Selective clone with PII anonymization
# Use staging environment to test new features safely

## Section 6: Migration Rollback and Data Consistency

**Why rollback matters:**
- 15-20% of first migrations require rollback
- Target rollback time: **<60 seconds** from detection to traffic recovery
- Faster rollback = less customer impact = preserved SLA

**Rollback Triggers:**
1. Data consistency failure (checksum mismatch)
2. Target environment errors (500s spike >5%)
3. Latency regression (p99 latency >2x baseline)
4. Business metric drop (conversion rate <80% baseline)
5. Manual trigger (engineering judgment)

**Rollback Process:**
1. **Immediate:** Route 100% traffic back to blue (DNS/load balancer update)
2. **Verify:** Confirm traffic flowing to blue, error rate normalized
3. **Restore:** Apply rollback snapshot if data corruption detected
4. **Post-mortem:** Document failure reason, update playbook

**Data Consistency Verification:**
- Calculate MD5/SHA checksums for each system
- Compare source vs target after sync
- Report differences by system and data type
- Block cutover if consistency <100%

In [None]:
# Import verification and rollback functions
from l3_m14_tenant_lifecycle import verify_data_consistency, rollback_migration

# Example: Verify data consistency before cutover
tenant_id = "tenant_migration_001"
source_env = "blue"
target_env = "green"

print(f"Verifying data consistency for tenant: {tenant_id}")
print(f"Source: {source_env} vs Target: {target_env}")
print()

consistency = verify_data_consistency(
    tenant_id=tenant_id,
    source_env=source_env,
    target_env=target_env,
    offline=OFFLINE
)

print(f"Consistency check: {'PASS' if consistency['consistent'] else 'FAIL'}")
print(f"Systems checked: {', '.join(consistency['systems_checked'])}")

if consistency['consistent']:
    print("\n✓ All checksums match - safe to proceed with cutover")
else:
    print(f"\n✗ Data inconsistency detected in {len(consistency['differences'])} systems")
    print("Differences:")
    for diff in consistency['differences']:
        print(f"  - {diff['system']}: {diff['source_checksum']} != {diff['target_checksum']}")
    print("\nAction: Trigger full re-sync before proceeding")

# Expected (offline): Consistent (simulated checksums match)
# Production: May detect mismatches requiring re-sync

In [None]:
# Example: Rollback failed migration
# Scenario: Green environment showing high error rate

rollback_snapshot = "backup_pre_migration_001"

print(f"ROLLBACK INITIATED")
print(f"Tenant: {tenant_id}")
print(f"Reason: Target environment error rate >5%")
print(f"Rollback snapshot: {rollback_snapshot}")
print()

rollback_result = rollback_migration(
    tenant_id=tenant_id,
    rollback_snapshot=rollback_snapshot,
    offline=OFFLINE
)

print(f"Rollback status: {rollback_result['status']}")
if rollback_result.get('skipped'):
    print(f"Reason: {rollback_result['reason']}")
else:
    print(f"Rollback duration: {rollback_result['duration_seconds']:.2f} seconds")
    print(f"Target: <60 seconds (SLA requirement)")
    
    if rollback_result['duration_seconds'] < 60:
        print("\n✓ Rollback completed within SLA")
    else:
        print("\n⚠️  Rollback exceeded SLA - review DNS/load balancer config")

# Expected: Sub-second simulated rollback
# Production target: <60 seconds from decision to traffic recovery

## Section 7: Common Failures and Debugging

Based on production experience from GCCs serving Fortune 500 clients:

### Failure 1: Data Inconsistency (Checksum Mismatch)
**Symptom:** Verification shows different checksums between source and target
**Cause:** High write rate outpaces incremental sync, network partition during sync
**Fix:** Full re-sync with parallel workers, increase sync frequency from 5min to 1min

### Failure 2: Rollback Failure (Load Balancer Config Drift)
**Symptom:** DNS updated but traffic still hitting green environment
**Cause:** Load balancer configuration out of sync with Terraform state
**Fix:** Manual DNS update to blue IP, verify health checks enabled, quarterly rollback drills

### Failure 3: Incomplete GDPR Deletion
**Symptom:** Verification finds residual data in analytics database
**Cause:** Analytics system not documented in deletion checklist
**Fix:** Multi-system discovery scan (`grep -r tenant_id`), update deletion checklist, manual audit for first 5 deletions

### Failure 4: Migration Timeout (High Write Rate)
**Symptom:** Incremental sync never catches up, replication lag >30 minutes
**Cause:** Tenant writing >10K records/second, single sync worker saturated
**Fix:** Scale parallel workers from 4 to 16, consider short maintenance window for write-heavy tenants

### Failure 5: Backup Restoration Failure (Schema Incompatibility)
**Symptom:** Restore fails with column mismatch error
**Cause:** Production schema evolved (added columns) but backup from older version
**Fix:** Schema migration before restore, maintain version metadata in backups, monthly restore testing

In [None]:
# Debugging: Simulate common failure scenarios

print("Common Failure Scenarios - Debugging Guide")
print("="*50)

# Scenario 1: Data Inconsistency
print("\n1. DATA INCONSISTENCY")
print("Symptoms:")
print("  - Checksum mismatch in verification")
print("  - Different record counts between blue and green")
print("Debug commands:")
print("  verify_data_consistency(tenant_id, 'blue', 'green')")
print("  # Check each system's checksum individually")
print("Fix:")
print("  # Full re-sync with increased parallelism")
print("  # Extend sync window, reduce write rate during migration")

# Scenario 2: Rollback Failure
print("\n2. ROLLBACK FAILURE")
print("Symptoms:")
print("  - Traffic still hitting green despite DNS update")
print("  - Rollback duration >60 seconds")
print("Debug commands:")
print("  # Check load balancer config")
print("  # Verify DNS propagation: dig tenant.example.com")
print("  # Check health checks: curl http://blue-lb/health")
print("Fix:")
print("  # Manual DNS update to blue IP")
print("  # Quarterly rollback drills to catch config drift")

# Scenario 3: Incomplete GDPR Deletion
print("\n3. INCOMPLETE GDPR DELETION")
print("Symptoms:")
print("  - Verification finds residual data in 1+ systems")
print("  - Deletion certificate blocked")
print("Debug commands:")
print("  verify_gdpr_deletion(tenant_id, all_systems)")
print("  # Discovery scan: grep -r tenant_id /var/log/* /backups/*")
print("Fix:")
print("  # Update deletion checklist with missing systems")
print("  # Re-run deletion workflow")
print("  # Manual audit for first 5 deletions")

# Scenario 4: Migration Timeout
print("\n4. MIGRATION TIMEOUT")
print("Symptoms:")
print("  - Incremental sync replication lag >30 minutes")
print("  - Never reaches 100% caught up")
print("Debug commands:")
print("  # Check write rate: SELECT COUNT(*) FROM writes WHERE timestamp > NOW() - INTERVAL '1 minute'")
print("  # Monitor sync workers: top -p <sync_worker_pid>")
print("Fix:")
print("  # Scale workers from 4 to 16")
print("  # Consider short maintenance window for write-heavy tenants")

# Scenario 5: Backup Restoration Failure
print("\n5. BACKUP RESTORATION FAILURE")
print("Symptoms:")
print("  - Restore fails with schema error (column mismatch)")
print("  - Version incompatibility between backup and current")
print("Debug commands:")
print("  # Check backup metadata: cat backup_manifest.json")
print("  # Compare schemas: pg_dump --schema-only current | diff - backup_schema.sql")
print("Fix:")
print("  # Run schema migration before restore")
print("  # Monthly restore testing to catch incompatibilities early")

## Section 8: Decision Card - When to Use Zero-Downtime Migration

### Cost-Benefit Analysis

| Approach | Development Cost | Operational Cost | Downtime | Best For |
|----------|-----------------|------------------|----------|----------|
| **Zero-Downtime** | ₹40-60 lakh | ₹50K-80K/migration | 0 minutes | Platinum tenants, 99.99% SLA |
| **Maintenance Window** | ₹2-5 lakh | ₹5K-15K/migration | 2-4 hours | Gold tenants, scheduled updates |
| **Rolling K8s Update** | Minimal | Minimal | 0 minutes | Code-only, no data migration |
| **Hybrid Tiered** | ₹15-25 lakh | Varies by tier | Varies | Multi-tier customer base |

### When to Use Zero-Downtime

**Use this approach when:**
- Tenant has strict SLA requiring 99.99% uptime (max 52 minutes/year)
- Revenue impact >₹1 lakh/hour during downtime
- Tenant tier is Platinum/Enterprise with premium support
- Migration spans business hours or peak usage times
- Regulatory compliance prohibits outages (financial trading, healthcare)
- Previous migrations caused customer escalations or churn
- Multi-region rollout with gradual geographic cutover
- Testing new infrastructure version before full commitment (canary)

**Do NOT use when:**
- Tenant accepts scheduled maintenance windows (Bronze/Silver tiers)
- Off-peak migration window available (nights, weekends for B2B)
- Same-region updates compatible with Kubernetes rolling deployment
- Budget constraints: ₹40-60 lakh development cost too high for ROI
- Small tenant with <10K queries/day (backup-restore faster)
- Infrastructure change is backward-compatible (no data migration)
- First-time migration (20% rollback rate too risky)

### Performance Metrics

- **Migration duration:** 6-8 hours (first attempt) → 3-4 hours (experienced)
- **Rollback time:** <60 seconds (SLA requirement)
- **Success rate:** 80% (first) → 98%+ (after 3-5 migrations)
- **Data consistency:** 100% (checksum verification)
- **Operational cost:** ₹50K-80K per migration (parallel infrastructure)

### Trade-offs

**Complexity:**
- 5000+ lines of orchestration code
- 7+ systems to coordinate
- Senior DevOps/SRE expertise required
- Quarterly rollback testing burden

**Latency:**
- +5-10ms write latency during dual-write mode
- Gradual cutover extends total migration time (vs instant cutover)

**Cost:**
- 10-20x higher development cost vs maintenance window
- 3-5x higher operational cost (parallel infrastructure)
- But: 0 revenue loss from downtime, preserved customer trust

In [None]:
# Cost Calculator: Zero-Downtime vs Maintenance Window

def calculate_migration_costs(tenant_tier, queries_per_day, revenue_per_hour_downtime):
    """
    Calculate total cost of ownership for different migration strategies.
    
    Args:
        tenant_tier: 'platinum', 'gold', or 'bronze'
        queries_per_day: Daily query volume
        revenue_per_hour_downtime: Revenue lost per hour of downtime (in ₹)
    """
    print(f"Migration Cost Analysis")
    print(f"Tenant tier: {tenant_tier.upper()}")
    print(f"Query volume: {queries_per_day:,}/day")
    print(f"Revenue impact: ₹{revenue_per_hour_downtime:,}/hour downtime")
    print()
    
    # Zero-downtime migration
    zd_dev_cost = 50_00_000  # ₹50 lakh
    zd_op_cost = 65_000  # ₹65K per migration
    zd_downtime_hours = 0
    zd_revenue_loss = zd_downtime_hours * revenue_per_hour_downtime
    zd_total = zd_dev_cost + zd_op_cost + zd_revenue_loss
    
    print("ZERO-DOWNTIME MIGRATION:")
    print(f"  Development: ₹{zd_dev_cost:,}")
    print(f"  Operational: ₹{zd_op_cost:,}")
    print(f"  Revenue loss: ₹{zd_revenue_loss:,} ({zd_downtime_hours} hours)")
    print(f"  TOTAL: ₹{zd_total:,}")
    print()
    
    # Maintenance window migration
    mw_dev_cost = 3_50_000  # ₹3.5 lakh
    mw_op_cost = 10_000  # ₹10K per migration
    mw_downtime_hours = 3  # 3 hour maintenance window
    mw_revenue_loss = mw_downtime_hours * revenue_per_hour_downtime
    mw_total = mw_dev_cost + mw_op_cost + mw_revenue_loss
    
    print("MAINTENANCE WINDOW MIGRATION:")
    print(f"  Development: ₹{mw_dev_cost:,}")
    print(f"  Operational: ₹{mw_op_cost:,}")
    print(f"  Revenue loss: ₹{mw_revenue_loss:,} ({mw_downtime_hours} hours)")
    print(f"  TOTAL: ₹{mw_total:,}")
    print()
    
    # Recommendation
    print("RECOMMENDATION:")
    if revenue_per_hour_downtime > 1_00_000:  # ₹1 lakh/hour
        print("  → Use ZERO-DOWNTIME (high revenue impact)")
    elif tenant_tier == 'platinum':
        print("  → Use ZERO-DOWNTIME (SLA requirement)")
    elif queries_per_day < 10_000:
        print("  → Use MAINTENANCE WINDOW (low volume tenant)")
    else:
        print("  → Use MAINTENANCE WINDOW (cost-effective)")
    print()
    print(f"Cost difference: ₹{abs(zd_total - mw_total):,}")
    print(f"ROI breakeven: {(zd_dev_cost - mw_dev_cost) / max(1, mw_revenue_loss - zd_revenue_loss):.1f} migrations")

# Example scenarios
print("Scenario 1: High-value Platinum tenant")
calculate_migration_costs(
    tenant_tier='platinum',
    queries_per_day=500_000,
    revenue_per_hour_downtime=3_00_000  # ₹3 lakh/hour
)

print("\n" + "="*70 + "\n")

print("Scenario 2: Low-volume Bronze tenant")
calculate_migration_costs(
    tenant_tier='bronze',
    queries_per_day=5_000,
    revenue_per_hour_downtime=10_000  # ₹10K/hour
)

## Section 9: GCC Enterprise Context

**What is a Global Capability Center (GCC)?**
Offshore/nearshore service delivery centers (e.g., Bangalore, Pune, Hyderabad) serving multiple business units of multinational corporations. Common in:
- Financial services (JPMorgan, Goldman Sachs, HSBC)
- Technology (Microsoft, Google, Amazon)
- Manufacturing (Siemens, GE, Bosch)

**Why Tenant Lifecycle Management is Critical at GCC Scale:**

1. **Multi-BU Complexity:** Single GCC serves 50-200 business units, each a separate tenant
2. **Cross-Region Operations:** Tenants span US, EU, APAC with different regulatory requirements
3. **Cost Attribution:** Finance requires per-tenant cost tracking for chargeback
4. **Compliance Layers:** Must satisfy parent company regulations + India operations + client contracts
5. **SLA Diversity:** Platinum tenants pay 5-10x more than Bronze, expect different treatment

**Stakeholder Perspectives:**

**CFO (Finance):**
- "What's the ROI of ₹50 lakh zero-downtime investment?"
- "Can we do tiered approach? Platinum gets zero-downtime, Bronze gets maintenance window?"
- Needs: Cost-benefit analysis, chargeback model, budget forecast

**CTO (Technology):**
- "What's our migration success rate? Rollback frequency?"
- "How do we prevent another 6-hour outage like last quarter?"
- Needs: Reliability metrics, incident post-mortems, capacity planning

**Compliance (Legal/Risk):**
- "How do we prove GDPR deletion to EU regulators?"
- "What if tenant is under legal hold during deletion request?"
- Needs: Audit trails, deletion certificates, legal hold workflow

**Production Deployment Checklist (GCC Standards):**
1. ✓ Technical review: Architecture, code quality, test coverage
2. ✓ Security review: PII handling, encryption, access controls
3. ✓ Compliance review: GDPR, SOC2, HIPAA applicability
4. ✓ Business review: Cost justification, ROI analysis, SLA impact
5. ✓ Governance review: Runbooks, on-call rotation, escalation paths

## Section 10: Summary and Next Steps

### What You Learned

**1. Blue-Green Migration Pattern**
- Six-phase orchestration: provision → sync → dual-write → catchup → cutover → decommission
- Gradual traffic cutover (10% → 25% → 50% → 100%) for safe rollback
- Sub-60 second rollback capability for failed migrations
- Data consistency verification using multi-system checksums

**2. GDPR Deletion Workflow**
- Systematic erasure across 7+ systems (PostgreSQL, Redis, Pinecone, S3, CloudWatch, backups, analytics)
- Legal hold checks preventing deletion during litigation
- Multi-system verification to catch incomplete deletion
- Cryptographically signed certificates (SHA-256) for audit trails

**3. Backup/Restore/Clone Operations**
- Tiered backup strategies by tenant SLA (Platinum: daily, Gold: weekly, Bronze: monthly)
- Point-in-time recovery using transaction logs + base backups
- Cross-region replication for disaster recovery
- Tenant cloning with PII anonymization for staging/testing

**4. Decision Framework**
- When to use zero-downtime (₹40-60 lakh dev cost) vs maintenance window (₹2-5 lakh)
- Cost-benefit analysis by tenant tier and revenue impact
- Trade-offs: complexity, latency, cost vs availability and customer trust

### Systems You Built

1. **Migration Orchestrator:** Blue-green deployment with 6-phase workflow
2. **GDPR Deletion Engine:** Multi-system deletion with verification and certificates
3. **Backup/Restore Service:** Per-tenant snapshots with PITR and cross-region replication
4. **Verification Framework:** Checksum validation and deletion completeness scans

### Career Relevance

**Roles where this matters:**
- **Senior DevOps Engineer:** Design and implement zero-downtime deployment strategies
- **Site Reliability Engineer (SRE):** Build backup/restore, incident response, disaster recovery
- **Platform Engineer:** Multi-tenant lifecycle management at scale
- **Compliance Engineer:** GDPR/HIPAA/SOC2 deletion workflows and audit trails

**Interview Questions You Can Answer:**
- "How would you migrate a multi-TB tenant database with zero downtime?"
- "Design a GDPR-compliant deletion system for a SaaS platform"
- "What's your approach to backup/restore testing in production?"
- "How do you handle rollback for failed migrations?"

### Next Module: M14.4 Operating Model & Governance

**What's next:**
- Multi-GCC operating models with cost allocation and chargeback
- SLA management and tiered support (Platinum/Gold/Bronze)
- Incident response and on-call rotation
- Compliance frameworks (SOC2, HIPAA, ISO 27001)
- Capacity planning and resource optimization

### Practice Exercises

1. **Extend the migration orchestrator:**
   - Add parallel sync workers (scale from 1 to 16)
   - Implement sync progress monitoring
   - Add automated rollback on error rate spike

2. **Enhance GDPR deletion:**
   - Add support for 8th system (e.g., Elasticsearch)
   - Implement log anonymization (PII masking)
   - Build deletion request approval workflow

3. **Backup optimization:**
   - Implement backup compression (reduce storage cost)
   - Add backup encryption (compliance requirement)
   - Build restore testing automation (monthly drills)

4. **Cost optimization:**
   - Build tiered migration strategy (Platinum/Gold/Bronze)
   - Implement backup lifecycle policies (archive old backups to Glacier)
   - Add cost tracking per tenant (chargeback model)

### Resources

- **GDPR Article 17:** [https://gdpr-info.eu/art-17-gdpr/](https://gdpr-info.eu/art-17-gdpr/)
- **AWS Well-Architected Framework:** Reliability pillar (backup/restore)
- **Google SRE Book:** Chapter 18 - Software Engineering in SRE
- **PostgreSQL Replication:** Logical replication for zero-downtime migrations
- **Terraform Blue-Green Modules:** Infrastructure as Code for dual environments