# L3 M12.4: Compliance Boundaries & Data Governance

## Learning Arc

**Purpose:** Build production-ready compliance management for multi-tenant RAG systems with automated data deletion across 7 systems to meet GDPR Article 17, CCPA, DPDPA, and other regulatory requirements.

**Concepts Covered:**
- Per-tenant compliance configuration (retention policies, data residency, regulations)
- Automated scheduled deletion across multi-system architecture
- GDPR Article 17 (Right to Erasure) 30-day SLA workflow
- Legal hold protection (litigation/investigation freeze)
- Multi-system cascade deletion (vector DB, S3, PostgreSQL, Redis, logs, backups, CDN)
- Immutable audit trail (7-10 year retention)
- Verification testing (48 hours post-deletion)
- Regulatory compliance (GDPR, CCPA, DPDPA, SOX, HIPAA, PCI-DSS, FINRA)
- GCC cost optimization (‚Çπ15K-50K/month vs. ‚Çπ2L-10L/month vendor)
- Failure scenarios and remediation (partial deletion, legal hold bypass, backup restoration)

**After Completing This Notebook:**
- You will understand per-tenant compliance configuration and how to support 7+ regulatory frameworks
- You can implement automated scheduled deletion with legal hold protection
- You will recognize the critical importance of multi-system cascade deletion and verification
- You can build GDPR Article 17 workflows with 30-day SLA compliance
- You will understand immutable audit trails and 7-10 year retention requirements
- You can calculate GCC compliance costs and ROI vs. vendor solutions
- You will recognize common failure scenarios and implement remediation strategies

**Context in Track L3.M12:**
This module builds on **L3 M11 (Data Security & Encryption)** and **L3 M10 (Multi-Tenancy & Access Control)**, preparing you for **L3 M13 (Cost Optimization & Monitoring)**.

## Section 1: Environment Setup & Service Detection

In [None]:
import os
import sys
from datetime import datetime, timedelta
import json

# Add src to path for imports
if '../src' not in sys.path:
    sys.path.insert(0, '../src')

# OFFLINE mode for L3 consistency (no external API calls)
OFFLINE = os.getenv("OFFLINE", "true").lower() == "true"

# SERVICE detection from script - Pinecone (PRIMARY) + AWS (SECONDARY)
PINECONE_ENABLED = os.getenv("PINECONE_ENABLED", "false").lower() == "true"
AWS_ENABLED = os.getenv("AWS_ENABLED", "false").lower() == "true"

if OFFLINE or (not PINECONE_ENABLED and not AWS_ENABLED):
    print("‚ö†Ô∏è  Running in OFFLINE mode")
    print("   ‚Üí External API calls will be skipped")
    print("   ‚Üí Set PINECONE_ENABLED=true and/or AWS_ENABLED=true in .env to enable")
    print("   ‚Üí Add API keys: PINECONE_API_KEY, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY")
else:
    print("‚úì Online mode - external services enabled")
    if PINECONE_ENABLED:
        print("  ‚úì Pinecone vector database enabled")
    if AWS_ENABLED:
        print("  ‚úì AWS S3/CloudFront enabled")

# Expected: OFFLINE mode message (default)

## Section 2: Import Core Compliance Modules

In [None]:
from l3_m12_compliance_boundaries import (
    TenantComplianceConfig,
    ComplianceDeletionService,
    DeletionRequest,
    ComplianceAuditTrail,
    RegulationType,
    DataResidency,
    create_compliance_config,
    execute_scheduled_deletion,
    check_legal_hold,
    verify_deletion,
)

print("‚úì Successfully imported compliance modules")
print(f"  Available regulations: {[r.value for r in RegulationType]}")
print(f"  Data residency options: {[d.value for d in DataResidency]}")

# Expected: Import success with regulation and residency lists

## Section 3: Per-Tenant Compliance Configuration

Create compliance configurations for different regulatory requirements:
- **Tenant A (EU)**: GDPR 90-day retention
- **Tenant B (US)**: SOX 7-year retention (2555 days)
- **Tenant C (India)**: DPDPA 180-day retention with data localization

In [None]:
# Tenant A: GDPR compliance (EU)
tenant_a_config = create_compliance_config(
    tenant_id="tenant_a_eu",
    tenant_name="ACME Corp EU Division",
    tenant_email="dpo@acme.eu",
    regulations=["GDPR", "SOX"],
    retention_days=90,  # GDPR Article 5(e) storage limitation
    data_residency="EU",
    encryption_required=True,
    encryption_standard="AES-256",
)

print("‚úì Tenant A (GDPR) configuration created:")
print(f"  Regulations: {[r.value for r in tenant_a_config.regulations]}")
print(f"  Retention: {tenant_a_config.retention_days} days")
print(f"  Data residency: {tenant_a_config.data_residency.value}")
print(f"  Auto-delete: {tenant_a_config.auto_delete_enabled}")

# Expected: GDPR tenant with 90-day retention, EU residency

In [None]:
# Tenant B: SOX compliance (US) - 7-year retention
tenant_b_config = create_compliance_config(
    tenant_id="tenant_b_us",
    tenant_name="TechStart Inc US",
    tenant_email="compliance@techstart.com",
    regulations=["CCPA", "SOX"],
    retention_days=2555,  # 7 years - SEC Rule 17a-4
    data_residency="US",
)

print("‚úì Tenant B (SOX) configuration created:")
print(f"  Retention: {tenant_b_config.retention_days} days ({tenant_b_config.retention_days // 365} years)")
print(f"  Regulations: {[r.value for r in tenant_b_config.regulations]}")

# Expected: SOX tenant with 2555-day (7-year) retention

In [None]:
# Tenant C: DPDPA compliance (India) - data localization
tenant_c_config = create_compliance_config(
    tenant_id="tenant_c_in",
    tenant_name="Mumbai Finance Services",
    tenant_email="legal@mumbaifinance.in",
    regulations=["DPDPA", "PCI-DSS"],
    retention_days=180,  # DPDPA data minimization
    data_residency="IN",  # India data localization requirement
)

print("‚úì Tenant C (DPDPA) configuration created:")
print(f"  Retention: {tenant_c_config.retention_days} days")
print(f"  Data residency: {tenant_c_config.data_residency.value} (India localization)")

# Expected: DPDPA tenant with 180-day retention, India residency

## Section 4: Legal Hold Protection

**CRITICAL:** Legal holds prevent data deletion during litigation/investigation.
Unauthorized deactivation may constitute evidence destruction (federal crime).

In [None]:
# Scenario 1: No legal hold (deletion allowed)
legal_hold_active, reason = check_legal_hold("tenant_a_eu", tenant_a_config)

print("Legal Hold Check - Tenant A:")
print(f"  Legal hold active: {legal_hold_active}")
print(f"  Reason: {reason or 'None - deletion allowed'}")

# Expected: Legal hold not active, deletion allowed

In [None]:
# Scenario 2: Activate legal hold (litigation pending)
tenant_litigation = create_compliance_config(
    tenant_id="tenant_litigation",
    tenant_name="Company Under Investigation",
    tenant_email="legal@company.com",
    regulations=["SOX", "FINRA"],
    retention_days=2555,
    data_residency="US",
    legal_hold_active=True,
    legal_hold_reason="SEC Investigation: Case #2024-SEC-98765",
)

legal_hold_active, reason = check_legal_hold("tenant_litigation", tenant_litigation)

print("\nLegal Hold Check - Litigation Tenant:")
print(f"  Legal hold active: {legal_hold_active}")
print(f"  Reason: {reason}")
print("  ‚ö†Ô∏è  ALL DELETION BLOCKED - evidence preservation required")

# Expected: Legal hold active, SEC investigation reason

## Section 5: Scheduled Deletion Workflow

Daily execution at 2am UTC:
1. Check legal hold (3 sources)
2. Check auto_delete_enabled
3. Calculate cutoff date = NOW() - retention_days
4. Cascade delete across 7 systems
5. Log to immutable audit trail
6. Verify 48 hours later

In [None]:
# Initialize deletion service (OFFLINE mode)
deletion_service = ComplianceDeletionService(
    pinecone_client=None,
    s3_client=None,
    redis_client=None,
    db_session=None,
    cloudfront_client=None,
    offline_mode=True,  # No external API calls
)

print("‚úì Deletion service initialized (OFFLINE mode)")
print("  ‚Üí External API calls will be skipped")

# Expected: Offline mode initialization

In [None]:
# Execute scheduled deletion for Tenant A (GDPR)
system_config = {
    "s3_bucket": "compliance-documents",
    "cloudfront_distribution_id": None,
}

deletion_result = execute_scheduled_deletion(
    tenant_id="tenant_a_eu",
    compliance_config=tenant_a_config,
    deletion_service=deletion_service,
    config=system_config,
)

print("\nScheduled Deletion Result - Tenant A:")
print(f"  Tenant: {deletion_result['tenant_id']}")
print(f"  Cutoff date: {deletion_result['cutoff_date']}")
print(f"  All systems succeeded: {deletion_result['all_systems_succeeded']}")
print(f"  Audit trail ID: {deletion_result['audit_trail_id']}")
print(f"\n  Per-system results:")
for system, result in deletion_result['results'].items():
    status = "‚úì" if result['success'] else "‚úó"
    print(f"    {status} {system}: {result['count']} items, error: {result['error']}")

# Expected: Deletion completed across 7 systems (offline mode - skipped operations)

## Section 6: Legal Hold Blocking Deletion

When legal hold is active, scheduled deletion is BLOCKED to prevent evidence destruction.

In [None]:
# Attempt deletion with legal hold active (should be BLOCKED)
deletion_result_blocked = execute_scheduled_deletion(
    tenant_id="tenant_litigation",
    compliance_config=tenant_litigation,
    deletion_service=deletion_service,
    config=system_config,
)

print("Deletion Attempt with Legal Hold:")
print(f"  Skipped: {deletion_result_blocked['skipped']}")
print(f"  Reason: {deletion_result_blocked['reason']}")
print("  üö® DELETION BLOCKED - legal hold protection working correctly")

# Expected: Deletion skipped, SEC investigation reason

## Section 7: Multi-System Cascade Deletion

Deletion cascades across 7 systems independently:
1. **Vector DB (Pinecone)** - namespace deletion with metadata filter
2. **S3 (AWS)** - prefix object deletion (batched, max 1,000/API call)
3. **PostgreSQL** - CASCADE delete with FK constraints
4. **Redis** - pattern key deletion
5. **Logs** - anonymization (PII replaced with <REDACTED>)
6. **Backups** - marked for deletion in next cycle
7. **CDN (CloudFront)** - cache invalidation

In [None]:
# Demonstrate cascade deletion logic
cutoff_date = datetime.utcnow() - timedelta(days=90)

cascade_results = deletion_service.cascade_delete(
    tenant_id="tenant_a_eu",
    cutoff_date=cutoff_date,
    config=system_config,
)

print("Cascade Deletion Results (7 Systems):")
print(f"  Cutoff date: {cutoff_date.isoformat()}\n")

for system, result in cascade_results.items():
    status_icon = "‚úì" if result['success'] else "‚úó"
    print(f"  {status_icon} {system.upper()}:")
    print(f"      Deleted: {result['count']} items")
    print(f"      Status: {result['error'] or 'Success'}")

# Expected: 7 systems processed (offline mode - operations skipped)

## Section 8: GDPR Article 17 (Right to Erasure)

Implement user deletion request with 30-day SLA compliance.

In [None]:
# Create GDPR Article 17 deletion request
deletion_request = DeletionRequest(
    request_id="del_tenant_a_eu_user_123_1699920000",
    tenant_id="tenant_a_eu",
    user_id="user_123",
    request_type="gdpr_article_17",
)

print("GDPR Article 17 Deletion Request:")
print(f"  Request ID: {deletion_request.request_id}")
print(f"  Tenant: {deletion_request.tenant_id}")
print(f"  User: {deletion_request.user_id}")
print(f"  Type: {deletion_request.request_type}")
print(f"  Requested at: {deletion_request.requested_at.isoformat()}")
print(f"\n  SLA: 30 days (GDPR Article 17 requirement)")
print(f"  Deadline: {(deletion_request.requested_at + timedelta(days=30)).isoformat()}")

# Expected: Deletion request created with 30-day deadline

## Section 9: Verification Testing (48 Hours Post-Deletion)

**CRITICAL:** Independent verification ensures data was actually deleted.
Don't reuse deletion code - check systems directly.

In [None]:
# Run verification 48 hours post-deletion
verification_result = verify_deletion(
    tenant_id="tenant_a_eu",
    deletion_request_id="del_tenant_a_eu_user_123_1699920000",
    deletion_service=deletion_service,
    config=system_config,
)

print("Deletion Verification Results:")
print(f"  Tenant: {verification_result['tenant_id']}")
print(f"  Request ID: {verification_result['deletion_request_id']}")
print(f"  Verified at: {verification_result['verified_at']}")
print(f"  All verified: {verification_result['all_verified']}\n")

print("  Per-system verification:")
for system, result in verification_result['systems'].items():
    status = "‚úì" if result['verified'] else "‚úó"
    print(f"    {status} {system}: {result['message']}")

# Expected: All systems verified (offline mode)

## Section 10: Immutable Audit Trail

**NEVER DELETED** - 7-10 year retention for regulatory compliance.
Append-only log of all compliance events.

In [None]:
# Create audit trail entry
audit_entry = ComplianceAuditTrail(
    tenant_id="tenant_a_eu",
    event_type="scheduled_deletion_executed",
    event_data={
        "cutoff_date": "2024-08-20T00:00:00Z",
        "retention_days": 90,
        "results": {
            "vector_db": {"deleted": 1847},
            "s3": {"deleted": 523},
            "postgresql": {"deleted": 523},
            "redis": {"deleted": 234},
        },
    },
)

print("Immutable Audit Trail Entry:")
print(f"  Audit ID: {audit_entry.id}")
print(f"  Tenant: {audit_entry.tenant_id}")
print(f"  Event: {audit_entry.event_type}")
print(f"  Created at: {audit_entry.created_at.isoformat()}")
print(f"  Retention: 7-10 years (NEVER DELETED)")
print(f"\n  Event data:")
print(f"    Cutoff date: {audit_entry.event_data['cutoff_date']}")
print(f"    Retention policy: {audit_entry.event_data['retention_days']} days")
print(f"    Deletion counts:")
for system, counts in audit_entry.event_data['results'].items():
    print(f"      {system}: {counts['deleted']} items")

# Expected: Audit entry with 7-10 year retention

## Section 11: Failure Scenarios & Remediation

Common failures and their fixes from production experience.

In [None]:
# Demonstrate failure scenarios
failure_scenarios = [
    {
        "failure": "Partial Deletion",
        "problem": "Vector DB deleted, S3 deleted, PostgreSQL down ‚Üí partial deletion logged as success",
        "impact": "GDPR violation (data still exists but reported deleted)",
        "fix": "Track per-system status in deletion_status table; mark failed systems, retry tomorrow; don't mark deletion 'complete' until ALL systems succeed",
    },
    {
        "failure": "Legal Hold Not Checked",
        "problem": "Litigation active (legal hold = TRUE), scheduled job deletes anyway ‚Üí evidence destruction",
        "impact": "Obstruction of justice charge (federal crime)",
        "fix": "Triple-check legal hold from 3 sources: (1) tenant_compliance_config.legal_hold_active, (2) legal_holds table, (3) External legal system API. If ANY = TRUE, skip deletion and alert Legal Counsel",
    },
    {
        "failure": "Backup Restore Brings Data Back",
        "problem": "Data deleted from production; backup restore on unrelated incident brings deleted data back",
        "impact": "GDPR violation (user's 'forgotten' data resurfaces)",
        "fix": "(1) Mark backups for deletion in same workflow, (2) Verification job checks backup_metadata.marked_for_deletion = TRUE, (3) Alert if backups not marked",
    },
    {
        "failure": "Third-Party Systems Not Notified",
        "problem": "Data deleted from your systems, but Mixpanel/Segment still have it",
        "impact": "GDPR violation (processor retains data without controller authorization)",
        "fix": "(1) Maintain inventory of third-party systems + deletion APIs, (2) Automated deletion notifications to processors, (3) Manual tickets for systems without API, (4) Track third-party deletion status in audit trail",
    },
    {
        "failure": "No Verification (False Confidence)",
        "problem": "Deletion job logs success, but data still exists (bug in deletion logic). Auditor finds data ‚Üí compliance failure",
        "impact": "Regulatory violation + loss of audit credibility",
        "fix": "(1) Separate verification job runs 48 hours post-deletion, (2) Independent checks (don't reuse deletion code), (3) Verification failures trigger ops alert, (4) Log verification results (PASS/FAIL) to audit trail",
    },
]

print("Common Failure Scenarios & Remediation:\n")
for idx, scenario in enumerate(failure_scenarios, 1):
    print(f"{idx}. {scenario['failure']}")
    print(f"   Problem: {scenario['problem']}")
    print(f"   Impact: {scenario['impact']}")
    print(f"   Fix: {scenario['fix']}\n")

# Expected: 5 failure scenarios with remediation strategies

## Section 12: Cost Analysis (GCC vs. Vendor)

Calculate compliance costs and ROI for GCC environment.

In [None]:
# Cost breakdown for different tenant scales
def calculate_gcc_compliance_cost(num_tenants):
    """Calculate monthly compliance cost for GCC."""
    if num_tenants <= 20:
        # Small GCC
        return {
            "celery_workers": 2000,
            "audit_log_storage": 3000,
            "deletion_api_calls": 500,
            "dpo_time": 10000,  # 20% FTE
            "total": 15000,
            "per_tenant": 750,
        }
    else:
        # Medium GCC (50 tenants)
        return {
            "celery_workers": 5000,
            "audit_log_storage": 10000,
            "deletion_api_calls": 2000,
            "dpo_time": 35000,  # 100% FTE
            "total": 50000,
            "per_tenant": 1000,
        }

small_gcc_cost = calculate_gcc_compliance_cost(20)
medium_gcc_cost = calculate_gcc_compliance_cost(50)

print("GCC Compliance Cost Analysis:\n")
print("Small GCC (20 tenants):")
for item, cost in small_gcc_cost.items():
    if item == "total":
        print(f"  {'='*40}")
    print(f"  {item.replace('_', ' ').title()}: ‚Çπ{cost:,}")

print("\nMedium GCC (50 tenants):")
for item, cost in medium_gcc_cost.items():
    if item == "total":
        print(f"  {'='*40}")
    print(f"  {item.replace('_', ' ').title()}: ‚Çπ{cost:,}")

# ROI comparison
manual_deletion_cost = 60000  # DPO-only manual process
vendor_cost_min = 200000  # OneTrust, BigID minimum
vendor_cost_max = 1000000  # Enterprise tier

print("\nROI Comparison (Medium GCC):")
print(f"  Manual deletion (DPO only): ‚Çπ{manual_deletion_cost:,}/month")
print(f"  Automated (this system): ‚Çπ{medium_gcc_cost['total']:,}/month")
print(f"  Vendor solution (OneTrust): ‚Çπ{vendor_cost_min:,}-{vendor_cost_max:,}/month")
print(f"\n  Savings vs. Manual: ‚Çπ{manual_deletion_cost - medium_gcc_cost['total']:,}/month (‚Çπ{(manual_deletion_cost - medium_gcc_cost['total']) * 12:,}/year)")
print(f"  Savings vs. Vendor (min): ‚Çπ{vendor_cost_min - medium_gcc_cost['total']:,}/month (‚Çπ{(vendor_cost_min - medium_gcc_cost['total']) * 12:,}/year)")

# Expected: Cost breakdown showing ‚Çπ10K-150K monthly savings

## Section 13: Production Deployment Checklist

Ensure all components are ready before going to production.

In [None]:
# Production readiness checklist
production_checklist = [
    "‚úì Tenant compliance config stored (per-tenant regulations, retention, residency)",
    "‚úì Scheduled deletion job runs daily (2am UTC, processes all tenants)",
    "‚úì Multi-system cascade implemented (vector DB, S3, PostgreSQL, Redis, logs, backups, CDN)",
    "‚úì Legal hold triple-check (3 independent sources prevent evidence destruction)",
    "‚úì Verification testing automated (48 hours post-deletion, alerts on failures)",
    "‚úì Audit trail immutable (7-10 year retention, never deleted)",
    "‚úì Third-party systems inventoried (Mixpanel, Segment, Snowflake, others)",
    "‚úì Backup integration complete (marked for deletion, verified in next backup cycle)",
    "‚úì CFO chargeback reports monthly (per-tenant cost allocation, ¬±2% accuracy)",
    "‚úì DPO approval workflow (high-stakes tenants require manual sign-off before deletion)",
    "‚úì Cross-border transfer mechanisms (SCCs or BCRs documented for EU‚ÜîIndia)",
    "‚úì Regulatory inventory maintained (updated quarterly, reviewed by Compliance Officer)",
]

print("Production Deployment Checklist:")
print("=" * 80)
for item in production_checklist:
    print(item)
print("=" * 80)
print(f"\nTotal items: {len(production_checklist)}")
print("\n‚ö†Ô∏è  IMPORTANT: Consult Legal Team before deploying to production")

# Expected: 12-item production checklist

## Section 14: Key Takeaways

**What You've Learned:**

1. **Per-Tenant Compliance**: Configure retention policies per tenant (90 days GDPR vs. 7 years SOX)
2. **Legal Hold Protection**: Triple-check prevents evidence destruction (litigation/investigation freeze)
3. **Multi-System Cascade**: Delete across 7 systems independently with per-system status tracking
4. **GDPR Article 17**: 30-day SLA with automated deletion and verification
5. **Immutable Audit Trail**: 7-10 year retention, never deleted, audit-ready evidence
6. **Verification Testing**: Independent checks 48 hours post-deletion (don't reuse deletion code)
7. **Failure Remediation**: Partial deletion, legal hold bypass, backup restoration - all have fixes
8. **GCC Cost Optimization**: ‚Çπ15K-50K/month vs. ‚Çπ2L-10L/month vendor (‚Çπ1.2L-9.5L/year savings)

**Common Pitfalls to Avoid:**
- ‚ùå Skipping legal hold checks ‚Üí Evidence destruction (federal crime)
- ‚ùå No verification testing ‚Üí False confidence (data still exists)
- ‚ùå Forgetting backups ‚Üí GDPR violation (backup restore brings data back)
- ‚ùå Ignoring third-party processors ‚Üí Compliance failure (Mixpanel still has data)
- ‚ùå Partial deletion without retry ‚Üí GDPR violation (some systems succeed, others fail)

**Next Steps:**
1. Consult Legal Team before setting retention policies
2. Test deletion across all systems in UAT environment
3. Establish Legal Counsel workflows for legal holds
4. Review Data Processing Agreements with third-party processors
5. Implement verification testing (48 hours post-deletion)
6. Set up audit trail with 7-10 year retention

**Regulatory Warnings:**
- ‚ö†Ô∏è  GDPR Compliance Requires Professional Legal Review - This Is Not Legal Advice
- ‚ö†Ô∏è  Consult Legal Team Before Implementing Retention Policies
- ‚ö†Ô∏è  Consult DevOps Team Before Implementing Deletion Automation
- ‚ö†Ô∏è  Data Deletion Must Be Tested Across All Systems

**Continue to:** L3 M13 - Cost Optimization & Monitoring (caching, batching, usage tracking)