# L3 M1.2: Data Governance Requirements for RAG

## Learning Arc

**Duration:** 40-45 minutes  
**Level:** L3 SkillElevate  
**Track:** GCC Compliance Basics

### Learning Objectives

1. Implement data classification schemes with PII/PHI/financial detection using Presidio
2. Design complete data lineage tracking across 7 RAG systems with immutable audit trails
3. Apply retention policies across all systems with automated deletion via Airflow
4. Configure data residency controls for multi-region GCCs (GDPR Article 44, DPDPA)
5. Build consent management workflows with revocation mechanisms (GDPR Articles 6 & 7)
6. Execute GDPR Article 17 erasure requests with legal exception handling

### Module Sections

1. **Introduction & Hook** - Real GCC case study and ‚Ç¨200,000 GDPR fine
2. **Technology Stack** - 6-layer architecture overview
3. **Conceptual Foundations** - 6 core concepts explained
4. **Technical Implementation** - 6 components built and tested
5. **Integration Examples** - End-to-end governance workflows
6. **Common Failures** - 6 failure scenarios with solutions
7. **Decision Card** - When to use/not use this approach
8. **Practice Exercise** - Build your own governance layer

**Prerequisites:** Generic CCC M1-M4, GCC Compliance M1.1

## Section 1: Setup and OFFLINE Mode Guard

This section configures the environment and checks for external service availability.

In [None]:
import os
import sys
from dotenv import load_dotenv

# Add project root to path
sys.path.insert(0, os.path.abspath('..'))

# Load environment variables
load_dotenv()

# Check service availability
PRESIDIO_ENABLED = os.getenv("PRESIDIO_ENABLED", "false").lower() == "true"
OPENAI_ENABLED = os.getenv("OPENAI_ENABLED", "false").lower() == "true"
PINECONE_ENABLED = os.getenv("PINECONE_ENABLED", "false").lower() == "true"

print("="*70)
print("L3 M1.2: Data Governance Requirements for RAG - Notebook")
print("="*70)
print("\nService Status:")
print(f"  {'‚úÖ' if PRESIDIO_ENABLED else '‚ö†Ô∏è'} Presidio (local PII detection): {'Enabled' if PRESIDIO_ENABLED else 'Disabled'}")
print(f"  {'‚úÖ' if OPENAI_ENABLED else '‚ö†Ô∏è'} OpenAI (embeddings): {'Enabled' if OPENAI_ENABLED else 'Disabled'}")
print(f"  {'‚úÖ' if PINECONE_ENABLED else '‚ö†Ô∏è'} Pinecone (vector DB): {'Enabled' if PINECONE_ENABLED else 'Disabled'}")

if not any([PRESIDIO_ENABLED, OPENAI_ENABLED, PINECONE_ENABLED]):
    print("\n‚ö†Ô∏è  Running in OFFLINE mode (no external services configured)")
    print("   All examples will use local processing only.")
    print("   To enable services, set environment variables in .env file.")
else:
    print("\n‚úÖ External services configured")

print("="*70)

# Import core components
from src.l3_m1_compliance_foundations_rag_systems import (
    DataClassifier,
    LineageTracker,
    RetentionEngine,
    ResidencyController,
    ConsentManager,
    GDPRErasureWorkflow,
    DataType,
    Region,
    SensitivityLevel
)

print("\n‚úÖ All modules imported successfully")
print("\n# SAVED_SECTION:1")

## Section 2: Introduction & The ‚Ç¨200,000 Fine

### Real GCC Case Study

**The Scenario:**  
A Fortune 500 parent company deployed a RAG system for HR policy queries at their Global Capability Center. When an employee submitted a GDPR Article 17 erasure request, the team discovered:

- Data scattered across **7 untracked systems**
- No documented lineage from source documents to embeddings
- Unable to prove comprehensive deletion within 30-day deadline
- **Result:** ‚Ç¨200,000 GDPR fine for non-compliance

### Key Insight

> "Data governance is not a feature you add. It's an architecture you build."

### The 7 Systems Problem

1. **Vector Database** - Embeddings in Pinecone
2. **Document Store** - Original files in S3/GCS
3. **Application Logs** - CloudWatch/ELK
4. **Backup Systems** - S3 Glacier
5. **Cache Layer** - Redis
6. **Generation History** - PostgreSQL
7. **Analytics Database** - BigQuery/Snowflake

**Without lineage tracking:** Impossible to know which embeddings came from which source documents.

In [None]:
# Load example GDPR erasure request
import json

with open('../example_data.json', 'r') as f:
    example_data = json.load(f)

gdpr_request = example_data['gdpr_requests'][0]

print("Example GDPR Article 17 Erasure Request:")
print("="*50)
print(f"Request ID: {gdpr_request['request_id']}")
print(f"User ID: {gdpr_request['user_id']}")
print(f"Email: {gdpr_request['email']}")
print(f"Request Type: {gdpr_request['request_type']}")
print(f"Reason: {gdpr_request['reason']}")
print(f"Submitted: {gdpr_request['submitted_date']}")
print(f"Deadline: {gdpr_request['deadline']} (30 days)")
print("\n# Expected: Complete deletion across all 7 systems within 30 days")
print("\n# SAVED_SECTION:2")

## Section 3: Component 1 - Data Classification with Presidio

### Data Classification Engine

**Purpose:** Automatically classify documents by sensitivity level and detect PII/PHI/financial data.

**Classification Levels:**
- **PUBLIC** - Externally shareable, no encryption required
- **INTERNAL** - Company-wide access, encryption in transit
- **CONFIDENTIAL** - Limited access, encryption at rest + in transit
- **RESTRICTED** - Strictly controlled, highest security (PHI, SSN, etc.)

**Data Types Detected:**
- **PII:** Email, SSN, Phone, IBAN
- **PHI:** Medical records, diagnoses, medications
- **Financial:** Invoice numbers, account numbers, transactions
- **Proprietary:** Trade secrets, confidential information

In [None]:
# Initialize Data Classifier
classifier = DataClassifier(enable_presidio=PRESIDIO_ENABLED)

# Example 1: HR Record (CONFIDENTIAL)
hr_text = """Employee Jane Smith (EMP-12345) submitted resignation letter. 
Email: jane.smith@company.com, Phone: +1-555-123-4567."""

hr_classification = classifier.classify_document(hr_text, {"document_type": "hr_records"})

print("Example 1: HR Record Classification")
print("="*50)
print(f"Sensitivity: {hr_classification['sensitivity_level'].upper()}")
print(f"Data Types: {', '.join(hr_classification['data_types'])}")
print(f"Retention: {hr_classification['retention_period_days']} days (7 years per FLSA)")
print(f"Encryption Required: {hr_classification['requires_encryption']}")
print(f"Access Groups: {', '.join(hr_classification['access_groups'])}")
print(f"PII Detected: {len(hr_classification['pii_entities'])} entities")
print()

# Example 2: Financial Document (CONFIDENTIAL - 10 years)
financial_text = "Invoice #INV-2024-001: Payment of $50,000. Account: 9876543210."
financial_classification = classifier.classify_document(
    financial_text, 
    {"document_type": "financial_statement"}
)

print("Example 2: Financial Document Classification")
print("="*50)
print(f"Sensitivity: {financial_classification['sensitivity_level'].upper()}")
print(f"Retention: {financial_classification['retention_period_days']} days (10 years per SOX 802)")
print()

# Example 3: Medical Record (RESTRICTED)
medical_text = "Patient John Doe (SSN: 123-45-6789) diagnosed with Type 2 diabetes. Medication: Metformin 500mg."
medical_classification = classifier.classify_document(
    medical_text,
    {"document_type": "medical_record"}
)

print("Example 3: Medical Record Classification")
print("="*50)
print(f"Sensitivity: {medical_classification['sensitivity_level'].upper()}")
print(f"Data Types: {', '.join(medical_classification['data_types'])}")
print(f"Retention: {medical_classification['retention_period_days']} days (7 years per HIPAA)")

print("\n# Expected: RESTRICTED classification with PHI + PII data types")
print("\n# SAVED_SECTION:3")

## Section 4: Component 2 - Data Lineage Tracking (7 Stages)

### Complete RAG Data Flow

**Purpose:** Track every data transformation from source document to final answer, enabling complete deletion.

**7 Stages:**
1. **Document Upload** ‚Üí S3 with source_id
2. **Chunking** ‚Üí Split document, generate chunk_ids
3. **Embedding** ‚Üí OpenAI embeddings with embedding_ids
4. **Vector Storage** ‚Üí Pinecone with vector_db_ids
5. **Retrieval** ‚Üí User query, track retrieved chunks
6. **Generation** ‚Üí LLM answer with generation_id
7. **Caching** ‚Üí Redis with cache_key and TTL

**Why this matters:** For GDPR Article 17 erasure, you must delete from ALL 7 systems. Without lineage, you can't know which embeddings to delete.

In [None]:
# Initialize Lineage Tracker
lineage_tracker = LineageTracker()

# Simulate complete RAG workflow for one document
source_id = "doc_hr_001"

print("Tracking Complete RAG Workflow")
print("="*50)

# Stage 1: Document Upload
upload_record = lineage_tracker.track_document_upload(
    source_id,
    {"filename": "employee_policy.pdf", "size": 102400, "uploaded_by": "hr_admin"}
)
print(f"Stage 1 - Document Upload: {upload_record}")

# Stage 2: Chunking
chunk_ids = [f"{source_id}_chunk_{i}" for i in range(1, 4)]
chunk_records = lineage_tracker.track_chunking(source_id, chunk_ids)
print(f"Stage 2 - Chunking: {len(chunk_records)} chunks tracked")

# Stage 3: Embeddings
for i, chunk_id in enumerate(chunk_ids, 1):
    embedding_id = f"emb_{source_id}_{i}"
    lineage_tracker.track_embedding(chunk_id, embedding_id, "text-embedding-ada-002")
print(f"Stage 3 - Embeddings: {len(chunk_ids)} embeddings tracked")

# Stage 4: Vector Storage
for i in range(1, 4):
    lineage_tracker.track_vector_storage(f"emb_{source_id}_{i}", f"vec_{source_id}_{i}", "hr_policies")
print(f"Stage 4 - Vector Storage: {len(chunk_ids)} vectors stored in Pinecone")

# Stage 5: Retrieval
query_id = "query_001"
retrieved_chunks = [chunk_ids[0], chunk_ids[2]]  # Retrieved chunk 1 and 3
lineage_tracker.track_retrieval(query_id, retrieved_chunks, "user_123")
print(f"Stage 5 - Retrieval: {len(retrieved_chunks)} chunks retrieved for query")

# Stage 6: Generation
generation_id = "gen_001"
lineage_tracker.track_generation(query_id, generation_id, "gpt-4")
print(f"Stage 6 - Generation: Answer generated with ID {generation_id}")

# Stage 7: Caching
cache_key = f"cache_{query_id}"
lineage_tracker.track_caching(query_id, cache_key, ttl_seconds=86400)
print(f"Stage 7 - Caching: Response cached for 24 hours")

# Get full lineage
full_lineage = lineage_tracker.get_full_lineage(source_id)
print(f"\nComplete Lineage: {len(full_lineage)} records tracked across all stages")

print("\n# Expected: 7+ lineage records (1 upload + 3 chunks + 3 embeddings + ...)")
print("\n# SAVED_SECTION:4")

## Section 5: Component 3 - Retention Policy Engine

### Automated Data Deletion

**Purpose:** Enforce legal retention requirements by automatically deleting data past its retention period.

**Retention Policies:**
- **HR Records:** 7 years (2555 days) - FLSA, EEOC
- **Financial:** 10 years (3650 days) - SOX 802
- **Medical:** 7 years (2555 days) - HIPAA
- **Marketing:** 30 days - GDPR minimization
- **Audit Logs:** 7 years - SOX, GDPR
- **General PII:** 3 years (1095 days) - GDPR minimization

**Implementation:** Apache Airflow DAGs run daily, check document age, delete from all 7 systems if expired.

In [None]:
# Initialize Retention Engine
retention_engine = RetentionEngine(lineage_tracker)

# Check retention compliance for HR document
print("Retention Compliance Check")
print("="*50)

compliance = retention_engine.check_retention_compliance(
    source_id,
    hr_classification
)

print(f"Document ID: {source_id}")
print(f"Retention Period: {compliance['retention_days']} days (7 years)")
print(f"Document Age: {compliance['document_age_days']} days")
print(f"Compliant: {compliance['compliant']}")
print(f"Action: {compliance['action'].upper()}")
print(f"Days Until Deletion: {compliance.get('days_until_deletion', 'N/A')}")
print()

# Schedule retention job for HR records
job_config = retention_engine.schedule_retention_job(
    "hr_records",
    cron_schedule="0 2 * * *"  # Daily at 2 AM
)

print("Scheduled Airflow Retention Job:")
print("="*50)
print(f"Job ID: {job_config['job_id']}")
print(f"Policy: {job_config['retention_policy']}")
print(f"Retention Days: {job_config['retention_days']}")
print(f"Schedule: {job_config['cron_schedule']} (Daily at 2 AM)")
print(f"Airflow DAG: {job_config['airflow_dag']}")
print(f"Status: {job_config['status'].upper()}")

print("\n# Expected: Document is compliant (age < 7 years), scheduled job created")
print("\n# SAVED_SECTION:5")

## Section 6: Component 4 - Data Residency Controller

### Multi-Region Compliance

**Purpose:** Ensure data stays in legally compliant regions (GDPR Article 44, DPDPA).

**Residency Rules:**
- **EU (GDPR Article 44):** EU personal data MUST stay in EU unless:
  - Adequacy decision exists (US does NOT have one post-Schrems II)
  - Standard Contractual Clauses (SCCs) implemented
  - Binding Corporate Rules (BCRs) in place
- **India (DPDPA 2023):** Sensitive data can transfer with explicit consent
- **US:** Generally unrestricted for most data types

**Regions:**
- **EU:** Frankfurt (eu-central-1)
- **India:** Mumbai (ap-south-1)
- **US:** N. Virginia (us-east-1)

In [None]:
# Initialize Residency Controller
residency_controller = ResidencyController()

print("Data Residency Validation")
print("="*50)

# Example 1: EU to EU (COMPLIANT)
validation1 = residency_controller.validate_residency(Region.EU, Region.EU)
print("Example 1: EU user data stored in EU")
print(f"  Compliant: {validation1['compliant']} ‚úÖ")
print(f"  Regulation: {validation1['regulation']}")
print(f"  Action: {validation1['action'].upper()}")
print()

# Example 2: EU to US (NON-COMPLIANT)
validation2 = residency_controller.validate_residency(Region.EU, Region.US)
print("Example 2: EU user data stored in US")
print(f"  Compliant: {validation2['compliant']} ‚ùå")
print(f"  Action: {validation2['action'].upper()}")
print(f"  Allowed Regions: {', '.join(validation2['allowed_regions'])}")
print()

# Example 3: Cross-border transfer restrictions
transfer_check = residency_controller.enforce_cross_border_restrictions(Region.EU, Region.US)
print("Cross-Border Transfer: EU ‚Üí US")
print(f"  Allowed: {transfer_check['allowed']} ‚ùå")
print(f"  Reason: {transfer_check['reason']}")
print(f"  Required Mechanism: {transfer_check['required_mechanism']}")
print()

# Example 4: Regional routing
country_codes = ["DE", "IN", "US"]
print("Regional Routing by Country Code:")
for country in country_codes:
    region = residency_controller.route_to_compliant_region(country, {})
    print(f"  {country} ‚Üí {region.value}")

print("\n# Expected: EU-to-EU allowed, EU-to-US blocked (requires SCCs)")
print("\n# SAVED_SECTION:6")

## Section 7: Component 5 - Consent Management

### GDPR Consent Requirements (Articles 6 & 7)

**Purpose:** Track and enforce user consent for data processing.

**GDPR Consent Must Be:**
- **Freely given** - No coercion
- **Specific** - Clear purpose defined
- **Informed** - User knows how data is used
- **Unambiguous** - Affirmative action required (no pre-checked boxes)
- **Withdrawable** - User can revoke at any time (Article 7(3))

**Legal Bases for Processing (Article 6):**
1. **Consent** - User explicitly agrees
2. **Contract** - Necessary for contract performance
3. **Legal Obligation** - Required by law
4. **Vital Interests** - Protect life
5. **Public Task** - Official authority
6. **Legitimate Interest** - Balanced against user rights

In [None]:
# Initialize Consent Manager
consent_manager = ConsentManager()

print("Consent Management Workflow")
print("="*50)

# Grant consent for RAG processing
consent1 = consent_manager.grant_consent(
    user_id="user_001",
    data_type=DataType.PII,
    purpose="RAG query processing",
    legal_basis="consent"
)
print("Consent Granted:")
print(f"  User: {consent1['user_id']}")
print(f"  Data Type: {consent1['data_type']}")
print(f"  Purpose: {consent1['purpose']}")
print(f"  Legal Basis: {consent1['legal_basis']}")
print(f"  Status: {consent1['status']} ‚úÖ")
print()

# Check consent before processing
has_consent = consent_manager.check_consent(
    "user_001",
    DataType.PII,
    "RAG query processing"
)
print(f"Consent Check (RAG processing): {has_consent} ‚úÖ")

# Check consent for different purpose (should fail)
has_marketing_consent = consent_manager.check_consent(
    "user_001",
    DataType.PII,
    "marketing campaigns"
)
print(f"Consent Check (marketing): {has_marketing_consent} ‚ùå (purpose mismatch)")
print()

# Grant additional consents
consent_manager.grant_consent("user_001", DataType.FINANCIAL, "analytics", "legitimate_interest")
consent_manager.grant_consent("user_001", DataType.PII, "service_improvement", "consent")

# Get all user consents (GDPR Article 15 - Right to Access)
all_consents = consent_manager.get_user_consents("user_001")
print(f"All User Consents: {len(all_consents)} active consents")
print()

# Revoke consent (GDPR Article 7(3))
revocation = consent_manager.revoke_consent("user_001", DataType.PII)
print("Consent Revocation (GDPR Article 7(3)):")
print(f"  User: {revocation['user_id']}")
print(f"  Data Type: {revocation['data_type']}")
print(f"  Revoked Count: {revocation['revoked_count']}")
print(f"  Timestamp: {revocation['revocation_timestamp']}")

# Verify consent revoked
consent_after_revocation = consent_manager.check_consent("user_001", DataType.PII, "RAG query processing")
print(f"\nConsent after revocation: {consent_after_revocation} ‚ùå (successfully revoked)")

print("\n# Expected: Consent granted, checked, and successfully revoked")
print("\n# SAVED_SECTION:7")

## Section 8: Component 6 - GDPR Article 17 Erasure Workflow

### Right to be Forgotten

**Purpose:** Execute complete user data deletion across all 7 systems within 30 days.

**GDPR Article 17 Requirements:**
- User can request erasure of ALL personal data
- Must complete within **30 days** of request
- Must delete from **all systems** (including backups)
- Must provide **deletion certificate** as proof
- Must verify deletion across all systems

**Legal Exceptions (Article 17(3)):**
- Compliance with legal obligation
- Public interest/official authority
- Public health interest
- Archiving/research/statistics (public interest)
- Legal claims defense
- Freedom of expression

**Workflow Steps:**
1. Validate request (check legal exceptions)
2. Revoke all consents
3. Identify data locations via lineage
4. Delete from all 7 systems
5. Verify complete deletion
6. Generate deletion certificate
7. Notify user

In [None]:
# Initialize GDPR Erasure Workflow
erasure_workflow = GDPRErasureWorkflow(
    lineage_tracker,
    consent_manager,
    retention_engine
)

print("GDPR Article 17 Erasure Workflow")
print("="*70)

# Step 1: Validate erasure request
user_to_erase = "user_gdpr_001"
validation = erasure_workflow.validate_erasure_request(
    user_to_erase,
    "I want all my data deleted per GDPR Article 17"
)

print("Step 1: Validate Erasure Request")
print(f"  User ID: {validation['user_id']}")
print(f"  Validated: {validation['validated']} ‚úÖ")
print(f"  Approval Status: {validation['approval_status'].upper()}")
print(f"  Legal Exceptions: {len(validation['exceptions'])} (none found)")
print()

# Step 2: Execute erasure across all systems
erasure_report = erasure_workflow.execute_erasure(user_to_erase)

print("Step 2: Execute Erasure Across All Systems")
print(f"  User ID: {erasure_report['user_id']}")
print(f"  Timestamp: {erasure_report['erasure_timestamp']}")
print(f"  Systems Processed: {len(erasure_report['systems_processed'])}")
print(f"  Total Records Deleted: {erasure_report['total_records_deleted']}")
print(f"  Completion Status: {erasure_report['completion_status'].upper()} ‚úÖ")
print()

print("Systems Deleted From:")
for system in erasure_report['systems_processed']:
    print(f"  ‚Ä¢ {system['system']}: {system.get('status', 'processed')}")
print()

# Step 3: Verify complete erasure
verification = erasure_workflow.verify_erasure(user_to_erase)

print("Step 3: Verify Complete Erasure")
print(f"  User ID: {verification['user_id']}")
print(f"  Verification Status: {verification['verification_status'].upper()} ‚úÖ")
print(f"  Remaining Data Found: {verification['remaining_data_found']} ‚úÖ")
print(f"  Systems Checked: {len(verification['systems_checked'])}")
print()

# Step 4: Generate deletion certificate
certificate = erasure_workflow.generate_deletion_certificate(user_to_erase, erasure_report)

print("Step 4: Deletion Certificate Generated")
print("="*70)
print(certificate[:500] + "...")
print("="*70)

print("\n‚úÖ GDPR Article 17 Erasure Complete")
print("   - All consents revoked")
print("   - Data deleted from 7 systems")
print("   - Deletion verified")
print("   - Certificate generated for compliance proof")

print("\n# SAVED_SECTION:8")

## Section 9: End-to-End Integration Example

### Complete Governance Workflow

This example demonstrates a complete data governance workflow from document ingestion through GDPR erasure:

1. **Ingest Document** ‚Üí Classify sensitivity
2. **Track Lineage** ‚Üí Record all transformations
3. **Validate Residency** ‚Üí Ensure compliant storage region
4. **Check Consent** ‚Üí Verify user permission
5. **Monitor Retention** ‚Üí Schedule automated deletion
6. **Handle Erasure** ‚Üí Execute GDPR Article 17 workflow

In [None]:
print("End-to-End Data Governance Workflow")
print("="*70)

# Scenario: EU employee submits HR document, later requests deletion
user_id = "employee_eu_001"
document_text = "Employee Anna Schmidt (anna.schmidt@company.de) submitted expense report for ‚Ç¨5,000."

# Step 1: Classify document
classification = classifier.classify_document(document_text, {"document_type": "hr_records"})
print("Step 1 - Classification:")
print(f"  Sensitivity: {classification['sensitivity_level']}")
print(f"  Retention: {classification['retention_period_days']} days")
print()

# Step 2: Track lineage
doc_id = "doc_eu_hr_001"
lineage_tracker.track_document_upload(doc_id, {"user_id": user_id, "region": "EU"})
lineage_tracker.track_chunking(doc_id, [f"{doc_id}_chunk_1"])
print("Step 2 - Lineage Tracking: ‚úÖ Document and chunks tracked")
print()

# Step 3: Validate data residency (EU user ‚Üí must stay in EU)
residency_check = residency_controller.validate_residency(Region.EU, Region.EU)
print("Step 3 - Data Residency:")
print(f"  EU user ‚Üí EU storage: {residency_check['compliant']} ‚úÖ")
print()

# Step 4: Grant consent
consent_manager.grant_consent(user_id, DataType.PII, "HR processing", "contract")
has_consent = consent_manager.check_consent(user_id, DataType.PII, "HR processing")
print(f"Step 4 - Consent: {has_consent} ‚úÖ")
print()

# Step 5: Check retention compliance
retention_check = retention_engine.check_retention_compliance(doc_id, classification)
print("Step 5 - Retention:")
print(f"  Action: {retention_check['action']} (within 7-year period)")
print()

# Step 6: User requests deletion (GDPR Article 17)
print("Step 6 - GDPR Article 17 Erasure Request Received")
erasure_report = erasure_workflow.execute_erasure(user_id)
print(f"  Status: {erasure_report['completion_status']} ‚úÖ")
print(f"  Systems: {len(erasure_report['systems_processed'])} processed")
print()

print("="*70)
print("‚úÖ Complete governance workflow executed successfully")
print("   From document ingestion through GDPR erasure")
print("\n# SAVED_SECTION:9")

## Section 10: Common Failures & Solutions

### 6 Failure Scenarios from Real GCCs

| Failure | Root Cause | Solution |
|---------|------------|----------|
| **Incomplete deletion** | Data remains in cache/backups | Implement lineage tracking across ALL 7 systems |
| **Untracked lineage** | Cannot locate embedding sources | Track every transformation with unique IDs |
| **Purpose creep** | Data used beyond consent scope | Enforce consent checks at retrieval time |
| **Cross-region leakage** | EU data stored outside EU | Validate residency before storage |
| **Retention gaps** | Data kept past legal limit | Schedule Airflow DAGs for automated deletion |
| **Consent withdrawal delays** | Processing continues after revocation | Implement real-time consent checks |

In [None]:
# Load failure scenarios from example data
print("Common Failure Scenarios - Demonstrations")
print("="*70)

# Failure 1: Purpose Creep Detection
print("Failure 1: Purpose Creep Detection")
consent_manager.grant_consent("user_002", DataType.PII, "analytics", "consent")
allowed = consent_manager.check_consent("user_002", DataType.PII, "marketing")
print(f"  User consented to: analytics")
print(f"  Attempted use for: marketing")
print(f"  Allowed: {allowed} ‚ùå (PURPOSE CREEP BLOCKED)")
print()

# Failure 2: Cross-Region Leakage Detection
print("Failure 2: Cross-Region Leakage Detection")
eu_to_us = residency_controller.validate_residency(Region.EU, Region.US)
print(f"  EU data ‚Üí US storage")
print(f"  Compliant: {eu_to_us['compliant']} ‚ùå (BLOCKED)")
print(f"  Required: {eu_to_us.get('allowed_regions', ['EU regions only'])}")
print()

# Failure 3: Consent Withdrawal Not Enforced
print("Failure 3: Consent Withdrawal Enforcement")
consent_manager.grant_consent("user_003", DataType.PII, "service", "consent")
print(f"  Consent granted: {consent_manager.check_consent('user_003', DataType.PII, 'service')} ‚úÖ")
consent_manager.revoke_consent("user_003", DataType.PII)
print(f"  Consent after revocation: {consent_manager.check_consent('user_003', DataType.PII, 'service')} ‚ùå")
print(f"  Processing BLOCKED immediately after revocation ‚úÖ")

print("\n# Expected: All failures detected and blocked by governance system")
print("\n# SAVED_SECTION:10")

## Section 11: Decision Card - When to Use This

### ‚úÖ Use This Comprehensive Governance When:

- **Multi-region GCC** serving 50+ business units
- **2+ regulatory jurisdictions** (US/EU/India)
- **Processing HR, financial, or health data** (high sensitivity)
- **Anticipating GDPR/SOX/DPDPA audits** within 12 months
- **Handling classified data** requiring strict access controls
- **Enterprise deployment** with >1000 users

### ‚ùå Don't Use When:

- Single-region deployment with public data only
- Early MVP stage (<6 months to production)
- Limited regulatory exposure
- Small user base (<100 users)
- Research/academic projects without real user data

### üí∞ Cost Estimates:

- **Small GCC:** ~‚Çπ25K/month + ‚Çπ50K setup
- **Medium GCC:** ~‚Çπ1.5L/month + ‚Çπ2L setup
- **Large GCC:** ~‚Çπ4L/month + ‚Çπ5L setup

In [None]:
# Decision matrix evaluation
def evaluate_governance_need(multi_region, jurisdictions, sensitive_data, user_count):
    """
    Evaluate if comprehensive governance is needed.
    
    Returns recommendation with reasoning.
    """
    score = 0
    reasons = []
    
    if multi_region:
        score += 2
        reasons.append("Multi-region deployment requires residency controls")
    
    if jurisdictions >= 2:
        score += 2
        reasons.append(f"{jurisdictions} jurisdictions require comprehensive compliance")
    
    if sensitive_data:
        score += 2
        reasons.append("Sensitive data (PII/PHI/Financial) requires strict controls")
    
    if user_count > 1000:
        score += 1
        reasons.append("Large user base benefits from automation")
    
    if score >= 5:
        return "STRONGLY RECOMMENDED", reasons
    elif score >= 3:
        return "RECOMMENDED", reasons
    else:
        return "NOT NEEDED (use lightweight approach)", reasons

# Example scenarios
scenarios = [
    {"name": "Large GCC", "multi_region": True, "jurisdictions": 3, "sensitive": True, "users": 5000},
    {"name": "Medium GCC", "multi_region": True, "jurisdictions": 2, "sensitive": True, "users": 500},
    {"name": "Small Startup", "multi_region": False, "jurisdictions": 1, "sensitive": False, "users": 50},
]

print("Decision Matrix Evaluation")
print("="*70)

for scenario in scenarios:
    recommendation, reasons = evaluate_governance_need(
        scenario["multi_region"],
        scenario["jurisdictions"],
        scenario["sensitive"],
        scenario["users"]
    )
    
    print(f"\nScenario: {scenario['name']}")
    print(f"  Users: {scenario['users']}, Regions: {'Multi' if scenario['multi_region'] else 'Single'}")
    print(f"  Recommendation: {recommendation}")
    for reason in reasons:
        print(f"    ‚Ä¢ {reason}")

print("\n# SAVED_SECTION:11")

## Conclusion & Next Steps

### What You've Learned

‚úÖ **Data Classification** with Presidio for PII/PHI/financial detection  
‚úÖ **Complete Lineage Tracking** across 7 RAG system stages  
‚úÖ **Automated Retention Policies** with Airflow DAGs  
‚úÖ **Multi-Region Data Residency** (GDPR Article 44, DPDPA)  
‚úÖ **Consent Management** with GDPR Articles 6 & 7 compliance  
‚úÖ **GDPR Article 17 Erasure** with cross-system deletion  

### Real-World Impact

This architecture prevents:
- ‚Ç¨200,000+ GDPR fines for non-compliance
- Data breaches from untracked PII
- Retention violations from expired data
- Cross-border transfer violations
- Consent violations from purpose creep

### Next Steps

1. **Deploy to Production:** Start with Presidio classification
2. **Implement Lineage:** Add tracking to existing RAG pipeline
3. **Configure Airflow:** Schedule retention jobs
4. **Test GDPR Workflow:** Execute practice erasure requests
5. **Document for Audits:** Generate compliance certificates

### Resources

- **API Documentation:** http://localhost:8000/docs
- **Test Suite:** `pytest tests/`
- **Augmented Script:** [GitHub](https://github.com/yesvisare/gcc_comp_ai_ccc_l2)
- **GDPR Full Text:** https://gdpr.eu/
- **DPDPA 2023:** https://www.meity.gov.in/dpdpa-2023

---

**Built with TechVoyageHub L3 SkillElevate Standards**  
**Compliance-Ready ‚Ä¢ Production-Grade ‚Ä¢ GCC-Optimized**