# L3 M12.2: Document Storage & Access Control

## Learning Arc

**Purpose:** Application-layer validation alone is insufficient for multi-tenant security. This module teaches you to architect document storage so tenant isolation is enforced at the storage layer, not just the application layer.

**Concepts Covered:**
- Bucket-Per-Tenant isolation model (maximum security, ₹20L/year for 50 tenants)
- Shared Bucket + IAM Policies isolation (complex but scalable, ₹8L/year)
- Hybrid Model with TenantS3Client wrapper (RECOMMENDED - balances security and cost)
- Presigned URLs with tenant validation (5-minute expiration, tag verification)
- Data residency enforcement for multi-region compliance (GDPR, DPDPA, HIPAA)
- TenantS3Client wrapper implementation (prevents direct boto3 access)
- Comprehensive audit logging for storage access (CloudTrail patterns)
- Cost trade-offs between isolation models (bucket limits, IAM complexity)
- Security vulnerabilities from application-only validation (leaked credentials, direct Console access)
- Multi-region bucket strategy (EU, US, India separate buckets)
- KMS encryption with tenant-specific keys (₹75/key/month)
- Cross-tenant attack prevention (prefix validation, object tagging)

**After Completing This Notebook:**
- You will understand why application-layer validation is insufficient (bugs, credential leaks, Console access)
- You can implement three distinct S3 isolation models with production-ready code
- You will design tenant-aware presigned URLs that enforce boundaries
- You can build data residency enforcement for multi-region compliance
- You will create comprehensive audit logging for storage operations
- You can analyze cost trade-offs between bucket-per-tenant vs shared-bucket approaches
- You will prevent cross-tenant document access through wrapper-based enforcement
- You can implement TenantS3Client that automatically scopes all operations to tenant prefix
- You will generate presigned URLs with security checks (prefix validation, tag verification)
- You can enforce data residency requirements at upload time (block wrong-region uploads)

**Context in Track L3.M12:**
This module builds on L3 M12.1 (Multi-Tenant Architecture Foundations) and prepares you for L3 M12.3 (Query Isolation for vector and traditional databases).

In [None]:
import os
import sys

# Add src to path for imports
if '../src' not in sys.path:
    sys.path.insert(0, '../src')

# OFFLINE mode for L3 consistency
OFFLINE = os.getenv("OFFLINE", "false").lower() == "true"

# AWS_S3 service detection from script - auto-detected from Section 4
AWS_S3_ENABLED = os.getenv("AWS_S3_ENABLED", "false").lower() == "true"

if OFFLINE or not AWS_S3_ENABLED:
    print("⚠️ Running in OFFLINE/AWS_S3_DISABLED mode")
    print("   → External S3 API calls will be skipped")
    print("   → Set AWS_S3_ENABLED=true in .env to enable")
else:
    print("✓ Online mode - AWS S3 services enabled")

## 1. Introduction: The Problem with Application-Layer Validation

**Central Question:** How do you architect document storage so tenant isolation is enforced at the storage layer, not just the application layer?

**Why Application-Layer Validation is Insufficient:**
Even with tenant ID checks in code, vulnerabilities exist through:
- Bugs in validation functions (logic errors, edge cases)
- Accidental removal of checks during code review
- Direct S3 access via AWS Console (bypasses application)
- Leaked AWS credentials used outside the application
- Third-party tools accessing S3 directly (backup tools, monitoring)

**Real Scenario:**
Healthcare GCC with 25 hospital tenants. Developer accidentally removes tenant validation check during refactoring. One hospital can now access another's patient records through the API. Even worse: if AWS credentials leak, attacker can download ALL documents directly from S3 Console.

In [None]:
# VULNERABLE IMPLEMENTATION - DO NOT USE IN PRODUCTION

def generate_presigned_url_vulnerable(s3_client, doc_key):
    """
    WRONG - no tenant validation!
    Attacker could request: GET /documents/tenant-b/secret-contract.pdf
    and receive a valid presigned URL for another tenant's document.
    """
    url = s3_client.generate_presigned_url(
        'get_object',
        Params={'Bucket': 'docs', 'Key': doc_key}
    )
    return url

# Expected: Attacker requests tenant-b/secret.pdf
# Result: Gets valid presigned URL (SECURITY BREACH)
print("VULNERABLE: No tenant validation before URL generation")

## 2. Three S3 Isolation Models

### Model 1: Bucket-Per-Tenant
**Structure:** Each tenant receives dedicated S3 bucket (tenant-a, tenant-b, tenant-c)

**Pros:**
- Maximum isolation (bucket-level security boundary)
- Simplest IAM policies (one policy per bucket)
- Clear cost attribution (per-bucket billing)
- Distinct audit trails (per-bucket CloudTrail)

**Cons:**
- AWS bucket limits (100 default, 1,000 with request)
- Management overhead for 50+ tenants
- Per-bucket provisioning complexity
- Per-bucket logging costs

**Cost:** ₹20L annually for 50 tenants  
**Recommendation:** Premium GCCs with <100 tenants

---

### Model 2: Shared Bucket + IAM Policies
**Structure:** One bucket with tenant-specific IAM roles; isolation via bucket policies and object tagging

**IAM Policy Pattern:**
```json
{
  "Effect": "Deny",
  "Principal": "*",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::rag-docs-shared/*",
  "Condition": {
    "StringNotEquals": {
      "aws:PrincipalTag/tenant_id": "${s3:ExistingObjectTag/tenant_id}"
    }
  }
}
```

**Pros:**
- Scales to 1000+ tenants
- Single management point
- Lower storage costs
- Easier backup/restore

**Cons:**
- Complex IAM policies (human error risk)
- Requires strict object tagging discipline
- Harder cost attribution
- One misconfigured policy enables data breach

**Cost:** ₹8L annually  
**Recommendation:** Large GCCs (100+ tenants), cost-conscious organizations with strong IAM expertise

---

### Model 3: Hybrid (Shared Bucket + Tenant Prefixes + Application Wrapper) **RECOMMENDED**
**Structure:** Shared bucket with mandatory prefix isolation + TenantS3Client wrapper that prevents direct boto3 access

**Security Layers:**
- Prefix validation (tenant-{id}/)
- Metadata checks (tenant_id tag)
- Audit logging (all operations)
- Short expiration times (5 minutes)

**Pros:**
- Scales to 1000+ tenants
- Simpler IAM than Model 2
- Cost-effective (₹8L annually)
- Wrapper prevents S3 bypass

**Cons:**
- Requires disciplined code (never use boto3 directly)
- Wrapper must be bulletproof
- Still needs IAM policies as backup

**Cost:** ₹8L annually  
**Recommendation:** MOST GCCs (50-500 tenants) - balances security, scale, cost

In [None]:
import pandas as pd

# Decision Matrix for S3 Isolation Models
decision_matrix = pd.DataFrame({
    'Criteria': ['Tenant Scale', 'Annual Cost', 'Security Level', 'Implementation Complexity', 'Recommended For'],
    'Bucket-Per-Tenant': ['<100', '₹20L', 'Highest', 'Low', 'Premium GCCs'],
    'Shared+IAM': ['100-1000+', '₹8L', 'Medium', 'High', 'Cost-sensitive'],
    'Hybrid': ['50-500', '₹8L', 'High', 'Medium', 'Most GCCs']
})

print(decision_matrix.to_string(index=False))

# Expected: Table showing comparison of three models
# Hybrid model recommended for most use cases

## 3. TenantS3Client Wrapper Implementation

The `TenantS3Client` class wraps boto3 S3 client and enforces tenant isolation through:
- Automatic prefix enforcement (tenant-{id}/)
- Data residency validation (region matches tenant requirement)
- Audit logging (all operations tracked)
- Private boto3 client (not exposed directly)

**Security Guarantee:** All operations automatically scoped to tenant prefix, cannot access other tenants' documents.

In [None]:
from l3_m12_document_storage import TenantS3Client

# Initialize tenant-scoped client
client = TenantS3Client("tenant-healthcare-us-001")

# Upload document - automatically prefixed with tenant-tenant-healthcare-us-001/
sample_data = b"Patient medical record - HIPAA protected"
s3_url = client.upload("patient-records/john-doe-2024.pdf", sample_data)

print(f"Uploaded to: {s3_url}")
# Expected: s3://rag-docs-us/tenant-tenant-healthcare-us-001/patient-records/john-doe-2024.pdf (offline)

print(f"\nClient prefix: {client.prefix}")
print(f"Client region: {client.region}")
print(f"AWS enabled: {client.aws_enabled}")

In [None]:
# List documents - only shows tenant's documents
documents = client.list_documents()

print("Documents for tenant-healthcare-us-001:")
for doc in documents:
    print(f"  - {doc}")

# Expected: example-doc1.pdf, example-doc2.pdf (offline mode)
# In production: actual documents under tenant prefix

## 4. Presigned URLs with Security Validation

Presigned URLs enable temporary, direct S3 downloads without exposing AWS credentials.

**Security Requirements Before Generating URL:**
1. Extract tenant from authenticated user context
2. Validate key starts with expected tenant prefix (tenant-{id}/)
3. Verify document exists and belongs to tenant via metadata check
4. Check object tags match tenant_id
5. Only then generate presigned URL with validation gates
6. Implement audit logging for presigned URL generation
7. Set short expiration (5 minutes maximum)

**CRITICAL:** ALWAYS validate tenant ownership BEFORE generating presigned URL. Never trust client-provided document keys without verification.

In [None]:
from l3_m12_document_storage import PresignedURLService

# Initialize presigned URL service
url_service = PresignedURLService(redis_client=None)

# Generate presigned URL for tenant's own document
tenant_id = "tenant-healthcare-us-001"
doc_key = f"tenant-{tenant_id}/patient-records/john-doe-2024.pdf"

url = url_service.generate_url(tenant_id, doc_key, expiration=300)

print(f"Presigned URL: {url}")
print(f"Expires in: 300 seconds (5 minutes)")

# Expected: https://offline-mode.s3.amazonaws.com/tenant-tenant-healthcare-us-001/patient-records/john-doe-2024.pdf?expires=300
# In production: actual AWS presigned URL with X-Amz-* parameters

In [None]:
from l3_m12_document_storage import Forbidden

# Attempt cross-tenant access (SHOULD FAIL)
attacker_tenant = "tenant-healthcare-us-001"
target_document = "tenant-tenant-logistics-eu-002/shipping-manifests/order-789.pdf"

try:
    url = url_service.generate_url(attacker_tenant, target_document)
    print("SECURITY BREACH: Cross-tenant access allowed!")
except Forbidden as e:
    print(f"✓ Cross-tenant access blocked: {e}")

# Expected: ✓ Cross-tenant access blocked: Cross-tenant access denied

## 5. Data Residency & Multi-Region Compliance

**Regulatory Context:**
- **GDPR (EU):** Personal data of EU citizens must remain in EU or approved countries
- **DPDPA (India):** Sensitive personal data of Indian citizens must stay in India
- **HIPAA (US):** Protected health information requires US data centers
- **Schrems II Ruling:** Invalidated EU-US Privacy Shield; restricts US cloud storage for EU data

**GCC Scenario:**
Bangalore-based GCC serving clients in London (GDPR), New York (CCPA), Mumbai (DPDPA) cannot store all documents in single us-east-1 bucket.

**Multi-Region Architecture:**
```
eu-west-1: rag-docs-eu (EU tenants)
us-east-1: rag-docs-us (US tenants)
ap-south-1: rag-docs-india (India tenants)
```

**Cost Implications:**
- Data transfer between regions: ₹6/GB
- For 10TB storage with 10% cross-region access: ₹60K/month
- Replication for DR: ₹120K/month additional

**Real Scenario Risk:**
Healthcare GCC with 25 hospital tenants (15 US, 10 EU). If EU data crosses to us-east-1 without enforcement: GDPR violation potential fine ₹20M+ (4% revenue).

In [None]:
from l3_m12_document_storage import DataResidencyValidator, TenantMetadata, DataResidencyViolation

# Initialize validator
validator = DataResidencyValidator()

# Create EU tenant metadata
eu_tenant = TenantMetadata(
    tenant_id="tenant-logistics-eu-002",
    data_residency="EU",
    data_residency_region="eu-west-1",
    encryption_key_id="key-eu-logistics"
)

# Validate correct region upload
required_region = validator.validate_upload(
    "tenant-logistics-eu-002",
    region_override=None,
    tenant_metadata=eu_tenant
)

print(f"✓ Required region for EU tenant: {required_region}")
# Expected: eu-west-1

In [None]:
# Attempt upload to wrong region (SHOULD FAIL)
try:
    validator.validate_upload(
        "tenant-logistics-eu-002",
        region_override="us-east-1",  # Wrong region for EU tenant
        tenant_metadata=eu_tenant
    )
    print("COMPLIANCE VIOLATION: Cross-region upload allowed!")
except DataResidencyViolation as e:
    print(f"✓ Data residency violation blocked: {e}")

# Expected: ✓ Data residency violation blocked: Upload attempted in us-east-1, but tenant tenant-logistics-eu-002 requires eu-west-1

In [None]:
# Show region mapping for all data residencies
print("Data Residency Region Mapping:")
for residency, region in validator.REGION_MAPPING.items():
    print(f"  {residency:10} → {region}")

# Expected:
#   EU         → eu-west-1
#   US         → us-east-1
#   India      → ap-south-1

## 6. Comprehensive Audit Logging

Audit logging is critical for:
- Compliance reporting (GDPR, HIPAA, SOC 2)
- Anomaly detection (>100 API calls/hour per tenant)
- Security incident investigation (who accessed what, when)
- Cost attribution (per-tenant storage usage)

**Logged Operations:**
- Upload (tenant_id, doc_key, timestamp, status)
- Download (tenant_id, doc_key, user_id, timestamp)
- Delete (tenant_id, doc_key, timestamp)
- List (tenant_id, prefix, timestamp)
- Presigned URL generation (tenant_id, doc_key, expiration)
- Cross-tenant access attempts (attacker_tenant, target_document)
- Data residency violations (tenant_id, attempted_region, required_region)

In [None]:
from l3_m12_document_storage import StorageAuditLogger

# Initialize audit logger (without database for demo)
audit_logger = StorageAuditLogger(db_connection=None)

# Log various operations
audit_logger.log_access(
    operation='upload',
    tenant_id='tenant-healthcare-us-001',
    doc_key='patient-records/john-doe-2024.pdf',
    user_id='user-doctor-123',
    status='success'
)

audit_logger.log_access(
    operation='presigned_url',
    tenant_id='tenant-logistics-eu-002',
    doc_key='shipping-manifests/order-789.pdf',
    status='success'
)

audit_logger.log_access(
    operation='download',
    tenant_id='tenant-fintech-india-003',
    doc_key='kyc-documents/customer-xyz.pdf',
    user_id='user-compliance-456',
    status='success'
)

print("✓ Audit logs recorded (check application logs)")
# Expected: Log entries in console output with AUDIT: prefix

In [None]:
# Retrieve access logs for tenant
logs = audit_logger.get_tenant_access_log('tenant-healthcare-us-001', days=30)

print("Access logs for tenant-healthcare-us-001 (last 30 days):")
for log in logs:
    print(f"  {log['operation']:15} - {log['count']:3} times by {log['user_id']}")

# Expected: Mock data showing upload and download operations
# In production: actual database query results

## 7. Common Implementation Pitfalls & Fixes

### Failure 1: Weak Prefix Validation
**Risk:** Users manipulate keys to access other tenants  
**Fix:** Validate prefix immutably in wrapper, never allow bypass

### Failure 2: Missing Object Tagging
**Risk:** Objects uploaded without tenant_id tags become unattributable  
**Fix:** Mandatory tagging on every upload, tag validation before presigned URL generation

### Failure 3: Presigned URL Over-Sharing
**Risk:** URLs shared externally, cached in browsers, forwarded to others  
**Fix:** Short expiration (5 minutes), audit every generation, track URL usage

### Failure 4: Cross-Tenant IAM Misconfiguration
**Risk:** Over-permissive bucket policies allow cross-tenant reads  
**Fix:** Use explicit deny policies, regular IAM audit, principle of least privilege

### Failure 5: Data Residency Bypass
**Risk:** Engineers upload to wrong region to meet deadline  
**Fix:** Client-level region enforcement, CloudTrail alerts, monthly compliance audits

### Failure 6: Unencrypted Metadata
**Risk:** Object metadata (tenant_id tags) visible in CloudTrail  
**Fix:** Use private tags not searchable, encrypt metadata in application

### Failure 7: Presigned URLs Valid for Hours
**Risk:** Long-lived URLs forwarded externally and accessed after intended expiration  
**Fix:** Change ExpiresIn=300 (5 minutes) in generate_presigned_url() call

### Failure 8: Database Query Timeout (N+1 Queries)
**Risk:** Fetching tenant metadata for each document during bulk upload  
**Fix:** Cache tenant metadata in Redis with 1-hour TTL

In [None]:
import pandas as pd

# Common failure scenarios table
failures = pd.DataFrame([
    {
        'Failure': 'Tenant A downloads Tenant B file',
        'Cause': 'Missing prefix validation',
        'Fix': 'Add doc_key.startswith(f"tenant-{tenant_id}/") check'
    },
    {
        'Failure': 'Wrong S3 region (GDPR violation)',
        'Cause': 'No region enforcement at upload',
        'Fix': 'Implement DataResidencyValidator in TenantS3Client'
    },
    {
        'Failure': 'Presigned URLs valid for hours',
        'Cause': 'Expiration set to 3600s',
        'Fix': 'Change ExpiresIn=300 (5 minutes)'
    },
    {
        'Failure': 'Objects without tenant tags',
        'Cause': 'Tagging not mandatory',
        'Fix': 'Enforce tagging in upload, fail without tags'
    },
    {
        'Failure': 'S3 costs 3x expected',
        'Cause': 'Cross-region transfers',
        'Fix': 'Block cross-region uploads at app layer'
    },
    {
        'Failure': 'CloudTrail shows "system" user',
        'Cause': 'IAM role lacks tagging',
        'Fix': 'Add tenant_id tags to IAM roles'
    },
    {
        'Failure': 'Presigned URL forwarded externally',
        'Cause': 'No expiration enforcement',
        'Fix': 'Return URL + explicit TTL message to frontend'
    },
    {
        'Failure': 'Database query timeout',
        'Cause': 'N+1 queries for tenant metadata',
        'Fix': 'Cache tenant metadata in Redis (1-hour TTL)'
    }
])

print(failures.to_string(index=False))
# Expected: Table showing all 8 failure scenarios with causes and fixes

## 8. GCC Enterprise Context: 50-Tenant Architecture

**Typical GCC Scenario:**
- 25 US healthcare providers (HIPAA: must stay in us-east-1)
- 15 EU logistics companies (GDPR: must stay in eu-west-1)
- 10 India financial services (RBI: must stay in ap-south-1)

**Implementation Decisions:**

1. **Use Hybrid Model** (shared bucket + tenant prefixes + wrapper)
   - Scales to 1000+ tenants without bucket limits
   - Cost ₹8L/year vs ₹20L for bucket-per-tenant
   - Still achieves high isolation through wrapper enforcement

2. **Multi-Region Bucket Strategy**
   - Create 3 separate S3 buckets in 3 regions
   - TenantS3Client routes based on tenant.data_residency_region
   - Each region has independent lifecycle, encryption, logging

3. **Encryption Strategy**
   - Create 50 individual KMS keys (one per tenant)
   - Cost: 50 × ₹75 = ₹3,750/month
   - Enables key rotation per tenant independently
   - Audit trail per key

4. **Cost Attribution**
   - Tag all objects with tenant_id and team (for chargeback)
   - Use AWS Cost Explorer to drill down by tag
   - Report monthly storage costs per tenant
   - Implement S3 lifecycle: Standard → Intelligent-Tiering → Glacier after 90 days

5. **Compliance Reporting**
   - CloudTrail logs all access (immutable audit)
   - Monthly audit reports: who accessed what, when
   - Automated anomaly detection: >100 API calls/hour per tenant flags alert
   - GDPR Data Subject Access Requests: query CloudTrail + object list for citizen data

6. **Disaster Recovery**
   - Enable cross-region replication for each region
   - us-east-1 → us-west-2 (₹0.02/1K requests)
   - eu-west-1 → eu-central-1
   - ap-south-1 → ap-southeast-1
   - RTO: <1 hour, RPO: <15 minutes

In [None]:
# Cost comparison for 50-tenant GCC
import pandas as pd

cost_analysis = pd.DataFrame([
    {
        'Component': 'S3 Storage (10TB)',
        'Bucket-Per-Tenant': '₹17.5L/year',
        'Hybrid': '₹17.5L/year',
        'Savings': '₹0'
    },
    {
        'Component': 'Bucket Provisioning',
        'Bucket-Per-Tenant': '₹2L/year',
        'Hybrid': '₹0.3L/year',
        'Savings': '₹1.7L'
    },
    {
        'Component': 'CloudTrail Logging',
        'Bucket-Per-Tenant': '₹0.5L/year',
        'Hybrid': '₹0.2L/year',
        'Savings': '₹0.3L'
    },
    {
        'Component': 'KMS Encryption (50 keys)',
        'Bucket-Per-Tenant': '₹0.45L/year',
        'Hybrid': '₹0.45L/year',
        'Savings': '₹0'
    },
    {
        'Component': 'TOTAL',
        'Bucket-Per-Tenant': '₹20.45L/year',
        'Hybrid': '₹18.45L/year',
        'Savings': '₹2L (10%)'
    }
])

print("Cost Analysis: 50-Tenant GCC Architecture")
print(cost_analysis.to_string(index=False))

# Expected: Table showing cost breakdown and 10% savings with Hybrid model

## 9. Decision Card: When to Use This Solution

### Use This Solution When:

✅ **50+ tenants** with varied data residency requirements (GDPR, DPDPA, HIPAA)  
✅ **Documents >100MB** requiring efficient download (presigned URLs avoid proxy overhead)  
✅ **Multi-region compliance mandatory** (EU, US, India with separate buckets)  
✅ **Cost optimization critical** (shared infrastructure reduces costs by 60%)  
✅ **Audit trail required** for regulatory compliance (CloudTrail + database logging)  
✅ **Team has AWS expertise** (IAM, KMS, S3 bucket policies)  
✅ **Asynchronous document access** (uploads/downloads don't block application flow)  
✅ **Large document corpus** (>10GB total storage across all tenants)  

### DO NOT Use When:

❌ **<10GB total storage** across all tenants (database BYTEA simpler)  
❌ **Simple row-level database security sufficient** (no regulatory compliance needs)  
❌ **Team lacks AWS expertise** and can't hire (IAM misconfiguration risks)  
❌ **Budget for managed solutions** like Dropbox Business (₹10-50/GB/month)  
❌ **Synchronous document access required** (<100ms latency needed)  
❌ **Write-heavy patterns** (>10K uploads/day incur high S3 PUT request costs)  
❌ **Single-region deployment** (multi-region overhead unnecessary)  
❌ **Documents <100MB** accessed frequently (database storage more efficient)  

### Trade-offs:

**Cost:**
- Bucket-Per-Tenant: ₹20L/year for 50 tenants (higher isolation, higher cost)
- Shared Bucket (Hybrid): ₹8L/year for 50 tenants (60% savings)
- Cross-region transfer: ₹6/GB for data movement
- KMS encryption: ₹75/key/month + ₹0.24/10K API calls

**Latency:**
- Presigned URL generation: <200ms (with Redis cache)
- Direct S3 download: 50-500ms (depends on region proximity)
- Cross-region access: +100-300ms (if user far from bucket region)

**Complexity:**
- Bucket-Per-Tenant: Low (simple IAM, easy to understand)
- Shared Bucket + IAM: High (complex policies, tag discipline required)
- Hybrid (TenantS3Client wrapper): Medium (wrapper must be used consistently)

**Security:**
- Best: Bucket-Per-Tenant (physical isolation)
- Good: Hybrid with wrapper enforcement (application-level isolation)
- Risk: Shared Bucket + IAM (policy misconfiguration vulnerability)

### Success Metrics:

- Zero cross-tenant document access attempts in audit logs
- All documents properly tagged with tenant_id within 48 hours
- Presigned URL generation <200ms (Redis cache hit rate >80%)
- Data residency violations = 0 (enforcement active)
- Cost per GB/month = ₹1.75 standard + ₹0.05 for tagging overhead

## 10. Hands-On: Practathon Mission

**Objective:** Implement Hybrid Model with verified isolation

**Tasks:**

1. ✓ Set up S3 bucket with tenant prefixes (simulated in offline mode)
2. ✓ Implement TenantS3Client wrapper (already done in module)
3. ✓ Generate and validate presigned URLs (tested above)
4. ✓ Set up audit logging (demonstrated with StorageAuditLogger)
5. ✓ Implement data residency enforcement (validated with DataResidencyValidator)

**Success Criteria:**
- ✓ Tenant A cannot download Tenant B's documents (verified with Forbidden exception)
- ✓ Presigned URLs expire after 5 minutes (default expiration=300)
- ✓ Cross-region upload attempts blocked (DataResidencyViolation raised)
- ✓ All access logged with timestamps and user IDs (audit_logger.log_access())

**Next Steps:**
1. Review the source code in `src/l3_m12_document_storage/__init__.py`
2. Run the test suite: `pytest tests/test_m12_document_storage.py -v`
3. Start the FastAPI server: `./scripts/run_api.ps1`
4. Explore API endpoints at http://localhost:8000/docs
5. Configure AWS credentials in `.env` to test with real S3

## Conclusion & Key Takeaways

**Key Learnings:**

1. **Application-layer isolation is insufficient** - storage-layer enforcement is essential
   - Bugs, leaked credentials, and direct Console access bypass application checks
   - TenantS3Client wrapper enforces prefix isolation at the client level

2. **Hybrid Model (shared bucket + wrapper + prefixes)** balances security, scale, cost for most GCCs
   - Scales to 1000+ tenants without AWS bucket limits
   - 60% cost savings vs bucket-per-tenant (₹8L vs ₹20L annually)
   - Wrapper prevents direct boto3 access, enforces tenant scoping

3. **Multi-region storage is mandatory** for regulated industries
   - GDPR (EU), DPDPA (India), HIPAA (US) require regional data residency
   - Plan for ₹60K+/month cross-region transfer costs
   - DataResidencyValidator blocks wrong-region uploads before S3 API call

4. **Presigned URLs must be validated BEFORE generation**, not trusted after
   - Always check: prefix matches tenant, document exists, tags match tenant_id
   - 5-minute expiration prevents URL forwarding abuse
   - Audit log every URL generation for compliance trail

5. **Data residency is legal requirement**, not optional compliance checkbox
   - GDPR violation fines: up to ₹20M+ (4% revenue)
   - Enforce at upload time with region validation
   - CloudTrail alerts for cross-region access attempts

**Architecture Patterns Applied:**
- Wrapper pattern (TenantS3Client wraps boto3)
- Prefix-based isolation (tenant-{id}/ enforced immutably)
- Defense in depth (prefix + tagging + IAM + wrapper)
- Audit logging for all operations (CloudTrail + database)
- Multi-region data residency enforcement

**Next Modules:**
- **L3 M12.3:** Implement query isolation (vector DB + traditional DB separation)
- **L3 M12.4:** Compliance automation (GDPR/HIPAA audit trails)
- **L3 M13:** Scaling to 500+ tenants (performance optimization)

**Production Checklist:**
- [ ] Configure AWS credentials in .env (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- [ ] Set up Redis for presigned URL caching (reduces S3 API calls)
- [ ] Configure PostgreSQL for audit logging (immutable compliance trail)
- [ ] Create KMS keys per tenant (encryption key rotation)
- [ ] Set up CloudTrail in all regions (us-east-1, eu-west-1, ap-south-1)
- [ ] Implement cost attribution tags (tenant_id on all objects)
- [ ] Configure S3 lifecycle policies (Standard → Glacier after 90 days)
- [ ] Enable cross-region replication for DR (RTO <1 hour)
- [ ] Set up anomaly detection alerts (>100 API calls/hour)
- [ ] Test GDPR Data Subject Access Request workflow

---

**You now have production-ready multi-tenant document storage with storage-layer isolation enforcement!**