# L3 M3.3: Audit Logging & SIEM Integration for RAG Systems

## Learning Arc

By the end of this notebook, you will understand:

1. **Audit Logging vs. Application Logging** - Critical differences for compliance
2. **Six Critical Audit Points in RAG** - Where to log: query, access control, retrieval, generation, response, errors
3. **Immutability Strategies** - PostgreSQL RLS, S3 Object Lock, cryptographic hash chains
4. **Correlation ID Architecture** - Multi-tenant tracking with tenant_id + correlation_id + span_id
5. **SIEM Integration Patterns** - Real-time streaming to Splunk, Elasticsearch, Datadog
6. **Tiered Retention Policies** - Hot/warm/cold storage for 7-10 year compliance
7. **Hash Chain Verification** - Detecting tampering with cryptographic proofs
8. **Regulatory Requirements** - SOX, GDPR, HIPAA, PCI-DSS, ISO 27001 compliance mapping

**üéØ Final Outcome:** Design and implement production-grade audit logging for regulated RAG systems with cryptographic immutability and enterprise SIEM integration.

**üìö Regulatory Context:** This module addresses SOX Section 404 (immutable audit trails), GDPR Article 15 (right to explanation), HIPAA 164.312(b) (audit controls), and PCI-DSS Requirement 10 (access logging).

In [None]:
# OFFLINE Mode Check
# This module operates OFFLINE (no external LLM APIs)
# Requirements: PostgreSQL database only (AWS S3 and SIEM are optional)

import os
import sys

# Add parent directory to path for imports
sys.path.insert(0, os.path.abspath('..'))

OFFLINE_MODE = True  # This module is always offline (local PostgreSQL)

if OFFLINE_MODE:
    print("‚úÖ OFFLINE Mode: This module uses local PostgreSQL only")
    print("No external LLM API calls required")
    print("")
    print("Optional integrations (not required for core functionality):")
    print("  - AWS S3: For long-term archival (7-10 year retention)")
    print("  - SIEM: For real-time security monitoring (Splunk/ELK/Datadog)")
else:
    print("‚ö†Ô∏è This module should always be offline")

# Expected: Confirmation of offline mode

## Section 1: Introduction & The HIPAA Audit Failure

### The $1.8M Mistake: Why Audit Logging Matters

**Real-world scenario:** A healthcare company deployed a RAG system to answer medical questions for staff. They logged application errors (500 errors, latency spikes) but **failed to log who accessed what patient data**.

During a HIPAA audit:
- **Auditor:** "Show me everyone who accessed patient ID 12345's records in Q3 2024."
- **Company:** "We... don't have that data. We log application health, not data access."
- **Result:** $1.8M fine for HIPAA 164.312(b) violation (lack of audit controls)

### The Fundamental Distinction

| Aspect | Application Logs | Audit Logs |
|--------|------------------|------------|
| **Purpose** | "Is my system healthy?" | "Can I prove compliance?" |
| **Content** | Errors, latency, resource usage | Who, what, when, why (data access) |
| **Retention** | 7-30 days (ops needs) | 7-10 years (regulatory) |
| **Immutability** | Overwrite old logs to save space | NEVER delete (regulatory violation) |
| **Audience** | Engineers, DevOps | Auditors, legal, compliance |

**Key Insight:** Application logs are for **engineering**. Audit logs are for **regulators**. This module teaches the latter.

In [None]:
# Import core audit logging components
from src.l3_m3_monitoring_reporting_compliance import (
    AuditEventType,
    DataClassification,
    AuditEvent,
    CorrelationContext,
    AuditLogger,
)
from datetime import datetime
import json

print("‚úÖ Audit logging components imported successfully")
print(f"Available event types: {[e.name for e in AuditEventType]}")
print(f"Data classifications: {[c.name for c in DataClassification]}")

# Expected:
# ‚úÖ Audit logging components imported successfully
# Available event types: ['RAG_QUERY', 'RAG_RETRIEVAL', 'RAG_GENERATION', ...]
# Data classifications: ['PUBLIC', 'INTERNAL', 'CONFIDENTIAL', 'RESTRICTED']

## Section 2: Theory - Six Critical Audit Points in RAG Systems

### Every RAG Request Has 6 Mandatory Logging Touchpoints

1. **Query Input Audit**
   - **What:** User ID, role, exact query text (or hash if PII-sensitive), timestamp
   - **Why:** Prove who initiated the request and what they asked
   - **Regulatory:** GDPR Article 15 (right to access own data)

2. **Access Control Decision**
   - **What:** Allow/deny decision, policy rule applied, user privileges
   - **Why:** Prove authorization was checked before data access
   - **Regulatory:** SOX 404 (internal controls), PCI-DSS Req 7

3. **Retrieval Audit**
   - **What:** Document IDs retrieved, relevance scores, classification levels
   - **Why:** Track which sensitive documents were exposed
   - **Regulatory:** HIPAA (PHI access), PCI-DSS (cardholder data)

4. **LLM Generation Audit**
   - **What:** Prompt sent to LLM, model used, response generated, token counts
   - **Why:** Detect prompt injection, data leakage, cost anomalies
   - **Regulatory:** ISO 27001 (information security controls)

5. **Response Delivery Audit**
   - **What:** Final response shown to user, acknowledgment, data classification
   - **Why:** Prove user received data (for right-to-access requests)
   - **Regulatory:** GDPR Article 15 (data access proof)

6. **Error Audit**
   - **What:** Failure type, stack trace, recovery action taken
   - **Why:** Detect tampering attempts disguised as errors
   - **Regulatory:** SOX 404 (detect fraud indicators)

### Volume Calculation

**Example:** 1,000 RAG queries/day with 6 events each:
- **Daily:** 1,000 queries √ó 6 events = 9,000 events
- **Yearly:** 9,000 √ó 365 = 3.3 million events
- **Storage:** 3.3M √ó 2KB per event = **6.6 GB/year**
- **7-year SOX retention:** 6.6 GB √ó 7 = **46 GB total**

**Insight:** Storage is cheap (~$46 for 7 years on S3). Non-compliance fines are expensive ($1.8M+).

In [None]:
# Create correlation context for multi-tenant tracking
context = CorrelationContext(
    tenant_id="finance",
    correlation_id="req-demo-12345",
    span_id="query-span-1"
)

print("Correlation Context Created:")
print(json.dumps(context.to_dict(), indent=2))
print()

# Create child span for nested operations
retrieval_context = context.create_child_span("retrieval")
print("Child Span for Retrieval:")
print(json.dumps(retrieval_context.to_dict(), indent=2))
print()

print("‚úÖ Note: Same correlation_id, different span_ids")
print(f"   This links query ‚Üí retrieval ‚Üí generation in audit trail")

# Expected:
# Correlation Context Created:
# {
#   "tenant_id": "finance",
#   "correlation_id": "req-demo-12345",
#   "span_id": "query-span-1"
# }

## Section 3: Technology Stack & Architecture

### Three-Tier Audit Logging Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ        RAG Application Layer            ‚îÇ
‚îÇ   (Query ‚Üí Retrieval ‚Üí Generation)      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
               ‚îÇ Audit Events
               ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ      Audit Logger (This Module)         ‚îÇ
‚îÇ  ‚Ä¢ Correlation ID Generation            ‚îÇ
‚îÇ  ‚Ä¢ Cryptographic Hash Chaining          ‚îÇ
‚îÇ  ‚Ä¢ Multi-Destination Routing            ‚îÇ
‚îî‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
   ‚îÇ          ‚îÇ              ‚îÇ
   ‚ñº          ‚ñº              ‚ñº
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê      ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ PG   ‚îÇ  ‚îÇ  S3  ‚îÇ      ‚îÇ  SIEM  ‚îÇ
‚îÇ RLS  ‚îÇ  ‚îÇ Lock ‚îÇ      ‚îÇ (Live) ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
Hot        Cold          Real-time
(0-90d)    (7yr)         (SOC)
```

### Component Breakdown

**1. PostgreSQL with Row-Level Security (RLS)**
- **Purpose:** Hot storage (0-90 days) for fast queries
- **Immutability:** Application role can INSERT only, cannot DELETE/UPDATE
- **Cost:** $$ (database licensing)
- **Query Speed:** Fast (direct SQL)

**2. AWS S3 Object Lock**
- **Purpose:** Cold archive (7-10 years) for compliance
- **Immutability:** COMPLIANCE mode - cannot delete until retention expires
- **Cost:** $ (~$0.023/GB/month = $46 for 7-year 46GB archive)
- **Query Speed:** Slow (requires retrieval from Glacier)

**3. SIEM Integration (Splunk/ELK/Datadog)**
- **Purpose:** Real-time security monitoring by SOC team
- **Features:** Correlation rules, anomaly detection, alerting
- **Cost:** $$$ (Splunk ~$150/GB/year, ELK self-hosted, Datadog ~$15/host/month)
- **Use Case:** Detect insider threats, data exfiltration, privilege escalation

### Immutability Strategy Comparison

| Approach | How It Works | Tamper Evidence |
|----------|--------------|----------------|
| **PostgreSQL RLS** | Database policies prevent DELETE/UPDATE | Access control (admin can still tamper if malicious) |
| **S3 Object Lock** | Cryptographic retention locks | Strong (AWS enforces, cannot override) |
| **Hash Chain** | SHA-256 linking like blockchain | Mathematical proof (broken chain = tampering) |

**Best Practice:** Use **hash chains** for tamper detection + **S3 Object Lock** for enforcement.

In [None]:
# Create audit event with hash chaining
event1 = AuditEvent(
    event_type=AuditEventType.RAG_QUERY,
    context=context,
    user_id="emp-5678",
    user_role="analyst",
    user_department="finance",
    data={
        "query": "What were Q4 2024 revenue figures for EMEA?",
        "client_ip": "10.0.1.45"
    },
    data_classification=DataClassification.CONFIDENTIAL,
    compliance_flags=["SOX_RELEVANT", "FINANCIAL_DATA"]
)

print("First Audit Event Created:")
print(f"Event ID: {event1.event_id}")
print(f"Event Type: {event1.event_type.value}")
print(f"User: {event1.user_id} ({event1.user_role})")
print(f"Previous Hash: {event1.previous_hash}")
print(f"Current Hash: {event1.current_hash[:16]}...")
print()

# Create second event linked to first
event2 = AuditEvent(
    event_type=AuditEventType.RAG_RETRIEVAL,
    context=retrieval_context,
    user_id="emp-5678",
    user_role="analyst",
    user_department="finance",
    data={
        "retrieved_doc_ids": ["doc-991", "doc-992", "doc-993"],
        "relevance_scores": [0.95, 0.87, 0.82]
    },
    data_classification=DataClassification.CONFIDENTIAL,
    compliance_flags=["SOX_RELEVANT"],
    previous_hash=event1.current_hash  # Link to previous event
)

print("Second Audit Event Created (Linked via Hash):")
print(f"Event ID: {event2.event_id}")
print(f"Event Type: {event2.event_type.value}")
print(f"Previous Hash: {event2.previous_hash[:16]}... (matches event1.current_hash)")
print(f"Current Hash: {event2.current_hash[:16]}...")
print()
print("‚úÖ Hash Chain Verification:")
print(f"   event1.current_hash == event2.previous_hash: {event1.current_hash == event2.previous_hash}")

# Expected:
# First Audit Event Created:
# Event ID: <uuid>
# Event Type: RAG_QUERY
# Previous Hash: None (genesis event)
# Current Hash: a3d5f7e9... (SHA-256)
# 
# Second Audit Event Created (Linked via Hash):
# Previous Hash: a3d5f7e9... (matches event1)
# ‚úÖ Hash Chain Verification: True

## Section 4: Technical Implementation - Hands-On Audit Logging

### Step 1: Initialize Audit Logger (PostgreSQL Only)

For this demo, we'll use a minimal configuration with just PostgreSQL. In production, you would enable S3 archival and SIEM integration.

In [None]:
# Import additional components
from src.l3_m3_monitoring_reporting_compliance import (
    PostgreSQLAuditStore,
    create_audit_logger,
)
from config import get_config

# Load configuration
try:
    config = get_config()
    print("‚úÖ Configuration loaded successfully")
    print(f"   Database: {config['db_host']}:{config['db_port']}/{config['db_name']}")
    print(f"   AWS S3 Enabled: {config['aws_enabled']}")
    print(f"   SIEM Enabled: {config['siem_enabled']}")
    print(f"   Hash Chain Enabled: {config['enable_hash_chain']}")
except Exception as e:
    print(f"‚ö†Ô∏è Configuration error: {e}")
    print("   Continuing with demo (will skip database operations)")
    config = None

# Expected:
# ‚úÖ Configuration loaded successfully
#    Database: localhost:5432/audit_logs
#    AWS S3 Enabled: False
#    SIEM Enabled: False
#    Hash Chain Enabled: True

### Step 2: Complete RAG Workflow with Audit Logging

Simulate a full RAG request with all 6 audit points:

In [None]:
# Create correlation context for this request
workflow_context = CorrelationContext(
    tenant_id="finance",
    correlation_id="req-workflow-demo"
)

print("=" * 60)
print("SIMULATING FULL RAG WORKFLOW WITH AUDIT LOGGING")
print("=" * 60)
print()

# 1. QUERY INPUT AUDIT
print("[1/6] Query Input Audit")
query_event = AuditEvent(
    event_type=AuditEventType.RAG_QUERY,
    context=workflow_context,
    user_id="emp-5678",
    user_role="analyst",
    user_department="finance",
    data={
        "query": "What were Q4 2024 revenue figures for EMEA region?",
        "query_hash": "sha256:abc123...",
        "client_ip": "10.0.1.45",
        "user_agent": "Mozilla/5.0"
    },
    data_classification=DataClassification.CONFIDENTIAL,
    compliance_flags=["SOX_RELEVANT", "FINANCIAL_DATA"]
)
print(f"   ‚úÖ Logged query from {query_event.user_id}")
print(f"   Event ID: {query_event.event_id}")
print()

# 2. ACCESS CONTROL DECISION
print("[2/6] Access Control Decision Audit")
access_context = workflow_context.create_child_span("access_control")
access_event = AuditEvent(
    event_type=AuditEventType.ACCESS_CONTROL,
    context=access_context,
    user_id="emp-5678",
    user_role="analyst",
    user_department="finance",
    data={
        "decision": "ALLOW",
        "policy_rule": "ROLE_BASED_ACCESS_CONTROL",
        "requested_resource": "confidential_financial_docs",
        "reason": "User has analyst role with finance department access"
    },
    data_classification=DataClassification.INTERNAL,
    previous_hash=query_event.current_hash
)
print(f"   ‚úÖ Access decision: {access_event.data['decision']}")
print()

# 3. RETRIEVAL AUDIT
print("[3/6] Retrieval Audit")
retrieval_ctx = workflow_context.create_child_span("retrieval")
retrieval_event = AuditEvent(
    event_type=AuditEventType.RAG_RETRIEVAL,
    context=retrieval_ctx,
    user_id="emp-5678",
    user_role="analyst",
    user_department="finance",
    data={
        "retrieved_doc_ids": ["doc-991", "doc-992", "doc-993"],
        "relevance_scores": [0.95, 0.87, 0.82],
        "retrieval_method": "vector_search",
        "search_time_ms": 45
    },
    data_classification=DataClassification.CONFIDENTIAL,
    compliance_flags=["SOX_RELEVANT"],
    previous_hash=access_event.current_hash
)
print(f"   ‚úÖ Retrieved {len(retrieval_event.data['retrieved_doc_ids'])} documents")
print()

# 4. LLM GENERATION AUDIT
print("[4/6] LLM Generation Audit")
generation_ctx = workflow_context.create_child_span("generation")
generation_event = AuditEvent(
    event_type=AuditEventType.RAG_GENERATION,
    context=generation_ctx,
    user_id="emp-5678",
    user_role="analyst",
    user_department="finance",
    data={
        "model": "gpt-4",
        "prompt_tokens": 1200,
        "completion_tokens": 250,
        "total_tokens": 1450,
        "generation_time_ms": 3500,
        "temperature": 0.7
    },
    data_classification=DataClassification.CONFIDENTIAL,
    compliance_flags=["SOX_RELEVANT"],
    previous_hash=retrieval_event.current_hash
)
print(f"   ‚úÖ Generated response using {generation_event.data['model']}")
print(f"   Tokens used: {generation_event.data['total_tokens']}")
print()

# 5. RESPONSE DELIVERY AUDIT
print("[5/6] Response Delivery Audit")
response_ctx = workflow_context.create_child_span("response_delivery")
response_event = AuditEvent(
    event_type=AuditEventType.RESPONSE_DELIVERY,
    context=response_ctx,
    user_id="emp-5678",
    user_role="analyst",
    user_department="finance",
    data={
        "response_length": 850,
        "response_hash": "sha256:def456...",
        "delivery_time_ms": 120,
        "user_acknowledged": True
    },
    data_classification=DataClassification.CONFIDENTIAL,
    compliance_flags=["SOX_RELEVANT", "DATA_DELIVERY"],
    previous_hash=generation_event.current_hash
)
print(f"   ‚úÖ Response delivered to user")
print()

# 6. Verify Hash Chain Integrity
print("[6/6] Hash Chain Verification")
print(f"   Query ‚Üí Access: {query_event.current_hash == access_event.previous_hash}")
print(f"   Access ‚Üí Retrieval: {access_event.current_hash == retrieval_event.previous_hash}")
print(f"   Retrieval ‚Üí Generation: {retrieval_event.current_hash == generation_event.previous_hash}")
print(f"   Generation ‚Üí Response: {generation_event.current_hash == response_event.previous_hash}")
print()
print("‚úÖ Complete audit trail with cryptographic hash chain intact!")
print()
print("Correlation ID: All events share same correlation_id for querying:")
print(f"   {workflow_context.correlation_id}")

# Expected:
# [1/6] Query Input Audit
#    ‚úÖ Logged query from emp-5678
# [2/6] Access Control Decision Audit
#    ‚úÖ Access decision: ALLOW
# ...
# ‚úÖ Complete audit trail with cryptographic hash chain intact!

## Section 5: Reality Check - Production Considerations

### Common Production Pitfalls

**‚ùå Pitfall 1: Deleting Old Logs to Free Disk Space**
- **Impact:** Regulatory violation (SOX requires 7-year retention)
- **Fine Risk:** Regulators assume you're hiding fraudulent activity
- **Solution:** Implement tiered retention with archival to S3 Glacier

**‚ùå Pitfall 2: Giving Engineers DELETE Privileges**
- **Impact:** Insider can cover tracks by deleting audit logs
- **Solution:** Separate database roles:
  - `audit_app`: Can INSERT only (application writes logs)
  - `audit_admin`: Can SELECT only (for archival to S3)
  - NO role can DELETE or UPDATE

**‚ùå Pitfall 3: Logging to Application Database**
- **Impact:** If application DB is compromised, audit logs are too
- **Solution:** Use dedicated audit database with stricter access controls

### Cost-Benefit Reality

**Storage Cost (7-year retention for 1,000 queries/day):**
- PostgreSQL (90 days hot): ~$50/year
- S3 Standard (365 days warm): ~$15/year
- S3 Glacier (7 years cold): ~$5/year
- **Total:** ~$70/year

**Non-Compliance Fine Risk:**
- HIPAA: $100-$50,000 per violation (up to $1.5M/year)
- SOX: Criminal penalties, company delisting
- GDPR: 4% of annual global revenue

**ROI:** Spending $70/year to avoid $1.8M fine = **25,000x return**

In [None]:
# Storage calculation for production planning

def calculate_audit_storage(
    queries_per_day: int,
    events_per_query: int = 6,
    bytes_per_event: int = 2048,  # 2KB
    retention_years: int = 7
):
    """
    Calculate audit log storage requirements.
    
    Args:
        queries_per_day: Daily RAG query volume
        events_per_query: Audit events per query (default 6)
        bytes_per_event: Average event size in bytes
        retention_years: Compliance retention period
    
    Returns:
        Storage requirements breakdown
    """
    events_per_day = queries_per_day * events_per_query
    bytes_per_day = events_per_day * bytes_per_event
    mb_per_day = bytes_per_day / (1024 * 1024)
    gb_per_year = (bytes_per_day * 365) / (1024 * 1024 * 1024)
    gb_total = gb_per_year * retention_years
    
    # Cost estimates (AWS S3 Glacier)
    cost_per_gb_month = 0.004  # S3 Glacier Deep Archive
    cost_per_year = gb_total * cost_per_gb_month * 12
    
    return {
        "events_per_day": events_per_day,
        "mb_per_day": round(mb_per_day, 2),
        "gb_per_year": round(gb_per_year, 2),
        "gb_total_retention": round(gb_total, 2),
        "estimated_cost_per_year": round(cost_per_year, 2),
    }

# Example: Small GCC (1,000 queries/day)
small_gcc = calculate_audit_storage(1000)
print("Small GCC (1,000 queries/day):")
print(f"  Events per day: {small_gcc['events_per_day']:,}")
print(f"  Storage per day: {small_gcc['mb_per_day']} MB")
print(f"  Storage per year: {small_gcc['gb_per_year']} GB")
print(f"  Total 7-year retention: {small_gcc['gb_total_retention']} GB")
print(f"  Estimated annual cost: ${small_gcc['estimated_cost_per_year']}")
print()

# Example: Medium GCC (10,000 queries/day)
medium_gcc = calculate_audit_storage(10000)
print("Medium GCC (10,000 queries/day):")
print(f"  Events per day: {medium_gcc['events_per_day']:,}")
print(f"  Total 7-year retention: {medium_gcc['gb_total_retention']} GB")
print(f"  Estimated annual cost: ${medium_gcc['estimated_cost_per_year']}")
print()

# Example: Large GCC (100,000 queries/day)
large_gcc = calculate_audit_storage(100000)
print("Large GCC (100,000 queries/day):")
print(f"  Events per day: {large_gcc['events_per_day']:,}")
print(f"  Total 7-year retention: {large_gcc['gb_total_retention']} GB")
print(f"  Estimated annual cost: ${large_gcc['estimated_cost_per_year']}")
print()
print("üí° Key Insight: Storage cost is negligible compared to compliance fines!")

# Expected:
# Small GCC (1,000 queries/day):
#   Events per day: 6,000
#   Storage per day: 11.72 MB
#   Storage per year: 4.18 GB
#   Total 7-year retention: 29.27 GB
#   Estimated annual cost: $1.41

## Section 6: Decision Card - When to Use This Approach

### ‚úÖ WHEN TO USE Comprehensive Audit Logging

**Organization Profile:**
- Operating in **regulated industries**: healthcare (HIPAA), finance (SOX), payments (PCI-DSS), EU operations (GDPR)
- Enterprise GCC serving **10+ internal tenants** or external clients
- Handling **classified/confidential data** requiring audit proof
- Subject to **compliance audits** (annual or ad-hoc)

**Technical Profile:**
- Production RAG system processing **100+ queries/day**
- **Multiple user roles** with differentiated access levels
- Integration with **centralized security operations center (SOC)**
- Long-term data retention requirements **(6-10 years)**

**Risk Tolerance:**
- Non-compliance fines exceed logging implementation cost
- Regulatory audit failure **unacceptable to business**
- Need to **prove compliance** to external auditors/board

### ‚ùå WHEN NOT TO USE (or Defer)

**Skip Full Audit Logging If:**
- Internal **research project** with synthetic/test data only
- **Single-user** RAG system (no multi-tenant complexity)
- **Non-regulated industry** with no audit requirements
- **Short-lived proof-of-concept** (expected lifespan <6 months)
- **Zero external compliance** obligations

**Defer SIEM Integration If:**
- Organization has **no existing SIEM infrastructure**
- Security team **does not exist** or cannot monitor logs
- **Budget severely constrained** and basic database logging suffices
- Compliance requirements only require **log retention**, not real-time monitoring

### Trade-Offs Summary

| Approach | Cost | Tamper Resistance | Query Speed | Complexity |
|----------|------|-------------------|-------------|------------|
| PostgreSQL RLS | $$ | Medium | Fast | Low |
| S3 Object Lock | $ | Very High | Slow | Medium |
| Hash Chain | $ | Very High | Medium | High |
| **Recommended: Hash Chain + S3** | $ | Very High | Medium | Medium |

**Best Practice:** Start with PostgreSQL + hash chains. Add S3 archival when retention exceeds 365 days. Add SIEM when you have a security team ready to monitor.

## Section 7: Summary & Key Takeaways

### What You Learned

1. **Audit vs. Application Logging**
   - Application logs = system health ("Is it working?")
   - Audit logs = compliance proof ("Can I show the regulator?")
   - Retention: 7-30 days vs. 7-10 years

2. **Six Critical Audit Points**
   - Query input, access control, retrieval, generation, response, errors
   - Every RAG request = 6 audit events minimum
   - 1,000 queries/day = 9,000 events/day = 6.6 GB/year

3. **Immutability Strategies**
   - PostgreSQL RLS: Access control (medium security)
   - S3 Object Lock: Cryptographic retention (high security)
   - Hash chains: Mathematical tamper proof (highest security)
   - **Best practice:** Hash chains + S3 Object Lock

4. **Correlation IDs for Multi-Tenant Tracking**
   - `tenant_id`: Business unit (finance, hr, legal)
   - `correlation_id`: Single user request
   - `span_id`: Specific operation (query, retrieval, generation)
   - Enables tenant-specific audit queries

5. **SIEM Integration**
   - Real-time security monitoring by SOC team
   - Detect anomalies: bulk exports, after-hours access, privilege escalation
   - Platforms: Splunk (~$150/GB/year), Elasticsearch (self-hosted), Datadog (~$15/host/month)

6. **Regulatory Compliance Mapping**
   - SOX Section 404: Immutable audit trails for financial data
   - GDPR Article 15: Right to access (correlation IDs support this)
   - HIPAA 164.312(b): Audit controls for PHI access
   - PCI-DSS Requirement 10: Payment card data logging

### Production Deployment Checklist

- [ ] PostgreSQL database with Row-Level Security configured
- [ ] Separate database roles: `audit_app` (INSERT), `audit_admin` (SELECT)
- [ ] Hash chain verification tested on sample data
- [ ] S3 bucket with Object Lock enabled (COMPLIANCE mode, 7-year retention)
- [ ] SIEM integration configured (if SOC team exists)
- [ ] Tiered retention policy: hot (90d), warm (365d), cold (7yr)
- [ ] Correlation ID generation at RAG request entry point
- [ ] All 6 audit points instrumented in RAG pipeline
- [ ] Test audit query: "Show all access to document X in Q4 2024"
- [ ] Incident response plan for broken hash chains

### Next Steps

1. **Practice:** Run the API server and test all endpoints
2. **Explore:** Read the augmented script for deeper implementation details
3. **Extend:** Add custom compliance flags for your industry (e.g., MNPI for finance)
4. **Deploy:** Follow production deployment checklist above

### Resources

- **API Documentation:** http://localhost:8000/docs
- **Source Code:** `src/l3_m3_monitoring_reporting_compliance/`
- **Test Suite:** `tests/test_m3_monitoring_reporting_compliance.py`
- **Example Data:** `example_data.json`

---

**Remember:** Storage is cheap (~$70/year). Non-compliance fines are expensive ($1.8M+). When in doubt, log it and keep it immutable.

**üéØ You're now ready to implement production-grade audit logging for regulated RAG systems!**