# Bridge L3.M6.4 → L3.M7.1 Readiness Validation
## Security Complete → Observability Begins

**Purpose**: Validate that M6.4 security deliverables are operational before starting M7.1 distributed tracing.

---

## Section 1: M6.4 Delivery Recap

Module 6.4 completed the **enterprise security foundation** with four major achievements:

### 1. Comprehensive Audit Trail
- **ELK Stack Implementation**: Captures WHO (user_id, role, IP), WHAT (action, resource), WHEN (timestamp), and outcome
- **Tamper-Proof Storage**: Hash chaining for audit integrity
- **Centralized Aggregation**: Elasticsearch for compliance queries

### 2. GDPR Automation
- **Efficiency Gain**: 40 hours manual work → 5 minutes automated
- **Right-to-Erasure**: Automated across systems with audit proof
- **Consent Tracking**: Linked to processing actions

### 3. Retention Policies
- **Tiered Storage**:
  - Hot: 0-90 days (Elasticsearch)
  - Warm: 90 days-1 year (S3)
  - Cold: 1-7 years (Glacier)
- **Automatic Deletion**: After retention periods

### 4. Complete Security Stack
- M6.1: PII redaction
- M6.2: Secrets management
- M6.3: RBAC
- M6.4: Compliance auditing

**Result**: Enterprise-grade security posture achieved.

---

## Section 2: Readiness Check #1 - ELK Stack Operational

**Requirement**: Kibana accessible at localhost:5601 with audit-logs-* index showing events from last 24 hours; Elasticsearch query returns ≥100 events.

**Pass Criteria**:
- Kibana UI responds at http://localhost:5601
- Index `audit-logs-*` exists
- Event count ≥ 100 in last 24 hours

In [None]:
import requests
from datetime import datetime, timedelta

def check_elk_stack():
    # Expected: Kibana responds, audit-logs-* index exists, ≥100 events in 24h
    try:
        # Check Kibana accessibility
        kibana_url = "http://localhost:5601/api/status"
        resp = requests.get(kibana_url, timeout=5)
        print(f"✓ Kibana: {resp.status_code}")
    except Exception as e:
        print(f"⚠️ Skipping (Kibana not available): {e}")
        return
    
    # Check Elasticsearch index and event count
    try:
        es_url = "http://localhost:9200/audit-logs-*/_count"
        query = {"query": {"range": {"@timestamp": {"gte": "now-24h"}}}}
        resp = requests.post(es_url, json=query, timeout=5)
        count = resp.json().get('count', 0)
        status = "✓" if count >= 100 else "✗"
        print(f"{status} Events (24h): {count} (required: ≥100)")
    except Exception as e:
        print(f"⚠️ Skipping (Elasticsearch not available): {e}")

# check_elk_stack()  # Uncomment to run

## Section 3: Readiness Check #2 - Structured Logging

**Requirement**: Application logs include unique `request_id` fields; correlation works between logs and audit events via request_id queries in Kibana.

**Pass Criteria**:
- Application logs contain `request_id` field
- Correlation between logs and audit events via `request_id`
- Kibana query can trace a request across both log types

In [None]:
def check_structured_logging():
    # Expected: Logs contain request_id, correlation works across log types
    try:
        es_url = "http://localhost:9200/_search"
        query = {
            "size": 1,
            "query": {"exists": {"field": "request_id"}},
            "_source": ["request_id", "@timestamp", "message"]
        }
        resp = requests.post(es_url, json=query, timeout=5)
        hits = resp.json().get('hits', {}).get('hits', [])
        
        if hits:
            req_id = hits[0]['_source'].get('request_id')
            print(f"✓ request_id found: {req_id[:16]}...")
            # Expected: Can query both logs and audit events with this request_id
            print(f"✓ Correlation: Query Kibana with 'request_id:\"{req_id}\"'")
        else:
            print("✗ No logs with request_id field found")
    except Exception as e:
        print(f"⚠️ Skipping (Elasticsearch not available): {e}")

# check_structured_logging()  # Uncomment to run

## Section 4: Readiness Check #3 - Prometheus Metrics

**Requirement**: Grafana dashboard at localhost:3000 updates every 15 seconds showing P50 (300-600ms baseline), P95 (700-1,200ms), request rates, and error rates <1%.

**Pass Criteria**:
- Grafana accessible at http://localhost:3000
- Dashboard shows P50 latency: 300-600ms
- Dashboard shows P95 latency: 700-1,200ms
- Error rate < 1%

In [None]:
def check_prometheus_metrics():
    # Expected: Grafana accessible, P50: 300-600ms, P95: 700-1200ms, errors <1%
    try:
        grafana_url = "http://localhost:3000/api/health"
        resp = requests.get(grafana_url, timeout=5)
        print(f"✓ Grafana: {resp.status_code}")
    except Exception as e:
        print(f"⚠️ Skipping (Grafana not available): {e}")
        return
    
    # Expected: Query Prometheus for latency metrics
    try:
        prom_url = "http://localhost:9090/api/v1/query"
        # Check P50 and P95 latencies (values in ms)
        print("✓ Expected metrics: P50=300-600ms, P95=700-1200ms, errors<1%")
        print("  (Run: histogram_quantile(0.50, rate(http_request_duration_ms_bucket[5m])))")
        print("  (Run: histogram_quantile(0.95, rate(http_request_duration_ms_bucket[5m])))")
    except Exception as e:
        print(f"⚠️ Skipping (Prometheus not available): {e}")

# check_prometheus_metrics()  # Uncomment to run

## Section 5: Readiness Check #4 - M6 Completion

**Requirement**: GitHub repository contains commits for M6.1-M6.4; each module's functionality verified.

**Pass Criteria**:
- M6.1: PII detection functional
- M6.2: Vault secrets management operational
- M6.3: RBAC enforcement working
- M6.4: Elasticsearch audit events present

In [None]:
import subprocess
import os

def check_m6_completion():
    # Expected: Git commits for M6.1-M6.4, functionality verified
    if not os.path.exists('.git'):
        print("⚠️ Skipping (not a git repository)")
        return
    
    try:
        # Check for M6-related commits
        result = subprocess.run(['git', 'log', '--oneline', '--all', '--grep=M6'], 
                                capture_output=True, text=True, timeout=5)
        commits = result.stdout.strip().split('\n') if result.stdout.strip() else []
        print(f"✓ M6 commits found: {len(commits)}")
        
        # Expected: Verify module functionality (stubs)
        modules = ['M6.1 (PII)', 'M6.2 (Vault)', 'M6.3 (RBAC)', 'M6.4 (Audit)']
        for module in modules:
            print(f"  - {module}: Manual verification required")
    except Exception as e:
        print(f"⚠️ Skipping (git check failed): {e}")

# check_m6_completion()  # Uncomment to run

---

## Section 6: Call-Forward to M7.1 - Distributed Tracing

### The Gap We're About to Fill

**Current State**: Prometheus metrics show aggregate P95 latency of 850ms, but provide **zero visibility** into why individual requests exceed this baseline.

**The Problem**: A mysterious 4.2-second query cannot be diagnosed without request-level tracing.

### What M7.1 Will Deliver

**OpenTelemetry Instrumentation** enabling:

1. **Request-Level Timing** (millisecond precision)
   - Trace requests through retrieval → reranking → generation stages
   - Identify exact bottlenecks in the pipeline

2. **Service Dependency Visualization**
   - See how failures cascade through the system
   - Understand component interactions

3. **Production Optimization Evidence**
   - Example: Discover OpenAI calls consume 85% of latency
   - Enable data-driven model selection decisions

### Three Core Capabilities

| Capability | Value |
|------------|-------|
| **Per-Request Breakdown** | Identify which operation causes slowness in each request |
| **Cascading Failure Detection** | Visualize how one slow component impacts entire pipeline |
| **Jaeger UI** | Interactive trace visualization and analysis |

### Module 7 Roadmap (155 minutes total)

- **M7.1** (42 min): OpenTelemetry instrumentation - "Why was THIS query slow?"
- **M7.2** (38 min): Unified observability - Link metrics → logs → traces
- **M7.3** (35 min): Performance profiling - CPU/memory hotspots
- **M7.4** (40 min): Trace-based SLI monitoring and anomaly detection

**Next Step**: Once all readiness checks pass, begin M7.1 to gain request-level visibility into system performance.