# Bridge L3.M6.4 → L3.M7.1 Readiness Validation
## Security Complete → Observability Begins

## Purpose

Module 6.4 delivered enterprise security with audit trails, GDPR automation, and compliance logging through ELK and structured logging. Module 7.1 will shift from aggregate metrics to **request-level observability** using distributed tracing. This bridge validates that the security foundation is operational and metrics infrastructure is ready before adding trace instrumentation.

## Concepts Covered

- **ELK Stack readiness**: Verifying Kibana and Elasticsearch audit event storage
- **Structured logging validation**: Confirming request_id correlation across log types
- **Prometheus metrics baseline**: Checking P50/P95 latency targets for trace comparison
- **Security module verification**: Git commit evidence and functionality checks

## After Completing

- Verify ELK stack is operational with ≥100 audit events in the last 24 hours
- Confirm structured logging includes request_id fields for correlation
- Validate Grafana shows P50 (300-600ms) and P95 (700-1,200ms) baselines
- Check M6.1-M6.4 commits exist and security features are functional
- Understand the gap distributed tracing will fill (aggregate vs. per-request visibility)

## Context in Track

**Bridge L3.M6.4 → L3.M7.1**: Security Complete → Observability Begins

This validation ensures the security and metrics infrastructure from Module 6 is operational before Module 7 adds OpenTelemetry tracing to diagnose per-request performance bottlenecks.

---

## Run Locally

**Windows**:
```powershell
powershell -c "$env:PYTHONPATH='$PWD'; jupyter notebook"
```

**macOS/Linux**:
```bash
PYTHONPATH=$PWD jupyter notebook
```

**Note**: External services (Kibana, Grafana, Elasticsearch, Prometheus) are optional. Checks will skip gracefully if services are unavailable.

---

## Section 1: M6.4 Delivery Recap

Module 6.4 completed the **enterprise security foundation** with four major achievements:

### 1. Comprehensive Audit Trail
- **ELK Stack Implementation**: Captures WHO (user_id, role, IP), WHAT (action, resource), WHEN (timestamp), and outcome
- **Tamper-Proof Storage**: Hash chaining for audit integrity
- **Centralized Aggregation**: Elasticsearch for compliance queries

### 2. GDPR Automation
- **Efficiency Gain**: 40 hours manual work → 5 minutes automated
- **Right-to-Erasure**: Automated across systems with audit proof
- **Consent Tracking**: Linked to processing actions

### 3. Retention Policies
- **Tiered Storage**:
  - Hot: 0-90 days (Elasticsearch)
  - Warm: 90 days-1 year (S3)
  - Cold: 1-7 years (Glacier)
- **Automatic Deletion**: After retention periods

### 4. Complete Security Stack
- M6.1: PII redaction
- M6.2: Secrets management
- M6.3: RBAC
- M6.4: Compliance auditing

**Result**: Enterprise-grade security posture achieved.

---

## Section 2: Readiness Check #1 - ELK Stack Operational

**Requirement**: Kibana accessible at localhost:5601 with audit-logs-* index showing events from last 24 hours; Elasticsearch query returns ≥100 events.

**Pass Criteria**:
- Kibana UI responds at http://localhost:5601
- Index `audit-logs-*` exists
- Event count ≥ 100 in last 24 hours

### Check ELK Stack

This function verifies Kibana is accessible and Elasticsearch contains sufficient audit events. If services are unavailable, the check skips gracefully with a warning.

In [None]:
import requests
from datetime import datetime, timedelta

OFFLINE_MODE = False  # Set to True to skip all external service checks

def check_elk_stack():
    """Verify Kibana accessibility and Elasticsearch audit event count."""
    if OFFLINE_MODE:
        print("⚠️ Offline mode: Skipping ELK stack check")
        return
    
    # Check Kibana accessibility
    try:
        kibana_url = "http://localhost:5601/api/status"
        resp = requests.get(kibana_url, timeout=5)
        print(f"✓ Kibana: {resp.status_code}")
    except Exception as e:
        print(f"⚠️ Skipping (Kibana not available)")
        return
    
    # Check Elasticsearch index and event count
    try:
        es_url = "http://localhost:9200/audit-logs-*/_count"
        query = {"query": {"range": {"@timestamp": {"gte": "now-24h"}}}}
        resp = requests.post(es_url, json=query, timeout=5)
        count = resp.json().get('count', 0)
        status = "✓" if count >= 100 else "✗"
        print(f"{status} Events (24h): {count} (required: ≥100)")
    except Exception as e:
        print(f"⚠️ Skipping (Elasticsearch not available)")

# Uncomment to run: check_elk_stack()

## Section 3: Readiness Check #2 - Structured Logging

**Requirement**: Application logs include unique `request_id` fields; correlation works between logs and audit events via request_id queries in Kibana.

**Pass Criteria**:
- Application logs contain `request_id` field
- Correlation between logs and audit events via `request_id`
- Kibana query can trace a request across both log types

### Check Structured Logging

This function searches Elasticsearch for logs containing request_id fields, confirming that correlation queries will work in Kibana for tracing requests across services.

In [None]:
def check_structured_logging():
    """Verify logs contain request_id for cross-service correlation."""
    if OFFLINE_MODE:
        print("⚠️ Offline mode: Skipping structured logging check")
        return
    
    try:
        es_url = "http://localhost:9200/_search"
        query = {
            "size": 1,
            "query": {"exists": {"field": "request_id"}},
            "_source": ["request_id", "@timestamp", "message"]
        }
        resp = requests.post(es_url, json=query, timeout=5)
        hits = resp.json().get('hits', {}).get('hits', [])
        
        if hits:
            req_id = hits[0]['_source'].get('request_id')
            print(f"✓ request_id found: {req_id[:16]}...")
            print(f"✓ Correlation: Query Kibana with 'request_id:\"{req_id}\"'")
        else:
            print("✗ No logs with request_id field found")
    except Exception as e:
        print(f"⚠️ Skipping (Elasticsearch not available)")

# Uncomment to run: check_structured_logging()

## Section 4: Readiness Check #3 - Prometheus Metrics

**Requirement**: Grafana dashboard at localhost:3000 updates every 15 seconds showing P50 (300-600ms baseline), P95 (700-1,200ms), request rates, and error rates <1%.

**Pass Criteria**:
- Grafana accessible at http://localhost:3000
- Dashboard shows P50 latency: 300-600ms
- Dashboard shows P95 latency: 700-1,200ms
- Error rate < 1%

### Check Prometheus Metrics

This function verifies Grafana is accessible and displays the expected latency baseline queries. These metrics establish the pre-tracing performance baseline for comparison after M7.1 instrumentation.

In [None]:
def check_prometheus_metrics():
    """Verify Grafana accessibility and expected latency metric queries."""
    if OFFLINE_MODE:
        print("⚠️ Offline mode: Skipping Prometheus metrics check")
        return
    
    try:
        grafana_url = "http://localhost:3000/api/health"
        resp = requests.get(grafana_url, timeout=5)
        print(f"✓ Grafana: {resp.status_code}")
        
        # Display expected Prometheus queries for manual verification
        print("✓ Expected metrics: P50=300-600ms, P95=700-1200ms, errors<1%")
        print("  P50 query: histogram_quantile(0.50, rate(http_request_duration_ms_bucket[5m]))")
        print("  P95 query: histogram_quantile(0.95, rate(http_request_duration_ms_bucket[5m]))")
    except Exception as e:
        print(f"⚠️ Skipping (Grafana not available)")

# Uncomment to run: check_prometheus_metrics()

## Section 5: Readiness Check #4 - M6 Completion

**Requirement**: GitHub repository contains commits for M6.1-M6.4; each module's functionality verified.

**Pass Criteria**:
- M6.1: PII detection functional
- M6.2: Vault secrets management operational
- M6.3: RBAC enforcement working
- M6.4: Elasticsearch audit events present

### Check M6 Module Completion

This function searches git history for M6-related commits, providing evidence that security modules were completed. Manual verification of functionality is still required.

In [None]:
import subprocess
import os

def check_m6_completion():
    """Verify git commits exist for M6.1-M6.4 security modules."""
    if OFFLINE_MODE:
        print("⚠️ Offline mode: Skipping M6 completion check")
        return
    
    if not os.path.exists('.git'):
        print("⚠️ Skipping (not a git repository)")
        return
    
    try:
        # Check for M6-related commits in git history
        result = subprocess.run(
            ['git', 'log', '--oneline', '--all', '--grep=M6'], 
            capture_output=True, text=True, timeout=5
        )
        commits = result.stdout.strip().split('\n') if result.stdout.strip() else []
        print(f"✓ M6 commits found: {len(commits)}")
        
        # List modules requiring manual verification
        modules = ['M6.1 (PII)', 'M6.2 (Vault)', 'M6.3 (RBAC)', 'M6.4 (Audit)']
        for module in modules:
            print(f"  - {module}: Manual verification required")
    except Exception as e:
        print(f"⚠️ Skipping (git check failed)")

# Uncomment to run: check_m6_completion()

---

## Section 6: Call-Forward to M7.1 - Distributed Tracing

### The Gap We're About to Fill

**Current State**: Prometheus metrics show aggregate P95 latency of 850ms, but provide **zero visibility** into why individual requests exceed this baseline.

**The Problem**: A mysterious 4.2-second query cannot be diagnosed without request-level tracing.

### What M7.1 Will Deliver

**OpenTelemetry Instrumentation** enabling:

1. **Request-Level Timing** (millisecond precision)
   - Trace requests through retrieval → reranking → generation stages
   - Identify exact bottlenecks in the pipeline

2. **Service Dependency Visualization**
   - See how failures cascade through the system
   - Understand component interactions

3. **Production Optimization Evidence**
   - Example: Discover OpenAI calls consume 85% of latency
   - Enable data-driven model selection decisions

### Three Core Capabilities

| Capability | Value |
|------------|-------|
| **Per-Request Breakdown** | Identify which operation causes slowness in each request |
| **Cascading Failure Detection** | Visualize how one slow component impacts entire pipeline |
| **Jaeger UI** | Interactive trace visualization and analysis |

### Module 7 Roadmap (155 minutes total)

- **M7.1** (42 min): OpenTelemetry instrumentation - "Why was THIS query slow?"
- **M7.2** (38 min): Unified observability - Link metrics → logs → traces
- **M7.3** (35 min): Performance profiling - CPU/memory hotspots
- **M7.4** (40 min): Trace-based SLI monitoring and anomaly detection

**Next Step**: Once all readiness checks pass, begin M7.1 to gain request-level visibility into system performance.