# Bridge L3.M5.3 → L3.M5.4 Readiness Validation
**From:** Data Quality & Validation  
**To:** Vector Index Management

---

## 1. Recap: What M5.3 (Data Quality & Validation) Shipped

The previous module delivered four key capabilities:

### 1.1 Quality Scoring Algorithms
Systems detecting corrupted text and OCR failures with **>80% accuracy** using character distribution analysis.

### 1.2 Duplicate Detection at Scale
MinHash/LSH technology identifying near-duplicates across millions of chunks with **<5% false positive rate**.

### 1.3 Drift Monitoring
Statistical tests (Chi-square) that alert when document distributions change meaningfully, catching corpus shifts before impact.

### 1.4 Grafana Quality Dashboards
Real-time visibility into metrics surfaced in **under 30 seconds**.

---

## 2. Readiness Check #1: Quality Validation Pipeline Active

**Requirement:** Quality validation actively running in pipeline with recent metrics logged.

This check verifies that the data quality systems from M5.3 are operational and producing metrics.

In [None]:
import os
from datetime import datetime, timedelta

# Check for quality metrics log/database
METRICS_PATH = os.getenv("QUALITY_METRICS_PATH", "./quality_metrics.json")

if not os.path.exists(METRICS_PATH):
    print("⚠️ Skipping (no quality metrics file found)")
    print(f"   Expected: {METRICS_PATH}")
else:
    # Expected: Recent metrics within last 24h
    # Expected: quality_score, duplicate_rate, drift_score present
    print("✓ Quality metrics file found")
    print(f"  Location: {METRICS_PATH}")
    # In production: verify timestamp < 24h, required fields present

## 3. Readiness Check #2: Minimum Vector Count in Pinecone

**Requirement:** Minimum 5,000 vectors already in Pinecone for meaningful testing.

⚠️ **Note:** Backup/restore operations can overwhelm free Pinecone tier. Use subset testing first.

In [None]:
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_INDEX = os.getenv("PINECONE_INDEX", "rag-index")

if not PINECONE_API_KEY:
    print("⚠️ Skipping (no PINECONE_API_KEY)")
else:
    try:
        # Stub: In production, connect and query index stats
        # from pinecone import Pinecone
        # pc = Pinecone(api_key=PINECONE_API_KEY)
        # index = pc.Index(PINECONE_INDEX)
        # stats = index.describe_index_stats()
        # vector_count = stats.total_vector_count
        
        # Expected: vector_count >= 5000
        print(f"✓ Connected to Pinecone index: {PINECONE_INDEX}")
        print(f"  Expected: ≥5,000 vectors for meaningful M5.4 testing")
    except Exception as e:
        print(f"⚠️ Pinecone check failed: {e}")

## 4. Readiness Check #3: Required Metadata Fields

**Requirement:** Metadata includes `document_id` and `version` fields for targeted operations.

These fields enable targeted backup/restore and blue-green deployments in M5.4.

In [None]:
if not PINECONE_API_KEY:
    print("⚠️ Skipping (no PINECONE_API_KEY)")
else:
    try:
        # Stub: Query sample vectors and check metadata
        # index = pc.Index(PINECONE_INDEX)
        # sample = index.query(vector=[0]*1536, top_k=1, include_metadata=True)
        # metadata = sample['matches'][0]['metadata']
        # assert 'document_id' in metadata
        # assert 'version' in metadata
        
        # Expected: All vectors have 'document_id' and 'version' in metadata
        print("✓ Metadata schema check")
        print("  Expected fields: document_id, version")
    except Exception as e:
        print(f"⚠️ Metadata check failed: {e}")

## 5. Readiness Check #4: Prometheus Metrics for Index Size

**Requirement:** Prometheus metrics tracking index size (specifically `rag_documents_in_index`).

This metric is essential for M5.4's index health monitoring and alerting.

In [None]:
PROMETHEUS_URL = os.getenv("PROMETHEUS_URL", "http://localhost:9090")

try:
    # Stub: Query Prometheus for the metric
    # import requests
    # response = requests.get(f"{PROMETHEUS_URL}/api/v1/query",
    #                        params={'query': 'rag_documents_in_index'})
    # if response.json()['data']['result']:
    #     metric_value = response.json()['data']['result'][0]['value'][1]
    
    # Expected: Metric 'rag_documents_in_index' exists and is being tracked
    print(f"✓ Prometheus endpoint: {PROMETHEUS_URL}")
    print(f"  Expected metric: rag_documents_in_index")
except Exception as e:
    print(f"⚠️ Skipping (Prometheus not accessible): {e}")

---

## 6. Call-Forward: What M5.4 (Vector Index Management) Will Introduce

Having validated that your data quality systems are operational and your vector infrastructure is ready, you're now prepared to tackle the next critical challenge: **infrastructure resilience**.

### The Problem
RAG systems face three major operational risks:
- **Data loss** from accidental deletions or corruption
- **Downtime** during index updates or schema migrations  
- **Performance degradation** that goes undetected until users complain

### What M5.4 Delivers

#### 6.1 Automated Backup and Restore
- **Nightly backups** with verification checksums
- **Recovery in minutes** rather than hours
- Enables confident experimentation and rollback

#### 6.2 Blue-Green Deployments
- **Zero-downtime** index migrations
- **Instant version switching** between old and new indices
- **Rollback capability** if issues are detected

#### 6.3 Index Health Monitoring
- **Automated alerts** for query latency spikes
- **Index size tracking** to prevent capacity issues
- **Corruption detection** before it impacts production

### Driving Question
*How do we ensure our RAG infrastructure is resilient, recoverable, and always available?*

M5.4 will answer this by implementing the operational safety net every production RAG system requires.

---

**Next Step:** Proceed to Module 5.4 to implement Vector Index Management.