# M1.3 ‚Üí M1.4 Bridge Validation

## Purpose

You've built a complete document processing pipeline in M1.3 that extracts, chunks, enriches, embeds, and stores documents in Pinecone. Before moving to M1.4's query pipeline, you must validate that your processed data meets minimum quality and completeness thresholds. Without this validation, query failures in M1.4 will be ambiguous‚Äîyou won't know if the problem is in retrieval logic or missing data. This notebook bridges the gap by confirming your vector database is query-ready.

## Concepts Covered

- **Vector database health checks:** Verifying minimum vector count and metadata completeness
- **Pipeline documentation:** Ensuring chunking parameters are recorded for debugging
- **Basic semantic search:** Testing retrieval with three query types (factual, how-to, comparison)
- **Offline-friendly validation:** Graceful skipping when API keys are unavailable

## After Completing

You will be able to:
- ‚úÖ Confirm your Pinecone index has ‚â•100 vectors with required metadata fields
- ‚úÖ Validate that sample retrieval returns expected metadata keys (source, chunk_id, content_type)
- ‚úÖ Document your chunking strategy for future debugging
- ‚úÖ Run basic semantic queries to verify the pipeline works end-to-end
- ‚úÖ Identify specific gaps that need remediation before M1.4

## Context in Track

**Bridge:** L1.M1.3 (Document Processing) ‚Üí L1.M1.4 (Query Pipeline)  
**Track:** CCC RAG Fundamentals - Module 1  
**Duration:** 15-20 minutes

---

## Run Locally (Windows-first)

```powershell
# Set environment variables
$env:PINECONE_API_KEY="your-key-here"
$env:OPENAI_API_KEY="your-key-here"
$env:PINECONE_INDEX_NAME="rag-index"

# Launch notebook
powershell -c "$env:PYTHONPATH='$PWD'; jupyter notebook"
```

**Linux/Mac:**
```bash
export PINECONE_API_KEY="your-key-here"
export OPENAI_API_KEY="your-key-here"
export PINECONE_INDEX_NAME="rag-index"
jupyter notebook
```

**Note:** Notebook runs offline-friendly. Cells skip gracefully if API keys are missing.

---

## Section 1: Recap - Pipeline Architecture

**M1.3 Built:** Complete 6-stage document processing pipeline

```
extract ‚Üí clean ‚Üí chunk ‚Üí enrich ‚Üí embed ‚Üí store
```

**Key Components:**
- ‚úì Semantic chunking (15-20% overlap, sentence boundaries)
- ‚úì Metadata enrichment (source, chunk_id, content_type)
- ‚úì Embedding generation (OpenAI/sentence-transformers)
- ‚úì Vector storage (Pinecone)

**What's Next:** Query understanding, hybrid retrieval, reranking, response generation

---

## Section 2: Check - Pinecone Vector Count

**Requirement:** ‚â•100 vectors with complete metadata

**What this cell does:** Connects to Pinecone and queries index statistics to verify vector count meets the minimum threshold of 100 vectors. Skips gracefully if `PINECONE_API_KEY` is not set.

In [None]:
import os

# Offline-friendly: Check for API key before attempting connection
api_key = os.getenv("PINECONE_API_KEY")
index_name = os.getenv("PINECONE_INDEX_NAME", "rag-index")
index = None

if not api_key:
    print("‚ö†Ô∏è  SKIP: PINECONE_API_KEY not set")
    print("   Set environment variable to run this check")
    print("   Example: export PINECONE_API_KEY='your-key-here'")
else:
    try:
        from pinecone import Pinecone
        
        pc = Pinecone(api_key=api_key)
        index = pc.Index(index_name)
        stats = index.describe_index_stats()
        
        vector_count = stats.get('total_vector_count', 0)
        print(f"üìä Vector Count: {vector_count}")
        
        if vector_count >= 100:
            print(f"‚úÖ PASS: {vector_count} vectors (‚â•100 required)")
        else:
            print(f"‚ùå FAIL: {vector_count} vectors (need ‚â•100)")
            print("   ‚Üí Return to M1.3 to process more documents")
            
    except Exception as e:
        print(f"‚ö†Ô∏è  ERROR: {str(e)}")
        print("   Cannot connect to Pinecone - check credentials and network")

---

## Section 3: Check - Sample Retrieval & Metadata

**Requirement:** Sample vectors have source, chunk_id, content_type metadata

**What this cell does:** Queries Pinecone with a dummy vector to retrieve 3 sample records, then validates that all required metadata fields are present. This ensures M1.4's filtering and source attribution features will work.

In [None]:
# Offline-friendly: Skip if no API key or index unavailable
if not api_key:
    print("‚ö†Ô∏è  SKIP: No API key - cannot test retrieval")
elif index is None:
    print("‚ö†Ô∏è  SKIP: Pinecone index not available from previous step")
else:
    try:
        # Create dummy query vector (zeros for sampling)
        dummy_vector = [0.0] * 1536  # OpenAI embedding dimension
        
        # Query for 3 samples
        results = index.query(vector=dummy_vector, top_k=3, include_metadata=True)
        
        if len(results['matches']) == 0:
            print("‚ùå FAIL: No vectors returned from query")
        else:
            print(f"‚úÖ Retrieved {len(results['matches'])} sample vectors\n")
            
            required_keys = ['source', 'chunk_id', 'content_type']
            all_pass = True
            
            for i, match in enumerate(results['matches'][:3], 1):
                metadata = match.get('metadata', {})
                print(f"Sample {i}: score={match['score']:.3f}")
                
                missing = [k for k in required_keys if k not in metadata]
                if missing:
                    print(f"   ‚ùå Missing keys: {missing}")
                    all_pass = False
                else:
                    print(f"   ‚úÖ Has: {', '.join(required_keys)}")
                    # Show sample values (truncated)
                    src = str(metadata.get('source', ''))[:40]
                    print(f"      source={src}...")
                print()
            
            if all_pass:
                print("‚úÖ PASS: All samples have required metadata")
            else:
                print("‚ùå FAIL: Some samples missing metadata keys")
                print("   ‚Üí Review M1.3 metadata enrichment step")
                
    except Exception as e:
        print(f"‚ö†Ô∏è  ERROR: {str(e)}")
        print("   Cannot query index - check connection")

---

## Section 4: Check - Chunking Params Documentation

**Requirement:** Chunking strategy documented (approach + parameters)

**What this cell does:** Searches for existing chunking configuration files (config.json, chunking_config.json, README.md). If none exist, creates a default `chunking_params.json` with M1.3 pipeline settings to enable future debugging.

In [None]:
import json
from pathlib import Path

# Look for existing chunking config
config_paths = [
    "config/chunking_config.json",
    "chunking_config.json",
    "config.json",
    "README.md"
]

found_config = None
for path in config_paths:
    if Path(path).exists():
        found_config = path
        break

if found_config:
    print(f"‚úÖ Found config: {found_config}")
    if found_config.endswith('.json'):
        with open(found_config) as f:
            config = json.load(f)
            print(f"   Config keys: {list(config.keys())}")
    else:
        print(f"   (Documentation in {found_config})")
else:
    print("‚ö†Ô∏è  No config found - creating chunking_params.json")
    
    # Default params based on M1.3 requirements
    default_params = {
        "chunking_strategy": "semantic",
        "chunk_size": 512,
        "chunk_overlap": 0.18,
        "overlap_description": "15-20% overlap",
        "sentence_boundary": True,
        "min_chunk_size": 100,
        "embedding_model": "text-embedding-ada-002",
        "embedding_dimension": 1536,
        "notes": "Parameters for M1.3 document processing pipeline"
    }
    
    with open("chunking_params.json", "w") as f:
        json.dump(default_params, f, indent=2)
    
    print("‚úÖ Created chunking_params.json with default values")
    print(f"   Strategy: {default_params['chunking_strategy']}")
    print(f"   Chunk size: {default_params['chunk_size']}")
    print(f"   Overlap: {default_params['overlap_description']}")

print("\n‚úÖ PASS: Chunking strategy documented")

---

## Section 5: Mini Smoke Test - Query Types

**Requirement:** Test 3 query types (factual, how-to, comparison)

**What this cell does:** Embeds three representative queries using OpenAI's API and retrieves the top matching chunk from Pinecone for each. Shows only the similarity score and source (truncated) to confirm end-to-end retrieval works before M1.4's advanced query pipeline.

In [None]:
# Offline-friendly: Skip if missing API keys
openai_key = os.getenv("OPENAI_API_KEY")

if not api_key:
    print("‚ö†Ô∏è  SKIP: PINECONE_API_KEY not set")
elif not openai_key:
    print("‚ö†Ô∏è  SKIP: OPENAI_API_KEY not set")
    print("   Set environment variable to run smoke test")
    print("   Example: export OPENAI_API_KEY='your-key-here'")
elif index is None:
    print("‚ö†Ô∏è  SKIP: Pinecone index not available")
else:
    try:
        from openai import OpenAI
        client = OpenAI(api_key=openai_key)
        
        # Test queries: factual, how-to, comparison
        test_queries = [
            ("factual", "What is semantic chunking?"),
            ("how-to", "How do I improve RAG accuracy?"),
            ("comparison", "Compare dense vs sparse embeddings")
        ]
        
        print("üß™ Smoke Test: 3 Query Types\n")
        
        for query_type, query in test_queries:
            # Embed query
            response = client.embeddings.create(
                input=query,
                model="text-embedding-ada-002"
            )
            query_vec = response.data[0].embedding
            
            # Query Pinecone (top 1 only)
            results = index.query(vector=query_vec, top_k=1, include_metadata=True)
            
            if results['matches']:
                top = results['matches'][0]
                score = top['score']
                metadata = top.get('metadata', {})
                source = metadata.get('source', 'N/A')[:30]
                
                print(f"{query_type.upper()}: \"{query}\"")
                print(f"  ‚Üí score={score:.3f}, source={source}...")
            else:
                print(f"{query_type.upper()}: \"{query}\"")
                print(f"  ‚Üí No results")
            print()
        
        print("‚úÖ PASS: Query pipeline functional")
        print("   (M1.4 will add: query understanding, rerank, citations)")
        
    except Exception as e:
        print(f"‚ö†Ô∏è  ERROR: {str(e)}")
        print("   Cannot complete smoke test - check API keys and network")

---

## Section 6: Call-Forward - M1.4 Query Pipeline

**Next Module:** M1.4 Query Pipeline & Response Generation

**What You'll Build:**

### 1. Query Understanding
- Query classification (factual/how-to/comparison)
- Query expansion (synonyms, related terms)
- Query preprocessing pipeline

### 2. Hybrid Retrieval
- Semantic search (dense vectors)
- Keyword search (BM25/sparse vectors)
- Fusion strategies for 20-40% better recall

### 3. Reranking & Citations
- Cross-encoder scoring for relevance
- Response generation with GPT-4
- Source attribution with proper citations

**Expected Outcomes:**
- Complete query-to-answer pipeline
- Intelligent question answering system
- Production-grade RAG capabilities

**Trade-offs:**
- Adds 200-400ms latency per query
- 5 new pipeline components (classifier ‚Üí expander ‚Üí retriever ‚Üí reranker ‚Üí generator)
- $150-500/month infrastructure costs

**Duration:** ~44 min video + 60 min hands-on

---

**Ready for M1.4!**