# M1.3 ‚Üí M1.4 Bridge Validation Notebook

**Purpose:** Validate document processing pipeline before moving to query pipeline

**From:** M1.3 Document Processing Pipeline  
**To:** M1.4 Query Pipeline & Response Generation

---

## Section 1: Recap - Pipeline Architecture

**M1.3 Built:** Complete 6-stage document processing pipeline

```
extract ‚Üí clean ‚Üí chunk ‚Üí enrich ‚Üí embed ‚Üí store
```

**Key Components:**
- ‚úì Semantic chunking (15-20% overlap, sentence boundaries)
- ‚úì Metadata enrichment (source, chunk_id, content_type)
- ‚úì Embedding generation (OpenAI/sentence-transformers)
- ‚úì Vector storage (Pinecone)

**What's Next:** Query understanding, hybrid retrieval, reranking, response generation

---

**SAVED_SECTION: 1**

## Section 2: Check - Pinecone Vector Count

**Requirement:** ‚â•100 vectors with complete metadata

**Test:** Query Pinecone stats API

In [None]:
import os
from pinecone import Pinecone

# Check 2.1: Pinecone connection and stats
try:
    api_key = os.getenv("PINECONE_API_KEY")
    if not api_key:
        print("‚ùå FAIL: PINECONE_API_KEY not set")
        print("   Skip gracefully - no API key available")
    else:
        pc = Pinecone(api_key=api_key)
        index_name = os.getenv("PINECONE_INDEX_NAME", "rag-index")
        
        # Expected: index stats showing vector count
        index = pc.Index(index_name)
        stats = index.describe_index_stats()
        
        vector_count = stats.get('total_vector_count', 0)
        print(f"üìä Vector Count: {vector_count}")
        
        if vector_count >= 100:
            print(f"‚úÖ PASS: {vector_count} vectors (‚â•100 required)")
        else:
            print(f"‚ùå FAIL: {vector_count} vectors (need ‚â•100)")
            print("   ‚Üí Return to M1.3 to process more documents")
            
except Exception as e:
    print(f"‚ö†Ô∏è  ERROR: {str(e)}")
    print("   Skip gracefully - cannot connect to Pinecone")

---

**SAVED_SECTION: 2**

## Section 3: Check - Sample Retrieval & Metadata

**Requirement:** Sample vectors have source, chunk_id, content_type metadata

**Test:** Fetch 3 sample vectors and validate metadata keys

In [None]:
# Check 3.1: Sample retrieval with metadata validation
try:
    if not api_key:
        print("‚ö†Ô∏è  SKIP: No API key - cannot test retrieval")
    else:
        # Create dummy query vector (zeros for sampling)
        import numpy as np
        dummy_vector = [0.0] * 1536  # OpenAI embedding dimension
        
        # Query for 3 samples
        results = index.query(vector=dummy_vector, top_k=3, include_metadata=True)
        
        if len(results['matches']) == 0:
            print("‚ùå FAIL: No vectors returned from query")
        else:
            print(f"‚úÖ Retrieved {len(results['matches'])} sample vectors\n")
            
            required_keys = ['source', 'chunk_id', 'content_type']
            all_pass = True
            
            for i, match in enumerate(results['matches'][:3], 1):
                metadata = match.get('metadata', {})
                print(f"Sample {i}: score={match['score']:.3f}")
                
                missing = [k for k in required_keys if k not in metadata]
                if missing:
                    print(f"   ‚ùå Missing keys: {missing}")
                    all_pass = False
                else:
                    print(f"   ‚úÖ Has: {', '.join(required_keys)}")
                    # Show sample values (truncated)
                    src = str(metadata.get('source', ''))[:40]
                    print(f"      source={src}...")
                print()
            
            if all_pass:
                print("‚úÖ PASS: All samples have required metadata")
            else:
                print("‚ùå FAIL: Some samples missing metadata keys")
                print("   ‚Üí Review M1.3 metadata enrichment step")
                
except Exception as e:
    print(f"‚ö†Ô∏è  ERROR: {str(e)}")
    print("   Skip gracefully")

---

**SAVED_SECTION: 3**

## Section 4: Check - Chunking Params Documentation

**Requirement:** Chunking strategy documented (approach + parameters)

**Test:** Check for config file or create JSON record if missing

In [None]:
import json
from pathlib import Path

# Check 4.1: Look for existing chunking config
config_paths = [
    "config/chunking_config.json",
    "chunking_config.json",
    "config.json",
    "README.md"
]

found_config = None
for path in config_paths:
    if Path(path).exists():
        found_config = path
        break

if found_config:
    print(f"‚úÖ Found config: {found_config}")
    if found_config.endswith('.json'):
        with open(found_config) as f:
            config = json.load(f)
            print(f"   Config keys: {list(config.keys())}")
    else:
        print(f"   (Documentation in {found_config})")
else:
    print("‚ö†Ô∏è  No config found - creating chunking_params.json")
    
    # Default params based on M1.3 requirements
    default_params = {
        "chunking_strategy": "semantic",
        "chunk_size": 512,
        "chunk_overlap": 0.18,
        "overlap_description": "15-20% overlap",
        "sentence_boundary": True,
        "min_chunk_size": 100,
        "embedding_model": "text-embedding-ada-002",
        "embedding_dimension": 1536,
        "notes": "Parameters for M1.3 document processing pipeline"
    }
    
    with open("chunking_params.json", "w") as f:
        json.dump(default_params, f, indent=2)
    
    print("‚úÖ Created chunking_params.json with default values")
    print(f"   Strategy: {default_params['chunking_strategy']}")
    print(f"   Chunk size: {default_params['chunk_size']}")
    print(f"   Overlap: {default_params['overlap_description']}")

print("\n‚úÖ PASS: Chunking strategy documented")

---

**SAVED_SECTION: 4**

## Section 5: Mini Smoke Test - Query Types

**Requirement:** Test 3 query types (factual, how-to, comparison)

**Test:** Basic semantic search ‚Üí show top score only

In [None]:
# Check 5.1: Mini smoke test with 3 query types
try:
    openai_key = os.getenv("OPENAI_API_KEY")
    
    if not api_key or not openai_key:
        print("‚ö†Ô∏è  SKIP: Missing API keys")
    else:
        from openai import OpenAI
        client = OpenAI(api_key=openai_key)
        
        # Test queries: factual, how-to, comparison
        test_queries = [
            ("factual", "What is semantic chunking?"),
            ("how-to", "How do I improve RAG accuracy?"),
            ("comparison", "Compare dense vs sparse embeddings")
        ]
        
        print("üß™ Smoke Test: 3 Query Types\n")
        
        for query_type, query in test_queries:
            # Embed query
            response = client.embeddings.create(
                input=query,
                model="text-embedding-ada-002"
            )
            query_vec = response.data[0].embedding
            
            # Query Pinecone (top 1 only)
            results = index.query(vector=query_vec, top_k=1, include_metadata=True)
            
            if results['matches']:
                top = results['matches'][0]
                score = top['score']
                metadata = top.get('metadata', {})
                source = metadata.get('source', 'N/A')[:30]
                
                print(f"{query_type.upper()}: \"{query}\"")
                print(f"  ‚Üí score={score:.3f}, source={source}...")
            else:
                print(f"{query_type.upper()}: \"{query}\"")
                print(f"  ‚Üí No results")
            print()
        
        print("‚úÖ PASS: Query pipeline functional")
        print("   (M1.4 will add: query understanding, rerank, citations)")
        
except Exception as e:
    print(f"‚ö†Ô∏è  ERROR: {str(e)}")
    print("   Skip gracefully")

---

**SAVED_SECTION: 5**

## Section 6: Call-Forward - M1.4 Query Pipeline

**Next Module:** M1.4 Query Pipeline & Response Generation

**What You'll Build:**

### 1. Query Understanding
- Query classification (factual/how-to/comparison)
- Query expansion (synonyms, related terms)
- Query preprocessing pipeline

### 2. Hybrid Retrieval
- Semantic search (dense vectors)
- Keyword search (BM25/sparse vectors)
- Fusion strategies for 20-40% better recall

### 3. Reranking & Citations
- Cross-encoder scoring for relevance
- Response generation with GPT-4
- Source attribution with proper citations

**Expected Outcomes:**
- Complete query-to-answer pipeline
- Intelligent question answering system
- Production-grade RAG capabilities

**Trade-offs:**
- Adds 200-400ms latency per query
- 5 new pipeline components (classifier ‚Üí expander ‚Üí retriever ‚Üí reranker ‚Üí generator)
- $150-500/month infrastructure costs

**Duration:** ~44 min video + 60 min hands-on

---

**Ready for M1.4!**

---

**SAVED_SECTION: 6**