# BRIDGE: M9.2 Multi-Hop Retrieval → M9.3 HyDE

**Purpose:** Minimal validation notebook to verify readiness for M9.3 Hypothetical Document Embeddings

**Duration:** 8-10 minutes  
**Format:** Bridge validation (checks only)

---

## Section 1: RECAP — What M9.2 Multi-Hop Retrieval Shipped

M9.2 delivered enterprise-grade multi-hop retrieval with the following achievements:

### ✓ Knowledge Graph with Neo4j
- Built document relationship graph tracking references
- Calculated PageRank importance scores (hub documents surface automatically)
- Provided citation trails for compliance requirements

### ✓ LLM-Powered Reference Extractor
- Implemented GPT-4o-mini extraction with **85%+ precision**
- Validated references against Pinecone corpus
- Prevented hallucinated document IDs

### ✓ Recursive Retriever with 5 Stop Conditions
- `max_hops=3` (maximum traversal depth)
- Visited tracking (prevent cycles)
- `relevance_threshold=0.7` (quality gate)
- `timeout=10s` (performance constraint)
- No-refs-found detection (graceful termination)

### ✓ Completeness Improvement: 40% → 87%
- Reference-chain queries now retrieve all linked documents
- Answer quality improved by **25%** for 15-25% of queries with cross-references
- Critical for legal, medical, and compliance systems

---

**Key Takeaway:** You've built a system that handles queries single-shot RAG and query decomposition couldn't solve. You're following citation trails automatically.

## Section 2: Readiness Check #1 — Knowledge Graph & PageRank

**Requirement:** Knowledge graph tracking document relationships and PageRank scores

**Test:** Query Neo4j for hub documents (documents with 5+ incoming references)

**Impact:** HyDE will use these hub docs as examples for hypothetical answer generation

In [None]:
# Readiness Check #1: Knowledge Graph & PageRank
import os

# Check for Neo4j credentials
neo4j_uri = os.getenv("NEO4J_URI", "bolt://localhost:7687")
neo4j_user = os.getenv("NEO4J_USER", "neo4j")
neo4j_password = os.getenv("NEO4J_PASSWORD")

if not neo4j_password:
    print("⚠️ Skipping (no Neo4j credentials)")
    print("Set NEO4J_PASSWORD environment variable to run this check")
else:
    try:
        from neo4j import GraphDatabase
        
        driver = GraphDatabase.driver(neo4j_uri, auth=(neo4j_user, neo4j_password))
        
        # Query for hub documents (5+ incoming references)
        query = """
        MATCH (doc:Document)
        OPTIONAL MATCH (doc)<-[:REFERENCES]-(ref)
        WITH doc, count(ref) as incoming_refs, doc.pagerank as pagerank
        WHERE incoming_refs >= 5
        RETURN doc.id as doc_id, incoming_refs, pagerank
        ORDER BY incoming_refs DESC
        LIMIT 5
        """
        
        with driver.session() as session:
            result = session.run(query)
            hub_docs = list(result)
        
        driver.close()
        
        if len(hub_docs) > 0:
            print(f"✓ Found {len(hub_docs)} hub documents")
            # Expected: 3-5 hub documents with 5+ references
        else:
            print("✗ No hub documents found (need docs with 5+ incoming refs)")
            
    except Exception as e:
        print(f"⚠️ Neo4j connection failed: {str(e)[:50]}")

# Expected: 
# ✓ Found 3-5 hub documents
# These will be used as examples for HyDE hypothetical answer generation

## Section 3: Readiness Check #2 — Reference Extractor Precision

**Requirement:** Reference extractor achieving 85%+ precision (no hallucinated references)

**Test:** Test 20 extractions, verify all doc_ids exist in Pinecone (no false positives)

**Impact:** HyDE builds on retrieval quality—garbage references = garbage hypothetical answers

In [None]:
# Readiness Check #2: Reference Extractor Precision

# Check for OpenAI and Pinecone credentials
openai_key = os.getenv("OPENAI_API_KEY")
pinecone_key = os.getenv("PINECONE_API_KEY")

if not openai_key or not pinecone_key:
    print("⚠️ Skipping (no OpenAI or Pinecone keys)")
    print("Set OPENAI_API_KEY and PINECONE_API_KEY to run this check")
else:
    print("⚠️ Stub implementation - actual test would:")
    print("1. Extract references from 20 sample documents using GPT-4o-mini")
    print("2. Validate each extracted doc_id exists in Pinecone index")
    print("3. Calculate precision = valid_refs / total_extracted_refs")
    print("4. Verify precision >= 85%")
    
    # Stub result
    precision = 0.87  # Mock value
    if precision >= 0.85:
        print(f"\n✓ Reference extractor precision: {precision:.0%} (target: 85%+)")
    else:
        print(f"\n✗ Precision too low: {precision:.0%} (need 85%+)")

# Expected:
# ✓ Reference extractor precision: 87% (target: 85%+)
# No hallucinated doc_ids found in validation set

## Section 4: Readiness Check #3 — Multi-Hop Latency

**Requirement:** Multi-hop retrieval completing within 2 seconds for 3-hop queries

**Test:** Benchmark P95 latency, verify <2s with timeout=10s and max_hops=3

**Impact:** HyDE will add +500-1000ms—if multi-hop is already slow, total latency becomes unacceptable

In [None]:
# Readiness Check #3: Multi-Hop Latency
import time

# Simulate multi-hop retrieval latency benchmark
print("⚠️ Stub implementation - actual test would:")
print("1. Run 100 multi-hop queries with max_hops=3")
print("2. Measure latency for each query")
print("3. Calculate P95 latency (95th percentile)")
print("4. Verify P95 < 2000ms")

# Stub result (mock latency measurements)
latencies_ms = [1200, 1350, 1150, 1420, 1380]  # Mock P95 samples
p95_latency = 1420  # Mock P95 value in milliseconds

if p95_latency < 2000:
    print(f"\n✓ P95 latency: {p95_latency}ms (target: <2000ms)")
    print(f"  HyDE overhead: +500-1000ms → total ~{p95_latency + 750}ms")
else:
    print(f"\n✗ P95 latency too high: {p95_latency}ms (need <2000ms)")
    print("  Fix multi-hop performance before adding HyDE")

# Expected:
# ✓ P95 latency: 1420ms (target: <2000ms)
# HyDE overhead: +500-1000ms → total ~2170ms (acceptable)

## Section 5: Readiness Check #4 — Relevance Threshold

**Requirement:** Relevance threshold preventing hop degradation (relevance ≥0.7 at Hop 2)

**Test:** Log relevance scores per hop, verify Hop 2 avg ≥0.7

**Impact:** HyDE generates hypothetical answers based on retrieved context—poor context = poor hypotheticals

In [None]:
# Readiness Check #4: Relevance Threshold

print("⚠️ Stub implementation - actual test would:")
print("1. Run 50 multi-hop queries with max_hops=3")
print("2. Log relevance scores for each hop (Hop 0, 1, 2)")
print("3. Calculate average relevance at Hop 2")
print("4. Verify Hop 2 avg relevance >= 0.7")

# Stub result (mock relevance scores)
hop_0_avg = 0.82  # Initial retrieval
hop_1_avg = 0.76  # First hop
hop_2_avg = 0.73  # Second hop (critical check)

print(f"\nRelevance by hop:")
print(f"  Hop 0 (initial): {hop_0_avg:.2f}")
print(f"  Hop 1: {hop_1_avg:.2f}")
print(f"  Hop 2: {hop_2_avg:.2f}")

if hop_2_avg >= 0.7:
    print(f"\n✓ Hop 2 relevance: {hop_2_avg:.2f} (target: ≥0.7)")
    print("  Context quality sufficient for HyDE hypothetical generation")
else:
    print(f"\n✗ Hop 2 relevance too low: {hop_2_avg:.2f} (need ≥0.7)")
    print("  Poor context will produce poor hypothetical answers")

# Expected:
# ✓ Hop 2 relevance: 0.73 (target: ≥0.7)
# Quality gate prevents degraded context from reaching HyDE

## Section 6: CALL-FORWARD — What M9.3 HyDE Will Introduce

---

### The Problem: Vocabulary Mismatch

Your advanced multi-hop retrieval works perfectly when user queries and documents share similar language. But what happens when:

**User asks:** "tax implications of stock options"  
**Document contains:** "equity compensation taxation framework under IRC Section 409A"

**Result:** Dense retrieval fails due to **vocabulary mismatch** (30-40% of queries in specialized domains).

---

### M9.3 Hypothetical Document Embeddings (HyDE) — Three Critical Capabilities

#### 1. Hypothetical Answer Generation
- Instead of embedding the user query, generate a hypothetical answer using LLM
- Embed the hypothetical answer (which uses document-style language)
- Search for documents similar to hypothetical answer
- **Bridges the vocabulary gap** between user language and document language

#### 2. Hybrid Retrieval (HyDE + Traditional Dense)
- Combine HyDE results (good for mismatch queries) with traditional dense search
- Deduplicate and merge results
- Dynamically route: Use HyDE only when vocabulary mismatch detected
- **Best of both worlds:** precision for formal queries + recall for natural queries

#### 3. Cost-Performance Optimization
- HyDE adds 500-1000ms latency + $0.001-0.005 per query
- Implement caching for repeated queries
- A/B test: Measure if HyDE actually improves retrieval for your domain
- **Route intelligently:** 20-30% of queries benefit, 70-80% don't need it

---

### Key Question for M9.3

**"Your retrieval works for well-phrased queries. But when users use natural language and your docs use formal terminology, how do you bridge that gap?"**

**The Answer:** HyDE generates a hypothetical answer in document-style language, then searches for documents similar to that hypothetical answer.

**The Twist:** HyDE only helps 20-30% of queries (those with vocabulary mismatch). For the rest, it adds latency with zero benefit. M9.3 teaches you **when to use it and when to skip it**.

---

### Next Module

Proceed to **M9.3 Concept: Hypothetical Document Embeddings (HyDE)** to implement vocabulary mismatch detection and hypothetical answer generation.