# Technique 5: HyDE (Hypothetical Document Embeddings)

## The Problem
Query and documents use **different language**:
- Query: 'How do I...?'
- Doc: 'To register, you must...'

Embedding mismatch reduces retrieval!

## The Solution
**HyDE:**
1. Generate hypothetical answer to query
2. Embed the answer (not the query!)
3. Retrieve similar docs

Bridges language gap!

**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê‚òÜ

## üéØ How HyDE Works: Query Transformation

**The Core Idea:** Don't search with the question - search with a hypothetical answer!

### Traditional RAG (Query-Based)
```
User Query: "How do I register my tech startup?"
     ‚Üì [Embed Query]
Vector: [0.23, 0.87, ...] (question embedding)
     ‚Üì [Search]
Retrieved: Documents that SOUND like questions ‚ùå
```

### HyDE (Answer-Based)
```
User Query: "How do I register my tech startup?"
     ‚Üì [Generate Hypothetical Answer]
Hypothetical Doc: "To register a tech startup in Nigeria, 
                   visit the CAC portal and submit your 
                   business registration documents..."
     ‚Üì [Embed Hypothetical Answer]
Vector: [0.45, 0.12, ...] (answer-like embedding)
     ‚Üì [Search]
Retrieved: Documents that SOUND like answers ‚úÖ
```

**Why It Works:** Hypothetical answers are written in the SAME STYLE as documents!

## Step 1: Imports

In [None]:
from utils_openai import setup_openai_api, create_embeddings, create_llm, load_msme_data, create_vectorstore, get_baseline_prompt, load_existing_vectorstore
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

print('[OK] Imports done!')

## Step 2: Setup

In [None]:
api_key = setup_openai_api()
embeddings = create_embeddings(api_key)
llm = create_llm(api_key)
docs, metas, ids = load_msme_data('msme.csv')
vectorstore = create_vectorstore(docs, metas, ids, embeddings, 'msme_t8', './chroma_db_t8')
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
print('[OK] Setup complete!')

In [None]:
api_key = setup_openai_api()
embeddings = create_embeddings(api_key)
llm = create_llm(api_key)
vectorstore = load_existing_vectorstore(embeddings, 'msme_t8', './chroma_db_t8')
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})
print('[OK] Setup complete!')

## Step 3: Create HyDE Prompt
Generate hypothetical answer:

In [None]:
hyde_template = '''Write a brief hypothetical answer (2-3 sentences) to this question about MSMEs in Nigeria.
Make it sound like it's from a business guide document.

Question: {question}

Hypothetical answer:'''

hyde_prompt = ChatPromptTemplate.from_template(hyde_template)
print('[OK] HyDE prompt ready!')

## Step 4: Build HyDE Document Generator

In [None]:
hyde_doc_generator = hyde_prompt | llm | StrOutputParser()

# Test it
test_q = 'How do I register a construction business?'
hypo_answer = hyde_doc_generator.invoke({'question': test_q})
print(f'Question: {test_q}\n')
print(f'Hypothetical answer:\n{hypo_answer}')

## Step 5: Build HyDE RAG Chain
Retrieve using hypothetical doc:

In [None]:
hyde_retrieval = hyde_doc_generator | retriever

final_prompt = get_baseline_prompt()

hyde_rag_chain = (
    {'context': hyde_retrieval, 'question': RunnablePassthrough()}
    | final_prompt
    | llm
    | StrOutputParser()
)
print('[OK] HyDE RAG chain ready!')

## Step 6: Test

In [None]:
question = 'What are the loan options for tech startups?'

answer = hyde_rag_chain.invoke(question)
print(f'Question: {question}\n')
print(f'Answer:\n{answer}')

## ‚öñÔ∏è HyDE Trade-offs

| Aspect | Standard RAG | HyDE RAG |
|--------|--------------|----------|
| **Latency** | Fast (~100ms) | Slower (~500ms) |
| **LLM Calls** | 1 (generation only) | 2 (hypothetical + generation) |
| **Cost** | Lower | Higher (~2x) |
| **Query-Doc Match** | Direct embedding | Transformed embedding |
| **Language Gap** | Suffers from mismatch | Bridges the gap ‚úÖ |
| **Best For** | Queries matching doc style | Casual/vague queries |


## When to Use
**Use when:**
- Query-document language mismatch
- Abstract/vague queries
- Domain-specific vocabulary

**Avoid when:**
- Queries already match doc language
- Extra LLM call too expensive
- Simple retrieval works

## Exercise
1. Compare HyDE vs baseline for vague queries
2. Test with technical vs casual questions
3. Analyze when HyDE helps most


In [None]:
# Your code here

**Next:** RAG Evaluation