## Better RAG with HyDE: Hypothetical Document Embeddings

## [Checkout the detailed tutorial on my Newsletter](https://www.rohan-paul.com/p/better-rag-with-hyde-hypothetical)

#### ü§ñ Basic Concept

HyDE encodes queries into vector form by first generating a hypothetical text via an instruction-based LLM. That text is then fed into a contrastive encoder for embedding. This indirect approach captures semantic relevance without needing labeled data.

![](assets/2025-03-30-22-04-44.png)

The key core concept here is externalization of configuration‚Äîthat is, separating environment-specific settings (like your API endpoint) from your application logic. In our refactored code, instead of hardcoding the URL, we retrieve it from an environment variable (or build it dynamically), which makes the code more flexible, secure, and easier to maintain.

#### ‚öôÔ∏è Two-Step Mechanism

1. **Hypothetical Document Generation**  
An LLM receives a query and responds with an artificial passage that attempts to address the query. The output may contain inaccurate details, but it still encodes topic-related patterns.

2. **Contrastive Embedding**  
A contrastive encoder transforms the generated text into a dense vector. This encoding filters out fabricated elements and focuses on the core meaning. Final retrieval is done by matching this vector against the real corpus vectors.

#### üö© Practical Advantages

- Straightforward to deploy: No explicit relevance training data is needed.  
- Flexible: Works across different domains and languages with minimal adjustments.  
- Solid performance: Often rivals or surpasses unsupervised approaches (like Contriever alone) and can approach fine-tuned retrievers.

#### üåü Key Highlights

‚Üí Uses an LLM to propose a synthetic passage as a starting point.  
‚Üí Contrastive embeddings focus on essential details, discarding hallucinated content.  
‚Üí Matches synthetic vectors to real documents, bridging query and corpus.  
‚Üí No domain-specific labeled data required.  
‚Üí Outperforms many zero-shot dense retrieval baselines.

#### üß≤ Five Hooking Summaries

1. "It generates a fake document to guide retrieval in a real corpus."  
2. "Queries gain a synthetic helper document for sharper search in dense vectors."  
3. "A two-step trick that eliminates complex labels but keeps retrieval strong."  
4. "Hypothesis text plus contrastive encoding equals simplified, label-free document search."  
5. "Instructions feed a made-up passage, and vector matching does the rest."

In [None]:
# -------------------------------------------------------
# 1) LIBRARY IMPORTS
# -------------------------------------------------------
import os
import numpy as np
import json
import fitz
from openai import OpenAI
import re
import matplotlib.pyplot as plt

# -------------------------------------------------------
# 2) OPENAI CLIENT SETUP
# -------------------------------------------------------
# Instantiating the OpenAI client using an endpoint and an API key from environment variables.
import os

# Make sure to set the environment variable externally (e.g., in your OS or a .env file)
openai_base_url = os.getenv("OPENAI_BASE_URL")  # e.g., "https://api.studio.nebius.com/v1/"
openai_api_key = os.getenv("OPENAI_API_KEY")

api_client = OpenAI(
    base_url=openai_base_url,
    api_key=openai_api_key
)


In [None]:
# -------------------------------------------------------
# 3) DOCUMENT READING AND SEGMENTING
# -------------------------------------------------------
def pdf_to_text(pdf_file):
    """
    Reads a PDF file and extracts textual content page by page.

    Args:
        pdf_file (str): Path to the PDF.

    Returns:
        list of dict: Each item holds 'text' and 'metadata'.
    """
    print(f"Parsing PDF file: {pdf_file}")
    doc = fitz.open(pdf_file)
    contents = []

    for page_index in range(len(doc)):
        page_content = doc[page_index].get_text()
        if len(page_content.strip()) > 50:
            contents.append({
                "text": page_content,
                "metadata": {
                    "source": pdf_file,
                    "page_number": page_index + 1
                }
            })

    print(f"Pages extracted with sufficient content: {len(contents)}")
    return contents

def segment_text(raw_text, seg_size=1000, overlap=200):
    """
    Splits text into sections of fixed size, with optional overlap.

    Args:
        raw_text (str): Entire text to be split.
        seg_size (int): Character length of each segment.
        overlap (int): Overlapping characters between segments.

    Returns:
        list of dict: Segments with 'text' and 'metadata'.
    """
    segments = []
    step = seg_size - overlap

    for start_idx in range(0, len(raw_text), step):
        segment = raw_text[start_idx:start_idx + seg_size]
        if segment:
            segments.append({
                "text": segment,
                "metadata": {
                    "start": start_idx,
                    "end": start_idx + len(segment)
                }
            })

    print(f"Total segments created: {len(segments)}")
    return segments


In [None]:

# -------------------------------------------------------
# 4) VECTOR STORAGE CLASS
# -------------------------------------------------------
class VectorStorage:
    """
    Stores text segments and corresponding vector embeddings.
    Provides similarity searches based on cosine similarity.
    """
    def __init__(self):
        self.embeds = []
        self.text_blobs = []
        self.meta_info = []

    def store(self, text_piece, vector_embedding, meta=None):
        """
        Inserts a new record in the vector store.

        Args:
            text_piece (str): Content chunk.
            vector_embedding (list of float): Embedding for chunk.
            meta (dict): Additional metadata to store.
        """
        self.embeds.append(np.array(vector_embedding))
        self.text_blobs.append(text_piece)
        self.meta_info.append(meta if meta else {})

    def find_similar(self, query_vector, top_k=5, filter_func=None):
        """
        Retrieves top_k similar chunks based on cosine similarity.

        Args:
            query_vector (list of float): Embedding vector of query.
            top_k (int): Number of similar results.
            filter_func (callable): Optional function to filter results.

        Returns:
            list of dict: The top_k results, sorted by similarity.
        """
        if not self.embeds:
            return []

        query_vec = np.array(query_vector)
        sims = []

        for idx, db_vec in enumerate(self.embeds):
            if filter_func and not filter_func(self.meta_info[idx]):
                continue
            sim_val = np.dot(query_vec, db_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(db_vec))
            sims.append((idx, sim_val))

        sims.sort(key=lambda tup: tup[1], reverse=True)
        top_matches = []
        for i in range(min(top_k, len(sims))):
            found_idx, score_val = sims[i]
            top_matches.append({
                "text": self.text_blobs[found_idx],
                "metadata": self.meta_info[found_idx],
                "similarity": float(score_val)
            })

        return top_matches


In [None]:

# -------------------------------------------------------
# 5) EMBEDDING FUNCTIONS
# -------------------------------------------------------
def create_vector(text_list, model_name="BAAI/bge-en-icl"):
    """
    Generates vector embeddings for a batch of text inputs.

    Args:
        text_list (list of str): List of textual data to embed.
        model_name (str): Model used for embedding.

    Returns:
        list of list of float: Embedding vectors.
    """
    if not text_list:
        return []

    batch_limit = 100
    all_vecs = []

    # Splitting into smaller groups to avoid hitting API limits
    for start_index in range(0, len(text_list), batch_limit):
        subset = text_list[start_index:start_index + batch_limit]
        resp = api_client.embeddings.create(
            model=model_name,
            input=subset
        )
        sub_embeddings = [r.embedding for r in resp.data]
        all_vecs.extend(sub_embeddings)

    return all_vecs


## 6) Building Vector Storage from a PDF

Reads the PDF into a list of pages (via pdf_to_text).

Splits each page into overlapping chunks (segment_text).

Invokes create_vector to embed each chunk.

Stores them in an instance of VectorStorage.

This function effectively converts the PDF into an indexed and searchable structure for RAG queries.

In [None]:

# -------------------------------------------------------
# 6) BUILD THE VECTOR STORAGE FROM A PDF
# -------------------------------------------------------
def build_vector_storage(pdf_file, seg_size=1000, overlap=200):
    """
    Constructs a vector store from a PDF by extracting text, segmenting,
    and embedding each segment.

    Args:
        pdf_file (str): Path to PDF.
        seg_size (int): Segment size in characters.
        overlap (int): Overlap in characters.

    Returns:
        VectorStorage: The completed vector store.
    """
    pages_data = pdf_to_text(pdf_file)
    all_sections = []

    for page_item in pages_data:
        sub_parts = segment_text(page_item["text"], seg_size, overlap)
        for part in sub_parts:
            # Update metadata with the page's info
            part["metadata"].update(page_item["metadata"])
        all_sections.extend(sub_parts)

    print("Computing embeddings for all segments...")
    text_inputs = [p["text"] for p in all_sections]
    text_embeddings = create_vector(text_inputs)

    vs = VectorStorage()
    for idx, fragment in enumerate(all_sections):
        vs.store(
            text_piece=fragment["text"],
            vector_embedding=text_embeddings[idx],
            meta=fragment["metadata"]
        )

    print(f"Vector storage completed with {len(all_sections)} total segments.")
    return vs


## 7) Crafting a Hypothetical Document

* Sends a system prompt and user prompt to the LLM.

* The system instructs the model to generate a single, cohesive document that might answer the question thoroughly.

* The user prompt clarifies the question.

* The outcome is a ‚Äúhypothetical‚Äù text that might represent a perfect or near-perfect response to the user‚Äôs question. We don‚Äôt store or surface this text to the user directly; instead, we embed it and use it to search in our knowledge base.

## HyDE‚Äôs Core Concept

The fundamental idea behind HyDE (Hypothetical Document Embedding) is that instead of directly embedding the user‚Äôs short (and sometimes ambiguous) query, you first generate a hypothetical response to that query‚Äîan imagined, in-depth document that could answer the query in the best possible way‚Äîthen embed that document instead of the user‚Äôs original query. By embedding a richer, more detailed piece of text, you often capture deeper semantics that can lead to more accurate retrieval from your knowledge base.

In [None]:
# -------------------------------------------------------
# 7) CREATE HYPOTHETICAL DOCUMENT (HyDE)
# -------------------------------------------------------
def craft_hypothetical_answer(user_query, approx_len=1000):
    """
    Uses an LLM to produce a hypothetical document that could
    potentially answer the user's query in depth.

    Args:
        user_query (str): The user's query.
        approx_len (int): Desired approximate length of the generated doc.

    Returns:
        str: The generated hypothetical answer.
    """
    system_context = (
        f"You are tasked with writing an authoritative, detailed piece of text. "
        f"Given a user question, produce a thorough and informative response of about {approx_len} characters. "
        f"Provide facts, examples, and clarity. Do not mention it's hypothetical."
    )
    user_request = f"Question: {user_query}\n\nWrite a comprehensive document that fully addresses this question."

    response_msg = api_client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": system_context},
            {"role": "user", "content": user_request}
        ],
        temperature=0.1
    )

    return response_msg.choices[0].message.content


## 8) The HyDE RAG Process

* Generate a hypothetical answer to the user‚Äôs query via craft_hypothetical_answer.

* Embed that hypothetical document using create_vector.

* Retrieve top num_chunks from the vector store with the new embedding.

* (Optional) Compose a final answer, using the retrieved segments.

**Why does this help?**

A query might be only a few words. Generating a richer ‚Äúpotential answer‚Äù can produce a more comprehensive embedding that captures the semantics the user truly wants. This bridging can improve retrieval for short or ambiguous queries.

An instruction-based LLM creates a fake passage for each query q. Imagine you ask, ‚ÄúHow to store solar energy cheaply?‚Äù The LLM responds with a quick, possibly inaccurate paragraph about solar energy storage. This text represents a hypothetical document.

In [None]:

# -------------------------------------------------------
# 8) RAG WITH HYPOTHETICAL EMBEDDING (HyDE)
# -------------------------------------------------------
def execute_hyde_rag(query_text, vec_storage, num_chunks=5, gen_final=True):
    """
    Performs RAG with HyDE. Generates a hypothetical document,
    embeds it, retrieves similar text segments, and optionally
    composes a final answer.

    Args:
        query_text (str): User's query.
        vec_storage (VectorStorage): Populated vector store.
        num_chunks (int): Number of chunks to fetch.
        gen_final (bool): Whether to generate the final answer.

    Returns:
        dict: Includes the hypothetical doc, retrieved segments,
              and optionally the final composed response.
    """
    print(f"\n** HyDE RAG Query: {query_text} **\n")
    print("1) Creating hypothetical document...")
    hypothetical_text = craft_hypothetical_answer(query_text)
    print(f"Hypothetical document length: {len(hypothetical_text)} characters")

    print("2) Embedding the hypothetical document...")
    hypo_embed = create_vector([hypothetical_text])[0]

    print(f"3) Retrieving top {num_chunks} matching segments...")
    retrieved = vec_storage.find_similar(hypo_embed, top_k=num_chunks)

    outcomes = {
        "user_query": query_text,
        "hypo_document": hypothetical_text,
        "retrieved_fragments": retrieved
    }

    if gen_final:
        print("4) Composing final answer from retrieved segments...")
        final_ans = compose_answer(query_text, retrieved)
        outcomes["final_answer"] = final_ans

    return outcomes


In [None]:
# -------------------------------------------------------
# 9) COMPOSING THE FINAL ANSWER
# -------------------------------------------------------
def compose_answer(query_text, relevant_chunks):
    """
    Forms a concluding response from relevant chunks.

    Args:
        query_text (str): The original user query.
        relevant_chunks (list of dict): Top retrieved segments.

    Returns:
        str: Generated final answer.
    """
    combined_context = "\n\n".join([f["text"] for f in relevant_chunks])

    response_msg = api_client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful AI assistant. "
                    "Use the context below to answer the user's question."
                )
            },
            {
                "role": "user",
                "content": f"Context:\n{combined_context}\n\nQuestion: {query_text}"
            }
        ],
        temperature=0.5,
        max_tokens=500
    )

    return response_msg.choices[0].message.content

## 10) Standard RAG

Directly embeds the user query.

Retrieves the top num_chunks.

Optionally constructs a final answer from those retrieved pieces.

This is a baseline approach, often good but sometimes not as robust as HyDE for short or vague prompts.

In [None]:
# -------------------------------------------------------
# 10) STANDARD RAG (DIRECT QUERY EMBEDDING)
# -------------------------------------------------------
def execute_standard_rag(query_text, vec_storage, num_chunks=5, gen_final=True):
    """
    Carries out standard RAG: directly embed the user's query,
    retrieve similar segments, and optionally produce a final answer.

    Args:
        query_text (str): User's query.
        vec_storage (VectorStorage): Populated vector store.
        num_chunks (int): Number of chunks to retrieve.
        gen_final (bool): Whether to generate the final answer.

    Returns:
        dict: Contains the query, retrieved chunks, and optionally the answer.
    """
    print(f"\n** Standard RAG Query: {query_text} **\n")
    print("1) Generating query embedding...")
    direct_embed = create_vector([query_text])[0]

    print(f"2) Retrieving top {num_chunks} segments...")
    retrieved = vec_storage.find_similar(direct_embed, top_k=num_chunks)

    result_payload = {
        "user_query": query_text,
        "retrieved_fragments": retrieved
    }

    if gen_final:
        print("3) Composing final answer from retrieved segments...")
        final = compose_answer(query_text, retrieved)
        result_payload["final_answer"] = final

    return result_payload


In [None]:

# -------------------------------------------------------
# 11) COMPARISON AND EVALUATION
# -------------------------------------------------------
def assess_strategies(query_text, vec_storage, reference_solution=None):
    """
    Evaluates the outcomes of both HyDE-based RAG and
    Standard RAG for a particular query.

    Args:
        query_text (str): The query to evaluate.
        vec_storage (VectorStorage): Vector store containing chunk data.
        reference_solution (str): Optional known correct answer.

    Returns:
        dict: Contains responses from both approaches and a comparison.
    """
    # HyDE approach
    hyde_out = execute_hyde_rag(query_text, vec_storage)
    hyde_reply = hyde_out.get("final_answer", "")

    # Direct RAG approach
    standard_out = execute_standard_rag(query_text, vec_storage)
    standard_reply = standard_out.get("final_answer", "")

    # Evaluate differences
    comp = evaluate_outcomes(query_text, hyde_reply, standard_reply, reference_solution)

    return {
        "query": query_text,
        "hyde_response": hyde_reply,
        "hyde_hypothetical_doc": hyde_out["hypo_document"],
        "standard_response": standard_reply,
        "reference": reference_solution,
        "comparison": comp
    }


def evaluate_outcomes(query_text, hyde_ans, standard_ans, ref_ans=None):
    """
    Uses LLM to compare two responses (HyDE vs. Standard RAG)
    and optionally references a correct solution.

    Args:
        query_text (str): The original user query.
        hyde_ans (str): HyDE's final answer.
        standard_ans (str): Standard RAG final answer.
        ref_ans (str, optional): A known correct answer for reference.

    Returns:
        str: Summarized comparison text from the LLM.
    """
    sys_instructions = (
        "You are an evaluator comparing two approaches. "
        "Discuss accuracy, relevance, completeness, and clarity."
    )

    user_instructions = (
        f"Query: {query_text}\n\n"
        f"HyDE-based response:\n{hyde_ans}\n\n"
        f"Standard RAG response:\n{standard_ans}"
    )

    if ref_ans:
        user_instructions += f"\n\nReference Answer:\n{ref_ans}"

    user_instructions += (
        "\n\nPlease analyze which approach better addresses the question and why."
    )

    eval_msg = api_client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": sys_instructions},
            {"role": "user", "content": user_instructions}
        ],
        temperature=0
    )

    return eval_msg.choices[0].message.content

In [None]:
# -------------------------------------------------------
# 12) RUNNING A BATCH EVALUATION
# -------------------------------------------------------
def batch_evaluation(pdf_file, queries, correct_answers=None, seg_size=1000, overlap=200):
    """
    Processes a PDF file into a vector store, then runs multiple queries
    against both HyDE-based RAG and Standard RAG, collecting comparisons.

    Args:
        pdf_file (str): Path to the PDF.
        queries (list of str): List of user queries.
        correct_answers (list of str): Reference solutions for each query.
        seg_size (int): Segment size for chunking.
        overlap (int): Overlap among segments.

    Returns:
        dict: Contains the list of comparison results and a summary.
    """
    vs = build_vector_storage(pdf_file, seg_size, overlap)

    results_list = []
    for idx, item_query in enumerate(queries):
        print(f"\n=== EVALUATING QUERY {idx+1}/{len(queries)}: {item_query} ===")
        reference_text = None
        if correct_answers and idx < len(correct_answers):
            reference_text = correct_answers[idx]

        outcome = assess_strategies(item_query, vs, reference_text)
        results_list.append(outcome)

    overall_summary = aggregate_results(results_list)
    return {
        "comparisons": results_list,
        "summary": overall_summary
    }

def aggregate_results(comparison_outcomes):
    """
    Summarizes the overall performance across all queries.

    Args:
        comparison_outcomes (list of dict): Results from assess_strategies.

    Returns:
        str: A compiled summary from the LLM.
    """
    sys_prompt = (
        "You are an expert in comparing retrieval approaches for QA. "
        "Based on multiple queries, provide an overall assessment comparing "
        "HyDE RAG with standard RAG. Discuss where each approach excels and any recommendations."
    )

    short_summaries = ""
    for idx, result_item in enumerate(comparison_outcomes):
        short_summaries += f"Query #{idx+1}: {result_item['query']}\nComparison snippet: {result_item['comparison'][:200]}...\n\n"

    user_prompt = (
        f"Below are partial comparisons of {len(comparison_outcomes)} queries. "
        f"Please form a comprehensive analysis:\n\n{short_summaries}"
    )

    response_msg = api_client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0
    )

    return response_msg.choices[0].message.content

In [None]:
# -------------------------------------------------------
# 13) VISUALIZING RESULTS
# -------------------------------------------------------
def draw_comparison(query_txt, hyde_data, standard_data):
    """
    Provides a simple visualization comparing the query,
    hypothetical document, and retrieved chunks from both approaches.

    Args:
        query_txt (str): The user's query string.
        hyde_data (dict): Output from execute_hyde_rag.
        standard_data (dict): Output from execute_standard_rag.
    """
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    # 1) Plot the user query
    axes[0].text(0.5, 0.5, f"Query:\n\n{query_txt}",
                 ha='center', va='center', fontsize=11, wrap=True)
    axes[0].set_axis_off()

    # 2) Show the hypothetical document
    hypo_doc = hyde_data["hypo_document"]
    doc_excerpt = hypo_doc if len(hypo_doc) <= 500 else hypo_doc[:500] + "..."
    axes[1].text(0.5, 0.5, f"Hypothetical Doc:\n\n{doc_excerpt}",
                 ha='center', va='center', fontsize=10, wrap=True)
    axes[1].set_axis_off()

    # 3) Compare retrieved chunks
    hyde_chx = [c["text"][:100] + "..." for c in hyde_data["retrieved_fragments"]]
    std_chx = [c["text"][:100] + "..." for c in standard_data["retrieved_fragments"]]

    compare_text = "HyDE RAG retrieved:\n\n"
    for i, chunky in enumerate(hyde_chx):
        compare_text += f"{i+1}) {chunky}\n\n"

    compare_text += "\nStandard RAG retrieved:\n\n"
    for i, chunky in enumerate(std_chx):
        compare_text += f"{i+1}) {chunky}\n\n"

    axes[2].text(0.5, 0.5, compare_text, ha='center', va='center',
                 fontsize=8, wrap=True)
    axes[2].set_axis_off()

    plt.tight_layout()
    plt.show()

In [None]:
# -------------------------------------------------------
# 14) USAGE EXAMPLE
# -------------------------------------------------------
# Below is an example usage scenario that:
# 1) Processes a sample PDF into the vector store
# 2) Asks a query using HyDE
# 3) Asks a query using standard RAG
# 4) Compares & visualizes

if __name__ == "__main__":
    # Provide a path to a PDF file
    sample_pdf_path = "data/AI_Information.pdf"

    # Build vector store
    store = build_vector_storage(sample_pdf_path)

    # Example user query
    example_query = "What are the main ethical considerations in artificial intelligence development?"

    # HyDE-based retrieval
    hyde_outcome = execute_hyde_rag(example_query, store)
    print("\n-- HyDE-based Answer --")
    print(hyde_outcome.get("final_answer", ""))

    # Standard RAG retrieval
    standard_outcome = execute_standard_rag(example_query, store)
    print("\n-- Standard RAG Answer --")
    print(standard_outcome.get("final_answer", ""))

    # Visualization
    draw_comparison(example_query, hyde_outcome, standard_outcome)

    # Additional queries for a batch evaluation
    queries_to_test = [
        "How does neural network architecture impact AI performance?"
    ]
    # Optional reference answers
    references_list = [
        "Neural network architecture affects performance by influencing model capacity, generalization, and efficiency. Variations in layer depth, width, and connections optimize tasks like vision or language."
    ]

    # Execute a comprehensive run
    evaluation_data = batch_evaluation(
        pdf_file=sample_pdf_path,
        queries=queries_to_test,
        correct_answers=references_list
    )

    print("\n=== Overall Summary ===")
    print(evaluation_data["summary"])