# Hypothetical Document Embedding (HyDE) in Document Retrieval

## Overview

This code implements a Hypothetical Document Embedding (HyDE) system for document retrieval. HyDE is an innovative approach that transforms query questions into hypothetical documents containing the answer, aiming to bridge the gap between query and document distributions in vector space.

## Motivation

Traditional retrieval methods often struggle with the semantic gap between short queries and longer, more detailed documents. HyDE addresses this by expanding the query into a full hypothetical document, potentially improving retrieval relevance by making the query representation more similar to the document representations in the vector space.

## Key Components

1. PDF processing and text chunking
2. Vector store creation using FAISS and OpenAI embeddings
3. Language model for generating hypothetical documents
4. Custom HyDERetriever class implementing the HyDE technique

## Method Details

### Document Preprocessing and Vector Store Creation

1. The PDF is processed and split into chunks.
2. A FAISS vector store is created using OpenAI embeddings for efficient similarity search.

### Hypothetical Document Generation

1. A language model (GPT-4) is used to generate a hypothetical document that answers the given query.
2. The generation is guided by a prompt template that ensures the hypothetical document is detailed and matches the chunk size used in the vector store.

### Retrieval Process

The `HyDERetriever` class implements the following steps:

1. Generate a hypothetical document from the query using the language model.
2. Use the hypothetical document as the search query in the vector store.
3. Retrieve the most similar documents to this hypothetical document.

## Key Features

1. Query Expansion: Transforms short queries into detailed hypothetical documents.
2. Flexible Configuration: Allows adjustment of chunk size, overlap, and number of retrieved documents.
3. Integration with OpenAI Models: Uses GPT-4 for hypothetical document generation and OpenAI embeddings for vector representation.

## Benefits of this Approach

1. Improved Relevance: By expanding queries into full documents, HyDE can potentially capture more nuanced and relevant matches.
2. Handling Complex Queries: Particularly useful for complex or multi-faceted queries that might be difficult to match directly.
3. Adaptability: The hypothetical document generation can adapt to different types of queries and document domains.
4. Potential for Better Context Understanding: The expanded query might better capture the context and intent behind the original question.

## Implementation Details

1. Uses OpenAI's ChatGPT model for hypothetical document generation.
2. Employs FAISS for efficient similarity search in the vector space.
3. Allows for easy visualization of both the hypothetical document and retrieved results.

## Conclusion

Hypothetical Document Embedding (HyDE) represents an innovative approach to document retrieval, addressing the semantic gap between queries and documents. By leveraging advanced language models to expand queries into hypothetical documents, HyDE has the potential to significantly improve retrieval relevance, especially for complex or nuanced queries. This technique could be particularly valuable in domains where understanding query intent and context is crucial, such as legal research, academic literature review, or advanced information retrieval systems.

# Package Installation and Imports

The cell below installs all necessary packages required to run this notebook.


In [1]:
# Install required packages
#!uv pip install langchain langchain-openai langchain-community langchain-text-splitters python-dotenv pymupdf faiss-cpu

In [None]:
import os
import sys
from dotenv import load_dotenv

from utils.helper_functions import *
from utils.evaluate_rag import *

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Define document(s) path

In [2]:
path = "data/Understanding_Climate_Change.pdf"

### Define the HyDe retriever class - creating vector store, generating hypothetical document, and retrieving

In [3]:
class HyDERetriever:
    def __init__(self, files_path, chunk_size=500, chunk_overlap=100):
        self.llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini", max_tokens=4000)

        self.embeddings = OpenAIEmbeddings()
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.vectorstore = encode_pdf(files_path, chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap)
    
        
        self.hyde_prompt = PromptTemplate(
            input_variables=["query", "chunk_size"],
            template="""Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
            the document size has be exactly {chunk_size} characters.""",
        )
        self.hyde_chain = self.hyde_prompt | self.llm

    def generate_hypothetical_document(self, query):
        input_variables = {"query": query, "chunk_size": self.chunk_size}
        return self.hyde_chain.invoke(input_variables).content

    def retrieve(self, query, k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        similar_docs = self.vectorstore.similarity_search(hypothetical_doc, k=k)
        return similar_docs, hypothetical_doc


### Create a HyDe retriever instance

In [4]:
retriever = HyDERetriever(path)

### Demonstrate on a use case

In [6]:
test_query = "What is the main cause of climate change?"
results, hypothetical_doc = retriever.retrieve(test_query)

In [12]:
results

[Document(id='35d97bcf-edc7-437b-b85c-6cdfddb6777e', metadata={'producer': 'Microsoft¬Æ Word 2021', 'creator': 'Microsoft¬Æ Word 2021', 'creationdate': '2024-07-13T20:17:34+03:00', 'author': 'Nir', 'moddate': '2024-07-13T20:17:34+03:00', 'source': 'data/Understanding_Climate_Change.pdf', 'total_pages': 33, 'page': 0, 'page_label': '1'}, page_content='predict future trends. The evidence overwhelmingly shows that recent changes are primarily \ndriven by human activities, particularly the emission of greenhouse gases. \nChapter 2: Causes of Climate Change \nGreenhouse Gases \nThe primary cause of recent climate change is the increase in greenhouse gases in the \natmosphere. Greenhouse gases, such as carbon dioxide (CO2), methane (CH4), and nitrous \noxide (N2O), trap heat from the sun, creating a "greenhouse effect." This effect is essential'),
 Document(id='4fdf69a2-82bc-40e7-8e22-22bd09757d27', metadata={'producer': 'Microsoft¬Æ Word 2021', 'creator': 'Microsoft¬Æ Word 2021', 'creationd

### Plot the hypothetical document and the retrieved documnets 

In [7]:
docs_content = [doc.page_content for doc in results]

print("hypothetical_doc:\n")
print(text_wrap(hypothetical_doc)+"\n")
show_context(docs_content)

hypothetical_doc:

**The Main Cause of Climate Change**  Climate change primarily results from human activities, particularly the burning
of fossil fuels such as coal, oil, and natural gas. This process releases significant amounts of carbon dioxide (CO2)
and other greenhouse gases into the atmosphere, enhancing the greenhouse effect. Deforestation further exacerbates the
issue by reducing the number of trees that can absorb CO2. Additionally, industrial processes, agriculture, and waste
management contribute to emissions. The cumulative effect of these activities leads to global warming, altering weather
patterns and impacting ecosystems worldwide.

Context 1:
predict future trends. The evidence overwhelmingly shows that recent changes are primarily 
driven by human activities, particularly the emission of greenhouse gases. 
Chapter 2: Causes of Climate Change 
Greenhouse Gases 
The primary cause of recent climate change is the increase in greenhouse gases in the 
atmosphere. Greenhou

### Langchain HyDe Implementation

In [8]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_classic.chains import HypotheticalDocumentEmbedder

# Load PDF
loader = PyPDFLoader("data/Understanding_Climate_Change.pdf")
documents = loader.load()

# Split
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

# Create HYDE embeddings
base_embeddings = OpenAIEmbeddings()
llm = OpenAI(temperature=0)
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(
    llm, base_embeddings, prompt_key="web_search"
)

In [9]:
# Create vector store
vectorstore = Chroma.from_documents(chunks, hyde_embeddings)

# Query
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
docs = retriever.invoke("What are the roles of technology in Climate Change Mitigation?")

print(docs[0].page_content)

protect ecosystems. Practices such as agroforestry, precision farming, and regenerative 
agriculture offer pathways to a more sustainable and resilient food system. 
By understanding the causes, effects, and potential solutions to climate change, we can take 
informed actions to protect our planet for future generations. Global cooperation, innovation, 
and commitment are key to addressing this pressing challenge. 
 
Chapter 5: The Role of Technology in Climate Change 
Mitigation 
Advanced Renewable Energy Solutions 
Next-Generation Solar Technologies


### Minimal Usage

In [10]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

# Load PDF
docs = PyPDFLoader("data/Understanding_Climate_Change.pdf").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000).split_documents(docs)

# Create vector store and retriever
retriever = Chroma.from_documents(chunks, OpenAIEmbeddings()).as_retriever()
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# HYDE prompt
hyde_prompt = ChatPromptTemplate.from_template("Write a passage answering: {question}")

# Helper function to format documents
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

# Create HYDE RAG chain - CORRECTED ‚úÖ
hyde_chain = (
    {"question": RunnablePassthrough(), "hypothetical": hyde_prompt | llm | StrOutputParser()}
    | RunnableLambda(lambda x: {  # ‚úÖ Wrap lambda in RunnableLambda
        "context": format_docs(retriever.invoke(x["hypothetical"])), 
        "question": x["question"]
    })
    | ChatPromptTemplate.from_template("Context: {context}\n\nQuestion: {question}\n\nAnswer:")
    | llm
    | StrOutputParser()
)

# Query
answer = hyde_chain.invoke("What is this about?")
print(answer)

This text is about the importance of informing the public about climate change through various means such as journalism, public engagement initiatives, integrating climate education into school curricula, public awareness campaigns, and lifelong learning initiatives. It emphasizes the role of media organizations, education, and community engagement in raising awareness, promoting sustainable behaviors, and preparing future generations to address climate challenges.


### Comparision

In [11]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

print("üìö Loading and processing PDF...")
docs = PyPDFLoader("data/Understanding_Climate_Change.pdf").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_documents(docs)
print(f"‚úÖ Created {len(chunks)} chunks from {len(docs)} pages")

print("\nüíæ Creating vector store...")
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
print("‚úÖ Vector store ready")

# Helper functions
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

def retrieve_direct(query):
    """Direct retrieval using the query."""
    docs = retriever.invoke(query)
    return format_docs(docs)

def retrieve_with_hyde(inputs):
    """Retrieve using hypothetical document."""
    hypothetical = inputs["hypothetical"]
    
    # Show what was generated
    print(f"\nüîÆ Generated hypothetical document:")
    print(f"{hypothetical[:200]}...\n")
    
    docs = retriever.invoke(hypothetical)
    
    return {
        "context": format_docs(docs),
        "question": inputs["question"]
    }

# QA Prompt
qa_prompt = ChatPromptTemplate.from_template(
    """Use the following context to answer the question.

Context: {context}

Question: {question}

Answer:"""
)

# ============================================================================
# METHOD 1: STANDARD RAG (Direct Retrieval)
# ============================================================================
print("\nüîß Building STANDARD RAG chain...")
standard_chain = (
    {
        "context": RunnableLambda(retrieve_direct),
        "question": RunnablePassthrough()
    }
    | qa_prompt
    | llm
    | StrOutputParser()
)
print("‚úÖ Standard RAG chain ready")

# ============================================================================
# METHOD 2: HYDE RAG (Hypothetical Document Retrieval)
# ============================================================================
print("\nüîß Building HYDE RAG chain...")
hyde_prompt = ChatPromptTemplate.from_template(
    """Write a detailed, informative passage that would answer this question. 
Include relevant facts, explanations, and context.

Question: {question}

Passage:"""
)

hyde_chain = (
    {
        "question": RunnablePassthrough(),
        "hypothetical": hyde_prompt | llm | StrOutputParser()
    }
    | RunnableLambda(retrieve_with_hyde)
    | qa_prompt
    | llm
    | StrOutputParser()
)
print("‚úÖ HYDE RAG chain ready")

# ============================================================================
# COMPARE RESULTS
# ============================================================================
query = "What are the key findings about climate change?"

print("\n" + "=" * 80)
print("üîç STANDARD RAG (Direct Query Retrieval)")
print("=" * 80)
standard_answer = standard_chain.invoke(query)
print(standard_answer)

print("\n" + "=" * 80)
print("üîÆ HYDE RAG (Hypothetical Document Retrieval)")
print("=" * 80)
hyde_answer = hyde_chain.invoke(query)
print(hyde_answer)

print("\n" + "=" * 80)
print("üìä COMPARISON")
print("=" * 80)
print(f"Standard answer length: {len(standard_answer)} chars")
print(f"HYDE answer length: {len(hyde_answer)} chars")

üìö Loading and processing PDF...
‚úÖ Created 97 chunks from 33 pages

üíæ Creating vector store...
‚úÖ Vector store ready

üîß Building STANDARD RAG chain...
‚úÖ Standard RAG chain ready

üîß Building HYDE RAG chain...
‚úÖ HYDE RAG chain ready

üîç STANDARD RAG (Direct Query Retrieval)
Some key findings about climate change include the increase in extreme weather events such as hurricanes, heatwaves, droughts, and heavy rainfall. Warmer ocean temperatures intensify hurricanes and typhoons, leading to more destructive storms. Climate change is also linked to rising temperatures, more frequent and severe heatwaves, and changing seasons, all of which have significant impacts on ecosystems, human health, agriculture, and infrastructure.

üîÆ HYDE RAG (Hypothetical Document Retrieval)

üîÆ Generated hypothetical document:
Climate change is a pressing global issue that has been extensively studied by scientists and researchers. Some key findings about climate change include the fact 

## LlamaIndex Implementation

In [None]:
#!uv pip install llama-index llama-index-llms-openai llama-index-embeddings-openai

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
import os

# Configure
os.environ["OPENAI_API_KEY"] = "your-api-key"

# Set global settings
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
Settings.embed_model = OpenAIEmbedding()

# Load PDF
documents = SimpleDirectoryReader(input_files=["document.pdf"]).load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create HYDE query engine
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(
    index.as_query_engine(),
    query_transform=hyde
)

# Query
response = hyde_query_engine.query("What is this document about?")
print(response)

### Comparision

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

# Load PDF
documents = SimpleDirectoryReader(input_files=["document.pdf"]).load_data()
index = VectorStoreIndex.from_documents(documents)

# Standard query engine
standard_engine = index.as_query_engine()

# HYDE query engine
hyde_engine = TransformQueryEngine(
    index.as_query_engine(),
    query_transform=HyDEQueryTransform(include_original=True)
)

# Compare
query = "What are the main conclusions?"

print("=" * 80)
print("STANDARD RETRIEVAL")
print("=" * 80)
print(standard_engine.query(query))

print("\n" + "=" * 80)
print("HYDE RETRIEVAL")
print("=" * 80)
print(hyde_engine.query(query))

![](https://europe-west1-rag-techniques-views-tracker.cloudfunctions.net/rag-techniques-tracker?notebook=all-rag-techniques--hyde-hypothetical-document-embedding)