# **Hypothetical Document Embeddings (HyDE) RAG**

HyDE operates by creating hypothetical document embeddings that represent ideal documents relevant to a given query. This method contrasts with conventional RAG systems, which typically rely on the similarity between a user's query and existing document embeddings. By generating these hypothetical embeddings, HyDE effectively guides the retrieval process towards documents that are more likely to contain pertinent information.

Research Paper: [HyDE](https://arxiv.org/pdf/2212.10496)

# **📚 Notebook Summary & Step-by-Step Guide**

## **🎯 What This Notebook Does**
This notebook implements **HyDE (Hypothetical Document Embeddings) RAG** - a novel approach that generates **hypothetical ideal documents** to improve retrieval. Instead of directly searching with the user's query, HyDE first generates what an ideal answer document would look like, then uses that hypothetical document for similarity search.

## **🔧 Key Libraries & Their Roles**

### **Core RAG Libraries**
- **`langchain`** - Main orchestration framework for advanced RAG patterns
- **`langchain-openai`** - OpenAI integration for embeddings and generation
- **`langchain-weaviate`** - Weaviate vector database integration
- **`athina`** - RAG evaluation and monitoring framework

### **Vector Database**
- **`weaviate`** - Enterprise-grade vector database with GraphQL API
- **`weaviate.classes.init.Auth`** - Authentication for Weaviate cloud services
- **`OpenAIEmbeddings`** - Text-to-vector conversion for semantic search

### **Hypothetical Document Generation**
- **`ChatOpenAI`** - LLM for generating hypothetical documents
- **`HypotheticalDocumentEmbedder`** - LangChain component for HyDE implementation

## **📋 Step-by-Step Process**

### **Step 1: Advanced Setup**
- Install HyDE-specific packages: `athina`, `langchain-weaviate`
- Configure multiple API keys: OpenAI, Athina, Weaviate
- Set up enterprise vector database connection

### **Step 2: Document Processing**
- Load CSV data using standard `CSVLoader`
- Split documents into 500-character chunks
- Prepare documents for hypothetical matching

### **Step 3: Weaviate Vector Database Setup**
- Create enterprise Weaviate vectorstore with authentication
- Configure cloud-based vector storage with GraphQL API
- Index documents with OpenAI embeddings

### **Step 4: HyDE Implementation**
- **Hypothetical Generation**: Create ideal document based on query
- **Document Embedding**: Convert hypothetical document to vector
- **Similarity Search**: Find real documents similar to hypothetical one

### **Step 5: Query Transformation Workflow**
```
User Query → Generate Hypothetical Document → Embed Hypothetical → Search Real Docs
```

### **Step 6: Enhanced Retrieval Process**
- Generate multiple hypothetical documents for better coverage
- Use hypothetical embeddings instead of direct query embeddings
- Retrieve documents that match the "ideal" response pattern

### **Step 7: Response Generation**
- Use retrieved documents as context for final response
- Leverage improved retrieval quality from HyDE approach
- Generate more accurate and relevant answers

## **🚀 Key Advantages of HyDE RAG**

### **Retrieval Enhancement**
- **Semantic Bridging**: Bridges gap between question and answer domains
- **Domain Alignment**: Hypothetical docs match target document style
- **Context Richness**: Generated docs provide richer search context

### **Query Understanding**
- **Intent Clarification**: Hypothetical generation clarifies query intent
- **Domain Specificity**: Creates domain-appropriate search targets
- **Vocabulary Matching**: Uses terminology likely to appear in relevant docs

### **Real-World Benefits**
- **Improved Precision**: Better matches to actually relevant documents
- **Reduced Noise**: Less irrelevant content in retrieved context
- **Domain Adaptation**: Works well across specialized domains

## **🧠 HyDE Workflow Deep Dive**

### **Traditional RAG:**
```
"What causes diabetes?" → Embed Query → Search Documents → Retrieve Results
```

### **HyDE RAG:**
```
"What causes diabetes?" → Generate Hypothetical Answer → Embed Hypothetical → Search → Better Results
```

**Hypothetical Document Example:**
```
Query: "What causes diabetes?"
Generated: "Diabetes is caused by insufficient insulin production or insulin resistance. 
Type 1 diabetes results from autoimmune destruction of pancreatic beta cells..."
```

## **📊 Comparison Analysis**

| Aspect | Traditional RAG | HyDE RAG | Advantage |
|--------|----------------|----------|-----------|
| **Search Target** | User query | Hypothetical document | Better semantic match |
| **Vocabulary Gap** | Query ↔ Document mismatch | Aligned terminology | Improved findability |
| **Context Richness** | Short query | Rich hypothetical content | More search signals |
| **Domain Specificity** | Generic query terms | Domain-appropriate language | Better precision |

## **🏗️ Technical Architecture**

```
Query → LLM (Generate Hypothetical) → Embedding Model → Vector Search → Real Documents
  ↓                    ↓                      ↓               ↓
User Question → Ideal Answer → Vector Representation → Similar Real Docs → Final Context
```

## **⚙️ Implementation Components**

1. **Hypothetical Generator**: LLM creates ideal document
2. **Embedding Layer**: Converts hypothetical to vector
3. **Vector Database**: Stores and searches real document embeddings
4. **Similarity Matcher**: Finds real docs similar to hypothetical
5. **Context Assembler**: Prepares retrieved docs for generation

## **💡 Learning Outcomes**
Students will understand:
- Advanced query transformation techniques for better retrieval
- Hypothetical document generation strategies
- Semantic bridging between queries and target documents
- Enterprise vector database integration (Weaviate)
- Domain adaptation techniques in RAG systems
- The concept of retrieval via generation-first approaches
- Production-scale RAG architecture with cloud vector databases

## **Initial Setup**

In [2]:
! pip install --q -U athina langchain-weaviate

In [3]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ['ATHINA_API_KEY'] = userdata.get('ATHINA_API_KEY')
os.environ['WEAVIATE_API_KEY'] = userdata.get('WEAVIATE_API_KEY')


## **Indexing**

In [4]:
# load embedding model
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [5]:
# load data
from langchain.document_loaders import CSVLoader
loader = CSVLoader("./context.csv")
documents = loader.load()

In [6]:
# split documents
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
documents = text_splitter.split_documents(documents)

## **Weaviate Vector Database**

In [7]:
# create vectorstore using weaviate
import weaviate
from weaviate.classes.init import Auth
import os

wcd_url = 'your_url'

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=wcd_url,
    auth_credentials=Auth.api_key(os.environ['WEAVIATE_API_KEY']),
    headers={'X-OpenAI-Api-key': os.environ["OPENAI_API_KEY"]}
)

In [8]:
# create vectorstore
from langchain_weaviate.vectorstores import WeaviateVectorStore
vectorstore = WeaviateVectorStore.from_documents(documents,embedding=OpenAIEmbeddings(),client=client, index_name="your_collection_name",text_key="text")

In [9]:
# checking similarity search
vectorstore.similarity_search("world war II", k=3)

[Document(metadata={'source': './context.csv', 'row': 36.0}, page_content='context: ["The European theatre of World War II was one of the two main theatres of combat during World War II. It saw heavy fighting across Europe for almost six years, starting with Germany\'s invasion of Poland on 1 September 1939 and ending with the Western Allies conquering most of Western Europe, the Soviet Union conquering most of Eastern Europe and Germany\'s unconditional surrender on 8 May 1945 although fighting continued elsewhere in Europe until 25 May. On 5 June 1945, the Berlin'),
 Document(metadata={'source': './context.csv', 'row': 59.0}, page_content='war to end all wars" due to their perception of its then-unparalleled scale, devastation, and loss of life. After World War II\''),
 Document(metadata={'source': './context.csv', 'row': 59.0}, page_content='similarly wrote, "Some wars name themselves. This is the Great War." Contemporary Europeans also referred to it as "the war to end war" and it 

In [None]:
# # call vectorstore fron weaviate cloud
# vectorstore = WeaviateVectorStore(client=client,index_name="your_collection_name",embedding=OpenAIEmbeddings(),text_key="text")

## **Chromadb (Optional)**

In [None]:
# # optional vectorstore
# !pip install chromadb
# # create vectorstore
# from langchain.vectorstores import Chroma
# vectorstore = Chroma.from_documents(documents, embeddings)

## **Retriever**

In [None]:
# create retriever
retriever = vectorstore.as_retriever()

## **Hypothetical Answer Chain**

In [None]:
# create llm
from langchain_openai import ChatOpenAI
llm = ChatOpenAI()

In [None]:
# chain without the retriever
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

template ="""
You are a helpful assistant that answers questions.
Question: {input}
Answer:
"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", template),
        ("human", "{input}"),
    ]
)
qa_no_context = prompt | llm | StrOutputParser()

In [None]:
# response
question = 'how does interlibrary loan work'
answer = qa_no_context.invoke({"input": question})
answer

"Interlibrary loan is a service offered by libraries that allows patrons to borrow materials from other libraries. Here is how it typically works:\n\n1. **Request:** If a patron needs a book, article, or other material that their local library does not have, they can request it through interlibrary loan.\n  \n2. **Search:** The library staff will search for a library that owns the requested material and is willing to lend it.\n  \n3. **Request Submission:** Once a lending library is found, the request is submitted, and the material is shipped to the patron's local library.\n  \n4. **Pickup:** The patron can then pick up the material from their library, usually for a limited loan period.\n  \n5. **Return:** After using the material, the patron returns it to their library, which then returns it to the lending library.\n\nThere may be fees associated with interlibrary loan, depending on the policies of the lending library or the patron's local library. Interlibrary loan is a valuable serv

## **Combined RAG Chain**

In [None]:
# response with context
retrieval_chain = qa_no_context | retriever
retrieved_docs = retrieval_chain.invoke({"input":question})

In [None]:
template = """
You are a helpful assistant that answers questions based on the provided context.
Use the provided context to answer the question.
Question: {input}
Context: {context}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | llm
    | StrOutputParser()
)

In [None]:
# final response
final_rag_chain.invoke({"context":retrieved_docs,"input":question})

'Interlibrary loan works by allowing patrons of one library to borrow physical materials or receive electronic documents that are held by another library. The borrowing library identifies potential lending libraries with the desired item, and the lending library delivers the item either physically or electronically. The borrowing library then receives the item, delivers it to their patron, and arranges for its return if necessary. In some cases, fees may accompany interlibrary loan services. Libraries negotiate for interlibrary loan eligibility, especially for digital materials like ebooks, through legal, technical, and licensing aspects.'

## **Preparing Data for Evaluation**

In [None]:
# create dataset
question = ["how does interlibrary loan work"]
response = []
contexts = []

# Inference
for query in question:
  response.append(final_rag_chain.invoke({"context":retrieved_docs,"input":query}))
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "query": question,
    "response": response,
    "context": contexts,
}

In [None]:
# create dataset
from datasets import Dataset
dataset = Dataset.from_dict(data)

In [None]:
# create dataframe
import pandas as pd
df = pd.DataFrame(dataset)

In [None]:
df

Unnamed: 0,query,response,context
0,how does interlibrary loan work,Interlibrary loan works by enabling patrons of...,[Procedures and methods ==\n\nAfter receiving ...


In [None]:
# Convert to dictionary
df_dict = df.to_dict(orient='records')

# Convert context to list
for record in df_dict:
    if not isinstance(record.get('context'), list):
        if record.get('context') is None:
            record['context'] = []
        else:
            record['context'] = [record['context']]

## **Evaluation in Athina AI**

We will use **Context Relevancy** eval here. It Measures the relevancy of the retrieved context, calculated based on both the query and contexts. Please refer to our [documentation](https://docs.athina.ai/api-reference/evals/preset-evals/overview) for further details

In [None]:
# set api keys for Athina evals
from athina.keys import AthinaApiKey, OpenAiApiKey
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

In [None]:
# load dataset
from athina.loaders import Loader
dataset = Loader().load_dict(df_dict)

In [None]:
# evaluate
from athina.evals import RagasContextRelevancy
RagasContextRelevancy(model="gpt-4o").run_batch(data=dataset).to_df()

evaluating with [context_relevancy]


100%|██████████| 1/1 [00:01<00:00,  1.82s/it]


You can view your dataset at: https://app.athina.ai/develop/1db6cbc3-1096-4b1d-881e-db538858691d


Unnamed: 0,query,context,response,expected_response,display_name,failed,grade_reason,runtime,model,ragas_context_relevancy
0,how does interlibrary loan work,"[Procedures and methods ==\n\nAfter receiving a request from their patron, the borrowing library identifies potential lending libraries with the desired item. The lending library then delivers the item physically or electronically, and the borrowing library receives the item, delivers it to their patron, and if necessary, arranges for its return. In some cases, fees accompany interlibrary loan services. While the majority of interlibrary loan requests are now managed through semi-automated, ...","Interlibrary loan works by enabling patrons of one library to borrow physical materials or receive electronic documents from another library that holds the desired items. The borrowing library identifies potential lending libraries with the item, and the lending library delivers the item to the borrowing library either physically or electronically. If necessary, the borrowing library arranges for the return of the item. Fees may accompany interlibrary loan services, and libraries negotiate f...",,Ragas Context Relevancy,,This metric is calulated by dividing the number of sentences in context that are relevant for answering the given query by the total number of sentences in the retrieved context,2477,gpt-4o,0.04
