#### Hypothetical Document Embedding (HyDE) in Document Retrieval

This code implements a Hypothetical Document Embedding (HyDE) system for document retrieval. HyDE is an innovative approach that transforms query questions into hypothetical documents containing the answer, aiming to bridge the gap between query and document distributions in vector space.

In [9]:
import os
import sys
from dotenv import load_dotenv
load_dotenv()
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from utility import encode_pdf, show_context

In [2]:
file_path = "data/Understanding_Climate_Change.pdf"

In [4]:
class HyDERetriever:
    def __init__(self,file_path,chunk_size=500,chunk_overlap=100):
        self.llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini", max_tokens=4000)
        self.embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

        self.vectorstore = encode_pdf(file_path,chunk_size=self.chunk_size,chunk_overlap=self.chunk_overlap)

        self.prompt = PromptTemplate(
            template = """Given the question '{query}', generate a hypothetical document that directly answers this question. The document should be detailed and in-depth.
            the document size has be exactly {chunk_size} characters.""",
            input_variables=['query','chunk_size']
        )

        self.hyde_chain = self.prompt | self.llm
    
    def generate_hypothetical_document(self,query):
        response = self.hyde_chain.invoke({'query':query,'chunk_size':self.chunk_size})
        return response.content
    
    def retriever(self,query,k=3):
        hypothetical_doc = self.generate_hypothetical_document(query)
        hyde_retriever = self.vectorstore.as_retriever(search_kwargs = {'k':k})
        similar_doc = hyde_retriever.invoke(hypothetical_doc)
        return hypothetical_doc,similar_doc

In [5]:
ret = HyDERetriever(file_path)

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
test_query = "What is the main cause of climate change?"
hypothetical_doc,results  = ret.retriever(test_query)

In [10]:
docs_content = [doc.page_content for doc in results]

print("hypothetical_doc:\n")
print(hypothetical_doc)
show_context(docs_content)

hypothetical_doc:

**The Main Cause of Climate Change**

Climate change primarily results from human activities, particularly the burning of fossil fuels such as coal, oil, and natural gas. This process releases significant amounts of carbon dioxide (CO2) and other greenhouse gases into the atmosphere, enhancing the greenhouse effect. Deforestation further exacerbates the issue by reducing the number of trees that can absorb CO2. Additionally, industrial processes, agriculture, and waste management contribute to emissions. The cumulative effect of these activities leads to global warming, altering weather patterns and impacting ecosystems worldwide.
Context 1:
crucial role in sequestering carbon. Logging and land-use changes in these regions contribute 
to climate change. These forests are vital for regulating the Earth's climate and supporting 
indigenous communities and wildlife. 
Agriculture 
Agriculture contributes to climate change through methane emissions from livestock, rice 
p