In [7]:
!pip install rank_bm25

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


# Sparse Retrieval (BM25)

In [2]:
from langchain.retrievers import BM25Retriever
from langchain.schema import Document

docs = [Document(page_content="India won the cricket match.")]
retriever = BM25Retriever.from_documents(docs)
query = "Who won the game?"
results = retriever.get_relevant_documents(query)
print(results[0].page_content)

India won the cricket match.


  results = retriever.get_relevant_documents(query)


In [10]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document
from langchain.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [11]:
from langchain.document_loaders import PyMuPDFLoader
# Load the PDF
loader = PyMuPDFLoader("data/RAMAYANA.pdf")
documents  = loader.load()

In [12]:
# 2. Chunk your documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

In [13]:
# 3. Create Dense Retriever (FAISS + HuggingFace Embedding)
embedding_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2",
                                        model_kwargs={"device": "cpu"})
dense_vectorstore = FAISS.from_documents(chunks, embedding_model)
dense_retriever = dense_vectorstore.as_retriever(search_kwargs={"k": 4})

In [14]:
# 4. Create Sparse Retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 4  # Top k documents to retrieve

In [15]:
# 5. Create Ensemble Retriever (Combining Scores)
ensemble_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4],  # adjust based on which signal you want stronger
)

In [16]:
# 6. Query the Hybrid Retriever
query = "How RAM army went to lanka?"
results = ensemble_retriever.get_relevant_documents(query)

In [17]:
# 7. Display results
for i, doc in enumerate(results):
    print(f"\n--- Result {i+1} ---")
    print(doc.page_content[:300], "\n...")


--- Result 1 ---
was ready and cheering with excitement, the Vanara army crossed 
the sea and reached Lanka. 
As soon as the Vanara army reached the gates of Lanka, Rama 
divided them into battalions and placed each group at important 
places. The whole area echoed with the sound of conches being 
...

--- Result 2 ---
37 
 
among us who is capable of flying across the ocean. So, Please tell 
us how we can get there.” Rama also asked his trusted friend about 
Lanka’s city plan, about its main gates, about trenches built around 
the fort and many more such information to plan the attack: Though 
...

--- Result 3 ---
army? If you plan to attack Rama, you will have to defeat me first.” 
Bharatha was extremely hurt by this suspicion. But he explained to 
Guha that he would take Rama back to Ayodhya and crown him the 
king. Guha was very happy to hear this. So he helped Bharatha, his 
...

--- Result 4 ---
war with Rama, Ravana was very angry. 
Vibhishana, are you really my brother? You t