# Hybrid Retrieval System using FAISS and BM25

## Overview
This notebook demonstrates a hybrid retrieval system that integrates ***dense retrieval*** using **FAISS** and ***sparse retrieval*** using **BM25**. The system allows us to perform efficient document search by leveraging embeddings from a PDF document. The retrieval results can be compared from both methods, showcasing the benefits of hybrid retrieval in enhancing search relevance.

## Features
- **Dense Retrieval with FAISS**: Utilizes embeddings generated by the `all-MiniLM-L6-v2` model for efficient vector similarity searches.
- **Sparse Retrieval with BM25**: Implements the BM25 algorithm for traditional keyword-based searches.
- **Hybrid Retrieval Approach**: Combines results from both methods to provide more relevant search outcomes.
- **PDF Support**: Loads content directly from a PDF document for searching.



### Install Necessary Libraries


In [82]:
! pip install -qU semantic-chunkers datasets langchain pypdf faiss-cpu sentence_transformers rank_bm25 nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Import Libraries

In [100]:
import torch
from langchain.document_loaders import PyPDFLoader
from semantic_chunkers import StatisticalChunker
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from semantic_router.encoders import HuggingFaceEncoder
from rank_bm25 import BM25Okapi

### Loading the Data (PDF Notes)

In [3]:
file_path = "/home/wassim/Downloads/KubernetesNotes.pdf"  
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()

In [4]:
pages

[Document(metadata={'source': '/home/wassim/Downloads/KubernetesNotes.pdf', 'page': 0}, page_content="Kubernetes For Everyone\nKubernetes introduction and features\nHow Kubernetes works?\nIn Kubernetes, there is a master node and multiple worker nodes, each worker node can handle\nmultiple pods.\nPods are just a bunch of containers clustered together as a working unit. You can start designing\nyour applications using pods.\nOnce your pods are ready, you can specify pod definitions to the master node, and how many you\nwant to deploy. From this point, Kubernetes is in control.\nIt takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes\nstarts new pods on a functioning worker node.\nThis makes the process of managing the containers easy and simple.\nIt makes it easy to build and add more features and improving the application to attain higher\ncustomer satisfaction.\nFinally, no matter what technology you're invested in, Kubernetes can help you.\nImage

In [5]:
content_list = [page.page_content for page in pages]
content = ' '.join(content_list)

In [6]:
content

'Kubernetes For Everyone\nKubernetes introduction and features\nHow Kubernetes works?\nIn Kubernetes, there is a master node and multiple worker nodes, each worker node can handle\nmultiple pods.\nPods are just a bunch of containers clustered together as a working unit. You can start designing\nyour applications using pods.\nOnce your pods are ready, you can specify pod definitions to the master node, and how many you\nwant to deploy. From this point, Kubernetes is in control.\nIt takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes\nstarts new pods on a functioning worker node.\nThis makes the process of managing the containers easy and simple.\nIt makes it easy to build and add more features and improving the application to attain higher\ncustomer satisfaction.\nFinally, no matter what technology you\'re invested in, Kubernetes can help you.\nImage credits: Source: Knoldus Inc What is the Master node and Worker node in #Kubernetes?\nExplained bel

### Loading the Embedding Model (all-MiniLM-L6-v2)

In [7]:
encoder = HuggingFaceEncoder(name="sentence-transformers/all-MiniLM-L6-v2")

### Chunking the Data Semantically using: Statistical Chunking 

In [8]:
chunker = StatisticalChunker(
    encoder=encoder,
)

In [9]:
chunks = chunker(docs=[content])


[32m2024-10-20 11:12:10 INFO semantic_chunkers.utils.logger Single document exceeds the maximum token limit of 300. Splitting to sentences before semantically merging.[0m
100%|██████████| 10/10 [00:03<00:00,  2.69it/s]


In [10]:
chunks

[[Chunk(splits=['Kubernetes For Everyone', 'Kubernetes introduction and features', 'How Kubernetes works?', 'In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle', 'multiple pods.', 'Pods are just a bunch of containers clustered together as a working unit.', 'You can start designing', 'your applications using pods.', 'Once your pods are ready, you can specify pod definitions to the master node, and how many you', 'want to deploy.', 'From this point, Kubernetes is in control.', 'It takes the pods and deploys them to the worker nods.', 'If a worker node goes down, Kubernetes', 'starts new pods on a functioning worker node.', 'This makes the process of managing the containers easy and simple.'], is_triggered=True, triggered_score=0.20561899514469217, token_count=133, metadata=None),
  Chunk(splits=['It makes it easy to build and add more features and improving the application to attain higher', 'customer satisfaction.', "Finally, no matter what tech

In [11]:
chunker.print(chunks[0])

Split 1, tokens 133, triggered by: 0.21
[31mKubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit. You can start designing your applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you want to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes starts new pods on a functioning worker node. This makes the process of managing the containers easy and simple.[0m
----------------------------------------------------------------------------------------


Split 2, tokens 123, triggered by: 0.12
[32mIt makes it easy to build and add more features and improving the application to attain higher customer satisfaction. Finally, no matte

### Convert Content into Embedding Vectors

In [13]:
concatenated_strings = []

# Iterate through each chunk in the nested list
for chunk_list in chunks:  # outer list
    for chunk in chunk_list:  # inner list of chunks
        # Join the splits of each chunk into a single string and append to the list
        concatenated_string = ' '.join(chunk.splits)
        concatenated_strings.append(concatenated_string)

In [97]:
concatenated_strings

['Kubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit. You can start designing your applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you want to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes starts new pods on a functioning worker node. This makes the process of managing the containers easy and simple.',
 "It makes it easy to build and add more features and improving the application to attain higher customer satisfaction. Finally, no matter what technology you're invested in, Kubernetes can help you. Image credits: Source: Knoldus Inc What is the Master node and Worker node in #Kubernetes? Explained below, #Contain

In [15]:
concatenated_strings[0]

'Kubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit. You can start designing your applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you want to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes starts new pods on a functioning worker node. This makes the process of managing the containers easy and simple.'

In [22]:
print("Number of chunks: ",len(concatenated_strings))

Number of chunks:  41


In [16]:
embeddings = encoder(concatenated_strings)

In [17]:
embeddings

[[0.0529448539018631,
  -0.02714216522872448,
  0.05985463783144951,
  0.019305849447846413,
  -0.008232896216213703,
  0.010053739883005619,
  0.0012692443560808897,
  -0.027787169441580772,
  0.06317197531461716,
  0.0818970799446106,
  -0.060394130647182465,
  0.0328148752450943,
  0.013584661297500134,
  -0.032351043075323105,
  -0.010991974733769894,
  -0.004144428763538599,
  0.05254707485437393,
  0.0028332581277936697,
  -0.001213244628161192,
  -0.0575278103351593,
  -0.014004688709974289,
  -0.05983709543943405,
  -0.040443770587444305,
  -0.021461157128214836,
  -0.08465968072414398,
  0.07583969831466675,
  -0.07944905012845993,
  -0.029923483729362488,
  0.012598779052495956,
  -0.026744134724140167,
  -0.07919760048389435,
  -0.028849810361862183,
  0.059428781270980835,
  0.07936401665210724,
  -0.026407890021800995,
  0.07387437671422958,
  0.019816020503640175,
  -0.014407020062208176,
  -0.07149255275726318,
  0.0684272050857544,
  0.01822800002992153,
  -0.0614311248

In [28]:
# The Cosine similarity between two embedding vectors
from sklearn.metrics.pairwise import cosine_similarity 
similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Cosine similarity: {similarity[0][0]}")

Cosine similarity: 0.7863475968747669


In [47]:
len(embeddings[6])

384

### Building the FAISS Index

In [48]:
def list_shape(lst):
    shape = [len(lst)]
    if isinstance(lst[0], list):
        shape.append(len(lst[0]))
    return shape

In [49]:
print(list_shape(embeddings))

[41, 384]


In [50]:
embeddings_np = np.array(embeddings).astype('float32')
# Building The FAISS index using Euclidean distance
index = faiss.IndexFlatL2(embeddings_np.shape[1])
# Adding the embeddings to the index
index.add(embeddings_np)

In [52]:
print(f"Number of vectors in the index: {index.ntotal}")  

Number of vectors in the index: 41


In [78]:
embeddings_np[0:1]

array([[ 5.29448539e-02, -2.71421652e-02,  5.98546378e-02,
         1.93058494e-02, -8.23289622e-03,  1.00537399e-02,
         1.26924436e-03, -2.77871694e-02,  6.31719753e-02,
         8.18970799e-02, -6.03941306e-02,  3.28148752e-02,
         1.35846613e-02, -3.23510431e-02, -1.09919747e-02,
        -4.14442876e-03,  5.25470749e-02,  2.83325813e-03,
        -1.21324463e-03, -5.75278103e-02, -1.40046887e-02,
        -5.98370954e-02, -4.04437706e-02, -2.14611571e-02,
        -8.46596807e-02,  7.58396983e-02, -7.94490501e-02,
        -2.99234837e-02,  1.25987791e-02, -2.67441347e-02,
        -7.91976005e-02, -2.88498104e-02,  5.94287813e-02,
         7.93640167e-02, -2.64078900e-02,  7.38743767e-02,
         1.98160205e-02, -1.44070201e-02, -7.14925528e-02,
         6.84272051e-02,  1.82280000e-02, -6.14311248e-02,
        -1.24826156e-01, -3.12527036e-03,  6.09457819e-03,
         5.21901296e-03, -4.66375798e-02, -2.12810338e-02,
         4.69824336e-02, -1.41205080e-02,  3.71216461e-0

In [80]:
k = 3
# Testing
# Searching for the k-nearest neighbors of the first embedding embeddings_np[0:1]
distances, indices = index.search(embeddings_np[0:1], k)

print(f"Indices of nearest neighbors: {indices}")
print(f"Distances to nearest neighbors: {distances}")

Indices of nearest neighbors: [[0 2 1]]
Distances to nearest neighbors: [[0.         0.3200307  0.42730483]]


### Preparing the BM25

In [90]:
concatenated_strings

['Kubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit. You can start designing your applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you want to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes starts new pods on a functioning worker node. This makes the process of managing the containers easy and simple.',
 "It makes it easy to build and add more features and improving the application to attain higher customer satisfaction. Finally, no matter what technology you're invested in, Kubernetes can help you. Image credits: Source: Knoldus Inc What is the Master node and Worker node in #Kubernetes? Explained below, #Contain

In [103]:
#tokenize each chunk
tokenized_chunks=[]
for doc in concatenated_strings:
    doc_tokens = doc.split()
    tokenized_chunks.append(doc_tokens)

print(tokenized_chunks[1])

['It', 'makes', 'it', 'easy', 'to', 'build', 'and', 'add', 'more', 'features', 'and', 'improving', 'the', 'application', 'to', 'attain', 'higher', 'customer', 'satisfaction.', 'Finally,', 'no', 'matter', 'what', 'technology', "you're", 'invested', 'in,', 'Kubernetes', 'can', 'help', 'you.', 'Image', 'credits:', 'Source:', 'Knoldus', 'Inc', 'What', 'is', 'the', 'Master', 'node', 'and', 'Worker', 'node', 'in', '#Kubernetes?', 'Explained', 'below,', '#Containerization', 'is', 'the', 'trend', 'that', 'is', 'taking', 'over', 'the', 'world,', 'allowing', 'firms', 'to', 'run', 'any', 'kind', 'of', 'different', 'applications', 'in', 'a', 'variety', 'of', 'different', 'environments.', 'To', 'keep', 'track', 'of', 'all', 'these', 'containers,', 'to', 'schedule,', 'to', 'manage,', 'and', 'to', 'orchestrate', 'them,', 'we', 'all', 'require', 'an', 'orchestration', 'tool.', 'Kubernetes', 'does', 'it']


In [104]:
bm25 = BM25Okapi(tokenized_chunks)

In [117]:
query_test ="Master and slave"
tokenized_query = query_test.split(" ")

In [118]:
doc = bm25.get_top_n(tokenized_query,concatenated_strings, n=3)
print (doc)

['exponentially well. Kubernetes is a master-slave type of architecture. It operated with Master node and worker node principles. What exactly they do? Master Node: >The main machine that controls the nodes > Main entry point for all administrative tasks > It handles the orchestration of the worker nodes Worker Node: > It is a worker machine in Kubernetes (used to be known as a minion) > This machine performs the requested tasks. The Master Node controls each Node > Runs containers inside pods > This is where the Docker engine runs and takes care of downloading images and starting containers Know in-depth concepts here in the original article: https://blog.risingstack.com/what-is-kubernetes- how-to-get-started/ #Containers are the de-facto deployment format of today. But where does #Kubernetes comes in the play? While tools such as #Docker provide the actual containers, we also need tools to take care of things such as replication, failovers, orchestration, and that is where Kubernetes

### Combining FAISS and BM25

In [121]:
def hybrid_search(query, top_k=3):
    # Dense retrieval with FAISS
    query_embedding = encoder([query])
    D, I = index.search(np.array(query_embedding).astype('float32'), top_k)  # Top K results

    # Sparse retrieval with BM25
    tokenized_query = query.split(" ")
    bm25_scores = bm25.get_scores(tokenized_query)
    bm25_indices = bm25_scores.argsort()[-top_k:][::-1]  # Get top K indices

    # Combine results from both methods
    faiss_results = [concatenated_strings[i] for i in I[0]]
    bm25_results = [concatenated_strings[i] for i in bm25_indices]
    
    return {
        "faiss_results": faiss_results,
        "bm25_results": bm25_results
    }


In [122]:
query="master and slave"
hybrid_search(query=query)

{'faiss_results': ['Kubernetes For Everyone Kubernetes introduction and features How Kubernetes works? In Kubernetes, there is a master node and multiple worker nodes, each worker node can handle multiple pods. Pods are just a bunch of containers clustered together as a working unit. You can start designing your applications using pods. Once your pods are ready, you can specify pod definitions to the master node, and how many you want to deploy. From this point, Kubernetes is in control. It takes the pods and deploys them to the worker nods. If a worker node goes down, Kubernetes starts new pods on a functioning worker node. This makes the process of managing the containers easy and simple.',
  'exponentially well. Kubernetes is a master-slave type of architecture. It operated with Master node and worker node principles. What exactly they do? Master Node: >The main machine that controls the nodes > Main entry point for all administrative tasks > It handles the orchestration of the work