<a href="https://colab.research.google.com/github/ranjith88697/Bootcamp_Acc/blob/main/Day10_BM25_and_FAISS_hybrid_search_practical_task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Goal: Create a code explanation for each cell as text below it.**

**Creating a hybrid search system using**
* Embeddings for semantic search (sentence_transformers)
* BM25 for keyword ranking (Sparse retrieval)
* FAISS as a index.









In [None]:
!pip install sentence-transformers



This command installs the Sentence Transformers library, which provides  embedding models to perform Semantic Search

In [None]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


This installs rank_bm25, a lightweight Python implementation of the BM25 ranking algorithm.
BM25 is a classic sparse retrieval method that scores documents based on keyword frequency and relevance. This gives your hybrid system the ability to capture exact keyword matches

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


This installs FAISS, a fast vector similarity search library developed by Meta.
FAISS used to build an index of dense embeddings and perform efficient nearest neighbor search.

In [None]:
import sentence_transformers

This cell loads all the core libraries required for building a hybrid search system:

sentence_transformers: Provides pretrained embedding models that convert text into dense vectors for semantic search.

In [None]:
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

This cell loads all the core libraries required for building a hybrid search system:

numpy  
Used for numerical operations, especially when preparing vectors for FAISS.

rank_bm25.BM25Okapi  
Implements the BM25 algorithm, a classic keyword based ranking method. This gives the sparse retrieval capabilities.

faiss  
A high performance similarity search library used to index and query dense embeddings efficiently.

In [None]:
documents = [
    "Artificial Intelligence is changing the world.",
    "Machine Learning is a subset of AI.",
    "Deep Learning is a subset of Machine Learning.",
    "Natural Language Processing involves understanding text.",
    "Computer Vision allows machines to see and understand.",
    "AI includes areas like NLP and Computer Vision.",
    "The Pyramids of Giza are architectural marvels.",
    "Mozart was a prolific composer during the classical era.",
    "Mount Everest is the tallest mountain on Earth.",
    "The Nile is one of the world's longest rivers.",
    "Van Gogh's Starry Night is a popular piece of art.",
    "Basketball is a sport played with a round ball and two teams."
]

This cell defines a small corpus of documents that the hybrid search system will index and search over.
The list intentionally mixes: AI related content (NLP, Computer Vision, Machine Learning) and General knowledge topics (geography, art, sports)

This variety makes it easier to test whether the search system can correctly identify which documents are relevant to a given query — especially when the query is about AI.

In [None]:
query = "Tell me about AI in text and vision."

This is the user query the hybrid search system will process.
It references: (AI, Text relates to NLP and Vision relates to Computer Vision)

This makes it a great test case because relevant information is spread across multiple documents.
The hybrid system will use BM25 to match keywords like “AI”, “text”, “vision”

Use embeddings + FAISS to capture semantic meaning (e.g., “NLP” ≈ “text understanding”)

In [None]:
tokenized_corpus = [doc.split(" ") for doc in documents]

This line prepares the documents for BM25 by splitting each document into a list of individual tokens.
BM25 works on tokenized text, so this step converts the raw strings into the format BM25 expects.

In [None]:
bm25 = BM25Okapi(tokenized_corpus)

The BM25Okapi object using the tokenized documents.
This builds the internal BM25 index, enabling keyword based scoring.
BM25 will help identify documents that contain important query terms like AI, text, or vision.

In [None]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

This loads a pretrained SentenceTransformer model that converts text into dense vector embeddings.
The chosen model, paraphrase-MiniLM-L6-v2, is lightweight and fast while still providing strong semantic understanding.

In [None]:
document_embeddings = model.encode(documents)

This line encodes every document into a numerical vector.
Each vector captures the meaning of the document, allowing the system to find semantically similar text even when exact keywords dont match.
These vectors will later be indexed by FAISS for fast similarity search.

In [None]:
index = faiss.IndexFlatL2(document_embeddings.shape[1])

This creates a FAISS index configured for L2 (Euclidean) distance.
The dimension of the index is set to match the embedding size.
FAISS will allow you to perform efficient nearest neighbor searches over the embedding space.

In [None]:
index.add(np.array(document_embeddings).astype('float32'))


FAISS requires vectors in float32 format, so the embeddings are converted and added to the index.
Once added, the index is ready to perform fast semantic similarity searches for any query embedding you generate.

In [None]:
top_n =10

This sets the number of top BM25-ranked documents for the hybrid search step.
Instead of searching the entire corpus the candidate set to the 10 most keyword relevant documents, which makes the semantic search faster and more focused.

In [None]:
bm25_scores = bm25.get_scores(query.split(" "))

BM25 scores each document based on how well its tokens match the query tokens.
Splitting the query into words ensures BM25 can evaluate term frequency and relevance. The result is a list of numerical scores—one per document.

In [None]:
top_docs_indices = np.argsort(bm25_scores)[-top_n:]

This line sorts the BM25 scores and extracts the indices of the top 10 highest scoring documents will be passed to semantic stage.

In [None]:
top_docs_embeddings = [document_embeddings[i] for i in top_docs_indices]

Here you gather the dense embeddings for only the BM25 selected documents.
This reduces the search space and allows FAISS to perform a more efficient and meaningful semantic comparison.

In [None]:
query_embedding = model.encode([query])

The query is converted into a dense vector using the same SentenceTransformer model used for the documents.
This ensures the query and documents live in the same semantic space

In [None]:
sub_index = faiss.IndexFlatL2(top_docs_embeddings[0].shape[0])

This initializes a new FAISS index this time only for the top BM25 documents.
The index uses L2 distance and is sized according to the embedding dimension.

In [None]:
sub_index.add(np.array(top_docs_embeddings).astype('float32'))

This adds the embeddings of the BM25 selected documents into the FAISS sub index.
FAISS requires vectors in float32 format, so the embeddings are converted before insertion. Once added, the sub index becomes ready for fast semantic similarity search.

In [None]:
_,sub_dense_ranked_indices = sub_index.search(np.array(query_embedding).astype('float32'), top_n)

This performs a FAISS search using the query embedding.
FAISS returns the indices of the most semantically similar documents within the BM25 filtered subset.

The underscore _ captures distances, while sub_dense_ranked_indices contains the ranked positions.

In [None]:
sub_dense_ranked_indices


array([[9, 8, 1, 0, 6, 7, 2, 4, 3, 5]])

This output shows the order of the top documents within the sub index.
Each number refers to a position inside the BM25 filtered list

In [None]:
final_ranked_indices = [top_docs_indices[i] for i in sub_dense_ranked_indices[0]]

This is the key step that merges sparse and dense retrieval.

This step translates the FAISS ranking back into the original document indices.
It has a final list of document positions ranked by semantic similarity only with in relevant BM25 candidate list

In [None]:
ranked_docs = [documents[i] for i in final_ranked_indices]

This extracts the actual text of the documents in their final hybrid ranked order. The result is a list of documents that are both:

Keyword relevant

Semantically aligned with the query (ranked by FAISS)

In [None]:
ranked_docs

['AI includes areas like NLP and Computer Vision.',
 'Computer Vision allows machines to see and understand.',
 'Natural Language Processing involves understanding text.',
 'Deep Learning is a subset of Machine Learning.',
 "Van Gogh's Starry Night is a popular piece of art.",
 'Basketball is a sport played with a round ball and two teams.',
 'Mozart was a prolific composer during the classical era.',
 "The Nile is one of the world's longest rivers.",
 'The Pyramids of Giza are architectural marvels.',
 'Mount Everest is the tallest mountain on Earth.']

#Provide a brief description of the process this code implements.

This code implements a hybrid document retrieval system that combines keyword based search (BM25) with semantic vector search (SentenceTransformers + FAISS) to return the most relevant documents for a user query.

The workflow follows these steps:

- Prepare the data by loading documents and defining a query.

- Build a BM25 index to score documents based on keyword overlap with the query.

- Select the top BM25 candidates, narrowing the search space to the most likely matches.

- Generate dense embeddings for both documents and the query using a SentenceTransformer model.

- Use FAISS to perform fast semantic similarity search among the BM25 filtered documents.

- Rerank the candidates based on semantic similarity and return the final ordered list of relevant documents.

Overall, the system first filters documents using keywords, then reranks them using meaning, giving more accurate and efficient search result than either method alone.