#**Goal: Create a code explanation for each cell as text below it.**

**Creating a hybrid search system using**
* Embeddings for semantic search (sentence_transformers)
* BM25 for keyword ranking (Sparse retrieval)
* FAISS as a index.









In [1]:
!pip install sentence-transformers



In [2]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [3]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


In [4]:
import sentence_transformers

In [5]:
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import faiss

import sentence_transformers; this line imports the sentence_transformers library. It will be used it to turn sentences into vectors.

numpy: for working with numbers and arrays

BM25Okapi: for keyword based search

SentenceTransformer: for creating sentence embeddings

faiss: for fast similarity search between vectors.

In [6]:
documents = [
    "Artificial Intelligence is changing the world.",
    "Machine Learning is a subset of AI.",
    "Deep Learning is a subset of Machine Learning.",
    "Natural Language Processing involves understanding text.",
    "Computer Vision allows machines to see and understand.",
    "AI includes areas like NLP and Computer Vision.",
    "The Pyramids of Giza are architectural marvels.",
    "Mozart was a prolific composer during the classical era.",
    "Mount Everest is the tallest mountain on Earth.",
    "The Nile is one of the world's longest rivers.",
    "Van Gogh's Starry Night is a popular piece of art.",
    "Basketball is a sport played with a round ball and two teams."
]

In [7]:
query = "Tell me about AI in text and vision."

Cell for documents creates a list of example documents. Each document is a short sentence. These documents will be used to test the search system.

query = "Tell me about AI in text and vision."
This line sets the search query. This sentence will be used to find the most relevant documents from our list.

In [8]:
tokenized_corpus = [doc.split(" ") for doc in documents]

This line splits each document into words and creates a list of word lists. 

In [9]:
bm25 = BM25Okapi(tokenized_corpus)

This line creates a BM25 search model using the tokenized documents. This model will be used to score how well each document matches the search query.

In [None]:
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

In [11]:
document_embeddings = model.encode(documents)

model = SentenceTransformer('paraphrase-MiniLM-L6-v2') : Loads a pre trained model that can turn sentences into vectors.

document_embeddings = model.encode(documents) : This line converts all the example documents into vectors using the model

In [12]:
index = faiss.IndexFlatL2(document_embeddings.shape[1])

This cell creates a FAISS index for fast similarity search by using the size of the document vectors

In [13]:
index.add(np.array(document_embeddings).astype('float32'))


This cell adds all the document vectors to the FAISS index, so we can search through  them quickly.

In [14]:
top_n =10

top_n= 10 , sets the number of top results we want to get from our search to 10.

In [15]:
bm25_scores = bm25.get_scores(query.split(" "))

this line calculates a BM25 score for each document based on how well it matches with search query.

In [16]:
top_docs_indices = np.argsort(bm25_scores)[-top_n:]

This line finds the indices of the top 10 documents with the highest BM25 scores.

In [17]:
top_docs_embeddings = [document_embeddings[i] for i in top_docs_indices]

The line above creates a list of the embeddings for the top documents found by BM25 .

In [18]:
query_embedding = model.encode([query])

the line above turns the search query into a embedding by using the same model as before.

In [19]:
sub_index = faiss.IndexFlatL2(top_docs_embeddings[0].shape[0])

This line creates a new FAISS index for the top documents by using the size of their embeddings.

In [20]:
sub_index.add(np.array(top_docs_embeddings).astype('float32'))

The cell adds the embeddings of the top documents to the new FAISS sub index, so we can search for the most similar documents among just these top results.

In [24]:
_,sub_dense_ranked_indices = sub_index.search(np.array(query_embedding).astype('float32'), top_n)

The cell searches the sub index by using the query_embedding and returns the indices of the top documents that are most similar to the query

In [25]:
sub_dense_ranked_indices


array([[9, 8, 1, 0, 6, 7, 2, 4, 3, 5]])

sub_dense_ranked_indices is a variable that stores the indices of the top documents in the sub index that are most similar to the query.these indices help us fond out which of the top documents best match what we are  searching for. 

In [26]:
final_ranked_indices = [top_docs_indices[i] for i in sub_dense_ranked_indices[0]]

The cell creates a new list of indices that shows the original positions of the top documents, now ranked by how similar they are to the query in the sub index. This helps us know which documents from the whole list are the best matches as overall.

In [27]:
ranked_docs = [documents[i] for i in final_ranked_indices]

ranked_docs = [documents[i] for i in final_ranked_indices], creates a list of the actual documents ordered by how well they match the query. 

In [28]:
ranked_docs

['AI includes areas like NLP and Computer Vision.',
 'Computer Vision allows machines to see and understand.',
 'Natural Language Processing involves understanding text.',
 'Deep Learning is a subset of Machine Learning.',
 "Van Gogh's Starry Night is a popular piece of art.",
 'Basketball is a sport played with a round ball and two teams.',
 'Mozart was a prolific composer during the classical era.',
 "The Nile is one of the world's longest rivers.",
 'The Pyramids of Giza are architectural marvels.',
 'Mount Everest is the tallest mountain on Earth.']

#Provide a brief description of the process this code implements.

This code implements a hybrid search approach that combines keyword search(BM25) and semantic search using embeddings . First, BM25 retrieves the top documents based on keyword matching then the documents are reranked using semantic similarity to return the most relevant results.