<a href="https://colab.research.google.com/github/soumyashubham10/IEEE-ML/blob/main/Soumya_Shubham_IEEE_ML_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q datasets sentence-transformers faiss-cpu


 Environment Setup and Library Installation


In [None]:
from datasets import load_dataset

dataset = load_dataset(
    "ms_marco",
    "v1.1",
    split="train[:1%]"
)

print(dataset)


This cell loads a small subset of the MS MARCO v1.1 dataset using the HuggingFace datasets library.
Only 1% of the training split is loaded to ensure fast execution in Google Colab.

Hugging Face is used because it provides easy, reliable models. It allows direct loading of datasets like MS MARCO without manual downloads.



In [None]:
print(dataset.column_names)


This command displays the available columns in the MS MARCO dataset.
It helps to understand the dataset structure before processing

In [None]:
{
  'passage_text': '...',
  'is_selected': 1
}


In [None]:
dataset = load_dataset(
    "ms_marco",
    "v1.1",
    split="train[:10%]"
)

print("Queries loaded:", len(dataset))


This code loads a portion of the MS MARCO dataset using the Hugging Face load_dataset function. Only 10% of the training data is loaded, which helps reduce memory usage and makes experimentation faster. The len(dataset) statement then prints the total number of query samples that were successfully loaded.

In [None]:
queries = []
relevant_docs = []

for item in dataset:
    queries.append(item["query"])

    rel = set()
    for text, sel in zip(item["passages"]["passage_text"],
                         item["passages"]["is_selected"]):
        if sel == 1:
            rel.add(text)

    relevant_docs.append(rel)

print("Total queries:", len(queries))


In [None]:
import pandas as pd
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

This step imports some standard libraries.

In [None]:
import random
import numpy as np
import torch

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)


This code is used to fix the randomness in a program.
The same seed value (42) is given to random, NumPy, and PyTorch so that every time the code runs, it produces the same random results.
This helps in reproducibility, meaning experiments and results can be repeated and verified easil

In [None]:
texts = []
doc_ids = []

for item in dataset:
    for passage in item["passages"]["passage_text"]:
        texts.append(passage)
        doc_ids.append(len(doc_ids))
        if len(texts) >= 75000:
            break
    if len(texts) >= 75000:
        break

corpus_df = pd.DataFrame({
    "doc_id": doc_ids,
    "text": texts
})

print("Total passages extracted:", len(corpus_df))
corpus_df.head()


This cell extracts individual passages from the MS MARCO dataset and builds a document corpus.Each query in MS MARCO contains multiple passages.Passages are extracted sequentially and stored in a list.
A unique doc_id is assigned to each passage.
This DataFrame serves as the database for the semantic search engine.


In [None]:
df = pd.DataFrame({
    "doc_id": doc_ids,
    "text": texts
})


In [None]:
print("Total passages available:", len(df))


Although the target was to extract 75,000 passages, the dataset subset loaded contained only 67,656 passages in total.

This occurs because:
 MS MARCO passages are distributed unevenly across queries.
 Some queries contain fewer passages than others.

Since the task allows flexibility in dataset size the extracted corpus size is still sufficient and valid for building and demonstrating a semantic search engine.


In [None]:
model = SentenceTransformer("all-MiniLM-L6-v2")

This line loads the pre-trained all-MiniLM-L6-v2 SentenceTransformer model, which converts text passages into 384-dimensional dense embeddings that capture semantic meaning, allowing us to compare queries and documents using vector similarity for efficient and accurate semantic search.

In [None]:
import torch
print(torch.cuda.is_available())


This code checks whether a CUDA-enabled GPU is available in the current runtime.  
The torch.cuda.is_available function returns True if PyTorch can access a GPU, allowing computations to be accelerated using CUDA otherwise, it returns False, meaning all operations will run on the CPU.


In [None]:
embeddings = model.encode(
corpus_df["text"].tolist(),
batch_size=64,
show_progress_bar=True,
convert_to_numpy=True
)


print("Embedding shape:", embeddings.shape)

This code turns all texts into numerical embeddings and stores them as a NumPy array. batch_size speeds it up, show_progress_bar shows progress and embeddings.shape tells the number of texts and embedding size.

In [None]:
faiss.normalize_L2(embeddings)


index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)


print("Total vectors indexed:", index.ntotal)

It prepares your embeddings for fast similarity search.

In [None]:
def recall_at_k(preds, gold, k=10):
    hit = 0
    for p, g in zip(preds, gold):
        if len(set(p[:k]) & g) > 0:
            hit += 1
    return hit / len(preds)


def mrr_at_k(preds, gold, k=10):
    total = 0
    for p, g in zip(preds, gold):
        for rank, doc in enumerate(p[:k], start=1):
            if doc in g:
                total += 1 / rank
                break
    return total / len(preds)


def evaluate_dense_retriever(k=10, limit=500):
    predictions = []
    gold = []

    for q, g in zip(queries[:limit], relevant_docs[:limit]):
        res = semantic_search(q, top_k=k)
        predictions.append(res["text"].tolist())
        gold.append(g)

    print("Recall@10:", recall_at_k(predictions, gold, k))
    print("MRR@10:", mrr_at_k(predictions, gold, k))


This code checks how well a semantic search model is working. It runs search for each query and compares the results with the correct documents. Recall@10 shows whether at least one correct result appears in the top 10 and MRR@10 shows how high the first correct result is ranked. Together, these scores tell us how accurate and effective the search system is.

In [None]:
def semantic_search(query, top_k=5):
    query_embedding = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding)

    scores, indices = index.search(query_embedding, top_k)

    results = corpus_df.iloc[indices[0]].copy()
    results["score"] = scores[0]

    return results


This function takes a query, converts it into an embedding and normalizes it. It then searches the FAISS index for the top_k most similar texts, retrieves them from the DataFrame and adds their similarity scores, and returns the results.

In [None]:
query = "tell me about ieee"
results = semantic_search(query, top_k=5)
results

This output shows the result of a semantic search performed with a query "tell me about ieee".  
The table lists the top 5 most semantically similar passages, where doc_id identifies the document, text contains the retrieved passage.
Higher scores indicate stronger semantic relevance, which is why the top results accurately describe IEEE.


In [None]:
semantic_search("machine learning in healthcare")


This line performs a semantic search for the query machine learning in healthcare.

In [None]:
query = "Tell me about Electrical engineering"
results = semantic_search(query, top_k=10)

for i, row in results.iterrows():
    print(f"\nScore: {row['score']:.4f}")
    print(row['text'][:300], "...")


This version of the code differs in output format because the results are printed manually using a loop instead of being displayed as a DataFrame.  
Rather than showing all columns  in tabular form, it prints each result one by one with its similarity score and only the first 300 characters of the passage.  
This makes the output more readable and suitable for quick inspection.


In [None]:
semantic_search("tell me about the first prime minister of india", top_k=10)

In [None]:
!pip install -q rank-bm25


This command installs the rank-bm25 library which is required for the bonus part of the evaluation criteria.

In [None]:
print("\n--- Dense Retriever Evaluation ---")
evaluate_dense_retriever()


The function evaluate_dense_retriever() is then called to measure the performance of a dense retrieval model, such as how well it finds relevant documents for given queries.

In [None]:
from rank_bm25 import BM25Okapi

tokenized_corpus = [doc.split() for doc in corpus_df["text"]]
bm25 = BM25Okapi(tokenized_corpus)


In [None]:
def bm25_search(query, top_k=10):
    scores = bm25.get_scores(query.split())
    top_idx = np.argsort(scores)[::-1][:top_k]
    return corpus_df.iloc[top_idx]


This code sets up BM25 document ranking for text search.
Each document is tokenized by splitting text into words.
The bm25_search function then takes a query, calculates relevance scores for all documents, sorts them in descending order and returns the top-k most relevant documents from the corpus.

In [None]:
bm25_search("machine learning in healthcare")

Semantic search looks at the meaning of the query. It can find related results even if the exact words are different.

BM25 search looks for exact keywords in the text. It works well when the same words appear in the document.

In [None]:
def failure_analysis():
    q = "quantum cryptography standards"
    print("Query:", q)
    print("\nDense Results:")
    print(semantic_search(q)["text"].iloc[0][:300])

    print("\nBM25 Results:")
    print(bm25_search(q)["text"].iloc[0][:300])


In [None]:
failure_analysis()


Failure analysis is the systematic study of why a component, system, or process fails to perform its intended function. The main goal is to identify the root cause of failure and prevent it from happening again.