# 🔍 Hybrid Search for RAG Pipelines: BM25 + Vector Similarity

This notebook demonstrates **hybrid search** that combines:
- **BM25 sparse retrieval** for keyword relevance
- **Vector-based retrieval** for semantic understanding

By combining these methods, we improve recall and precision in document retrieval for RAG systems.

📦 Required packages: `rank_bm25`, `faiss-cpu`, `openai`, `nltk`, `tiktoken`

In [None]:
# !pip install rank_bm25 faiss-cpu openai nltk tiktoken

## 🧹 1. Preprocessing: Tokenize & Chunk the Documents

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import tiktoken

def chunk_text(text, max_tokens=300):
    enc = tiktoken.encoding_for_model("gpt-3.5-turbo")
    sentences = text.split(". ")
    chunks, current = [], ""
    for sentence in sentences:
        if len(enc.encode(current + sentence)) < max_tokens:
            current += sentence + ". "
        else:
            chunks.append(current.strip())
            current = sentence + ". "
    if current:
        chunks.append(current.strip())
    return chunks

## 🧠 2. BM25 Index Setup (Sparse Retrieval)

In [None]:
from rank_bm25 import BM25Okapi

def build_bm25_index(chunks):
    tokenized_corpus = [word_tokenize(doc.lower()) for doc in chunks]
    return BM25Okapi(tokenized_corpus), tokenized_corpus

## 🔢 3. Embedding + FAISS Vector Store (Semantic Retrieval)

In [None]:
import openai
import numpy as np
import faiss

def embed_text(text_list, model="text-embedding-ada-002"):
    embeddings = []
    for text in text_list:
        resp = openai.Embedding.create(input=text, model=model)
        embeddings.append(resp['data'][0]['embedding'])
    return np.array(embeddings).astype("float32")

def build_faiss_index(embeddings):
    dim = embeddings.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(embeddings)
    return index

## 🔀 4. Hybrid Retrieval Logic: BM25 + Vector Weighted Combo

In [None]:
def hybrid_search(query, chunks, bm25, tokenized_corpus, faiss_index, embeddings, alpha=0.5, k=5):
    # BM25 sparse scores
    bm25_scores = bm25.get_scores(word_tokenize(query.lower()))
    
    # Vector similarity
    q_embed = embed_text([query])
    D, I = faiss_index.search(q_embed, len(chunks))
    
    # Normalize scores
    bm25_norm = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores) + 1e-8)
    sim_norm = 1 - D[0] / (np.max(D[0]) + 1e-8)

    # Weighted hybrid score
    hybrid_scores = alpha * bm25_norm + (1 - alpha) * sim_norm
    top_indices = np.argsort(hybrid_scores)[-k:][::-1]
    return [chunks[i] for i in top_indices]

## ✅ 5. End-to-End Pipeline Demo

In [None]:
# full_text = """ Put your multi-paragraph document here """
# query = "What is the licensing requirement?"
# chunks = chunk_text(full_text)
# bm25, tokenized = build_bm25_index(chunks)
# embeds = embed_text(chunks)
# faiss_index = build_faiss_index(embeds)
# top_chunks = hybrid_search(query, chunks, bm25, tokenized, faiss_index, embeds)
# for i, chunk in enumerate(top_chunks):
#     print(f"--- Chunk {i+1} ---\n{chunk}\n")

## 📊 Summary: When to Use Hybrid Retrieval

| Case | Sparse (BM25) | Vector | Hybrid |
|------|----------------|--------|--------|
| Short text / exact match | ✅ | ❌ | ✅ |
| Long form / fuzzy match  | ❌ | ✅ | ✅ |
| Domain-sensitive terms   | ✅ | ✅ | ✅ |
| Noisy OCR/long docs      | ❌ | ✅ | ✅ |