# Hybrid Stack Overflow Search with Dense, Sparse and ColBERT Multivectors

This notebook implements a production-ready Stack Overflow Q&A search system using:
- **Dense vectors** (BAAI/bge-small-en-v1.5) for semantic search
- **Sparse vectors** (BM25) for lexical/keyword matching
- **ColBERT multivectors** for fine-grained reranking via late interaction
### https://www.kaggle.com/datasets/kutayahin/stackoverflow-programming-questions-2020-2025?resource=download
Dataset: `stackoverflow_combined.csv`. Pipeline: Hybrid retrieval (dense + sparse with RRF fusion) → ColBERT reranking → Formatted results.

## Cell 1: Setup and Dependencies

In [22]:
import os 
os.environ["QDRANT_URL"]="" 
os.environ["QDRANT_API_KEY"]=""


In [23]:
# Install dependencies if needed (uncomment for Colab)
# !pip install qdrant-client fastembed numpy ranx beautifulsoup4 pandas

import os
import pandas as pd
from qdrant_client import QdrantClient, models
from fastembed import TextEmbedding, SparseTextEmbedding, LateInteractionTextEmbedding
import numpy as np
import os
from bs4 import BeautifulSoup

# Optional: for evaluation
try:
    import ranx
    HAS_RANX = True
except ImportError:
    HAS_RANX = False

CSV_PATH = "stackoverflow_combined.csv"

def get_qdrant_credentials():
    """Load Qdrant URL and API key from env or Colab userdata."""
    try:
        return os.getenv("QDRANT_URL"), os.getenv("QDRANT_API_KEY")
    except Exception:
        print(f"The keys are {os.getenv("QDRANT_URL"), os.getenv("QDRANT_API_KEY")}")

url, api_key = get_qdrant_credentials()
print(url)




## Cell 2: Initialize Qdrant Client

In [5]:
client = QdrantClient(url=url, api_key=api_key)
collection_name = "stackoverflow_search"

# Optional: delete existing collection for fresh run
try:
    if client.collection_exists(collection_name):
        client.delete_collection(collection_name)
        print(f"Deleted existing collection '{collection_name}'")
except Exception as e:
    print(f"Note: {e}")

print("Qdrant client initialized.")

Qdrant client initialized.


## Cell 3: Collection Design

In [6]:
client.create_collection(
    collection_name=collection_name,
    vectors_config={
        "dense": models.VectorParams(
            size=384,
            distance=models.Distance.COSINE,
        ),
        "colbert": models.VectorParams(
            size=128,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
            hnsw_config=models.HnswConfigDiff(m=0),  # Reranking only, no HNSW
        ),
    },
    sparse_vectors_config={
        "sparse": models.SparseVectorParams(modifier=models.Modifier.IDF),
    },
)

# Payload indexes for faceting (tags and breadcrumbs support filter queries)
try:
    client.create_payload_index(collection_name, "tags", models.PayloadSchemaType.KEYWORD)
    client.create_payload_index(collection_name, "breadcrumbs", models.PayloadSchemaType.KEYWORD)
except Exception as e:
    print(f"Payload index note: {e}")

print("Collection 'stackoverflow_search' created with dense, sparse, and colbert vectors.")

Collection 'stackoverflow_search' created with dense, sparse, and colbert vectors.


## Cell 4: Load Stack Overflow CSV

In [8]:
def strip_html(html: str) -> str:
    """Strip HTML tags from body text."""
    if pd.isna(html) or not html:
        return ""
    soup = BeautifulSoup(str(html), "html.parser")
    return soup.get_text(separator=" ", strip=True)


def load_stackoverflow_csv(path: str, max_rows=None) -> list[dict]:
    """Load Stack Overflow questions from CSV and map to payload schema."""
    df = pd.read_csv(path, nrows=max_rows)
    chunks = []
    for _, row in df.iterrows():
        qid = row["question_id"]
        title = str(row["title"]) if pd.notna(row["title"]) else ""
        body = strip_html(row["body"])
        chunk_text = f"{title} {body}".strip()[:4000]  # Truncate for ColBERT token limit

        tags_raw = str(row["tags"]) if pd.notna(row["tags"]) else ""
        tags_list = [t.strip().lower() for t in tags_raw.split(",") if t.strip()]
        prog_lang = str(row["programming_language"]) if pd.notna(row.get("programming_language")) else ""
        breadcrumbs = [prog_lang] + tags_list if prog_lang else tags_list

        url = f"https://stackoverflow.com/questions/{qid}"
        chunks.append({
            "question_id": int(qid),
            "page_title": title,
            "section_title": "Question",
            "page_url": url,
            "section_url": url,
            "breadcrumbs": breadcrumbs[:10],
            "chunk_text": chunk_text,
            "prev_section_text": "",
            "next_section_text": "",
            "tags": tags_list,
        })
    return chunks


# Load Stack Overflow dataset (max_rows limits for faster demo; use None for full dataset)
chunks = load_stackoverflow_csv(CSV_PATH, max_rows=100)
print(f"Loaded {len(chunks)} Stack Overflow questions.")

Loaded 100 Stack Overflow questions.


## Cell 5: Embedding and Ingestion

In [9]:
dense_model = TextEmbedding("BAAI/bge-small-en-v1.5")
sparse_model = SparseTextEmbedding("Qdrant/bm25", language="english")
colbert_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")

texts = [c["chunk_text"] for c in chunks]

# Embed dense
dense_embeddings = list(dense_model.embed(texts))

# Embed sparse (BM25 expects list of indices and values)
sparse_embeddings = list(sparse_model.embed(texts))

# Embed ColBERT (returns list of arrays, one per doc)
colbert_embeddings = list(colbert_model.embed(texts))

BATCH_SIZE = 64
points = []
for i, chunk in enumerate(chunks):
    dense_vec = dense_embeddings[i].tolist()
    sparse_vec = models.SparseVector(
        indices=sparse_embeddings[i].indices.tolist(),
        values=sparse_embeddings[i].values.tolist(),
    )
    colbert_mat = colbert_embeddings[i]  # shape (num_tokens, 128)
    colbert_list = colbert_mat.tolist()

    points.append(
        models.PointStruct(
            id=chunk["question_id"],
            vector={
                "dense": dense_vec,
                "sparse": sparse_vec,
                "colbert": colbert_list,
            },
            payload=chunk,
        )
    )

# Batch upsert to avoid timeouts on large datasets
for i in range(0, len(points), BATCH_SIZE):
    batch = points[i : i + BATCH_SIZE]
    client.upsert(collection_name=collection_name, points=batch)
print(f"Upserted {len(points)} Stack Overflow questions.")

Upserted 100 Stack Overflow questions.


## Cell 6: Search Pipeline

In [19]:
def hybrid_search(query: str, limit: int = 10, prefetch_limit: int = 50):
    """Two-stage search: hybrid (dense+sparse RRF) -> ColBERT rerank."""
    # Embed query for all three representations
    dense_q = next(dense_model.embed([query])).tolist()
    sparse_q = next(sparse_model.query_embed([query]))
    colbert_q = list(colbert_model.query_embed([query]))[0].tolist()

    sparse_query = models.SparseVector(
        indices=sparse_q.indices.tolist(),
        values=sparse_q.values.tolist(),
    )

    # Stage 1: Prefetch dense + sparse, fuse with RRF
    hybrid_prefetch = models.Prefetch(
        prefetch=[
            models.Prefetch(query=dense_q, using="dense", limit=prefetch_limit),
            models.Prefetch(query=sparse_query, using="sparse", limit=prefetch_limit),
        ],
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=prefetch_limit,
    )

    # Stage 2: Rerank prefetched results with ColBERT multivector
    results = client.query_points(
        collection_name=collection_name,
        prefetch=[hybrid_prefetch],
        query=colbert_q,
        using="colbert",
        limit=limit,
        with_payload=True,
    )
    return results


# Example search
query = "AttributeError NoneType request AI Tools Gradio"
results = hybrid_search(query)
print(f"Query: {query}")
print(f"Results: {len(results.points)} points")

Query: AttributeError NoneType request AI Tools Gradio
Results: 10 points


## Cell 7: Result Formatting

In [20]:
def format_result(point, rank: int) -> str:
    p = point.payload or {}
    title = p.get("page_title", "N/A")
    url = p.get("section_url", p.get("page_url", ""))
    snippet = (p.get("chunk_text", "") or "")[:300]
    score = getattr(point, "score", None)
    tags = p.get("tags", [])
    tags_str = ", ".join(tags[:5]) if tags else ""
    return f"{rank}. {title}\n   URL: {url}\n   Tags: {tags_str}\n   Score: {score}\n   Snippet: {snippet}..."


for i, pt in enumerate(results.points, 1):
    print(format_result(pt, i))
    print()

1. AttributeError: 'NoneType' object has no attribute 'get' in request from AI Tools
   URL: https://stackoverflow.com/questions/79810291
   Tags: python, artificial-intelligence, huggingface, gradio
   Score: 26.74917
   Snippet: AttributeError: 'NoneType' object has no attribute 'get' in request from AI Tools I am trying to develop a tool that agent (CodeAgent) will call on user input (prompt) in Gradio. This tool will internally call a REST API to fetch the results. I am getting the error: Code:...

2. AttributeError: 'NoneType' object has no attribute 'columns' BPTK-Py
   URL: https://stackoverflow.com/questions/79808087
   Tags: python, debugging, stack, simulation
   Score: 15.661942
   Snippet: AttributeError: 'NoneType' object has no attribute 'columns' BPTK-Py First I used the first script and the result was an error,... After I was diagnosed with the second script, it turned out that the value for "Yearly_Tax_Cost_DM_After_Subsidies" actually exists. However, I need the first

## Cell 8: Evaluation

In [21]:
ground_truth_examples = [
    {
        "query": "AttributeError NoneType request AI Tools Gradio",
        "expected_ids": ["79810291"],
        "query_type": "error",
    },
    {
        "query": "Python SSH automation child process TTY",
        "expected_ids": ["79809924"],
        "query_type": "how-to",
    },
    {
        "query": "FastAPI middleware wrapper route",
        "expected_ids": ["79809886"],
        "query_type": "api",
    },
    {
        "query": "Polars rolling aggregation front date",
        "expected_ids": ["79809140"],
        "query_type": "how-to",
    },
    {
        "query": "Convert datetime timedelta to int",
        "expected_ids": ["79809098"],
        "query_type": "how-to",
    },
]


def extract_question_id(url: str):
    """Extract question_id from Stack Overflow URL."""
    if not url:
        return None
    parts = url.rstrip("/").split("/")
    return parts[-1] if parts else None


def compute_recall_at_k(retrieved_urls: list[str], expected_ids: list[str], k: int = 10) -> float:
    top_k = retrieved_urls[:k]
    retrieved_ids = [extract_question_id(u) for u in top_k if extract_question_id(u)]
    hits = sum(1 for eid in expected_ids if eid in retrieved_ids)
    return 1.0 if hits > 0 else 0.0


def compute_mrr(retrieved_urls: list[str], expected_ids: list[str]) -> float:
    for rank, url in enumerate(retrieved_urls, 1):
        qid = extract_question_id(url)
        if qid and qid in expected_ids:
            return 1.0 / rank
    return 0.0


latencies = []
eval_results = []

for ex in ground_truth_examples:
    import time
    t0 = time.perf_counter()
    res = hybrid_search(ex["query"])
    lat = (time.perf_counter() - t0) * 1000
    latencies.append(lat)

    urls = []
    for pt in res.points:
        u = (pt.payload or {}).get("section_url") or (pt.payload or {}).get("page_url", "")
        if u:
            urls.append(u)

    recall = compute_recall_at_k(urls, ex["expected_ids"])
    mrr = compute_mrr(urls, ex["expected_ids"])
    eval_results.append({
        "query": ex["query"],
        "recall@10": recall,
        "mrr": mrr,
        "latency_ms": lat,
    })

latencies_sorted = sorted(latencies)
p50 = latencies_sorted[len(latencies_sorted) // 2] if latencies_sorted else 0
p95 = latencies_sorted[int(len(latencies_sorted) * 0.95)] if len(latencies_sorted) > 1 else latencies_sorted[0] if latencies_sorted else 0

print("Evaluation Results")
print("-" * 60)
for r in eval_results:
    print(f"Query: {r['query'][:50]}...")
    print(f"  Recall@10: {r['recall@10']:.2f}, MRR: {r['mrr']:.2f}, Latency: {r['latency_ms']:.0f}ms")
print("-" * 60)
print(f"Aggregate Recall@10: {np.mean([r['recall@10'] for r in eval_results]):.2f}")
print(f"Aggregate MRR: {np.mean([r['mrr'] for r in eval_results]):.2f}")
print(f"P50 Latency: {p50:.0f}ms, P95 Latency: {p95:.0f}ms")

Evaluation Results
------------------------------------------------------------
Query: AttributeError NoneType request AI Tools Gradio...
  Recall@10: 1.00, MRR: 1.00, Latency: 144ms
Query: Python SSH automation child process TTY...
  Recall@10: 1.00, MRR: 1.00, Latency: 125ms
Query: FastAPI middleware wrapper route...
  Recall@10: 1.00, MRR: 1.00, Latency: 92ms
Query: Polars rolling aggregation front date...
  Recall@10: 1.00, MRR: 1.00, Latency: 81ms
Query: Convert datetime timedelta to int...
  Recall@10: 1.00, MRR: 1.00, Latency: 85ms
------------------------------------------------------------
Aggregate Recall@10: 1.00
Aggregate MRR: 1.00
P50 Latency: 92ms, P95 Latency: 144ms


## Cell 9: Reflection and Tuning

## Learnings

**High-Level Summary**
- **Domain:** Stack Overflow Q&A search (programming questions)
- **Key Result:** Hybrid + multivector reranking reached Recall@10=&lt;paste from Evaluation cell&gt; with P95=&lt;paste&gt;ms.

**Reproducibility**
- **Notebook/App:** `hybrid_docs_search.ipynb`
- **Repo (optional):** —
- **Models:** dense=`BAAI/bge-small-en-v1.5`, sparse=`Qdrant/bm25` (BM25 + IDF), colbert=`colbert-ir/colbertv2.0`
- **Collection:** `stackoverflow_search` (Cosine), points=1000
- **Dataset:** 1000 questions from `stackoverflow_combined.csv` (Stack Overflow 2020–2025; snapshot: from file)
- **Ground truth:** 5 queries (error / how-to / api / how-to / how-to)

**Settings (today)**
- **Chunking:** One question per chunk (title + body, HTML stripped; truncation 4000 chars)
- **Payload fields:** question_id, page_title, section_title, page_url, section_url, breadcrumbs, tags, chunk_text, prev_section_text, next_section_text
- **Fusion:** RRF, k_dense=50, k_sparse=50
- **Reranker:** ColBERT (MaxSim), top-k=10
- **Index/Search params:** default (hnsw_ef, m, ef_construct not tuned)

**Queries (examples)**

AttributeError: 'NoneType' object has no attribute 'get' in request from AI Tools
   URL: https://stackoverflow.com/questions/79810291
   Tags: python, artificial-intelligence, huggingface, gradio
   Score: 26.74917
   Snippet: AttributeError: 'NoneType' object has no attribute 'get' in request from AI Tools I am trying to develop a tool that agent (CodeAgent) will call on user input (prompt) in Gradio. This tool will internally call a REST API to fetch the results. I am getting the error: Code:...

2. AttributeError: 'NoneType' object has no attribute 'columns' BPTK-Py
   URL: https://stackoverflow.com/questions/79808087
   Tags: python, debugging, stack, simulation
   Score: 15.661942
   Snippet: AttributeError: 'NoneType' object has no attribute 'columns' BPTK-Py First I used the first script and the result was an error,... After I was diagnosed with the second script, it turned out that the value for "Yearly_Tax_Cost_DM_After_Subsidies" actually exists. However, I need the first script for...

3. AttributeError: module 'mysql.connector' has no attribute 'CMySQLConnection'
   URL: https://stackoverflow.com/questions/79809523
   Tags: python, mysql-connector-python
   Score: 13.641586
   Snippet: AttributeError: module 'mysql.connector' has no attribute 'CMySQLConnection' I'm trying to get MySQL connector working in VScode, have used to install it which seems to have worked but whenever I try to use I get AttributeError: module 'mysql.connector' has no attribute 'CMySQLConnection'. How to fi...

4. Why does appending data with PySpark raise a "SQLServerException: CREATE TABLE permission denied" exception?
   URL: https://stackoverflow.com/questions/79807775
   Tags: python, sql, sql-server, pyspark, databricks
   Score: 10.578304
   Snippet: Why does appending data with PySpark raise a "SQLServerException: CREATE TABLE permission denied" exception? In my Databricks cluster I'm trying to write a DataFrame to my table with the following code: And this line fails with Py4JJavaError: An error occurred while calling o53884.jdbc. : com.micros...

5. Why does a class attribute named `type` make `type[Foo]` fail?
   URL: https://stackoverflow.com/questions/79808954
   Tags: python, python-typing, mypy
   Score: 10.260448
   Snippet: Why does a class attribute named `type` make `type[Foo]` fail? When I define a class attribute named , a annotation inside the same class causes to report that the name is a variable and therefore “not valid as a type”. produces : I expected to refer to the built-in , but mypy treats as the class at...

6. CatalystAppError: {'code': 'FATAL ERROR', 'message': 'Catalyst headers are empty'} when initializing Catalyst app
   URL: https://stackoverflow.com/questions/79810057
   Tags: python, zoho, zohocatalyst
   Score: 10.194329
   Snippet: CatalystAppError: {'code': 'FATAL ERROR', 'message': 'Catalyst headers are empty'} when initializing Catalyst app I’m working on a chatbot project using the Zoho Catalyst SDK. My goal is to use the Catalyst Cache service to store session data for multiple users interacting with my chatbot independen...

7. ModuleNotFoundError: No module named 'kiwisolver' -- yet Requirement already satisfied: kiwisolver in /usr/lib/python3/dist-packages (1.3.2)
   URL: https://stackoverflow.com/questions/79808094
   Tags: python, installation, pip, virtualenv
   Score: 10.175929
   Snippet: ModuleNotFoundError: No module named 'kiwisolver' -- yet Requirement already satisfied: kiwisolver in /usr/lib/python3/dist-packages (1.3.2) I can run a software from a virtual environment that ends up with the following error: However, if I try to install it I get: What am I missing here?...

8. TypeError when calculating loan amount in my Python script
   URL: https://stackoverflow.com/questions/79807627
   Tags: python, typeerror
   Score: 9.209666
   Snippet: TypeError when calculating loan amount in my Python script I'm trying to write a python program that will calculate the maximum loan amount based on monthly payment, annual interest rate, and loan duration. Here is my current code : When I run the code I get the following error : How can I fix this ...

9. A wrapper route to simulate a FastAPI request
   URL: https://stackoverflow.com/questions/79809886
   Tags: python, fastapi, middleware, asgi
   Score: 8.647156
   Snippet: A wrapper route to simulate a FastAPI request My goal is to have a special route which receives name and payload for any other route and then (after doing some unpacking and decoding) call the target route which I prefer to go through HTTP middleware that does validation as if it was requested from ...

10. Import error on Flask CRUD App from import Modus from and url_decode
   URL: https://stackoverflow.com/questions/79809520
   Tags: python, flask
   Score: 8.57424
   Snippet: Import error on Flask CRUD App from import Modus from and url_decode I am trying to develop a CRUD app using flask. The API is a simple REST API using add, update, delete and show. The update on the route need to use a flask_modus and import Modus since I need to use request method using b"PATCH". I...

**Evaluation**
- Recall@10: &lt;paste&gt; | MRR: &lt;paste&gt; | P50: &lt;paste&gt;ms | P95: &lt;paste&gt;ms  
### Evaluation Results

Query: AttributeError NoneType request AI Tools Gradio...
  Recall@10: 1.00, MRR: 1.00, Latency: 144ms
Query: Python SSH automation child process TTY...
  Recall@10: 1.00, MRR: 1.00, Latency: 125ms
Query: FastAPI middleware wrapper route...
  Recall@10: 1.00, MRR: 1.00, Latency: 92ms
Query: Polars rolling aggregation front date...
  Recall@10: 1.00, MRR: 1.00, Latency: 81ms
Query: Convert datetime timedelta to int...
  Recall@10: 1.00, MRR: 1.00, Latency: 85ms
Aggregate Recall@10: 1.00
Aggregate MRR: 1.00
P50 Latency: 92ms, P95 Latency: 144ms 

**Why these matched**
- Dense embeddings capture semantic similarity (e.g. “AttributeError” / “request” / “Gradio”); sparse BM25 hits exact terms and tags; ColBERT reranks by token-level overlap so the intended question surfaces in the top 10.

**Surprise**
- "ColBERT reranking can noticeably reorder the RRF list; some queries that rank mid-list after fusion jump to #1 after MaxSim."

**Next step**
- "Run with full dataset (max_rows=None), tune prefetch_limit (50→100→200), and compare RRF vs DBSF to see impact on Recall@10 and latency."