# Module 1: Use Cases for Multi-Vector Search

Define a specific technical query about Python database connection pool exhaustion in async web applications.

In [1]:
query = "How can I prevent Python database connection " \
        "pool exhaustion in async web applications?"

Define four candidate documents with varying relevance levels  -  from highly relevant to keyword-stuffed to completely irrelevant.

In [2]:
documents = [
    # Document A: Highly relevant - addresses all query aspects
    "When async tasks fail to return database connections, the pool "
    "becomes exhausted and requests start failing. Ensuring "
    "connections are closed after awaits prevents this.",

    # Document B: Partially relevant - mentions some concepts
    "Database resource exhaustion can occur due to limited pool sizes.",

    # Document C: Keyword-stuffed - contains related terms without substance
    "Understanding concurrency, async IO, and database performance in "
    "Python web applications.",

    # Document D: Completely irrelevant
    "Handling training for pythons should be done gradually, starting "
    "with short sessions and increasing duration as the snake becomes "
    "more comfortable.",
]

Load a single-vector embedding model (BGE) to establish a baseline for comparison.

In [3]:
from fastembed import TextEmbedding

# Load the BAAI/bge-small-en-v1.5 model
dense_model = TextEmbedding("BAAI/bge-small-en-v1.5")

Embed the query as a single 384-dimensional vector.

In [4]:
dense_query_vector = next(dense_model.query_embed(query))
dense_query_vector.shape

(384,)

Embed all documents as single vectors for batch comparison.

In [5]:
import numpy as np

dense_vectors = np.array(list(dense_model.passage_embed(documents)))
dense_vectors.shape

(4, 384)

Compute dot product similarities. Notice that the keyword-stuffed Document C scores higher than the genuinely relevant Document A with single-vector search.

In [6]:
import numpy as np

np.dot(dense_query_vector, dense_vectors.T)

array([0.85484755, 0.76778555, 0.86039245, 0.5341173 ], dtype=float32)

Now load ColBERT, a late interaction model that produces per-token embeddings instead of a single vector.

In [7]:
from fastembed import LateInteractionTextEmbedding

# Load the colbert-ir/colbertv2.0 model
colbert_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")

Embed the query as a multi-vector representation  -  one 128-dimensional vector per token.

In [8]:
colbert_query_vector = next(colbert_model.query_embed(query))
colbert_query_vector.shape

(32, 128)

Embed documents as multi-vectors. Each document gets a different number of vectors depending on its token count.

In [9]:
colbert_vectors = list(colbert_model.passage_embed(documents))
[cv.shape for cv in colbert_vectors]

[(30, 128), (13, 128), (16, 128), (25, 128)]

Compute MaxSim scores. Unlike single-vector search, ColBERT correctly ranks Document A highest because it matches the query at the token level.

In [10]:
for colbert_doc_vector in colbert_vectors:
    # For each document, compute similarity between all query-doc token pairs
    dot_product = np.dot(colbert_query_vector, colbert_doc_vector.T)
    # For each query token, take the maximum similarity with any doc token
    max_scores = dot_product.max(axis=1)
    # Sum these maximum similarities to get the final MaxSim score
    print(max_scores.sum())

21.181477
16.222866
14.829226
10.377555
