# Testing different representation methods

Dense retrieval is easy to start with, but it does not provide the most accurate answers in all the cases. Sometimes, we need exact keyword matching and in that cases sparse vectors, such as BM25 might be more appropriate. They definitely excel at proper names detection, and we may need that to search over our datasets, with specific company constraints in mind. Let's add another representation method and build hybrid search using both of them.

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

## Setting up another Qdrant collection for multiple vectors per point

If we want to use different search methods, we need to store multiple vectors per point in one collection. It's easier that way, as multi-stage retrieval pipelines might be launched in a single API call.

In [2]:
# See: https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-text-embedding-models
COLLECTION_NAME = "hackernews-hybrid-rag"

# Dense retrieval
MODEL_NAME = "BAAI/bge-small-en-v1.5"
VECTOR_SIZE = 384
VECTOR_NAME = "bge-small-en-v1.5"

# Sparse model
BM25_MODEL_NAME = "Qdrant/bm25"
BM25_VECTOR_NAME = "bm25"

# Token-level representations
MUTLIVECTOR_MODEL_NAME = "colbert-ir/colbertv2.0"
MULTIVECTOR_SIZE = 128
MULTIVECTOR_NAME = "colbertv2.0"

In [3]:
from qdrant_client import QdrantClient, models

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

In [4]:
client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        VECTOR_NAME: models.VectorParams(
            size=VECTOR_SIZE,
            distance=models.Distance.COSINE,
        ),
        MULTIVECTOR_NAME: models.VectorParams(
            size=MULTIVECTOR_SIZE,j
            distance=models.Distance.DOT,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM
            ),
            # Disable HNSW for reranking
            hnsw_config=models.HnswConfigDiff(m=0),
        ),
    },
    sparse_vectors_config={
        BM25_VECTOR_NAME: models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        ),
    },
)

True

## Migrating to multiple vectors

There is no need to recreate the previously created dense embeddings, as we can migrate them from the previous collection and avoid recomputations. In the meantime we'll still create sparse and multi-vector representations. Also, since we agreed we need more context, we don't really need to store the points without more detailed text description of a submission, so let's filter them out.

In [5]:
OLD_COLLECTION_NAME = "hackernews-rag"

In [6]:
last_offset = None
while True:
    # Get a batch of records
    records, last_offset = client.scroll(
        collection_name=OLD_COLLECTION_NAME, 
        scroll_filter=models.Filter(
            must_not=[
                # Lack of field
                models.IsEmptyCondition(
                    is_empty=models.PayloadField(key="text"),
                ),
                # Field set to null value
                models.IsNullCondition(
                    is_null=models.PayloadField(key="text"),
                ),
                # Field set to an empty string
                models.FieldCondition(
                    key="text",
                    match=models.MatchValue(value=""),
                ),
            ],
        ),
        offset=last_offset,
        with_payload=True,
        with_vectors=True,
        limit=10,
    )

    # Migrate them to a new collection
    client.upsert(
        collection_name=COLLECTION_NAME,
        points=[
            models.PointStruct(
                id=record.id,
                vector={
                    # Copy the dense embedding directly
                    VECTOR_NAME: record.vector[VECTOR_NAME],
                    # Calculate BM25 embedding
                    BM25_VECTOR_NAME: models.Document(
                        text=f"{record.payload['title']} {record.payload['text']}",
                        model=BM25_MODEL_NAME,
                    ),
                    # Calculate ColBERT embeddings as well
                    MULTIVECTOR_NAME: models.Document(
                        text=f"{record.payload['title']} {record.payload['text']}",
                        model=MUTLIVECTOR_MODEL_NAME,
                    ),
                },
                payload=record.payload,
            )
            for record in records
        ]
    )

    # Stop when the last batch has been already processed
    if last_offset is None:
        break

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

danish.txt:   0%|          | 0.00/424 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

english.txt:   0%|          | 0.00/936 [00:00<?, ?B/s]

french.txt:   0%|          | 0.00/813 [00:00<?, ?B/s]

dutch.txt:   0%|          | 0.00/453 [00:00<?, ?B/s]

arabic.txt: 0.00B [00:00, ?B/s]

german.txt: 0.00B [00:00, ?B/s]

finnish.txt: 0.00B [00:00, ?B/s]

hungarian.txt: 0.00B [00:00, ?B/s]

greek.txt: 0.00B [00:00, ?B/s]

italian.txt: 0.00B [00:00, ?B/s]

norwegian.txt:   0%|          | 0.00/851 [00:00<?, ?B/s]

russian.txt: 0.00B [00:00, ?B/s]

romanian.txt: 0.00B [00:00, ?B/s]

spanish.txt: 0.00B [00:00, ?B/s]

swedish.txt:   0%|          | 0.00/559 [00:00<?, ?B/s]

turkish.txt:   0%|          | 0.00/260 [00:00<?, ?B/s]

portuguese.txt: 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [None]:
client.recover_snapshot(
    collection_name=COLLECTION_NAME,
    # Please do not modify the URL below
    location="https://storage.googleapis.com/tutorials-snapshots-bucket/workshop-improving-r-in-rag/hackernews-hybrid-rag.snapshot",
    wait=False, # Loading a snapshot may take some time, so let's avoid a timeout
)

## Experimenting with Hybrid Search

Our previous attempts to use dense retrieval to find some Qdrant-specific data weren't succesful. Let's try to build a better retriever that will use keyword-based search retrieval and dense reranking, so it hopefully capture more nuances.

In [7]:
def retrieve_dual(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    with BM25 retrieval and dense reranking.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=BM25_MODEL_NAME,
                ),
                using=BM25_VECTOR_NAME,
                # Prefetch ten times more!
                limit=(n_docs * 10)
            ),
        ],
        query=models.Document(
            text=q,
            model=MODEL_NAME,
        ),
        using=VECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [8]:
retrieve_dual("What does Qdrant do?", n_docs=10)

['How to Augment GPT-4 with Qdrant to Elevate Its Poetry Composition Capabilities GPT-4 and Qdrant synergize, transforming poetry with enhanced coherence and depth. Visit my medium article to view the code implementation: https:&#x2F;&#x2F;medium.com&#x2F;@akriti.upadhyay&#x2F;how-to-augment-gpt-4-with-qdrant-to-elevate-its-poetry-composition-capabilities-acbb7379346f',
 'Show HN: AssistantHunter, a GPT that searches through 9K+ GPTs for your task There will probably be over 1M GPTs by the end of the year, so I built a search engine for GPTs.<p>It&#x27;s built with GPT custom actions and qdrant.<p>We&#x27;re accepting new GPTs, please submit yours at assistanthunt.com.<p>Enjoy building!',
 'Show HN: VectorAdmin – An open-source vector database management system Hey HN,<p>At Mintplex Labs are building developer tools for AI applications. One area we encountered frustration was the use of Vector Databases like Pinecone, Chroma, QDrant, or Weaviate to &quot;unlock&quot; long-term memory a

In [9]:
from any_llm import acompletion
from typing import Callable

RetieverFunc = Callable[[str, int], list[str]]


async def rag(q: str, retrieve_func: RetieverFunc, *, n_docs: int = 10) -> str:
    """
    Run single-turn RAG on a given input query.
    Return just the model response.
    """
    docs = retrieve_func(q, n_docs)
    messages = [
        {
            "role": "user",
            "content": (
                "Please provide a response to my question based only " +
                "on the provided context and only it. If it doesn't " +
                "contain any helpful information, please let me know " +
                "and admit you cannot produce relevant answer.\n" +
                f"<context>{'\n'.join(docs)}</context>\n" +
                f"<question>{q}</question>"
            )
        }
    ]
    response = await acompletion(
        provider=os.environ.get("LLM_PROVIDER"),
        model="claude-sonnet-4-20250514",
        messages=messages,
    )
    return response.choices[0].message.content

In [10]:
response = await rag(
    "What does Qdrant do?", 
    retrieve_func=retrieve_dual
)
print(response)

Based on the provided context, Qdrant is a vector database that has several capabilities:

1. **Vector Database Functionality**: Qdrant is mentioned alongside other vector databases like Pinecone, Chroma, and Weaviate as a tool used to "unlock" long-term memory and contextual answers in AI applications.

2. **Integration with AI Models**: It can be integrated with GPT-4 to enhance poetry composition capabilities by providing "enhanced coherence and depth."

3. **Search Engine Backend**: Qdrant is used as part of the infrastructure for building search engines, such as the GPT search engine mentioned that searches through thousands of GPTs.

4. **Content Recommendation**: It serves as the foundation for content recommendation engines, specifically mentioned as being used with fastembed to create personalized "you might also like" sections for static sites.

5. **Embedding Management**: The context suggests it stores and manages embeddings (vector representations of data) that can be sear

In [11]:
retrieve_dual("How do I perform a KNN search on a large scale?", n_docs=10)

['Show HN: Quickwit – Cost-Efficient OSS Search Engine for Observability Hi HN, I’m one of the builders of Quickwit, a cloud-native OSS search engine for observability. As of 2023, we support logs and traces, metrics will come in 2024.<p>You know the pitch: while software like Datadog or Splunk are great, they often comes with hefty price tags. Our mission is to offer an affordable alternative. So we’ve built Quickwit, we’ve made it compatible with the observabilty ecosystem (OpenTelemetry, Jaeger, Grafana) and above all, we’ve made it cost-efficient &#x2F; “easy” to scale (well it’s never easy to scale to petabytes..).<p>To give you a glance at the engine performance I made a benchmark on the GitHub Archive dataset, 23 TB of events, here are the main observations:<p>Indexing: costs $2 per ingested TB. With 4CPU, throughput is at 20MBs However, you&#x27;ll observe &gt; 30MB throughput on simpler datasets, like logs and traces.<p>Search: a typical query costs $0.0002 per TB (considering

### More sophisticated reranking

Sparse retrieval and dense reranking might be a useful strategy, but it cannot support all the possible search queries. If we cannot capture a particular semantic match using sparse vectors, then dense reranking won't even see it, so it'll never get retrieved. That's why it pretty common to use both methods for prefetching, and something else for reranking, so we can have the best of both worlds.

In the simplest case, we can run both prefetches and combine the results with fusion based on the ranks as returned by the individual methods.

In [12]:
def retrieve_fusion(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    with BM25 and dense retrieval + fusion to merge them.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=BM25_MODEL_NAME,
                ),
                using=BM25_VECTOR_NAME,
                limit=n_docs,
            ),
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=MODEL_NAME,
                ),
                using=VECTOR_NAME,
                limit=n_docs,
            ),
        ],
        # Reciprocal Rank Fusion works on the rankings
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=n_docs,
    )
    docs = [
        f"{point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [13]:
retrieve_fusion("How do I perform a KNN search on a large scale?", n_docs=10)

['Show HN: GritQL, a Rust CLI for rewriting source code Hi everyone!<p>I’m excited to open source GritQL, a Rust CLI for searching and transforming source code.<p>GritQL comes from my experiences with conducting large scale refactors and migrations.<p>Usually, I would start exploring a codebase with grep. This is easy to start with, but most migrations end up accumulating additional requirements like ensuring the right packages are imported and excluding cases which don’t have a viable migration path.<p>Eventually, to build a complex migration, I usually ended up having to write a full codemod program with a tool like jscodeshift. This comes with its own problems:<p>- Most of the exploratory work has to be abandoned as you figure out how to represent your original regex search as an AST.\n- Reading&#x2F;writing a codemod requires mentally translating from AST names back to what source code actually looks like.\n- Performance is often an afterthought, so iterating on a large codemod can

In [14]:
response = await rag(
    "How do I perform a KNN search on a large scale?", 
    retrieve_func=retrieve_fusion
)
print(response)

Based on the provided context, I can offer limited information about large-scale KNN search.

The context mentions a few relevant points:

1. **Approximate k-NN algorithms**: One post asks about the state-of-the-art approximate k-NN search algorithms, specifically mentioning IVF, LSH, and HNSW as options, though no definitive answer is provided about which is best.

2. **KNN in RAG systems**: The Neum AI framework description mentions using "kNN search that the agent can use to find the right data in the db" as part of their RAG (Retrieval-Augmented Generation) system. They describe using vector embeddings and vector databases to search for semantically similar information at scale.

3. **Vector embeddings approach**: The context describes transforming content (like news articles, papers, etc.) into vector embeddings that represent semantic meaning, then organizing these into indexes for quick similarity searches.

However, I must admit that the provided context doesn't contain compreh

More complex problems may require running better rerankers to capture the data nuances. That's why we also created ColBERT embeddings, and it's finally time to test them.

In [19]:
def retrieve_colbert_reranking(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    with BM25 and dense retrieval + ColBERT to merge them.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=BM25_MODEL_NAME,
                ),
                using=BM25_VECTOR_NAME,
                limit=n_docs,
            ),
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=MODEL_NAME,
                ),
                using=VECTOR_NAME,
                limit=n_docs,
            ),
        ],
        # Reranking with ColBERT embeddings
        query=models.Document(
            text=q,
            model=MUTLIVECTOR_MODEL_NAME,
        ),
        using=MULTIVECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [20]:
response = await rag(
    "How do I perform a KNN search on a large scale?", 
    retrieve_func=retrieve_colbert_reranking
)
print(response)

Based on the provided context, I can only offer limited information about large-scale k-NN (k-nearest neighbor) search.

From the context, there is a question asking "What is the state of art approximate k-NN search algorithm today?" and mentions that someone is interested in learning what kind of ANN (Approximate Nearest Neighbor) algorithms big search companies like Google use, specifically mentioning IVF, LSH, and HNSW as examples.

Additionally, there's a mention in the Neum AI framework description of using "kNN search" with metadata embeddings to help an AI agent find the right data in databases, and reference to vector embeddings being used for semantic search where "vector representations are organized into indexes where we can quickly search for the pieces of information that most closely resembles (from a semantic perspective) a given question or query."

However, the context doesn't provide specific implementation details, performance characteristics, or concrete guidance on

In [21]:
retrieve_colbert_reranking("How do I perform a KNN search on a large scale?", n_docs=10)

['Show HN: Neum AI – Open-source large-scale RAG framework Over the last couple months we have been supporting developers in building large-scale RAG pipelines to process millions of pieces of data.<p>We documented our approach in an HN post (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37824547">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37824547</a>) a couple weeks ago. Today, we are open sourcing the framework we have developed.<p>The framework focuses on RAG data pipelines and provides scale, reliability, and data synchronization capabilities out of the box.<p>For those newer to RAG, it is a technique to provide context to Large Language Models. It consists of grabbing pieces of information (i.e. pieces of news articles, papers, descriptions, etc.) and incorporating them into prompts to help contextualize the responses. The technique goes one level deeper in finding the right pieces of information to incorporate. The search for relevant information is done 