# Easy ways to improve the search relevancy

Up till now, we have tested various embedding models, but it doesn't mean we have covered all the possible ways to make sure our search results are relevant. Relevance is sometimes not about the document itself, but about some additional criteria, such as geographical proximity, recency, or some other business rules that none of the models can capture. They may also vary over time, so it's clear we cannot encode them directly in the vectors. Qdrant has some mechanisms that can boost the quality of the retrieval outputs with little effort, so no vector computations are required.

In [None]:
from dotenv import load_dotenv

load_dotenv()

In [None]:
# See: https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-text-embedding-models
COLLECTION_NAME = "hackernews-hybrid-rag"

# Dense retrieval
MODEL_NAME = "BAAI/bge-small-en-v1.5"
VECTOR_SIZE = 384
VECTOR_NAME = "bge-small-en-v1.5"

# Sparse model
BM25_MODEL_NAME = "Qdrant/bm25"
BM25_VECTOR_NAME = "bm25"

# Token-level representations
MUTLIVECTOR_MODEL_NAME = "colbert-ir/colbertv2.0"
MULTIVECTOR_SIZE = 128
MULTIVECTOR_NAME = "colbertv2.0"

In [None]:
from qdrant_client import QdrantClient, models

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

In [None]:
from any_llm import acompletion
from typing import Callable

RetieverFunc = Callable[[str, int], list[str]]

LLM_NAME = "claude-sonnet-4-20250514"

async def rag(q: str, retrieve_func: RetieverFunc, *, n_docs: int = 10) -> str:
    """
    Run single-turn RAG on a given input query.
    Return just the model response.
    """
    docs = retrieve_func(q, n_docs)
    messages = [
        {
            "role": "user",
            "content": (
                "Please provide a response to my question based only " +
                "on the provided context and only it. If it doesn't " +
                "contain any helpful information, please let me know " +
                "and admit you cannot produce relevant answer.\n" +
                f"<context>{'\n'.join(docs)}</context>\n" +
                f"<question>{q}</question>"
            )
        }
    ]
    response = await acompletion(
        provider=os.environ.get("LLM_PROVIDER"),
        model=LLM_NAME,
        messages=messages,
    )
    return response.choices[0].message.content

## Search diversity

Your Retrieval Augmented Generation might be only as good as the retrieved documents provided to the LLM. A common issue in RAG-like applications is a lack of diversity in the retrieved documents and passing dozens of near duplicates. If a document does not bring any additional information, then we're only wasting tokens. It makes sense to diversify the set of results to cover a broader spectrum. Unfortunately, vector search alone will always return the documents with the highest scores possible, and that's what it is expected to do. Search results diversification is typically achieved as a post-processing step, and for that, we need to retrieve more candidates and choose a subset that maximizes the diversity. Qdrant has implemented a Maximal Marginal Relevance algorithm that does exactly this. It's also part of the Universal Query API.

In [None]:
def retrieve_diverse(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    with BM25 and dense retrieval + MMR to diversify
    on ColBERT vectors.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=BM25_MODEL_NAME,
                ),
                using=BM25_VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=MODEL_NAME,
                ),
                using=VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
        ],
        # Maximal Marginal Relevance
        query=models.NearestQuery(
            nearest=models.Document(
                text=q,
                model=MUTLIVECTOR_MODEL_NAME,
            ),
            mmr=models.Mmr(
                # 0.0 - relevance only, 1.0 - diversity only
                diversity=0.75,
            )
        ),
        using=MULTIVECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['datetime']} {point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [None]:
retrieve_diverse("How do I perform a KNN search on a large scale?", n_docs=10)

In [None]:
response = await rag(
    "How do I perform a KNN search on a large scale?", 
    retrieve_func=retrieve_diverse
)
print(response)

In [None]:
response = await rag(
    "What is the community working on?", 
    retrieve_func=retrieve_diverse
)
print(response)

In [None]:
retrieve_diverse("What is the community working on?", n_docs=10)

## Applying business rules

Search does not cover factors such as geographical proximity or recency. We could theoretically apply payload filters to restrict data coming from last week, month, or year, but it's not an ideal solution if we want to express a preference, not a hard limit. If we want to combine relevance with some additional criteria, we need to recalculate the scores based on the scores as returned by individual methods, and the other factors we want to consider. Score boosting is a mechanism providing a way to achieve exactly this.

HackerNews provides a `time` attribute we converted to a proper `datetime` at the very beginning. Let's try to use if for recency.

In [None]:
from datetime import datetime, timezone

def retrieve_recent(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    with BM25 and dense retrieval + recency.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=BM25_MODEL_NAME,
                ),
                using=BM25_VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=MODEL_NAME,
                ),
                using=VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
        ],
        # Score boosting
        query=models.FormulaQuery(
            formula=models.MultExpression(
                mult=[
                    "$score",
                    models.ExpDecayExpression(
                        exp_decay=models.DecayParamsExpression(
                            x=models.DatetimeKeyExpression(
                                datetime_key="datetime" # payload key 
                            ),
                            # Current datetime in "2025-09-25T00:00:00Z format
                            target=models.DatetimeExpression(
                                datetime=datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
                            ),
                            scale=86400 * 365, # 1 day in seconds * 365
                            # If item's "datetime" is more than 1 year apart from 
                            # the current datetime, relevance score is less than 0.5
                            midpoint=0.5
                        )
                    )
                ]
            ),
        ),
        # Five times more than expected too
        limit=n_docs,
    )
    docs = [
        f"{point.payload['datetime']} {point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [None]:
retrieve_recent("What is the community working on?", n_docs=10)

In [None]:
response = await rag(
    "What is the community working on?", 
    retrieve_func=retrieve_recent
)
print(response)