# Easy ways to improve the search relevancy

Up till now, we have tested various embedding models, but it doesn't mean we have covered all the possible ways to make sure our search results are relevant. Relevance is sometimes not about the document itself, but about some additional criteria, such as geographical proximity, recency, or some other business rules that none of the models can capture. They may also vary over time, so it's clear we cannot encode them directly in the vectors. Qdrant has some mechanisms that can boost the quality of the retrieval outputs with little effort, so no vector computations are required.

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
# See: https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-text-embedding-models
COLLECTION_NAME = "hackernews-hybrid-rag"

# Dense retrieval
MODEL_NAME = "BAAI/bge-small-en-v1.5"
VECTOR_SIZE = 384
VECTOR_NAME = "bge-small-en-v1.5"

# Sparse model
BM25_MODEL_NAME = "Qdrant/bm25"
BM25_VECTOR_NAME = "bm25"

# Token-level representations
MUTLIVECTOR_MODEL_NAME = "colbert-ir/colbertv2.0"
MULTIVECTOR_SIZE = 128
MULTIVECTOR_NAME = "colbertv2.0"

In [3]:
from qdrant_client import QdrantClient, models

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

In [4]:
from any_llm import acompletion
from typing import Callable

RetieverFunc = Callable[[str, int], list[str]]


async def rag(q: str, retrieve_func: RetieverFunc, *, n_docs: int = 10) -> str:
    """
    Run single-turn RAG on a given input query.
    Return just the model response.
    """
    docs = retrieve_func(q, n_docs)
    messages = [
        {
            "role": "user",
            "content": (
                "Please provide a response to my question based only " +
                "on the provided context and only it. If it doesn't " +
                "contain any helpful information, please let me know " +
                "and admit you cannot produce relevant answer.\n" +
                f"<context>{'\n'.join(docs)}</context>\n" +
                f"<question>{q}</question>"
            )
        }
    ]
    response = await acompletion(
        provider=os.environ.get("LLM_PROVIDER"),
        model="claude-sonnet-4-20250514",
        messages=messages,
    )
    return response.choices[0].message.content

## Search diversity

Your Retrieval Augmented Generation might be only as good as the retrieved documents provided to the LLM. A common issue in RAG-like applications is a lack of diversity in the retrieved documents and passing dozens of near duplicates. If a document does not bring any additional information, then we're only wasting tokens. It makes sense to diversify the set of results to cover a broader spectrum. Unfortunately, vector search alone will always return the documents with the highest scores possible, and that's what it is expected to do. Search results diversification is typically achieved as a post-processing step, and for that, we need to retrieve more candidates and choose a subset that maximizes the diversity. Qdrant has implemented a Maximal Marginal Relevance algorithm that does exactly this. It's also part of the Universal Query API.

In [5]:
def retrieve_diverse(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    with BM25 and dense retrieval + MMR to diversify
    on ColBERT vectors.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=BM25_MODEL_NAME,
                ),
                using=BM25_VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=MODEL_NAME,
                ),
                using=VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
        ],
        # Maximal Marginal Relevance
        query=models.NearestQuery(
            nearest=models.Document(
                text=q,
                model=MUTLIVECTOR_MODEL_NAME,
            ),
            mmr=models.Mmr(
                # 0.0 - relevance only, 1.0 - diversity only
                diversity=0.75,
            )
        ),
        using=MULTIVECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['datetime']} {point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [6]:
retrieve_diverse("How do I perform a KNN search on a large scale?", n_docs=10)

['2023-11-21T19:20:29Z Show HN: Neum AI – Open-source large-scale RAG framework Over the last couple months we have been supporting developers in building large-scale RAG pipelines to process millions of pieces of data.<p>We documented our approach in an HN post (<a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37824547">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37824547</a>) a couple weeks ago. Today, we are open sourcing the framework we have developed.<p>The framework focuses on RAG data pipelines and provides scale, reliability, and data synchronization capabilities out of the box.<p>For those newer to RAG, it is a technique to provide context to Large Language Models. It consists of grabbing pieces of information (i.e. pieces of news articles, papers, descriptions, etc.) and incorporating them into prompts to help contextualize the responses. The technique goes one level deeper in finding the right pieces of information to incorporate. The search for relevant

In [7]:
response = await rag(
    "How do I perform a KNN search on a large scale?", 
    retrieve_func=retrieve_diverse
)
print(response)

Based on the provided context, I can find some relevant information about performing KNN search on a large scale:

From the Neum AI RAG framework post, there are a few approaches mentioned for large-scale KNN search:

1. **Vector Embeddings and Vector Databases**: The context explains that pieces of information are "transformed into a vector embedding that represents the semantic meaning of the information. These vector representations are organized into indexes where we can quickly search for the pieces of information that most closely resembles (from a semantic perspective) a given question or query."

2. **High Throughput Distributed Architecture**: The Neum AI framework mentions supporting "large scale jobs through a high throughput distributed architecture" where you can "parallelize tasks like downloading documents, processing them, generating embedding and ingesting data into the vector DB."

3. **Metadata-Enhanced Search**: The framework supports "hybrid search using the availa

In [8]:
response = await rag(
    "What is the community working on?", 
    retrieve_func=retrieve_diverse
)
print(response)

Based on the provided context, there are multiple communities mentioned working on different things:

1. **Pirates community** - A founder-investor network with 1000 members on WhatsApp and Telegram focused on helping each other succeed, looking to evolve into a product.

2. **Bloxd Io community** - A community of builders and creators who connect, share works-in-progress, exchange ideas, and engage in friendly competitions around creative projects.

3. **Defense startup community** - People working in defense technology, specifically focusing on SIGINT & RF Spectrum Management solutions and researching market opportunities.

4. **Indie tech blog community** - A community for tech bloggers to share what they're working on, post drafts, and comment on other users' posts.

5. **JSON Schema organization** - Working on improving JSON Schema through Community Working Meetings and Office Hours.

6. **Oasis tech collective** - A community-driven collective building an advocacy flywheel throug

In [9]:
retrieve_diverse("What is the community working on?", n_docs=10)

['2023-05-30T18:30:35Z Ask HN: Where have you found community outside of work? Asking for myself and those who are looking for what good communities often provide: feeling of connection, purpose, a place to go, etc.',
 '2024-02-18T07:58:45Z How will you turn a community into a product? I run a tight founder investor network called Pirates. We have 1000 members active on WhatsApp and Telegram. Wonder how to evolve this community into a product. The goal is to help each other succeed. TIA',
 '2024-02-16T03:48:50Z Show HN: Bloxd Io It&#x27;s a community of passionate builders and creators. Connect with fellow players, share your works-in-progress, exchange ideas, and engage in friendly competitions. Join the conversation on social media using the hashtag #Bloxdio, and be part of a supportive community that celebrates creativity.',
 '2024-02-18T02:48:57Z Looking for Defense Startup Insights, Communities, Funding, etc. Hi all,<p>I’m currently working within the defense technology space focu

## Applying business rules

Search does not cover factors such as geographical proximity or recency. We could theoretically apply payload filters to restrict data coming from last week, month, or year, but it's not an ideal solution if we want to express a preference, not a hard limit. If we want to combine relevance with some additional criteria, we need to recalculate the scores based on the scores as returned by individual methods, and the other factors we want to consider. Score boosting is a mechanism providing a way to achieve exactly this.

HackerNews provides a `time` attribute we converted to a proper `datetime` at the very beginning. Let's try to use if for recency.

In [12]:
from datetime import datetime, timezone

def retrieve_recent(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    with BM25 and dense retrieval + recency.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=BM25_MODEL_NAME,
                ),
                using=BM25_VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
            models.Prefetch(
                query=models.Document(
                    text=q,
                    model=MODEL_NAME,
                ),
                using=VECTOR_NAME,
                # Ten times more than expected
                limit=(n_docs * 10),
            ),
        ],
        # Score boosting
        query=models.FormulaQuery(
            formula=models.MultExpression(
                mult=[
                    "$score",
                    models.ExpDecayExpression(
                        exp_decay=models.DecayParamsExpression(
                            x=models.DatetimeKeyExpression(
                                datetime_key="datetime" # payload key 
                            ),
                            # Current datetime in "2025-09-25T00:00:00Z format
                            target=models.DatetimeExpression(
                                datetime=datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
                            ),
                            scale=86400 * 365, # 1 day in seconds * 365
                            # If item's "datetime" is more than 1 year apart from 
                            # the current datetime, relevance score is less than 0.5
                            midpoint=0.5
                        )
                    )
                ]
            ),
        ),
        # Five times more than expected too
        limit=n_docs,
    )
    docs = [
        f"{point.payload['datetime']} {point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [13]:
retrieve_recent("What is the community working on?", n_docs=10)

['2025-09-11T09:43:57Z When Startups Ask for Free Security Work A few weeks ago, I explored [redacted], a YC-backed AI backend platform. Like many security researchers, I tend to poke at new tools to see how they handle common attack vectors.<p>It didn’t take long to find issues, both in security and user experience.<p>## The Vulnerabilities<p>*Authorization Flaw*: [redacted] limits free users to 3 items, with a paywall for more. But their API doesn’t enforce this. Anyone can bypass the frontend and call the API directly.<p>This classic flaw means free users can generate unlimited content, paid tiers lose value, and the business model collapses.<p>*UX Problems*: The platform also has confusing navigation, inconsistent design, poor hierarchy, clunky workflows, and unclear onboarding. When the product experience feels this raw, security flaws are just another sign of neglect.<p>## The Response<p>I asked in their community channel about their disclosure process. The founder replied:<p>“hi

In [14]:
response = await rag(
    "What is the community working on?", 
    retrieve_func=retrieve_recent
)
print(response)

Based on the provided context, I can see several different communities working on various projects:

1. **JSON Schema Community** - They are working on improving JSON Schema and making their Community Working Meetings and Office Hours more engaging. They're seeking feedback to attract new recurring participants to help improve JSON Schema.

2. **Defense Startup Community** - Someone is researching this community to understand market opportunities in defense technology, specifically focusing on SIGINT & RF Spectrum Management solutions.

3. **Various Individual Startup Communities** working on:
   - **Lambda** - An open-source, privacy-focused social media app
   - **Growr** - A marketplace app for buying and selling used baby and kids' items
   - **Frederick AI** - AI tools for early-stage startup founders, including market data collection and business plan generation
   - **Radius** - A Meetup.com alternative for creating communities and discovering events
   - **A VR-chat app** - A 3