# Filling a collection with data

We'll start our experiments with a collection filled with [HackerNews](https://news.ycombinator.com/) submissions. Retrieval Augmented Generation is typically built with dense vectors, so let's try if it works in all the cases we would like to support. The [hackernews.csv](../data/hackernews.csv) is a dump of HN submissions, without the comments. Let's process it!

## Setting up Qdrant collection

Our collection needs to be configured for a single vector per point. Even though we have just a single vector, we'll use named vectors. If you want to use a different model, it's the time to configure it below.

In [None]:
from dotenv import load_dotenv

load_dotenv()

In [None]:
# See: https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-text-embedding-models
COLLECTION_NAME = "hackernews-rag"
MODEL_NAME = "BAAI/bge-small-en-v1.5"
VECTOR_SIZE = 384
VECTOR_NAME = "bge-small-en-v1.5"

In [None]:
from qdrant_client import QdrantClient, models

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

In [None]:
client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        VECTOR_NAME: models.VectorParams(
            size=VECTOR_SIZE,
            distance=models.Distance.COSINE,
        )
    },
)

## Processing input data

Our dataset is a regular CSV file we need to iterate over and store in Qdrant. We'll use a local inference mode based on Qdrant<>FastEmbed integration, so we don't need to compute the vectors separately, but can just pass raw data and expect the client to convert it.

In [None]:
import csv

with open("../data/hackernews.csv", newline="") as csvfile:
    reader = csv.DictReader(csvfile)
    row = next(reader)
    print(row)

In [None]:
from itertools import batched
from tqdm import tqdm
from datetime import datetime

with open("../data/hackernews.csv", newline="") as csvfile:
    reader = csv.DictReader(csvfile)
    for batch in tqdm(batched(reader, n=16)):
        client.upsert(
            collection_name=COLLECTION_NAME,
            points=[
                models.PointStruct(
                    # HackerNews id is Qdrant id as well
                    id=int(point["id"]),
                    vector={
                        VECTOR_NAME: models.Document(
                            text=f"{point['title']} {point['text']}",
                            model=MODEL_NAME,
                        )
                    },
                    payload={
                        "datetime": datetime.utcfromtimestamp(int(point["time"])).strftime("%Y-%m-%dT%H:%M:%SZ"),
                        **point
                    },
                )
                for point in batch
            ]
        )

In [None]:
client.recover_snapshot(
    collection_name=COLLECTION_NAME,
    # Please do not modify the URL below
    location="https://storage.googleapis.com/tutorials-snapshots-bucket/workshop-improving-r-in-rag/hackernews-rag.snapshot",
    wait=False, # Loading a snapshot may take some time, so let's avoid a timeout
)

## Building RAG with Qdrant-based retrieval

Let's build a naive RAG with dense vector search. It'll be a very basic process, using the original prompt as a query and then passes retrieved context to the LLM.

In [None]:
from any_llm import list_models

list_models(provider=os.environ.get("LLM_PROVIDER"))

In [None]:
LLM_NAME = "claude-sonnet-4-20250514"

In [None]:
def retrieve(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        query=models.Document(
            text=q,
            model=MODEL_NAME,
        ),
        using=VECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [None]:
retrieve("What are the coolest ideas for an AI startup?", n_docs=10)

### Payload-based filtering

HackeNews submissions may have just a title, but in such a case they rarely provide any useful information. It seems to make sense to exclude such submissions entirely, and only focus on the ones having some more details than just the submission title.

In [None]:
client.create_payload_index(
    collection_name=COLLECTION_NAME,
    field_name="text",
    field_schema="keyword",
)

In [None]:
def retrieve_filtered(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query,
    but only those which have non-empty text attribute.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        query=models.Document(
            text=q,
            model=MODEL_NAME,
        ),
        query_filter=models.Filter(
            must_not=[
                # Lack of field
                models.IsEmptyCondition(
                    is_empty=models.PayloadField(key="text"),
                ),
                # Field set to null value
                models.IsNullCondition(
                    is_null=models.PayloadField(key="text"),
                ),
                # Field set to an empty string
                models.FieldCondition(
                    key="text",
                    match=models.MatchValue(value=""),
                ),
            ],
        ),
        using=VECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [None]:
retrieve_filtered("What are the coolest ideas for an AI startup?", n_docs=10)

In [None]:
from any_llm import acompletion
from typing import Callable

RetieverFunc = Callable[[str, int], list[str]]


async def rag(q: str, retrieve_func: RetieverFunc, *, n_docs: int = 10) -> str:
    """
    Run single-turn RAG on a given input query.
    Return just the model response.
    """
    docs = retrieve_func(q, n_docs)
    messages = [
        {
            "role": "user",
            "content": (
                "Please provide a response to my question based only " +
                "on the provided context and only it. If it doesn't " +
                "contain any helpful information, please let me know " +
                "and admit you cannot produce relevant answer.\n" +
                f"<context>{'\n'.join(docs)}</context>\n" +
                f"<question>{q}</question>"
            )
        }
    ]
    response = await acompletion(
        provider=os.environ.get("LLM_PROVIDER"),
        model=LLM_NAME,
        messages=messages,
    )
    return response.choices[0].message.content

In [None]:
response = await rag(
    "What are the coolest ideas for an AI startup?", 
    retrieve_func=retrieve_filtered
)
print(response)

In [None]:
response = await rag("What does Qdrant do?", retrieve_func=retrieve_filtered)
print(response)

In [None]:
docs = retrieve_filtered("What does Qdrant do?", n_docs=10)
docs