# Filling a collection with data

We'll start our experiments with a collection filled with [HackerNews](https://news.ycombinator.com/) submissions. Retrieval Augmented Generation is typically built with dense vectors, so let's try if it works in all the cases we would like to support. The [hackernews.csv](../data/hackernews.csv) is a dump of HN submissions, without the comments. Let's process it!

## Setting up Qdrant collection

Our collection needs to be configured for a single vector per point. Even though we have just a single vector, we'll use named vectors. If you want to use a different model, it's the time to configure it below.

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
# See: https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-text-embedding-models
COLLECTION_NAME = "hackernews-rag"
MODEL_NAME = "BAAI/bge-small-en-v1.5"
VECTOR_SIZE = 384
VECTOR_NAME = "bge-small-en-v1.5"

In [3]:
from qdrant_client import QdrantClient, models

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)

In [4]:
client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        VECTOR_NAME: models.VectorParams(
            size=VECTOR_SIZE,
            distance=models.Distance.COSINE,
        )
    },
)

True

## Processing input data

Our dataset is a regular CSV file we need to iterate over and store in Qdrant. We'll use a local inference mode based on Qdrant<>FastEmbed integration, so we don't need to compute the vectors separately, but can just pass raw data and expect the client to convert it.

In [6]:
import csv

with open("../data/hackernews.csv", newline="") as csvfile:
    reader = csv.DictReader(csvfile)
    row = next(reader)
    print(row)

{'id': '7811', 'deleted': '', 'type': 'story', 'by': 'Oldude59', 'time': '1175345257', 'text': '', 'dead': '', 'parent': '', 'poll': '', 'url': 'http://buscreate.blogspot.com', 'score': '1', 'title': 'Business Creativity and Happiness', 'descendants': '0', 'karma': '2'}


In [7]:
from itertools import batched
from tqdm import tqdm
from uuid import uuid4
from datetime import datetime

with open("../data/hackernews.csv", newline="") as csvfile:
    reader = csv.DictReader(csvfile)
    for batch in tqdm(batched(reader, n=16)):
        client.upsert(
            collection_name=COLLECTION_NAME,
            points=[
                models.PointStruct(
                    # HackerNews id is Qdrant id as well
                    id=int(point["id"]),
                    vector={
                        VECTOR_NAME: models.Document(
                            text=f"{point['title']} {point['text']}",
                            model=MODEL_NAME,
                        )
                    },
                    payload={
                        "datetime": datetime.utcfromtimestamp(int(point["time"])).strftime("%Y-%m-%dT%H:%M:%SZ"),
                        **point
                    },
                )
                for point in batch
            ]
        )

10405it [00:00, 15508.45it/s]


In [9]:
client.recover_snapshot(
    collection_name=COLLECTION_NAME,
    # Please do not modify the URL below
    location="https://storage.googleapis.com/tutorials-snapshots-bucket/workshop-improving-r-in-rag/hackernews-rag.snapshot",
    wait=False, # Loading a snapshot may take some time, so let's avoid a timeout
)

## Building RAG with Qdrant-based retrieval

Let's build a naive RAG with dense vector search. It'll be a very basic process, using the original prompt as a query and then passes retrieved context to the LLM.

In [10]:
from any_llm import list_models

list_models(provider=os.environ.get("LLM_PROVIDER"))

[Model(id='claude-opus-4-1-20250805', created=1754352000, object='model', owned_by='anthropic'),
 Model(id='claude-opus-4-20250514', created=1747872000, object='model', owned_by='anthropic'),
 Model(id='claude-sonnet-4-20250514', created=1747872000, object='model', owned_by='anthropic'),
 Model(id='claude-3-7-sonnet-20250219', created=1740355200, object='model', owned_by='anthropic'),
 Model(id='claude-3-5-sonnet-20241022', created=1729555200, object='model', owned_by='anthropic'),
 Model(id='claude-3-5-haiku-20241022', created=1729555200, object='model', owned_by='anthropic'),
 Model(id='claude-3-5-sonnet-20240620', created=1718841600, object='model', owned_by='anthropic'),
 Model(id='claude-3-haiku-20240307', created=1709769600, object='model', owned_by='anthropic'),
 Model(id='claude-3-opus-20240229', created=1709164800, object='model', owned_by='anthropic')]

In [11]:
def retrieve(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        query=models.Document(
            text=q,
            model=MODEL_NAME,
        ),
        using=VECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [12]:
retrieve("What are the coolest ideas for an AI startup?", n_docs=10)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]

['Why to build an AI startup ',
 'Startup Ideas ',
 'Discovering Cool Ideas Making Money with AI ',
 'Some startup ideas ',
 'Where are the opportunities for new startups in generative AI? ',
 'Make any AI generated app ',
 'Billion Dollar Startup Ideas ',
 'AI Startup Wants to Create Jobs, Not Take Them Away ',
 'Explore Side Hustle Ideas to Make Money with AI ',
 'Top Generative AI Startups 2024 ']

### Payload-based filtering

HackeNews submissions may have just a title, but in such a case they rarely provide any useful information. It seems to make sense to exclude such submissions entirely, and only focus on the ones having some more details than just the submission title.

In [13]:
client.create_payload_index(
    collection_name=COLLECTION_NAME,
    field_name="text",
    field_schema="keyword",
)

UpdateResult(operation_id=10416, status=<UpdateStatus.COMPLETED: 'completed'>)

In [14]:
def retrieve_filtered(q: str, n_docs: int) -> list[str]:
    """
    Retrieve documents based on the provided query,
    but only those which have non-empty text attribute.
    """
    result = client.query_points(
        collection_name=COLLECTION_NAME,
        query=models.Document(
            text=q,
            model=MODEL_NAME,
        ),
        query_filter=models.Filter(
            must_not=[
                # Lack of field
                models.IsEmptyCondition(
                    is_empty=models.PayloadField(key="text"),
                ),
                # Field set to null value
                models.IsNullCondition(
                    is_null=models.PayloadField(key="text"),
                ),
                # Field set to an empty string
                models.FieldCondition(
                    key="text",
                    match=models.MatchValue(value=""),
                ),
            ],
        ),
        using=VECTOR_NAME,
        limit=n_docs,
    )
    docs = [
        f"{point.payload['title']} {point.payload['text']}"
        for point in result.points
    ]
    return docs

In [15]:
retrieve_filtered("What are the coolest ideas for an AI startup?", n_docs=10)

['Startup IDEA:- INVEST IN AI MODELS by tokenizing it and selling parts Okay so the idea is similar to ICO(initial coin offering) in which different AI models will be tokenized and funds will be raised.<p>The idea is to create a platform where engineers can raise funds on the models created by them.<p>please review &amp; suggest!',
 'What are the productivity improvements you hope AI can help with? I&#x27;m curious to hear about the most ambitious or even the smallest productivity and creativity enhancements you guys believe AI can facilitate',
 'Ask HN: Hardware to Build an AI Assistant? Any idea on what to use? My first thought of course is a raspberry pi (I got a lot in a drawer!) but my requirements should be:<p>-a voice activated microphone\n-an audio output device (speaker)\n-maybe a single case containing all\n-optional: a small screen<p>ideas? thanks everyone!',
 ' AI-generated images look pretty cool sometimes, and I would like authors/publishers to go as crazy possible.',
 'S

In [16]:
from any_llm import acompletion
from typing import Callable

RetieverFunc = Callable[[str, int], list[str]]


async def rag(q: str, retrieve_func: RetieverFunc, *, n_docs: int = 10) -> str:
    """
    Run single-turn RAG on a given input query.
    Return just the model response.
    """
    docs = retrieve_func(q, n_docs)
    messages = [
        {
            "role": "user",
            "content": (
                "Please provide a response to my question based only " +
                "on the provided context and only it. If it doesn't " +
                "contain any helpful information, please let me know " +
                "and admit you cannot produce relevant answer.\n" +
                f"<context>{'\n'.join(docs)}</context>\n" +
                f"<question>{q}</question>"
            )
        }
    ]
    response = await acompletion(
        provider=os.environ.get("LLM_PROVIDER"),
        model="claude-sonnet-4-20250514",
        messages=messages,
    )
    return response.choices[0].message.content

In [17]:
response = await rag(
    "What are the coolest ideas for an AI startup?", 
    retrieve_func=retrieve_filtered
)
print(response)

Based on the provided context, here are some of the coolest AI startup ideas mentioned:

1. **AI Model Tokenization Platform** - A platform similar to ICO (Initial Coin Offering) where AI models are tokenized and engineers can raise funds by selling parts/shares of their AI models to investors.

2. **AI Co-founder for Startups (Frederick AI)** - An AI assistant that helps run startups by automating early-stage processes from idea generation to execution. It collects market data 24/7 to detect consumer, business, and government problems, then creates business plans and landing pages in under a minute.

3. **AI-Powered Icon Generator** - A micro SaaS that generates unique and beautiful icons using AI technology.

4. **AI-Driven Email Platform (ColdBook)** - An emailing platform designed to help startup founders and sales professionals with their outreach efforts.

5. **AI Assistant Hardware** - Building physical AI assistant devices with voice activation, audio output, optional screens, 

In [18]:
response = await rag("What does Qdrant do?", retrieve_func=retrieve_filtered)
print(response)

Based on the provided context, I can only find limited information about what Qdrant does:

1. **Poetry enhancement**: Qdrant can be integrated with GPT-4 to "elevate its poetry composition capabilities" and help transform poetry "with enhanced coherence and depth."

2. **GPT search functionality**: Qdrant is used as a component in building a search engine for GPTs, specifically mentioned in the context of AssistantHunter, which searches through thousands of GPTs.

However, the context doesn't provide a comprehensive explanation of Qdrant's core functionality or what it fundamentally does as a technology platform. The information is limited to these two specific use cases mentioned in passing. I cannot provide a more detailed answer about Qdrant's capabilities based solely on this context.


In [19]:
docs = retrieve_filtered("What does Qdrant do?", n_docs=10)
docs

['How to Augment GPT-4 with Qdrant to Elevate Its Poetry Composition Capabilities GPT-4 and Qdrant synergize, transforming poetry with enhanced coherence and depth. Visit my medium article to view the code implementation: https:&#x2F;&#x2F;medium.com&#x2F;@akriti.upadhyay&#x2F;how-to-augment-gpt-4-with-qdrant-to-elevate-its-poetry-composition-capabilities-acbb7379346f',
 'Show HN: Qpackt – web server that can serve two versions of your website Hi guys,<p>This is very initial release of something I&#x27;ve been working on for the last few months.<p>Qpackt is an open source, Rust based, web server with several interesting features:<p>1. Basic analytics without tracking cookies.<p>2. Can serve multiple versions of your website. This can be used to A&#x2F;B tests, track users&#x27; engagement coming from different sources or gently roll users to the new version.<p>3. Auto-fetch (and renew) SSL certificates.<p>4. GUI configuration (mostly ;) ) for the ease of use.<p>It&#x27;s missing a lot 