# 03. Vector search

At scale nobody performs an original kNN for the vector similarity. It just doesn't scale well enough if you have thousands or millions of vectors. There is a lot going on in the area of **Approximate Nearest Neighbours**. There is plenty of available *vector databases* that implements the process of finding similar vectors as a service, and [Qdrant](https://qdrant.tech) is one of them.

In [None]:
!cd .. && docker-compose up -d

In [None]:
import qdrant_client

In [None]:
client = qdrant_client.QdrantClient(
    host="localhost", port=6333, timeout=30
)

In [None]:
client.get_collections()

We can now start with putting the data into the Qdrant collection, so it might be queried effectively after.

In [None]:
from datasets import load_dataset

import pandas as pd

In [None]:
tweet_qa_dataset = load_dataset("tweet_qa")
train_df = pd.DataFrame(tweet_qa_dataset["train"])

Of course, we're going to use a pretrained model to create those vectors. And we're going to vectorize the answers (full tweets) to put them into the database.

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
embedder = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
from qdrant_client.http import models as rest

In [None]:
client.recreate_collection(
    collection_name="tweets-qa",
    vectors_config=rest.VectorParams(
        size=embedder[0].get_word_embedding_dimension(),
        distance=rest.Distance.COSINE,
    ),
)

In [None]:
answer_embeddings = embedder.encode(train_df["Tweet"])
client.upload_collection(
    collection_name="tweets-qa",
    vectors=answer_embeddings.tolist(),
    payload=[{"qid": qid} for qid in train_df["qid"]],
)

In [None]:
client.get_collection("tweets-qa")

Right now, the next step is to use the question embeddings to find the most relevant tweet for each of them. Since we know the proper one, we can easily calculate the embeddings quality using **top-k-accuracy** measure.

In [None]:
question_embeddings = embedder.encode(train_df["Question"])

In [None]:
from typing import List


def top_k_accuracy(k: int):
    found_in_top = 0
    for target_qid, question_embedding in zip(train_df["qid"],
                                              question_embeddings):
        response = client.search(
            collection_name="tweets-qa",
            query_vector=question_embedding,
            limit=k,
            with_payload=True,
        )
        top_qids = [point.payload.get("qid") for point in response]
        if target_qid in top_qids:
            found_in_top += 1
    return found_in_top / train_df.shape[0]

In [None]:
top_k_accuracy(10)

In [None]:
top_k_accuracy(100)