# Qdrant 101

![qdrant](https://qdrant.tech/images/logo_with_text.png)

Vector databases are a "relatively" new way for interacting with abstract data representations derived from opaque machine learning models -- deep learning architectures being the most common ones. These representations are often called vectors or embeddings and they are a compressed version of the data used to train a machine learning model to accomplish a task (e.g., sentiment analysis, speech recognition, object detection, and many more).

## Table of Contents

1. Learning Outcomes
2. What is Qdrant?
    - What are Vector Databases?
    - Why do We Need Vector Databases??
    - Overview of Qdrant's Architecture    
    - How do We Get Started?
3. Getting Started
    - Adding Points
    - Payload
    - Search
4. Use Cases
    - Natural Language Processing
    - Computer Vision
    - Audio
    - Tabular
7. Conclusion

## 1. Learning Outcomes

## 2. What is Qdrant?

Qdrant "is a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage points (i.e. vectors) with an additional payload." You can get started with plain python using the `qdrant-client`, pull the latest docker image of `qdrant` and connect to it locally, or try out Qdrant's Cloud free tier option until you are ready to make the full switch.

### 2.1 What Are Vector Databases?

A vector database is a type of database designed to store and query high-dimensional vectors efficiently. In traditional [OLTP](https://www.ibm.com/topics/oltp) and [OLAP](https://www.ibm.com/topics/olap) databases, data is organized in rows and columns, and queries are performed based on the values in those columns. However, in certain applications, such as machine learning, image recognition, natural language processing, and recommendation systems, data is often represented as vectors in a high-dimensional space.

A vector in this context is a mathematical representation of an object or data point, where each element of the vector corresponds to a specific feature or attribute of the object. For example, in an image recognition system, a vector could represent an image, with each element of the vector representing a pixel value or a descriptor of that pixel.

Vector databases are optimized for storing and querying these high-dimensional vectors efficiently, often using specialized data structures and indexing techniques. They enable fast similarity searches, allowing users to find vectors that are similar to a given query vector based on some distance metric, such as Euclidean distance or cosine similarity.

### 2.2 Why do we need Vector Databases?

Vector databases play a crucial role in various applications that require similarity search, such as recommendation systems, content-based image retrieval, and personalized search. By leveraging efficient indexing and search techniques, vector databases enable faster and more accurate retrieval of similar vectors, enabling advanced data analysis and decision-making.

In addition, other benefits of using vector databases include:
1. Efficient storage and indexing of high-dimensional data.
3. Ability to handle large-scale datasets with billions or trillions of data points.
4. Support for real-time analytics and queries.
5. Ability to handle complex data types, such as images, videos, and natural language text.
6. Improved performance and reduced latency in machine learning and AI applications.
7. Reduced development and deployment time and cost compared to building a custom solution.

Keep in mind that the specific benefits of using a vector database may vary depending on the use case of your organization and the features of the database.

### 2.3 Overview of Qdrant's Architecture (High-Level)

![qdrant](../images/qdrant_archi.png)

TODO: Add storage and third-party integrations to the image above

- Collections
- Distance Metrics
- Points
    - id
    - Vector
    - Payload
- Storage
- Clients

### 2.4 How do we get started?

The open source version of Qdrant is available as a docker image and it can be pulled and run from any machine running docker. If you don't have Docker installed in your PC you can follow the instructions in the official documentation [here](https://docs.docker.com/get-docker/). After that, open your terminal start by downloading the image with the following command.

```sh
docker pull qdrant/qdrant
```

Next, initialize Qdrant with the following command, and you should be good to go.

```sh
docker run -p 6333:6333 \
    -v $(pwd)/qdrant_storage:/qdrant/storage \
    qdrant/qdrant
```

If you experience any issues during the start process, please let us know in our [discord channel here](https://qdrant.to/discord). We are always available to help.

Now that you have Qdrant up and running, your next step is to pick a client to connect to it. We'll be using Python as it has the most mature data tools ecosystem out there. Therefore, let's start setting up our dev environment with the tools we'll be using.

```sh
# with mamba or conda
mamba env create -n my_env python=3.10
mamba activate my_env

# or with virtualenv
python -m venv venv
source venv/bin/activate

# install packages
pip install qdrant-client transformers datasets pandas numpy torch
```

After your have your environment ready, let's get started with Qdrant.

**Note:** At the time of writing, Qdrant supports Rust, GO, Python and TypeScript. We might see other programming languages getting added in the future.

## 3. Getting Started

The two modules we'll use the most are the `QdrantClient` and the `models` one. The former allows to connect to Qdrant or it allows us to run an in-memory database with the parameter `host=` switched to `":memory:"` (this is a great feature for testing in a CI/CD pipeline). We'll start by instantiating our client using `host="localhost"` and `port=6333` and we'll check the status of it with .

In [None]:
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import CollectionStatus

In [None]:
client = QdrantClient(host="localhost", port=6333)
client

In OLTP and OLAP databases we call specific bundles of data, **Tables**, but in vector databases, we refer to these bundles of vectors as **collections**. In the same way in which we can create many tables in a database, we can create many collections in a vector-based db using a client. The key difference to note is that when we create a collection, we need to specify the width of the collection beforehand with the parameter `size=...`, as well as the similarity metric with the parameter `distance=...` (which can be changed later on).

The distances currently supported by Qdrant are:
- Cosine Similarity
- Dot Product
- Euclidean Distance

Let's create our first collection and have the vectors be of with 100 and the distance set to [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Please note that, at the time of writing, Qdrant supports cosine similarity, dot product and 

In [None]:
first_collection = client.recreate_collection(
    collection_name="first_collection",
    vectors_config=models.VectorParams(size=100, distance=models.Distance.COSINE)
)
print(first_collection)

In [None]:
collection_info = client.get_collection(collection_name="first_collection")
collection_info

Note the information available 

In [None]:
assert collection_info.status == CollectionStatus.GREEN
assert collection_info.vectors_count == 0

In [None]:
# we can check that our collection was indeed created with
client.get_collections()

There's a couple of things to notice from what we have done.
- The first is that when we initiated our docker image, we created a local directory, `qdrant_storage`, where all of our collections, plus their metadata, will be saved at. You can have a look at that directory in a *nix system with `tree qdrant_storage -L 2`. You should see the following.
    ```bash
    qdrant_storage
    ├── aliases
    │   └── data.json
    ├── collections
    │   └── my_first_collection
    └── raft_state
    ```
- The second is that we used `client.create_collection` and this command can only be used once per collection. To recreate the collection with new parameters and the like, we would use `client.recreate_collection` instead.

Now that we know how to create collections, let's create a bit of fake data and add some vectors to our collection.

#### 4.3.2 Adding Points

The points are the central entity that Qdrant operates with, and these points contain records consisting of a vector, an optional id and an optional payload (which we'll talk more about in the next section).

The optional id can be represented by unassigned integers or UUIDs. We are going a range of numbers for this.

In [None]:
data = np.random.uniform(low=-1.0, high=1.0, size=(1_000, 100))
data

In [None]:
index = list(range(len(data)))
index[-10:]

In [None]:
client.upsert(
    collection_name="my_first_collection",
    points=models.Batch(
        ids=index,
        vectors=data.tolist()
    )
)

We can retrieve specific points based on their ID (for example, artist X with ID 1000) and get some additional information from that result.

In [None]:
client.retrieve(
    collection_name="my_first_collection",
    ids=[100],
    with_vectors=True # we can turn this on and off depending on our needs
)

We can also update our collection one point at a time, for example, as new data comes in.

In [None]:
def create_song():
    return np.random.uniform(low=-1.0, high=1.0, size=100).tolist()

In [None]:
client.upsert(
    collection_name="my_first_collection",
    points=[
        models.PointStruct(
            id=1000,
            vector=create_song(),
        )
    ]
)

We can also delete it in a straightforward fashion.

In [None]:
client.count(
    collection_name="my_first_collection", 
    exact=True,
)

In [None]:
client.delete(
    collection_name="my_first_collection",
    points_selector=models.PointIdsList(
        points=[1000],
    ),
)

In [None]:
client.count(
    collection_name="my_first_collection", 
    exact=True,
)

#### 4.3.3 Payloads

Qdrant has incredible features on top of speed and reliability, and one of its most useful ones is without a doubt the ability to store additional information along with vectors. In Qdrant terminology, this information is considered a payload and it is represented as a JSON file. In addition, not only can you get this information back when you search in the database, but you can also filter your search by the parameters in the payload, and we'll see how in a second.

Imagine the fake vectors we created actually represented a song. If we were building a recommender system for songs then, naturally, the things we would want to get back would be the song itself, the artist, maybe the genre, and so on.

What we'll do here is to take advantage of a Python package call `faker` and create a bit of information to add to our payload.

In [None]:
from faker import Faker

In [None]:
fake_something = Faker()
fake_something.name()

In [None]:
payload = []

for i in range(len(data)):
    payload.append(
        {
            "artist": fake_something.name(),
            "song": " ".join(fake_something.words()),
            "url_song": fake_something.url(),
            "year": fake_something.year(),
            "country": fake_something.country()
        }
    )

payload[:3]

In [None]:
client.upsert(
    collection_name="my_first_collection",
    points=models.Batch(
        ids=index,
        vectors=data.tolist(),
        payloads=payload
    )
)

In [None]:
resutls = client.retrieve(
    collection_name="my_first_collection",
    ids=[10, 50, 100, 500],
    with_vectors=False
)
resutls

In [None]:
resutls[0].payload

In [None]:
# client.clear_payload(
#     collection_name="my_first_collection",
#     points_selector=models.PointIdsList(
#         points=index,
#     )
# )

#### 4.3.4 Search

Now that we have our vectors with an ID and a payload, we can explore a few of ways in which we can search for content when, in our use case, new music gets selected. Let's check it out.

Say, for example, that a new song comes in and our model immediately transforms it into a vector.

In [None]:
living_la_vida_loca = create_song()

In [None]:
client.search(
    collection_name="my_first_collection",
    query_vector=living_la_vida_loca,
    limit=10
)

Now imagine that we only want Australian songs recommended to us.

In [None]:
aussie_songs = models.Filter(
    must=[models.FieldCondition(key="country", match=models.MatchValue(value="Australia"))]
)

In [None]:
client.search(
    collection_name="my_first_collection",
    query_vector=living_la_vida_loca,
    query_filter=aussie_songs,
    limit=5
)

Lastly, say we want aussie songs but we don't care how new or old these songs are.

In [None]:
client.search(
    collection_name="my_first_collection",
    query_vector=living_la_vida_loca,
    query_filter=aussie_songs,
    with_payload=models.PayloadSelectorExclude(exclude=["year"]),
    limit=5
)

As you can see, the possibilities are endless.

## 4. Use Cases

The most common use case you will find as of today will most-likely involve language-based Generative AI models, and understandably so. Models like GPT-3, Codex, 

### 4.1 Natural Language Processing

In NLP, vector databases are used to store word embeddings. Word embeddings are vector representations of words that capture their semantic meaning. They are used to improve the performance of NLP tasks such as text classification, machine translation, and question answering.



In [None]:
from transformers import GPT2Tokenizer, GPT2Model
from datasets import load_dataset
import numpy as np
import torch

In [None]:
dataset = load_dataset("ag_news", split="train")

In [None]:
dataset

In [None]:
dataset[1000]

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2').to(device)

In [None]:
tokenizer.eos_token

In [None]:
tokenizer.pad_token = tokenizer.eos_token

In [None]:
text = "That movie was amazing"
em = tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)
em

In [None]:
with torch.no_grad():
    embs = model(**em)
embs

In [None]:
embs.last_hidden_state.size()

In [None]:
def mean_pooling(model_output, attention_mask):

    token_embeddings = model_output[0]
    input_mask_expanded = (attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float())
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

def embed_text(examples):
    inputs = tokenizer(
        examples["text"], padding=True, truncation=True, return_tensors="pt"
    ).to(device)
    with torch.no_grad():
        model_output = model(**inputs)
    pooled_embeds = mean_pooling(model_output, inputs["attention_mask"])
    return {"embedding": pooled_embeds.cpu().numpy()}

In [None]:
dataset = dataset.shuffle(42).select(range(3000)).map(embed_text, batched=True, batch_size=128)

### 4.2 Computer Vision

In CV, vector databases are used to store image features. Image features are vector representations of images that capture their visual content. They are used to improve the performance of CV tasks such as object detection, image classification, and image retrieval.

In [None]:
from transformers import AutoImageProcessor, ResNetForImageClassification

In [None]:
dataset = load_dataset("marmal88/skin_cancer", split='train')
dataset

In [None]:
image = dataset[100]["image"]
image

In [None]:
image_processor = AutoImageProcessor.from_pretrained("microsoft/resnet-50")
model = ResNetForImageClassification.from_pretrained("microsoft/resnet-50", outpu)

In [None]:
inputs = image_processor(image, return_tensors="pt")
inputs.keys(), inputs.pixel_values.shape

In [None]:
ResNetForImageClassification.from_pretrained??

In [None]:
with torch.no_grad():
    features = model(**inputs)
features

In [None]:






# model predicts one of the 1000 ImageNet classes
predicted_label = logits.argmax(-1).item()
print(model.config.id2label[predicted_label])

### 4.3 Audio

### 4.4 Tabular

## 5. Conclusion