### What is vector seach?

Vector search leverages machine learning (ML) to capture the meaning and context of unstructured data, including text and images, transforming it into a numeric representation. Frequently used for semantic search, vector search finds similar data using approximate nearest neighbor (ANN) algorithms. Compared to traditional keyword search, vector search yields more relevant results and executes faster.

`semantic search`
 is a search method that focuses on understanding the meaning and intent behind a user's search query, rather than just matching keywords.

`Vector search` is a technique where text (or images, code, etc.) is converted into numerical vectors, and search is done by comparing the similarity of vectors — instead of matching words.

```It lets you search by meaning, not just keywords.```



### What is an Embedding?

To apply `vector search` we need to embed the unstructured data we want to perform our vector search on.

An `embedding` is just a list of numbers that represents the meaning of your text.

for example

``` "The cat is cute" → [0.12, -0.44, 0.91, ..., 0.03] ```

### What is an Embedding Model?
An embedding model is a special AI model that creates these embeddings.

It takes text like this:


``` "The cat is cute" ```

And turns it into something like this:

``` [0.12, -0.44, 0.91, ..., 0.03] ```  

You can think of the embedding model as:

A translator that turns words into vectors (math) that computers can search, compare, and rank.


After vectorizing our data, we can perform operations like similarity search. Tools like Qdrant help by storing and indexing these vectors for fast and efficient retrieval.

### What is Qdrant?

Qdrant “is a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage points (i.e. vectors) with an additional payload.” You can think of the payloads as additional pieces of information that can help you hone in on your search and also receive useful information that you can give to your users.

Qdrant’s Role:

- Qdrant is a vector database.

    It can:

    ✅ Store vectors (and associated metadata, aka “payload”)

    ✅ Index them efficiently

    ✅ Search them by similarity (e.g. cosine similarity)

    ✅ Filter results using metadata (e.g. filter by language, category, etc.)



### Qdrant Vs Elasticsearch

`Qdrant` and `Elasticsearch` differ in how they index and search data:

`Elasticsearch` is optimized for keyword-based search, using inverted indexes to match exact terms or phrases.

`Qdrant`, on the other hand, is designed for semantic search, using vector indexes to find similar meanings rather than exact words.

#### `Even Simpler Version`:

`Elasticsearch` finds documents by matching words.
`Qdrant` finds documents by matching meaning.

### let's look at the 2 steps to do a vector search 

(1) Indexing Stored Data (One-Time Setup) (your knowledg db)

- This is where you prepare your documents for search.

`Document Chunk → Embedding Model → Vector → Qdrant (Store)`


(2) Searching with a Query (Repeated at runtime)
- This happens when a user asks a question or wants to search.

`User Query → Embedding Model → Vector → Qdrant (Search) → Matching Text`


`Embedding Model` A model that converts text into vectors based on meaning

`Vector` A list of numbers (like [0.24, -0.18, 0.91, ...]) that encodes meaning

`Qdrant` A vector database: stores, indexes, and searches over vectors





### Let's start working with qdrant.

`quickstart`

- Use the following link to quickly set up you Qdrant.

https://qdrant.tech/documentation/quickstart/

### Set up python client 

- install required library

        ```pip install -q "qdrant-client[fastembed]>=1.14.2"```

- It provides a client interface so your Python code can easily talk to the Qdrant server via its API.


In [52]:
%pip install -q "qdrant-client[fastembed]>=1.14.2"

python(42904) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


### Step 1. Import Required Libraries & Connect to Qdrant

Now let’s import the necessary modules from the qdrant-client package.

The QdrantClient class allows us to establish a connection to the Qdrant service,
while the models module provides definitions for various configurations and parameters we’ll use.


In [53]:
from qdrant_client import QdrantClient, models
client = QdrantClient("http://localhost:6333") #connecting to local Qdrant instance

### Step 2: Study the Dataset
To build a working vector search solution (and, more generally, to understand if/when/how it’s needed), it's good to study the dataset and figure out the nature and structure of the data we’re working with, for example:

- modality — is it text, images, videos, a combination?
- specifics — if it’s text: language used, how big are the text pieces, are there any special characters, etc.
It will help us define:

- the right data "schema" (what to vectorize, what to store as metadata, etc);
- the right embedding model (the best fit based on the domain, precision & resource requirements).
We have a toy dataset provided for experimentation, let's check it out:

In [54]:
import requests

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

In [55]:
# documents_raw

### Step 3. Choosing embeddding model

Embedding Model = A tool that turns text into numbers (meaningful vector).
 exmaple: 
- `text-embedding-ada-002` by OpenAI 

- Qdrant's `FastEmbed`

- `transformers` by Hugging Face 


The choice of an embedding model depends on many factors:

The task, data modality, and data specifics;
The trade-off between search precision and resource usage (larger embeddings require more storage and memory);
The cost of inference (especially if you're using a third-party provider);

> The best way to select an embedding model is to test and benchmark different options on your own data.



In this particular study we will use `Qdrant's FastEmbed`.


`FastEmbed` is an optimized embedding solution designed specifically for Qdrant. It delivers low-latency, CPU-friendly embedding generation, eliminating the need for heavy frameworks like PyTorch or TensorFlow. It uses quantized model weights and ONNX Runtime, making it significantly faster than traditional Sentence Transformers on CPU while maintaining competitive accuracy.

FastEmbed supports:

`Dense embeddings` for text and images (the most common type in vector search, ones we're going to use today)

`Sparse embeddings` (e.g., BM25 and sparse neural embeddings)

`Multivector embeddings` (e.g., ColPali and ColBERT, late interaction models)

`Rerankers`

All of these can be directly used in Qdrant (as Qdrant supports dense, sparse & multivectors along with hybrid search).
FastEmbed’s integration with Qdrant allows you to directly pass text or images to the Qdrant client for embedding.

In this notebook, we’ll use FastEmbed for local inference with Qdrant.

> Keep in mind your machine's resources when choosing an embedding model for local inference

##### FastEmbed for Textual Data
Let’s select an embedding model to use for our course question answers, stored in text fields, from the options supported by FastEmbed.

In [56]:
from fastembed import TextEmbedding
TextEmbedding.list_supported_models() # shows you different types of textembedding models

[{'model': 'BAAI/bge-base-en',
  'sources': {'hf': 'Qdrant/fast-bge-base-en',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.42,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model': 'BAAI/bge-base-en-v1.5',
  'sources': {'hf': 'qdrant/bge-base-en-v1.5-onnx-q',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.21,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model':

It makes sense to choose a model that produces small-to-moderate-sized embeddings (e.g., 512 dimensions), so we don’t overuse resources in our simple setup.

In [57]:
import json

EMBEDDING_DIMENSIONALITY = 512

for model in TextEmbedding.list_supported_models():
    if model["dim"] == EMBEDDING_DIMENSIONALITY:
        print(json.dumps(model, indent=2))

{
  "model": "BAAI/bge-small-zh-v1.5",
  "sources": {
    "hf": "Qdrant/bge-small-zh-v1.5",
    "url": "https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz",
    "_deprecated_tar_struct": true
  },
  "model_file": "model_optimized.onnx",
  "description": "Text embeddings, Unimodal (text), Chinese, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.",
  "license": "mit",
  "size_in_GB": 0.09,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "Qdrant/clip-ViT-B-32-text",
  "sources": {
    "hf": "Qdrant/clip-ViT-B-32-text",
    "url": null,
    "_deprecated_tar_struct": false
  },
  "model_file": "model.onnx",
  "description": "Text embeddings, Multimodal (text&image), English, 77 input tokens truncation, Prefixes for queries/documents: not necessary, 2021 year",
  "license": "mit",
  "size_in_GB": 0.25,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "jinaai/jina-embeddings-v2-small-e

> We need an embedding model suitable for English text.

It also makes sense to select a unimodal model, since we’re not including images in our search, and specifically tailored solutions are usually better than universal ones.

> It seems like `jina-embedding-small-en` is a good choice!

In [58]:
model_handle = "jinaai/jina-embeddings-v2-small-en"

Like most dense embedding models, `jina-embedding-small-en` was trained to measure semantic closeness using **cosine similarity**.

> The parameters of the chosen embedding model, including the output embedding dimensions and the semantic similarity (distance) metric, are required to configure semantic search in Qdrant.

Now we’re ready to configure and use Qdrant for semantic search. To fully understand what’s happening, here’s a quick overview of Qdrant’s core terminology:

`Points` are the central entity Qdrant works with.

A point is a record consisting of an `ID`, a `vector`, and an optional `payload`.

A `collection` is a named set of points (i.e., vectors with optional payloads) that you can search within.

Think of it as the `container` for your vector search solution, `a single business problem solved`.

> Qdrant supports different types of vectors to enable different modes of data exploration and search (dense, sparse, multivectors, and named vectors).

In this example, we’ll use the most common type, `dense vectors`.


Embeddings capture the semantic essence of the data, while the `payload` holds structured metadata.

This metadata becomes especially useful when applying filters or sorting during search. `Qdrant's payloads` can hold structured data like `booleans`, `keywords`, `geo-locations`, `arrays`, and `nested objects`.

### Step 4: Create a Collection
When creating a [collection](https://qdrant.tech/documentation/concepts/collections/), we need to specify:

`Name`: A unique identifier for the collection.

`Vector Configuration`:

- `Size`: The dimensionality of the vectors.
- `Distance Metric`: The method used to measure similarity between vectors.
    
There are additional parameters you can explore in our documentation. Moreover, you can configure other vector types in Qdrant beyond typical dense embeddings (f.e., for hybrid search). However, for this example, the simplest default configuration is sufficient.

In [59]:
# Define the collection name
collection_name = "zoomcamp-rag"

# Create the collection with specified vector parameters
from qdrant_client.http.exceptions import UnexpectedResponse


try: 
    client.create_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size = EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
            distance=models.Distance.COSINE  # Distance metric for similarity search
        )
    )
except UnexpectedResponse as e:
    if "already exists" in str(e):
        print("Collection already exists. Skipping creation.")
    else:
        raise e

Collection already exists. Skipping creation.


### Step 5: Create, Embed & Insert Points into the Collection

`Points` are the core data entities in Qdrant. Each point consists of:

- `ID`. A unique identifier. Qdrant supports both 64-bit unsigned integers and UUIDs.

- `Vector`. The embedding that represents the data point in vector space.

- `Payload` (optional). Additional metadata as key-value pairs.

In [60]:
points = []
id = 0

for course in documents_raw:
    for doc in course['documents']:

        point = models.PointStruct(
            id=id,
            vector=models.Document(text=doc['text'], model=model_handle), #embed text locally with "jinaai/jina-embeddings-v2-small-en" from FastEmbed
            payload={
                "text": doc['text'],
                "section": doc['section'],
                "course": course['course']
            } #save all needed metadata fields
        )
        points.append(point)

        id += 1

Now we’re going to embed and upload points to our collection.


First, FastEmbed will fetch&download the selected model (path defaults to `os.path.join(tempfile.gettempdir(), "fastembed_cache")`), and perform inference directly on your machine.

Then, the generated points will be upserted into the collection, and the vector index will be built.

In [61]:
client.upsert(
    collection_name=collection_name,
    points=points
)

UpdateResult(operation_id=4, status=<UpdateStatus.COMPLETED: 'completed'>)


The speed of upsert mainly depends on the time spent on local inference.

To speed this up, you could run FastEmbed on GPUs or use a machine with more resources.

In addition to basic upsert, Qdrant supports `batch upsert` in both column- and record-oriented formats.

The Python client offers:

- Parallelization

- Retries

- Lazy batching

These can be configured via parameters in the upload_collection and upload_points functions.
For details, check the documentation.

### Study Data Visually
Let’s explore the uploaded data in the Qdrant Web UI at http://localhost:6333/dashboard to study semantic similarity visually.

For example, using the `Visualize` tab in the `zoomcamp-rag` collection, we can view all answers to the course questions (948 points) and see how they group together by meaning, additionally coloured by the course type.

To do that, run the following command:
``` json
{
  "limit": 948,
  "color_by": {
    "payload": "course"
  }
}
```
This 2D representation is the result of dimensionality reduction applied to `jina-embeddings`.

### Step 6: Running a Similarity Search
Now, let’s find the most similar text vector in Qdrant to a given query embedding - the most relevant answer to a given question.

##### How Similarity Search Works

1. Qdrant compares the query vector to stored vectors (based on a vector index) using the distance metric defined when creating the collection.

2. The closest matches are returned, ranked by similarity.

> Vector index is built for `approximate nearest neighbor (ANN)` search, making large-scale vector search feasible.

If you'd like to dive into our choice of vector index for vector search, check our article["What is a vector database"](https://qdrant.tech/articles/what-is-a-vector-database/), or, for a more technical deep dive, our article on [Filterable Hierarchical Navigable Small World](https://qdrant.tech/articles/filtrable-hnsw/).

Let's define a search function:

In [62]:
def search(query, limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle 
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

Now let’s pick a random question from the course data.

As you remember, we didn’t upload the questions to Qdrant.

In [63]:
import random

course = random.choice(documents_raw)
course_piece = random.choice(course['documents'])
print(json.dumps(course_piece, indent=2))

{
  "text": "You may have this error:\nRetrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.u\nrllib3.connection.HTTPSConnection object at 0x7efe331cf790>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')':\n/simple/pandas/\nPossible solution might be:\n$ winpty docker run -it --dns=8.8.8.8 --entrypoint=bash python:3.9",
  "section": "Module 1: Docker and Terraform",
  "question": "Docker - Cannot pip install on Docker container (Windows)"
}


Let's see which answer we get:

In [64]:
result = search(course_piece['question'])

In [65]:
result



**score** – the `cosine similarity` between the `question` and `text` embeddings.

Let’s compare the original and retrieved answers for our randomly selected question.

In [66]:
print(f"Question:\n{course_piece['question']}\n")
print("Top Retrieved Answer:\n{}\n".format(result.points[0].payload['text']))
print("Original Answer:\n{}".format(course_piece['text']))

Question:
Docker - Cannot pip install on Docker container (Windows)

Top Retrieved Answer:
When trying to rerun the docker file in Windows, as opposed to developing in WSL/Linux, I got the error of:
```
Neither ‘pipenv’ nor ‘asdf’ could be found to install Python.
You can specify specific versions of Python with:
$ pipenv –python path\to\python
```
The solution was to add Python311 installation folder to the PATH and restart the system and run the docker file again. That solved the error.
(Added by Abhijit Chakraborty)

Original Answer:
You may have this error:
Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.u
rllib3.connection.HTTPSConnection object at 0x7efe331cf790>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')':
/simple/pandas/
Possible solution might be:
$ winpty docker run -it --dns=8.8.8.8 --entrypoint=bash python:3.9


Now let’s search the answer to a question that wasn’t in the initial dataset.

In [67]:
print(search("What if I submit homeworks late?").points[0].payload['text'])

No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y
Older news:[source1] [source2]


### Step 7: Running a Similarity Search with Filters

We can refine our search using metadata filters.

> Qdrant’s custom vector index implementation, Filterable HNSW, allows for precise and scalable vector search with filtering conditions.

For example, we can search for an answer to a question related to a specific course from the three available in the dataset.
Using a `mus`t filter ensures that all specified conditions are met for a data point to be included in the search results.

> Qdrant also supports other filter types such as `should`, `must_not`, `range`, and more. For a full overview, check our [Filtering Guide](https://qdrant.tech/articles/vector-search-filtering/)

To enable efficient filtering, we need to turn on [indexing of payload fields](https://qdrant.tech/documentation/concepts/indexing/#payload-index).

In [68]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact matching on string metadata fields
)

UpdateResult(operation_id=6, status=<UpdateStatus.COMPLETED: 'completed'>)

Now let's update our search function

In [69]:
def search_in_course(query, course="mlops-zoomcamp", limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle
        ),
        query_filter=models.Filter( # filter by course name
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

Let’s see how the same question is answered across different courses:

`data-engineering-zoomcamp`,` machine-learning-zoomcamp`, and `mlops-zoomcamp`.

In [70]:
print(search_in_course("What if I submit homeworks late?", "mlops-zoomcamp").points[0].payload['text'])

Please choose the closest one to your answer. Also do not post your answer in the course slack channel.


# done!