# Install Required Libraries

- The qdrant-client package. We'll be using the Python client, but Qdrant also offers official clients for JavaScript/TypeScript, Go, and Rust, so you can choose the best fit for your own projects.
- The fastembed package - an optimized embedding (data vectorization) solution designed specifically for Qdrant. Make sure you install version >= 1.14.2 to use the local inference with Qdrant.


In [None]:
# !python -m pip install -q "qdrant-client[fastembed]>=1.14.2"

# Step 1: Import Required Libraries & Connect to Qdrant

The QdrantClient class allows us to establish a connection to the Qdrant service,<br>
while the models module provides definitions for various configurations and parameters we’ll use.


In [1]:
from qdrant_client import QdrantClient, models

In [40]:
# Initialize the client
client =QdrantClient("http://localhost:6333") # connects to local Qdrant instance

# Step 2: Study the Dataset

To build a working vector search solution (and, more generally, to understand if/when/how it’s needed), it's good to study the dataset and figure out the nature and structure of the data we’re working with, for example:

- modality — is it text, images, videos, a combination?
- specifics — if it’s text: language used, how big are the text pieces, are there any special characters, etc.

It will help us define:

- the right data "schema" (what to vectorize, what to store as metadata, etc);
- the right embedding model (the best fit based on the domain, precision & resource requirements).


In [5]:
import requests

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

In [33]:
print(str(documents_raw)[:1_000])

[{'course': 'data-engineering-zoomcamp', 'documents': [{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", 'section': 'General course-related questions', 'question': 'Course - When will the course start?'}, {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites', 'section': 'General course-related questions', 'question': 'Course - What are the prerequisites for this course?'}, {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So 

Data already seems cleaned and chunked (i.e., divided into small pieces that embedding models can easily digest), so what's left is to define:

- which fields could be used for semantic search ;
- which fields should be stored as metadata, e.g. useable for filtering conditions;

We have a dataset with three course types:
data-engineering-zoomcamp, machine-learning-zoomcamp, and mlops-zoomcamp.
Each course includes a collection of question and text (answer) pairs, along with the section the question refers to.


---

_Which Fields Could Be Used for Semantic Search_<br>
if we’re building a Q&A retrieval-augmented generation (RAG) system,
it makes sense to store the text field (answers) as embeddings, and use vector search to find the most relevant answer to a given question query.

_Which Fields Should Be Stored as Metadata_<br>
For example, we could store the course and section fields as metadata.
This way, we can filter search results when asking questions related to a specific course or a specific section.

# Step 3: Choosing the Embedding Model with FastEmbed

The choice of an embedding model depends on many factors:

- The task, data modality, and data specifics;
- The trade-off between search precision and resource usage (larger embeddings require more storage and memory);
- The cost of inference (especially if you're using a third-party provider);
- etc<br>
  The best way to select an embedding model is to test and benchmark different options on your own data.<br>
  In this notebook, we’re going to use [FastEmbed](https://github.com/qdrant/fastembed) as our embedding provider.

## FastEmbed for Textual Data

Let’s select an embedding model to use for our course question answers, stored in text fields, from the options supported by FastEmbed.


In [34]:
from fastembed import TextEmbedding
TextEmbedding.list_supported_models()

[{'model': 'BAAI/bge-base-en',
  'sources': {'hf': 'Qdrant/fast-bge-base-en',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.42,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model': 'BAAI/bge-base-en-v1.5',
  'sources': {'hf': 'qdrant/bge-base-en-v1.5-onnx-q',
   'url': 'https://storage.googleapis.com/qdrant-fastembed/fast-bge-base-en-v1.5.tar.gz',
   '_deprecated_tar_struct': True},
  'model_file': 'model_optimized.onnx',
  'description': 'Text embeddings, Unimodal (text), English, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.',
  'license': 'mit',
  'size_in_GB': 0.21,
  'additional_files': [],
  'dim': 768,
  'tasks': {}},
 {'model':

Choose a model that produces small-to-moderate-sized embeddings (e.g., 512 dimensions), so we don’t overuse resources in our simple setup.


In [36]:
import json

EMBEDDING_DIMENSIONALITY = 512

for model in TextEmbedding.list_supported_models():
    if model['dim'] == EMBEDDING_DIMENSIONALITY:
        print(json.dumps(model, indent=2))

{
  "model": "BAAI/bge-small-zh-v1.5",
  "sources": {
    "hf": "Qdrant/bge-small-zh-v1.5",
    "url": "https://storage.googleapis.com/qdrant-fastembed/fast-bge-small-zh-v1.5.tar.gz",
    "_deprecated_tar_struct": true
  },
  "model_file": "model_optimized.onnx",
  "description": "Text embeddings, Unimodal (text), Chinese, 512 input tokens truncation, Prefixes for queries/documents: not so necessary, 2023 year.",
  "license": "mit",
  "size_in_GB": 0.09,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "Qdrant/clip-ViT-B-32-text",
  "sources": {
    "hf": "Qdrant/clip-ViT-B-32-text",
    "url": null,
    "_deprecated_tar_struct": false
  },
  "model_file": "model.onnx",
  "description": "Text embeddings, Multimodal (text&image), English, 77 input tokens truncation, Prefixes for queries/documents: not necessary, 2021 year",
  "license": "mit",
  "size_in_GB": 0.25,
  "additional_files": [],
  "dim": 512,
  "tasks": {}
}
{
  "model": "jinaai/jina-embeddings-v2-small-e

We need an embedding model suitable for English text.
<br><br>
It also makes sense to select a unimodal model, since we’re not including images in our search, and specifically tailored solutions are usually better than universal ones.
<br><br>
It seems like **jina-embedding-small-en** is a good choice!


In [37]:
model_handle = "jinaai/jina-embeddings-v2-small-en"

jina-embedding-small-en was trained to measure semantic closeness using cosine similarity. You can find this information, for example, on the model’s [Hugging Face card](https://huggingface.co/jinaai/jina-embeddings-v2-small-en).<br>

The parameters of the chosen embedding model, including the output embedding dimensions and the semantic similarity (distance) metric, are required to configure semantic search in Qdrant.<br>

Here’s a quick overview of Qdrant’s core terminology:

- **Points** are the central entity Qdrant works with.
  <br>A point is a record consisting of an **ID**, a **vector**, and an optional **payload**.
- A **collection** is a named set of points (i.e., vectors with optional payloads) that you can search within.
  <br>Think of it as the container for your vector search solution, a single business problem solved.

Qdrant supports different types of vectors to enable different modes of data exploration and search (dense, sparse, multivectors, and named vectors).

In this example, we’ll use the most common type, **dense vectors**.

Embeddings capture the semantic essence of the data, while the **payload** holds structured metadata.<br>
This metadata becomes especially useful when applying filters or sorting during search. Qdrant's payloads can hold structured data like booleans, keywords, geo-locations, arrays, and nested objects.


# Step 4: Create a Collection

When creating a [collection](https://qdrant.tech/documentation/concepts/collections/), we need to specify:

- Name: A unique identifier for the collection.
- Vector Configuration:
  - Size: The dimensionality of the vectors.
  - Distance Metric: The method used to measure similarity between vectors.

There are additional parameters you can explore in our [documentation](https://qdrant.tech/documentation/concepts/collections/#create-a-collection). Moreover, you can configure other vector types in Qdrant beyond typical dense embeddings (f.e., for hybrid search). However, for this example, the simplest default configuration is sufficient.


In [41]:
# Define the collection name
collection_name = 'zoomcamp-rag'

# Create the collection with the specified vector parameters
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=EMBEDDING_DIMENSIONALITY,  # Dimensionality of the vectors
        distance=models.Distance.COSINE  # Distance metric for similarity search
    )
)

True

# Step 5: Create, Embed & Insert Points into the Collection

[Points](https://qdrant.tech/documentation/concepts/points/#points) are the core data entities in Qdrant. Each point consists of:

1. **ID**. A unique identifier. Qdrant supports both 64-bit unsigned integers and UUIDs.
2. **Vector**. The embedding that represents the data point in vector space.
3. **Payload** (optional). Additional metadata as key-value pairs.


In [43]:
points = []
id = 0

for course in documents_raw:
    for doc in course['documents']:

        point = models.PointStruct(
            id=id,
            vector=models.Document(text=doc['text'], model=model_handle), #embed text locally with "jinaai/jina-embeddings-v2-small-en" from FastEmbed
            payload={
                'text': doc['text'],
                'section': doc['section'],
                'course': course['course']
            } #save all needed metadata fields
        )
        points.append(point)

        id += 1

In [46]:
for data in points[0]:
    print(data)

('id', 0)
('vector', Document(text="The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.", model='jinaai/jina-embeddings-v2-small-en', options=None))
('payload', {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register

Now we’re going to embed and upload points to our collection.

First, FastEmbed will fetch&download the selected model (path defaults to os.path.join(tempfile.gettempdir(), "fastembed_cache")), and perform inference directly on your machine.
Then, the generated points will be upserted into the collection, and the vector index will be built.


In [None]:
client.upsert(
    collection_name=collection_name,
    points=points
)

The speed of upsert mainly depends on the time spent on local inference.<br>
To speed this up, you could run FastEmbed on GPUs or use a machine with more resources.

In addition to basic upsert, Qdrant supports **batch upsert** in both column- and record-oriented formats.

The Python client offers:

- Parallelization
- Retries
- Lazy batching

These can be configured via parameters in the upload_collection and upload_points functions.
For details, check the [documentation](https://qdrant.tech/documentation/concepts/points/#upload-points).


## Study Data Visually

Let’s explore the uploaded data in the Qdrant Web UI at http://localhost:6333/dashboard to study semantic similarity visually.

For example, using the Visualize tab in the zoomcamp-rag collection, we can view all answers to the course questions (948 points) and see how they group together by meaning, additionally coloured by the course type.

To do that, run the following command:

```python
{
  "limit": 948,
  "color_by": {
    "payload": "course"
  }
}
```

This 2D representation is the result of dimensionality reduction applied to jina-embeddings.


# Step 6: Running a Similarity Search

Now, let’s find the most similar text vector in Qdrant to a given query embedding - the most relevant answer to a given question.
How Similarity Search Works

1. Qdrant compares the query vector to stored vectors (based on a vector index) using the distance metric defined when creating the collection.
2. The closest matches are returned, ranked by similarity.
   Vector index is built for **approximate** nearest neighbor (ANN) search, making large-scale vector search feasible.

If you'd like to dive into our choice of vector index for vector search, check our article ["What is a vector database"](https://qdrant.tech/articles/what-is-a-vector-database/), or, for a more technical deep dive, our article on [Filterable Hierarchical Navigable Small World](https://qdrant.tech/articles/filtrable-hnsw/).

Let's define a search function:


In [48]:
def search(query, limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle 
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

Now let’s pick a random question from the course data.<br>
As you remember, we didn’t upload the questions to Qdrant.


In [51]:
import random

course = random.choice(documents_raw)
course_piece = random.choice(course['documents'])
print(json.dumps(course_piece, indent=2))

{
  "text": "When creating a duplicate of your dataframe by doing the following:\nX_train = df_train\nX_val = df_val\nYou\u2019re still referencing the original variable, this is called a shallow copy. You can make sure that no references are attaching both variables and still keep the copy of the data do the following to create a deep copy:\nX_train = df_train.copy()\nX_val = df_val.copy()\nAdded by Ixchel Garc\u00eda",
  "section": "2. Machine Learning for Regression",
  "question": "Null column is appearing even if I applied .fillna()"
}


Let's see which answer we get:


In [52]:
result = search(course_piece['question'])

In [53]:
result

QueryResponse(points=[ScoredPoint(id=286, version=0, score=0.79871285, payload={'text': 'If you encounter data type error on trip_type column, it may due to some nan values that isn’t null in bigquery.\nSolution: try casting it to FLOAT datatype instead of NUMERIC', 'section': 'Module 4: analytics engineering with dbt', 'course': 'data-engineering-zoomcamp'}, vector=None, shard_key=None, order_value=None)])

Let’s compare the original and retrieved answers for our randomly selected question.


In [54]:
print(f"Question:\n{course_piece['question']}\n")
print("Top Retrieved Answer:\n{}\n".format(result.points[0].payload['text']))
print("Original Answer:\n{}".format(course_piece['text']))

Question:
Null column is appearing even if I applied .fillna()

Top Retrieved Answer:
If you encounter data type error on trip_type column, it may due to some nan values that isn’t null in bigquery.
Solution: try casting it to FLOAT datatype instead of NUMERIC

Original Answer:
When creating a duplicate of your dataframe by doing the following:
X_train = df_train
X_val = df_val
You’re still referencing the original variable, this is called a shallow copy. You can make sure that no references are attaching both variables and still keep the copy of the data do the following to create a deep copy:
X_train = df_train.copy()
X_val = df_val.copy()
Added by Ixchel García


Now let’s search the answer to a question that wasn’t in the initial dataset.


In [55]:
print(search("What if I submit homeworks late?").points[0].payload['text'])

No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y
Older news:[source1] [source2]


# Step 7: Running a Similarity Search with Filters

We can refine our search using metadata filters.

- Qdrant’s custom vector index implementation, Filterable HNSW, allows for precise and scalable vector search with filtering conditions.

For example, we can search for an answer to a question related to a specific course from the three available in the dataset.
Using a must filter ensures that all specified conditions are met for a data point to be included in the search results.

- Qdrant also supports other filter types such as should, must_not, range, and more. For a full overview, check our [Filtering Guide](https://qdrant.tech/articles/vector-search-filtering/)

To enable efficient filtering, we need to turn on [indexing of payload fields.](https://qdrant.tech/documentation/concepts/indexing/#payload-index)


In [56]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="course",
    field_schema="keyword" # exact matching on string metadata fields
)

UpdateResult(operation_id=2, status=<UpdateStatus.COMPLETED: 'completed'>)

Now let's update our search function


In [57]:
def search_in_course(query, course="mlops-zoomcamp", limit=1):

    results = client.query_points(
        collection_name=collection_name,
        query=models.Document( #embed the query text locally with "jinaai/jina-embeddings-v2-small-en"
            text=query,
            model=model_handle
        ),
        query_filter=models.Filter( # filter by course name
            must=[
                models.FieldCondition(
                    key="course",
                    match=models.MatchValue(value=course)
                )
            ]
        ),
        limit=limit, # top closest matches
        with_payload=True #to get metadata in the results
    )

    return results

Let’s see how the same question is answered across different courses:<br>
data-engineering-zoomcamp, machine-learning-zoomcamp, and mlops-zoomcamp.


In [58]:
print(search_in_course("What if I submit homeworks late?", "mlops-zoomcamp").points[0].payload['text'])

Please choose the closest one to your answer. Also do not post your answer in the course slack channel.


# Conclusion

🎉 Congratulations! I now have everything I need to run a simple semantic search with Qdrant! 👏

In general, data preparation, organization, and storage in a production-ready vector search solution is a topic worth a course of its own.
If you’re curious to dive deeper into efficient vector search setup, check out our [Vector Search Manuals](https://qdrant.tech/articles/vector-search-manuals/).

In the next videos, we will show you how to use [hybrid search](https://qdrant.tech/articles/hybrid-search/), combining the strengths of both keywords-based search and vector search. In many real-world applications, they work hand-in-hand, balancing the precision of keywords with the flexibility of embeddings to deliver the best results.

P.S. We encourage you to check out Qdrant’s capabilities, which go beyond similarity search powering RAG & agentic pipelines (but still, here's our [MCP server](https://github.com/qdrant/mcp-server-qdrant) ;) ).
