## Qdrant Essentials: Day 3 - Using Sparse Vectors in Qdrant

To interact with Qdrant we'll install the Qdrant Python client. This client enables us to communicate with the Qdrant service, manage collections, and perform vector searches.

### Step 1: Install the Qdrant Client


In [None]:
!pip -q install qdrant-client

## Step 2: Import Required Libraries

Now let's import the necessary modules from the `qdrant-client` package. The `QdrantClient` class allows us to establish a connection to the Qdrant service, while the [models module](https://python-client.qdrant.tech/qdrant_client.http.models.models) provides definitions for various configurations and parameters we'll use. It includes definitions for data structures such as `SparseVector`, `SparseVectorParams`, and `SparseIndexParams`, which we will need to use today.

In [None]:
from qdrant_client import QdrantClient, models

## Step 3: Connect to Qdrant Cloud

To connect to [Qdrant Cloud](https://cloud.qdrant.io/login), you need:

*   Cluster URL: Found in your Qdrant Cloud dashboard.
*   API Key: Generated in the Qdrant Cloud API Keys section.

In [None]:
from google.colab import userdata

client = QdrantClient(
    url="your_url",
    api_key=userdata.get('api-key')
)

## Step 4: Create a Collection with Sparse Vectors

Sparse vectors in Qdrant collections are configured using `sparse_vectors_config`.

Unlike in dense vector configuration `vectors_config`, we don’t need to define **size** or **distance metric** for sparse vectors:
- **Size** varies based on the number of non-zero elements in the sparse vector.  
>  The maximum number of non-zero elements (i.e., the sparse vector’s size) is limited by the `uint32` type, meaning 4,294,967,296 non-zero elements.
- **Distance metric** for comparing sparse vectors is always the `Dot product`.

Sparse vectors are not the default in Qdrant (unlike dense vectors). That's why, to configure collection with sparse vectors, we need to **give them a name**.  

> **Named vectors** additionally allow us to use multiple vectors for the same point, for example, one dense and one sparse.  We’ll cover more about this in the upcoming videos on `Hybrid Search`.


In [None]:
# Define the collection name
collection_name = "sparse_vectors_collection"

# Create the collection with sparse vectors
client.create_collection(
    collection_name=collection_name,
    sparse_vectors_config={ #vector named "sparse_vector"
        "sparse_vector": models.SparseVectorParams(),
    },
)

True

You can **optionally** configure the parameters of the `inverted index`, for example:

- `full_scan_threshold` (integer) – up to this number of vectors (not including this number), the inverted index won’t be used to compare sparse vectors (though **it will still be built**). Instead, vectors will be compared directly.
- `on_disk` – whether the index is stored on disk or kept in RAM. The default is `False`, meaning it’s stored in RAM.
- `datatype` – precision of the values (non-zero weights) stored in the index. Options are `uint8`, `float16`, or `float32`. Default is `float32`.
> No matter which `datatype` you choose, the original values **will still be stored on disk**.

> Only configure these parameters if you’re confident about their impact, the default values are chosen to work well for most use cases.


In [None]:
collection_name = "sparse_vectors_collection_custom_index"

client.create_collection(
    collection_name=collection_name,
    sparse_vectors_config={
        "sparse_vector": models.SparseVectorParams(
            index=models.SparseIndexParams( #inverted index parameters
                full_scan_threshold=0, #full scan search, not using inverted index
                on_disk=False, #where inverted index is stored
                datatype=models.VectorStorageDatatype("float32") #precision of values stored in inverted index

            )
        ),
    },
)

True

> The `inverted index` is **always built** for sparse vectors, no matter how many vectors are in the collection.


## Step 5: Insert Sparse Vectors into the Collection

Let’s insert points with sparse vectors into our `"sparse_vectors_collection"`.

Sparse vectors in Qdrant are represented by:
- `indices` – the indices of non-zero dimensions (stored as `uint32`, so they can range from 0 to 4,294,967,295).
- `values` – the values of these non-zero dimensions (stored as a float).

There are two important rules when creating a sparse vector in Qdrant:
- The `indices` in a sparse vector must be **unique**.
- The `indices` array must be the **same length** as the `values` array.

> Don’t confuse `indices` with the sparse vector’s `inverted index`. `indices` represent the non-zero dimensions of a sparse vector, while the `inverted index` is a data structure that helps compare sparse vectors efficiently.

In [None]:
collection_name = "sparse_vectors_collection"

# Insert vectors into the collection
client.upsert(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=1,
            payload={},
            vector={ #vector named "sparse_vector"
                "sparse_vector": models.SparseVector(
                    indices=[1, 2, 3], #uint32, from 0 to 4_294_967_295
                    values=[0.2, -0.2, 0.2] #stored as floats
                )
            },
        ),
        models.PointStruct(
            id=2,
            payload={},
            vector={ #vector named "sparse_vector"
                "sparse_vector": models.SparseVector(
                    indices=[1, 5], #uint32, from 0 to 4_294_967_295
                    values=[0.1, 0.1] #stored as floats
                )
            },
        ),
    ],
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

## Step 6: Running Similarity Search on Sparse Vectors

Now, let’s find the most similar sparse vector in `"sparse_vectors_collection"` to a query sparse vector.


In [None]:
collection_name = "sparse_vectors_collection"

client.query_points(
    collection_name=collection_name,
    using="sparse_vector",  # we need to specify the name of our sparse vectors to search against them
    limit=1,                # return the top 1 most similar result
    query=models.SparseVector(
        indices=[1, 3],
        values=[1, 1]
    ),
    with_vectors=True # to see the top 1 most similar vector
)


QueryResponse(points=[ScoredPoint(id=1, version=0, score=0.4, payload={}, vector={'sparse_vector': SparseVector(indices=[1, 2, 3], values=[0.2, -0.2, 0.2])}, shard_key=None, order_value=None)])

Let’s understand why we got `Point 1` as the answer.

In the collection, we have two points:

- `Point 1` has three non-zero values: `values = [0.2, -0.2, 0.2]` with `indices = [1, 2, 3]`
- `Point 2` has two non-zero values: `values = [0.1, 0.1]` with `indices = [1, 5]`

Our query has `indices = [1, 3]` with corresponding `values = [1, 1]`.

**The similarity score for sparse vectors** is calculated by comparing only the matching indices shared between the query and the points: `[1, 3]` for `Point 1`, and `[1]` for `Point 2`.

We multiply the corresponding values and sum them up:

- `score(query, Point 1)` = 1 * 0.2 + 1 * 0.2 = **0.4**
- `score(query, Point 2)` = 1 * 0.1 = 0.1

Since `0.4` is higher than `0.1`, `Point 1` is more similar to our query.

> Search on sparse vectors is always **exact**


Congratulations! 🎉 You can now work with sparse vectors in Qdrant!

Just like dense vectors, they can be combined with payload filters.  
And even more, they can be combined together with dense vectors!  
Why and how? You’ll see in our upcoming videos on **Hybrid Search**.