# Tweaking up semantic retrieval

There are various objectives we could try optimizing for when it comes to semantic retrieval. We could try to optimize the **speed** of the retrieval, the **quality** of it, or the **memory usage**. We'll review some of the techniques in all three areas.

## Loading the configuration and pipeline

Again, let's start with loading the configuration, and then set up our retriever. We don't want a full RAG pipeline, as we are solely interested in the semantic search part. Improving a single component at a time should be easier to understand and debug. 

In [None]:
from dotenv import load_dotenv

load_dotenv()

In [None]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    embed_model="local:BAAI/bge-large-en"
)

In [None]:
from qdrant_client import QdrantClient
from llama_index.vector_stores.qdrant import QdrantVectorStore

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)
vector_store = QdrantVectorStore(
    client=client, 
    collection_name="hacker-news"
)

In [None]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    service_context=service_context,
)

In [None]:
from llama_index.vector_stores import MetadataFilters, MetadataFilter
from llama_index.indices.vector_store import VectorIndexRetriever

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="type", value="story"),
        ]
    ),
)

In [None]:
nodes = retriever.retrieve("What is the best way to learn programming?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

## Quality optimization

We have implemented a basic RAG already, and we might be happy with the quality. There are a lot of aspects when it comes to measuring the quality of a semantic retrieval system, and we will not go into details here. It is usually related to the quality of the embedding model we use, and it is a topic for another day.

However, all the vector databases approximate the nearest neighbor search, and this approximation comes with a cost. The cost is that the results are not always ideal. HNSW, an algorithm used in Qdrant, has some parameters to control how the internal structures are built, and these parameters can be tweaked to improve the quality of the results. This is very specific to the vector database used, thus it's configured through the Qdrant API.

In [None]:
client.get_collection(collection_name="hacker-news")

As for now, the most interesting part is the `hnsw_config` field. The algorithm itself is controlled by two parameters. The number of edges per node is called the `m` parameter. The larger the value, the higher the precision of the search, but the more space required. The `ef_construct` parameter is the number of neighbors to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time. 

Playing with both parameters **improves just the approximation of the exact nearest neighbors**, and a proper embedding model is still way more important. However, [this quality aspect might also be controlled, even in an automated way](https://qdrant.tech/documentation/tutorials/retrieval-quality/). For the time being, we'll simply increase both values, but won't measure the impact on the overall quality of search results.

In [None]:
from qdrant_client import models

client.update_collection(
    collection_name="hacker-news",
    hnsw_config=models.HnswConfigDiff(
        m=32,
        ef_construct=200,
    )
)

In [None]:
import time

while True:
    collection = client.get_collection("hacker-news")
    if collection.status == models.CollectionStatus.GREEN:
        break
    time.sleep(1.0)
        
collection

In [None]:
nodes = retriever.retrieve("What is the best way to learn programming?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

## Memory optimization

Each point in a Qdrant collection consists of up to three elements: id, vector(s), and optional payload represented by a JSON object. Vectors are indexed in an HNSW graph, and search operations may involve semantic similarity and some payload-based criteria (it's best to add payload indexes on the fields we want to use for the filtering). Ideally, all the elements should be kept in RAM so access is fast.

Unfortunately, semantic search is a heavy operation in terms of memory requirements. However, some projects are implemented on a budget and can't afford machines with hundreds of gigabytes of RAM. Qdrant allows storing every single component on a disk to reduce memory usage, but that comes with a performance cost. Let's compare the efficiency of the operations with all the components in RAM and with some of them on disk.

In [None]:
%%timeit -n 100 -r 5
retriever.retrieve("What is the best way to learn programming?")

In [None]:
client.update_collection(
    collection_name="hacker-news",
    hnsw_config=models.HnswConfigDiff(
        on_disk=True,
    ),
    vectors_config={
        "": models.VectorParamsDiff(
            on_disk=True,
        )
    },
)

In [None]:
while True:
    collection = client.get_collection("hacker-news")
    if collection.status == models.CollectionStatus.GREEN:
        break
    time.sleep(1.0)
        
collection

In [None]:
%%timeit -n 100 -r 5
retriever.retrieve("What is the best way to learn programming?")

## Speed optimization

There are various ways of optimizing semantic search in terms of speed. The most straightforward one is to reduce both `m` and `ef_construct` parameters, as we did in the previous section. However, this comes with a cost of the quality of the results.

Qdrant also provides a number of quantization techniques, and two of them are primarily used to increase speed and reduce memory at the same time:

1. **Scalar Quantization** - uses `int8` instead of `float32` to store each vector dimension
2. **Binary Quantization** - `bool` values are used to store each vector dimension

The first one reduces the memory usage by up to 4x, while the second one by up to 32x and both increase the speed of the search. However, the quality of the search results is reduced, and Binary Quantization is not suitable for all the use cases. It only works with some specific models, usually the ones with high dimensionality.

In our case, we're going to set up the binary quantization either way. From the LlamaIndex perspective, the search operations are going to be fired identically.

In [None]:
client.update_collection(
    collection_name="hacker-news",
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(
            always_ram=True,
        )
    )
)

In [None]:
while True:
    collection = client.get_collection("hacker-news")
    if collection.status == models.CollectionStatus.GREEN:
        break
    time.sleep(1.0)
        
collection

In [None]:
nodes = retriever.retrieve("What is the best way to learn programming?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

In [None]:
%%timeit -n 100 -r 5
retriever.retrieve("What is the best way to learn programming?")