# Tweaking up semantic retrieval

There are various objectives we could try optimizing for when it comes to semantic retrieval. We could try to optimize the **speed** of the retrieval, the **quality** of it, or the **memory usage**. We'll review some of the techniques in all three areas.

## Loading the configuration and pipeline

Again, let's start with loading the configuration, and then set up our retriever. We don't want a full RAG pipeline, as we are solely interested in the semantic search part. Improving a single component at a time should be easier to understand and debug. 

In [1]:
from dotenv import load_dotenv

load_dotenv()

True

In [2]:
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(
    embed_model="local:BAAI/bge-large-en"
)

In [3]:
from qdrant_client import QdrantClient
from llama_index.vector_stores.qdrant import QdrantVectorStore

import os

client = QdrantClient(
    os.environ.get("QDRANT_URL"), 
    api_key=os.environ.get("QDRANT_API_KEY"),
)
vector_store = QdrantVectorStore(
    client=client, 
    collection_name="hacker-news"
)

In [4]:
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    service_context=service_context,
)

In [5]:
from llama_index.vector_stores import MetadataFilters, MetadataFilter
from llama_index.indices.vector_store import VectorIndexRetriever

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=5,
    filters=MetadataFilters(
        filters=[
            MetadataFilter(key="type", value="story"),
        ]
    ),
)

In [6]:
nodes = retriever.retrieve("What is the best way to learn programming?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

1 Ask HN: Where do you go to find recommendations for physical programming books?

I&#x27;m old school and like sitting down with a book both for learning actual coding and also for methodologies and philosophies. I don&#x27;t know where to go for recommendations. Any help? Thanks!

2 Ask HN: What to learn in order to get a software job in a decent country?

A good friend of mine is 18 and Russian. He is a programming prodigy and is trying to formulate a plan to get out. He&#x27;s thinking about his future CV and applying for jobs. What would be the best frameworks to invest time in getting experience with now?

3 Ask HN: What is the best way to get into building electronics as a programmer?

I am asking not only about learning what is taught in classes for solving ideal problems. I am talking about the real engineering like a hobbyist who actually understands what works in real life and how to build it properly.

4 Ask HN: Best tools for 4/5 year old to learn programming?

I&#x27;m lo

## Quality optimization

We have implemented a basic RAG already, and we might be happy with the quality. There are a lot of aspects when it comes to measuring the quality of a semantic retrieval system, and we will not go into details here. It is usually related to the quality of the embedding model we use, and it is a topic for another day.

However, all the vector databases approximate the nearest neighbor search, and this approximation comes with a cost. The cost is that the results are not always ideal. HNSW, an algorithm used in Qdrant, has some parameters to control how the internal structures are built, and these parameters can be tweaked to improve the quality of the results. This is very specific to the vector database used, thus it's configured through the Qdrant API.

In [7]:
client.get_collection(collection_name="hacker-news")

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=47037, indexed_vectors_count=45000, points_count=47037, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=1024, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None), shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None), payload_sch

As for now, the most interesting part is the `hnsw_config` field. The algorithm itself is controlled by two parameters. The number of edges per node is called the `m` parameter. The larger the value, the higher the precision of the search, but the more space required. The `ef_construct` parameter is the number of neighbors to consider during the index building. Again, the larger the value, the higher the precision, but the longer the indexing time. 

Playing with both parameters **improves just the approximation of the exact nearest neighbors**, and a proper embedding model is still way more important. However, [this quality aspect might also be controlled, even in an automated way](https://qdrant.tech/documentation/tutorials/retrieval-quality/). For the time being, we'll simply increase both values, but won't measure the impact on the overall quality of search results.

In [8]:
from qdrant_client import models

client.update_collection(
    collection_name="hacker-news",
    hnsw_config=models.HnswConfigDiff(
        m=32,
        ef_construct=200,
    )
)

True

In [9]:
import time

while True:
    collection = client.get_collection("hacker-news")
    if collection.status == models.CollectionStatus.GREEN:
        break
    time.sleep(1.0)
        
collection

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=47037, indexed_vectors_count=45000, points_count=47037, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=1024, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None), shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=32, ef_construct=200, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None), payload_sch

In [10]:
nodes = retriever.retrieve("What is the best way to learn programming?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

1 Ask HN: Where do you go to find recommendations for physical programming books?

I&#x27;m old school and like sitting down with a book both for learning actual coding and also for methodologies and philosophies. I don&#x27;t know where to go for recommendations. Any help? Thanks!

2 Ask HN: What to learn in order to get a software job in a decent country?

A good friend of mine is 18 and Russian. He is a programming prodigy and is trying to formulate a plan to get out. He&#x27;s thinking about his future CV and applying for jobs. What would be the best frameworks to invest time in getting experience with now?

3 Ask HN: What is the best way to get into building electronics as a programmer?

I am asking not only about learning what is taught in classes for solving ideal problems. I am talking about the real engineering like a hobbyist who actually understands what works in real life and how to build it properly.

4 Ask HN: Best tools for 4/5 year old to learn programming?

I&#x27;m lo

## Memory optimization

Each point in a Qdrant collection consists of up to three elements: id, vector(s), and optional payload represented by a JSON object. Vectors are indexed in an HNSW graph, and search operations may involve semantic similarity and some payload-based criteria (it's best to add payload indexes on the fields we want to use for the filtering). Ideally, all the elements should be kept in RAM so access is fast.

Unfortunately, semantic search is a heavy operation in terms of memory requirements. However, some projects are implemented on a budget and can't afford machines with hundreds of gigabytes of RAM. Qdrant allows storing every single component on a disk to reduce memory usage, but that comes with a performance cost. Let's compare the efficiency of the operations with all the components in RAM and with some of them on disk.

In [11]:
%%timeit -n 100 -r 5
retriever.retrieve("What is the best way to learn programming?")

169 ms ± 24.4 ms per loop (mean ± std. dev. of 5 runs, 100 loops each)


In [12]:
client.update_collection(
    collection_name="hacker-news",
    hnsw_config=models.HnswConfigDiff(
        on_disk=True,
    ),
    vectors_config={
        "": models.VectorParamsDiff(
            on_disk=True,
        )
    },
)

True

In [13]:
while True:
    collection = client.get_collection("hacker-news")
    if collection.status == models.CollectionStatus.GREEN:
        break
    time.sleep(1.0)
        
collection

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=47037, indexed_vectors_count=45000, points_count=47037, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=1024, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=True), shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=32, ef_construct=200, full_scan_threshold=10000, max_indexing_threads=0, on_disk=True, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None), payload_sche

In [14]:
%%timeit -n 100 -r 5
retriever.retrieve("What is the best way to learn programming?")

152 ms ± 20.5 ms per loop (mean ± std. dev. of 5 runs, 100 loops each)


## Speed optimization

There are various ways of optimizing semantic search in terms of speed. The most straightforward one is to reduce both `m` and `ef_construct` parameters, as we did in the previous section. However, this comes with a cost of the quality of the results.

Qdrant also provides a number of quantization techniques, and two of them are primarily used to increase speed and reduce memory at the same time:

1. **Scalar Quantization** - uses `int8` instead of `float32` to store each vector dimension
2. **Binary Quantization** - `bool` values are used to store each vector dimension

The first one reduces the memory usage by up to 4x, while the second one by up to 32x and both increase the speed of the search. However, the quality of the search results is reduced, and Binary Quantization is not suitable for all the use cases. It only works with some specific models, usually the ones with high dimensionality.

In our case, we're going to set up the binary quantization either way. From the LlamaIndex perspective, the search operations are going to be fired identically.

In [15]:
client.update_collection(
    collection_name="hacker-news",
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(
            always_ram=True,
        )
    )
)

True

In [16]:
while True:
    collection = client.get_collection("hacker-news")
    if collection.status == models.CollectionStatus.GREEN:
        break
    time.sleep(1.0)
        
collection

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=47037, indexed_vectors_count=45000, points_count=47037, segments_count=2, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=1024, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=True), shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=32, ef_construct=200, full_scan_threshold=10000, max_indexing_threads=0, on_disk=True, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=BinaryQuantization(

In [17]:
nodes = retriever.retrieve("What is the best way to learn programming?")
for i, node in enumerate(nodes):
    print(i + 1, node.text, end="\n\n")

1 Ask HN: Where do you go to find recommendations for physical programming books?

I&#x27;m old school and like sitting down with a book both for learning actual coding and also for methodologies and philosophies. I don&#x27;t know where to go for recommendations. Any help? Thanks!

2 Ask HN: What to learn in order to get a software job in a decent country?

A good friend of mine is 18 and Russian. He is a programming prodigy and is trying to formulate a plan to get out. He&#x27;s thinking about his future CV and applying for jobs. What would be the best frameworks to invest time in getting experience with now?

3 Ask HN: What is the best way to get into building electronics as a programmer?

I am asking not only about learning what is taught in classes for solving ideal problems. I am talking about the real engineering like a hobbyist who actually understands what works in real life and how to build it properly.

4 Ask HN: Best tools for 4/5 year old to learn programming?

I&#x27;m lo

In [18]:
%%timeit -n 100 -r 5
retriever.retrieve("What is the best way to learn programming?")

144 ms ± 23.5 ms per loop (mean ± std. dev. of 5 runs, 100 loops each)
