## Evaluating Precision@k for Qdrant Retrieval Methods

This notebook aims to systematically evaluate the Precision@k for various retrieval methods offered by Qdrant. We will explore the following retrieval methods:

- **Search**: Finds the k nearest neighbors to a given target vector.
- **Recommend**: Identifies the k nearest neighbors to the positive vectors while distancing from the negative vectors, with two strategies:
    - AVERAGE_VECTOR
    - BEST_SCORE
- **Discovery**: Combines target vectors with (positive, negative) pairs for refined retrieval.

We begin by loading our dataset, preparing it for insertion into Qdrant, and subsequently evaluating each method's performance in terms of Precision@k. 

In the previous notebooks we have created data points with their corresponding embeddings, that we will use to test the different retrieval techniques. We will run the Qdrant container that we have specified in the `docker-compose.yml` file.


In [1]:
!cd .. | docker-compose up -d qdrant

[1A[1B[0G[?25l[+] Running 0/0
 [33m⠋[0m Network stilsucher_default  Creating                                    [34m0.0s [0m
[?25h[1A[1A[0G[?25l[+] Running 0/1
 [33m⠙[0m Network stilsucher_default     Created                                  [34m0.1s [0m
 [33m⠋[0m Container stilsucher-qdrant-1  Start...                                 [34m0.1s [0m
[?25h[1A[1A[1A[0G[?25l[+] Running 0/2
 [33m⠹[0m Network stilsucher_default     Created                                  [34m0.2s [0m
 [33m⠙[0m Container stilsucher-qdrant-1  Start...                                 [34m0.2s [0m
[?25h[1A[1A[1A[0G[?25l[+] Running 1/2
 [33m⠸[0m Network stilsucher_default     Created                                  [34m0.2s [0m
 [32m✔[0m Container stilsucher-qdrant-1  [32mStart...[0m                                 [34m0.2s [0m
[?25h

In [2]:
from qdrant_client import QdrantClient
from qdrant_client.http import models
import numpy as np
from glob import glob
import pickle

# Global variable for the number of nearest neighbors to retrieve
k = 5

### Utility Functions for Data Loading and Preparation

Before we start, we define utility functions for loading our data, preparing vectors for Qdrant, and inserting data into a Qdrant collection. These functions streamline the process of managing our dataset and interacting with Qdrant.


In [3]:
def load_embeddings(file_pattern="embedding_*.pkl"):
    """Load embedding data from pickle files."""
    all_data = []
    for filename in sorted(glob(file_pattern)):
        with open(filename, "rb") as file:
            all_data.extend(pickle.load(file))
    return all_data

def insert_into_qdrant(organized_data, qclient, collection_name="fclip"):
    """Insert data into a Qdrant collection."""
    points = [
        models.PointStruct(
            id=int(item['id']),
            payload=item['metadata'],  # Ensure 'metadata' is suitable for payload
            vector=item['target_ground_truth_image']
        )
        for item in organized_data
    ]
    response = qclient.upload_points(
        collection_name=collection_name,
        points=points,
        parallel=4,
        max_retries=3
    )
    while True:
        collection_info = qclient.get_collection(collection_name=collection_name)
        if collection_info.status == models.CollectionStatus.GREEN:
            break
    return response

def prepare_vector_for_qdrant(vector):
    """Prepare a vector for Qdrant by ensuring it's a list of floats."""
    return vector.tolist() if isinstance(vector, np.ndarray) else vector


### Loading Organized Data

Next, we load the organized dataset prepared for insertion and evaluation in Qdrant. In this case the `article_embeddings_0.pkl` file

In [4]:
organized_data = load_embeddings(file_pattern='article_embeddings_*.pkl')

# Ensure vectors are lists of floats, as required by Qdrant
for item in organized_data:
    for key in ['target', 'enhanced_target', 'positive', 'negative', 'target_ground_truth_image']:
        item[key] = prepare_vector_for_qdrant(item[key])


### Initializing Qdrant Client and Inserting Data

With our dataset ready and utility functions in place, the next step is to initialize the Qdrant client, create the necessary collection, and insert our dataset into Qdrant. This setup is crucial for the subsequent retrieval and evaluation operations.


In [5]:
# Initialize Qdrant client
client = QdrantClient("http://localhost:6333")
# client = QdrantClient(path="./qdata")

# Collection name
collection_name = "fclip"

# Create the collection with the appropriate configuration
def create_collection(client, collection_name, force=False):
    collections_list = [collection.name for collection in client.get_collections().collections]
    if collection_name not in collections_list:
        client.create_collection(
            collection_name=collection_name,
            vectors_config=models.VectorParams(size=512, distance=models.Distance.COSINE),
        )

create_collection(client=client, collection_name=collection_name)

# Insert data into Qdrant
insert_into_qdrant(organized_data, client, collection_name=collection_name)


In [6]:
client.get_collection('fclip')

CollectionInfo(status=<CollectionStatus.GREEN: 'green'>, optimizer_status=<OptimizersStatusOneOf.OK: 'ok'>, vectors_count=3771, indexed_vectors_count=0, points_count=3771, segments_count=8, config=CollectionConfig(params=CollectionParams(vectors=VectorParams(size=512, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None), shard_number=1, sharding_method=None, replication_factor=1, write_consistency_factor=1, read_fan_out_factor=None, on_disk_payload=True, sparse_vectors=None), hnsw_config=HnswConfig(m=16, ef_construct=100, full_scan_threshold=10000, max_indexing_threads=0, on_disk=False, payload_m=None), optimizer_config=OptimizersConfig(deleted_threshold=0.2, vacuum_min_vector_number=1000, default_segment_number=0, max_segment_size=None, memmap_threshold=None, indexing_threshold=20000, flush_interval_sec=5, max_optimization_threads=1), wal_config=WalConfig(wal_capacity_mb=32, wal_segments_ahead=0), quantization_config=None), payload_schema={})

### Evaluating Retrieval Methods

With our data now inserted into Qdrant, we can proceed to evaluate the Precision@k for each retrieval method. This evaluation will provide insights into the effectiveness of each method in retrieving relevant items from our dataset.

The goal is to explore various ways of performing Text2Image retrieval. Beyond the traditional search method where we specify a query and receive results, we also want to specify attributes we want to be considered in our search and attributes we want to avoid. This can be achieved using the `recommend` and `discover` APIs provided by Qdrant.

We aim to simulate five situations:
1. **Simple Text2Image search**: Users provide a description for an item they are looking for. For this, we use the `detail_desc` column in `articles.csv`.
2. **Text2Image search with enhanced description**: Users not only provide a description of what they are looking for, but also specify an attribute from the item they are looking for and something they want to avoid. To do this, we append the following to the end of the `detail_desc`: `It has to be {category value for that item}, not {another random value from the same category}`.
3. **Recommendations API**: For this strategy, we use the `detail_desc` and a `positive_sample` for the positive vectors and `negative_sample` for the negative vectors. We will test two strategies:
    - `AVERAGE_VECTOR`: We average the positive vectors and use the resulting vector as the query.
    - `BEST_SCORE`: We use the positive vectors individually as queries and select the one with the highest score.
4. **Discovery API**: This is similar to the `Recommendations API`, but the target will be `detail_desc` and the `positive` and `negative` pairs will be from the `positive_sample` and `negative_sample`.

In [7]:
def search_and_evaluate_precision(qclient, organized_data, collection_name="fclip", target='target', k=5):
    """Evaluate Precision@k using Qdrant's search API."""
    precisions = []
    bad_queries = []
    good_queries = []
    for item in organized_data:
        query_vector = prepare_vector_for_qdrant(item[target])
        results = qclient.search(
            collection_name=collection_name,
            query_vector=query_vector,
            limit=k
        )
        retrieved_ids = [result.id for result in results]
        precision = int(item['id']) in retrieved_ids
        precisions.append(precision)
        if precision == False:
            bad_queries.append(item[target+"_text"])
        else:
            if retrieved_ids.index(int(item['id'])) in [0,1]: good_queries.append(item[target+"_text"])
    print(f"When using Search with {target}")
    print("Some good queries were:")
    for good_query in good_queries[:5]:
        print("\t", good_query)
        print()
    print("Some bad queries were:")
    for bad_query in bad_queries[:5]:
        print("\t", bad_query)
        print()
    print("-----------------------")
    return np.mean(precisions)

def recommend_and_evaluate_precision(qclient, organized_data, collection_name="fclip", k=5, strategy=models.RecommendStrategy.AVERAGE_VECTOR):
    """Evaluate Precision@k using Qdrant's recommend API with specified strategy."""
    precisions = []
    bad_queries = []
    good_queries = []
    for item in organized_data:
        positive_vectors = [prepare_vector_for_qdrant(item['target']), prepare_vector_for_qdrant(item['positive'])]
        negative_vectors = [prepare_vector_for_qdrant(item['negative'])]
        results = qclient.recommend(
            collection_name=collection_name,
            positive=positive_vectors,
            negative=negative_vectors,
            strategy=strategy,
            limit=k
        )
        retrieved_ids = [result.id for result in results]
        precision = int(item['id']) in retrieved_ids
        precisions.append(precision)
        if precision == False:
            bad_queries.append((item["target_text"], item["positive_text"], item["negative_text"]))
        else:
            if retrieved_ids.index(int(item['id'])) in [0,1]: good_queries.append((item["target_text"], item["positive_text"], item["negative_text"]))
    print("When using the Recommendations API")
    print("Some good queries were:")
    for good_query in good_queries[:5]:
        print("\ttarget (positive): ", good_query[0])
        print("\tpositive: ", good_query[1])
        print("\tnegative: ", good_query[2])
        print()
    print("Some bad queries were:")
    for bad_query in bad_queries[:5]:
        print("\ttarget (positive): ", bad_query[0])
        print("\tpositive: ", bad_query[1])
        print("\tnegative: ", bad_query[2])
        print()
    print("-----------------------")
    return np.mean(precisions)

def discovery_and_evaluate_precision(qclient, organized_data, collection_name="fclip", k=5, ef=128):
    """Evaluate Precision@k using Qdrant's discovery API."""
    precisions = []
    bad_queries = []
    good_queries = []
    for item in organized_data:
        target_vector = prepare_vector_for_qdrant(item['target'])
        context = models.ContextExamplePair(
            positive=item['positive'],
            negative=item['negative']
        )
        results = qclient.discover(
            collection_name=collection_name,
            target=target_vector,
            context=[context],
            limit=k,
            ef=ef    # Since the space is hard constrained by the context the accuracy will drop, we increase the ef to mitigate this
        )
        retrieved_ids = [result.id for result in results]
        precision = int(item['id']) in retrieved_ids
        precisions.append(precision)
        if precision == False:
            bad_queries.append((item["target_text"], item["positive_text"], item["negative_text"]))
        else:
            if retrieved_ids.index(int(item['id'])) in [0,1]: good_queries.append((item["target_text"], item["positive_text"], item["negative_text"]))
    print("When using the Discovery API")
    print("Some good queries were:")
    for good_query in good_queries[:5]:
        print("\ttarget: ", good_query[0])
        print("\tpositive: ", good_query[1])
        print("\tnegative: ", good_query[2])
        print()
    print("Some bad queries were:")
    for bad_query in bad_queries[:5]:
        print("\ttarget", bad_query[0])
        print("\tpositive: ", bad_query[1])
        print("\tnegative: ", bad_query[2])
        print()
    print("-----------------------")
    return np.mean(precisions)


In [8]:
k = 16  # You can adjust k as needed

# Evaluate Precision@k for each retrieval method
precision_search_target = search_and_evaluate_precision(client, organized_data, collection_name=collection_name, k=k)
precision_search_enhanced_target = search_and_evaluate_precision(client, organized_data, collection_name=collection_name, target='enhanced_target', k=k)
precision_recommend_avg = recommend_and_evaluate_precision(client, organized_data, collection_name=collection_name, k=k, strategy=models.RecommendStrategy.AVERAGE_VECTOR)
precision_recommend_best = recommend_and_evaluate_precision(client, organized_data, collection_name=collection_name, k=k, strategy=models.RecommendStrategy.BEST_SCORE)
precision_discovery = discovery_and_evaluate_precision(client, organized_data, collection_name=collection_name, k=k)

# Print results
print(f"Precision@{k} for Search (target): {precision_search_target}")
print(f"Precision@{k} for Search (target): {precision_search_enhanced_target}")
print(f"Precision@{k} for Recommend (Average Vector): {precision_recommend_avg}")
print(f"Precision@{k} for Recommend (Best Score): {precision_recommend_best}")
print(f"Precision@{k} for Discovery: {precision_discovery}")


When using Search with target
Some good queries were:
	 Two-strand hairband with braids in imitation suede and elastic at the back.

	 Cardigan in a bouclé knit made from a wool blend with a shawl collar, zip at one side and long sleeves.

	 Tights with an elasticated waist. 20 denier.

	 Jacket in sweatshirt fabric with a lined drawstring hood, zip down the front, side pockets and ribbing at the cuffs and hem.

	 Loafers in imitation suede with moccasin seams, decorative laces, fabric linings and insoles and rubber soles.

Some bad queries were:
	 Leggings in stretch jersey with an elasticated waist.

	 Tops in soft organic cotton jersey.

	 Short-sleeved top in jersey with sewn-in turn-ups on the sleeves.

	 Knee-length shorts in sweatshirt fabric with a low crotch, elasticated drawstring waist, side pockets, back pockets and ribbed hems.

	 Thin tights with an elasticated waist.

-----------------------
When using Search with enhanced_target
Some good queries were:
	 Two-strand hair

## Conclusion and Comparison of Retrieval Methods

Upon evaluating the Precision@k for each method, we can draw comparisons to understand the strengths and weaknesses of each retrieval method in the context of our specific dataset and use case.

Here are the results:

- Precision@16 for Search (target): ~69%$
- Precision@16 for Search (enhanced description): ~72%
- Precision@16 for Recommend (Average Vector): ~67%
- Precision@16 for Recommend (Best Score): ~66.7%
- Precision@16 for Discovery: ~64.3%

From the results, we observe that enhancing the description of the item we are looking for, and specifying attributes we want to be considered and avoided, improves the Precision@k for the Search method. Following the enhanced Search, the Recommendations API with the `AVERAGE_VECTOR` strategy performs next best. The `BEST_SCORE` strategy performs slightly worse, and the Discovery API yields the least effective results among the four methods.

It's also important to note a few things:  

- Our method of obtaining positive and negative samples is somewhat naive and may not always provide useful context. A potential future experiment could involve using a Generative Language Model that receives the item information and generates more semantically meaningful positive and negative samples. This could potentially yield better embeddings for the Discovery API.
- We are doing only retrieval over an approximate nearest neighbors index, so the way we build the index might also affect the results. We could experiment with different index types and parameters to see if we can improve the results.
- We are only retrieving candidates, but experimenting with a way to rank the candidates could also be interesting. We could train a second model that concats the text and image embeddings and predicts a score, and then use this score to rank the candidates. Tools like `Quaterion` could be useful for this.

In [9]:
client.close()

In [10]:
!cd .. | docker-compose stop

[1A[1B[0G[?25l[+] Stopping 0/0
 [33m⠋[0m Container stilsucher-qdrant-1  Stopp...                                 [34m0.1s [0m
[?25h[1A[1A[0G[?25l[34m[+] Stopping 1/1[0m
 [32m✔[0m Container stilsucher-qdrant-1  [32mStopp...[0m                                 [34m0.2s [0m
[?25h