# Enhancing OpenAI Embeddings with Qdrant's Binary Quantization

## Introduction

OpenAI Ada-003 embeddings are a powerful tool for natural language processing (NLP) tasks. However, the sheer size of the embeddings can make them challenging to work with, especially in applications that require real-time search and retrieval. In this article, we explore how Qdrant's Binary Quantization can be used to enhance the performance and efficiency of OpenAI embeddings, making them more accessible and practical for a wide range of applications.

We begin by discussing the significance of OpenAI embeddings and the challenges associated with their use in real-world applications. We then introduce Qdrant's Binary Quantization and explain how it can be used to improve the performance of OpenAI embeddings. We present the results of an experiment that demonstrates the effectiveness of this approach, highlighting the substantial improvements in search efficiency and accuracy. Finally, we discuss the implications of these findings for real-world applications and provide best practices for leveraging Binary Quantization to enhance OpenAI embeddings.

## New OpenAI Embeddings: Performance and Changes
As the field of embedding models continues to advance, the demand for powerful and efficient text-embedding models has grown significantly. OpenAI's Ada-003 embeddings are a prime example of such models, offering state-of-the-art performance on a wide range of NLP tasks as per [MTEB](huggingface.co/spaces/mteb/leaderboard) and [MIRACL](https://openai.com/blog/new-embedding-models-and-api-updates) both. 

### Multi-lingual Support
OpenAI text-embedding-3-large is a multi-lingual model that can encode text in 100+ languages. This makes it an attractive choice for applications that require support for diverse languages. 

Comparing text-embedding-ada-002 to text-embedding-3-large: on MIRACL, the average score has increased from 31.4% to 54.9%

### Matryoshka Representation Learning
The new OpenAI models have been trained with a novel approach called "Matryoshka Representation Learning". This approach allows developers to request embeddings of different sizes (number of dimensions) for both the small and large variants. This flexibility enables developers to choose the right balance between accuracy and size for their specific use case.

Here, we show how the accuracy of binary quantization is quite good across different dimensions -- for both the models. 

## Enhanced Performance and Efficiency with Binary Quantization
The efficiency gains from Binary Quantization are twofold: 

First, it reduces the overall storage footprint, which is particularly beneficial for applications dealing with large-scale datasets. This reduction in storage helps save on memory, and scale to upto 30x larger size at the same cost. Second, this also enhances the speed of data retrieval, as smaller data sizes generally translate to faster search operations. Second, Binary Quantization accelerates the search process itself. By simplifying the distance calculations between vectors to bitwise operations, searches become significantly faster, enabling real-time querying even in extensive databases.

![](Accuracy_Models.png)

# Experiment Setup: OpenAI Embeddings in Focus

In our exploration of Binary Quantization's impact on search efficiency and accuracy, we centered our experiment around the powerful text-embedding models provided by OpenAI. These models, known for their robust ability to capture nuanced linguistic features and semantic relationships, served as the backbone of our analysis, allowing us to delve deep into the potential enhancements offered by Qdrant's Binary Quantization feature.

### Dataset

We use 100K random samples from the [OpenAI 1M](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M) dataset. We select 100 records at random from the dataset and use the embeddings of the queries to search for the nearest neighbors in the dataset. 

### Experiment Parameters: Oversampling, Rescoring, and Search Limits

For each record, we run a parameter sweep over the number of oversampling, rescoring, and search limits to understand the impact of these parameters on the search accuracy and efficiency. Our experiment was meticulously designed to assess the impact of Binary Quantization under various conditions, with specific attention to three key parameters: oversampling, rescoring, and search limits.

- **Oversampling**: By oversampling, we aimed to mitigate the loss of information inherent in the quantization process, ensuring that the semantic richness of the OpenAI embeddings was preserved as much as possible. We experimented with different oversampling factors to observe how they affect the accuracy and efficiency of searches powered by Binary Quantization. In general, higher oversampling factors tend to improve the accuracy of searches but may come at the cost of increased computational overhead.

- **Rescoring**: Rescoring involves a secondary, more precise search step among the top candidates returned by the initial binary search. This process leverages the original high-dimensional vectors to refine the search results, **always** improving accuracy. We toggled rescoring on and off in our experiments to measure its effectiveness in conjunction with Binary Quantization and to understand its impact on search performance. 

- **Search Limits**: The search limit parameter defines the number of top results to consider in the search process. We experimented with various limits to observe how they affect the accuracy and efficiency of searches powered by Binary Quantization. By adjusting this parameter, we aimed to explore the trade-offs between search depth and performance, providing valuable insights for applications with different precision and speed requirements.

Through this detailed setup, our experiment sought to shed light on the nuanced interplay between Binary Quantization and the high-quality embeddings produced by OpenAI's models. By meticulously adjusting and observing the outcomes under different conditions, we aimed to uncover actionable insights that could empower users to harness the full potential of Qdrant in combination with OpenAI's embeddings, regardless of their specific application needs.

## Results: Binary Quantization's Impact on OpenAI Embeddings

To analyze the performance difference between having rescoring enabled (`True`) and disabled (`False`), we can compare the accuracy results across different model configurations and search limits from the provided data. Rescoring is a process that involves conducting an additional, more precise search on a subset of top candidates returned by the initial query to refine and improve the search accuracy.

### Rescoring

![](Rescoring_Impact.png)

Here are some key observations regarding the performance difference when rescoring is enabled versus when it is disabled:

1. **Significant Accuracy Improvement with Rescoring**:
   - Across all models and dimension configurations, enabling rescoring (`True`) consistently results in higher accuracy scores compared to when rescoring is disabled (`False`).
   - The improvement in accuracy is evident across various search limits (10, 20, 50, 100), underscoring the effectiveness of rescoring in refining the search results.

2. **Model and Dimension Specific Observations**:
   - For the `text-embedding-3-large` model with 3072 dimensions, rescoring boosts the accuracy from an average of about 76-77% without rescoring to 97-99% with rescoring, depending on the search limit and oversampling rate.
    - The accuracy improvement with increased oversampling is more pronounced when rescoring is enabled, indicating a better utilization of the additional binary codes in refining search results.
   - With the `text-embedding-3-small` model at 512 dimensions, accuracy increases from around 53-55% without rescoring to 71-91% with rescoring, highlighting the significant impact of rescoring, especially at lower dimensions.
   - For higher dimension models (e.g., text-embedding-3-large with 3072 dimensions), t
In contrast, for lower dimension models (e.g., text-embedding-3-small with 512 dimensions), the incremental accuracy gains from increased oversampling levels are less significant, even with rescoring enabled. This suggests a diminishing return on accuracy improvement with higher oversampling in lower dimension spaces.

3. **Influence of Search Limit**:
   - The performance gain from rescoring seems to be relatively stable across different search limits, suggesting that rescoring consistently enhances accuracy regardless of the number of top results considered.

In summary, enabling rescoring dramatically improves search accuracy across all tested configurations, making it a crucial feature for applications where precision is paramount. The consistent performance boost provided by rescoring underscores its value in refining search results, particularly when working with complex, high-dimensional data like OpenAI embeddings. This enhancement is critical for applications that demand high accuracy, such as semantic search, content discovery, and recommendation systems, where the quality of search results directly impacts user experience and satisfaction.

In [None]:
import pandas as pd

In [None]:
dataset_combinations = [
    {
        "model_name": "text-embedding-3-large",
        "dimensions": 3072,
    },
    {
        "model_name": "text-embedding-3-large",
        "dimensions": 1024,
    },
    {
        "model_name": "text-embedding-3-large",
        "dimensions": 1536,
    },
    {
        "model_name": "text-embedding-3-small",
        "dimensions": 512,
    },
    {
        "model_name": "text-embedding-3-small",
        "dimensions": 1024,
    },
    {
        "model_name": "text-embedding-3-small",
        "dimensions": 1536,
    },
]

In [None]:
for combination in dataset_combinations:
    model_name = combination["model_name"]
    dimensions = combination["dimensions"]
    print(f"Model: {model_name}, dimensions: {dimensions}")
    results = pd.read_json(f"../results/results-{model_name}-{dimensions}.json", lines=True)
    average_accuracy = results[results["limit"] != 1]
    average_accuracy = average_accuracy[average_accuracy["limit"] != 5]
    average_accuracy = average_accuracy.groupby(["oversampling", "rescore", "limit"])[
        "accuracy"
    ].mean()
    average_accuracy = average_accuracy.reset_index()
    acc = average_accuracy.pivot(
        index="limit", columns=["oversampling", "rescore"], values="accuracy"
    )
    print(acc)

## Impact of Oversampling

![](Oversampling_Impact.png)

## Leveraging Binary Quantization: Best Practices
We recommend the following best practices for leveraging Binary Quantization to enhance OpenAI embeddings:

1. Embedding Model: We know that the text-embedding-3-large from MTEB is the most accurate model, recommend using that
2. Dimensions: We recommend using the highest dimension available for the model, as it provides the best accuracy -- this is true for English and other languages.
3. Oversampling: We recommend using an oversampling factor of 3 for the best balance between accuracy and efficiency. This factor provides a good balance between accuracy and efficiency, making it suitable for a wide range of applications.
4. Rescoring: We recommend enabling rescoring to improve the accuracy of search results.
5. RAM: We recommend that the full vectors and payload be stored on disk, and only the binary quantization index be loaded into memory. This will help reduce the memory footprint and improve the overall efficiency of the system. The incremental latency from the disk read is negligible compared to the latency savings from the binary scoring in Qdrant, which uses SIMD instructions where possible.

## Conclusion
TBD

## Call to Action
TBD