# Binary Quantization with Qdrant

Binary Quantization is a promising approach to improve retrieval speeds and reduce memory footprint of vector search engines. In this notebook we will show how to use Qdrant to perform binary quantization of vectors and perform fast similarity search on the resulting index.

## Table of Contents
1. Imports
2. Download and Slice Dataset
3. Create Qdrant Collection
4. Indexing
5. Search
6. 

## 1. Imports

In [1]:
!pip install qdrant-client==1.5.1 pandas dataset --quiet --upgrade

In [2]:
import pandas as pd
import uuid
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.http.models import PointStruct

## 2. Download and Slice Dataset

We will be using the [dbpedia-entitis-openai-1M](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M) dataset from the [HuggingFace Datasets](https://huggingface.co/datasets) library. This contains 1M vectors of 1536 dimensions each. We will be using the first 10K vectors here.

In [3]:
import datasets
dataset = datasets.load_dataset("KShivendu/dbpedia-entities-openai-1M", split="train[0:500000]")

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

In [4]:
# !wget https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M/resolve/main/data/train-00000-of-00026-3c7b99d1c7eda36e.parquet -O train.parquet

In [5]:
# df = pd.read_parquet('train.parquet')
# len(df)
# dataset = df[:30000]
# del df

In [6]:
len(dataset)
dataset[0]

{'_id': '<dbpedia:Animalia_(book)>',
 'title': 'Animalia (book)',
 'text': "Animalia is an illustrated children's book by Graeme Base. It was originally published in 1986, followed by a tenth anniversary edition in 1996, and a 25th anniversary edition in 2012. Over three million copies have been sold.   A special numbered and signed anniversary edition was also published in 1996, with an embossed gold jacket.",
 'openai': [0.017398979514837265,
  -0.01408793218433857,
  -0.010348621755838394,
  -0.02228245511651039,
  -0.010668220929801464,
  0.025388959795236588,
  -0.030783794820308685,
  -0.027204282581806183,
  0.005008119158446789,
  -0.012988511472940445,
  0.007529756985604763,
  0.006433531641960144,
  -0.007580892648547888,
  0.002420963952317834,
  0.006503843702375889,
  0.024315105751156807,
  0.02932642214000225,
  0.027229851111769676,
  -0.013397597707808018,
  -0.0027149950619786978,
  0.0036338428035378456,
  0.04625239595770836,
  0.0095304474234581,
  -0.020863434299

## 3. Create Qdrant Collection

In [7]:
from concurrent.futures import ThreadPoolExecutor

def make_point(row):
    return PointStruct(
        id=uuid.uuid4().hex,
        vector=row["openai"],
        payload={
            "text": row["text"],
        }
    )

with ThreadPoolExecutor(max_workers=3) as executor:
    points = list(executor.map(make_point, dataset))

In [8]:
len(points)

500000

In [10]:
from qdrant_client import QdrantClient

# client = QdrantClient(
#     url="https://2aaa9439-b209-4ba6-8beb-d0b61dbd9388.us-east-1-0.aws.cloud.qdrant.io:6333", 
#     api_key="FCF8_ADVuSRrtNGeg_rBJvAMJecEDgQhzuXMZGW8F7OzvaC9wYOPeQ",
#     prefer_grpc=True
# )

client = QdrantClient(
    url="http://localhost:6334",
    timeout=600,
    prefer_grpc=True,
)

collection_name = "binary-quantization"
client.recreate_collection(
    collection_name=f"{collection_name}",
    vectors_config=models.VectorParams(
        size=1536,
        distance=models.Distance.COSINE,
        on_disk=True,
    ),
    quantization_config=models.BinaryQuantization(
        binary=models.BinaryQuantizationConfig(always_ram=True),
    ),
)

True

In [11]:
client.upsert(
    collection_name=collection_name,
    points=points,
)

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Socket closed"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2023-09-15T20:29:54.714913+05:30", grpc_status:14, grpc_message:"Socket closed"}"
>

In [None]:
collection_info = client.get_collection(collection_name=f"{collection_name}")

In [None]:
# client.delete_collection(collection_name=f"{collection_name}")