### Lab 3 - Product quantization in FAISS

Following up with Lab 2, we were using flat index and full vectors which soon can become a huge problem in ver large datasets. So you need a solution to compress the vectors. This is akin to compressing a file to save disk space.  FAISS provides an optimization technique called Product Quantization (PQ). Product Quantization in vector search allows to speed up results.

Let's use the example of Canada to illustrate Product Quantization. Each candian province is divided into multiple municipal wards. Groups of municipal wards are clustered to form either a city or town or rural municipality. The clustered group of wards is represented by a city council or townsmen or rural municipal council. In the same vein, we divide vectors into sub-vectors. The sub-vectors are clustered into centroids. The vector of sub-vector is represented by the nearest ID of the centroid. 

In the previous lab we used IndexIVFflat for index patitioning. In this lab we will use IndexIVFPQ index for product quantization. 

In [None]:
# 1. This lab cannot be run on its own you should have com
# 2. You will split document into sentences
# 3. Create a new index and train it on the data
# 4. Split the index into partitions of voronai cells
# 5. Given a query, i.e. "Who plays foot ", you find the K most similar sentences
# 6. Adjust the "k" parameter to explore speed vs accuracy vs approximation

In [None]:
# You will the need python libraries for this tutorial. A basic understanding of python is required. 
# You can install the libraries using pip if not in your notebook pre-installed. 

In [None]:
!pip install faiss-cpu
import requests
from io import StringIO
import pandas as pd
import numpy as np
import faiss

In [None]:
res = requests.get('https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/sick2014/SICK_train.txt')
# create dataframe
data = pd.read_csv(StringIO(res.text), sep='\t', on_bad_lines='skip')
data.head()

In [None]:
# You take all the sentences from into a python list 
# You will get an output of 4.5K sentences
sentences = data['sentence_A'].tolist()
sentences[:5]
len(sentences)

In [None]:
# You take all samples from both sentence A and B and merge them together
# You will get ~4.8K unique sentences
sentences = data['sentence_A'].tolist()
sentence_b = data['sentence_B'].tolist()
sentences.extend(sentence_b)   
len(set(sentences))  

In [None]:
# Still the dataset is small. You are going to add more data by parsing the data from below URLS
urls = [
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.train.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2013/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/OnWN.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/images.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/images.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/headlines.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/belief.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/answers-students.test.tsv',
    'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/answers-forums.test.tsv'
]

In [None]:
# Each of these datasets have the same structure, so we loop through each creating our sentences data
for url in urls:
    res = requests.get(url)
    # extract to dataframe
    data = pd.read_csv(StringIO(res.text), sep='\t', header=None, on_bad_lines='skip')
    # add to columns 1 and 2 to sentences list
    sentences.extend(data[1].tolist())
    sentences.extend(data[2].tolist())

In [None]:
# Let's clean up the data by removin duplicates and NaN 
# You will get approximately 25k sentences
sentences = [word for word in list(set(sentences)) if type(word) is str]
len(sentences)

In [None]:
# You need to install sentence_transformers library. This framework provides an easy method to compute 
# dense vector representations for sentences, paragraphs, and images.
# For additional reading https://pypi.org/project/sentence-transformers/
!pip install sentence-transformers

In [None]:
# The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. 
# and achieve state-of-the-art performance in various task. Read the pypi library link about supported models. 
# You need to initialize sentence transformer model. 
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
# create sentence embeddings using the multi-qa-MiniLM-L6 model from hugging face
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

In [None]:
# Let's get the dimensions 
d = sentence_embeddings.shape[1]
d

In [None]:
# IndexIVFPQ- Inverted file with Product Quantizer encoding. 
# Each residual vector is encoded as a product quantizer code
# nlist : The number of clusters to be formed. These clusters are the voronai cells
# quantizer : to assign the vectors to a particular cluster. This is usually another index that uses the L2 Euclidian distance metric (we use the IndexFlatL2 index)
# k represents the total number of centroids (or codes) that will be used to represent our vectors. 
# m represents the number of subvectors that we will split our vectors into. 
nlist = 50
quantizer = faiss.IndexFlatL2(d)
k = 8 #Note: The dimension has to be a multiple of k
m = 8
index = faiss.IndexIVFPQ(quantizer, d, nlist, k, m, faiss.METRIC_L2)

In [None]:
# Check to see if the index is trained. 
# When using product quantizer in Inverted file index, training the cells is required. The function below will return false.
index.is_trained

In [None]:
# Let's train the index and check if index is now trained. It should return "True"
index.train(sentence_embeddings)
index.is_trained  

In [None]:
# Let's add the vectors in to the index
index.add(sentence_embeddings)

In [None]:
index.ntotal

In [None]:
%%time
# Now that our index is trained, We are ready to search our index
# "xq" Query vector
# "nprobe" parameter specifies the number of clusters to visit during the search operation
# "k" specifies the number of similar vectors to be returned from the visited clusters.
#Then search with a given query `xq` and number of nearest neigbors to return `k`.
index.nprobe=2
k = 4
xq = model.encode(["Who is playing football"])
D, I = index.search(xq, k)  # search
print(I)

In [None]:
#You will be get 4 nearest locations returned by the query. Along with this you will know how long it takes to return the results.

In [None]:
# Let's see the results of query and 4 nearest neighbours related to Jeff Bezos and Internet
for i,location in enumerate(I[0].tolist()):
    print(location, ":", sentences[location])

In [None]:
%%time
# "nprobe" parameter specifies the number of clusters to visit during the search operation
# Let's increase the scope of clusters to search 
index.nprobe=4
k = 4
xq = model.encode(["Who is playing football"])
D, I = index.search(xq, k)  # search
print(I)

In [None]:
%%time
# "nprobe" parameter specifies the number of clusters to visit during the search operation
# Let's increase the scope of clusters to search 
index.nprobe=8
k = 4
xq = model.encode(["Who is playing football"])
D, I = index.search(xq, k)  # search
print(I)

In [None]:
%%time
# "nprobe" parameter specifies the number of clusters to visit during the search operation
# Let's increase the scope of clusters to search 
index.nprobe=16
k = 4
xq = model.encode(["Who is playing football"])
D, I = index.search(xq, k)  # search
print(I)

In [None]:
# In this lab, you learnt how you can optimize further using product quantization. 
# You should see 0 to 30% improvement in search speed 
# You would have noticed the one of the result location obtained was different from IndexIVFFlat index.
# This is due to an additional layer of approximation introduced to gain speed.
# 

# Next Step: Perform a comparison of results in this notebook using different optimization techniques. 