In this notebook we will see how we can create different types of indexes using [FAISS](https://github.com/facebookresearch/faiss) library for efficient similarity search.

# Install dependencies

In [None]:
!pip install datasets
!pip install sentence_transformers
!pip install faiss-cpu

# Import

In [None]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer

import faiss

import pandas as pd
import numpy as np
from tqdm import tqdm
import time
import plotly.express as px

# Data

We are going to use a Hugging Face text dataset with 832 rows. Each row of the dataset will be encoded by a BERT model with 768 dimensions.

In [None]:
dataset = load_dataset('glue', 'ax')
sentences = list(set(dataset['test']['premise']))
print(f"Number of documents: {len(sentences)}")
sentences[:5]

In [None]:
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')

In [None]:
def embed(model, sentences):
    return model.encode(sentences)

In [None]:
embeddings = embed(model, sentences) # this function may take a while to execute
dim = embeddings.shape[1]
print(f"Dimension of embeddings: {dim}")

Dimension of embeddings: 768


# IndexFlatL2 & IndexFlatIP
Let us create *IndexFlatL2*. We can already notice that *IndexFlatL2* does not require the training phase.

**NB:** *IndexFlatIP* works in a similar way as *IndexFlatL2*, except for distances it calculates inner product (not Euclidean distance).

In [None]:
index = faiss.IndexFlatL2(dim)
# index = faiss.IndexFlatIP
assert index.is_trained
index.add(embeddings)
print(f"Total number of documents: {index.ntotal}")

Total number of documents: 832


From now, we are going to search 3 nearest neighbours for each of our queries.

In [None]:
k = 3
queries = ['I am reading an interesting book', 'I want to buy a car']
embedded_queries = embed(model, sentences=queries)

In [None]:
distances, indices = index.search(embedded_queries, k)
print(f"Distances:\n{distances}\n")
print(f"Indices:\n{indices}")

Distances:
[[265.9757  272.8875  276.9157 ]
 [ 95.71779 178.98122 229.10774]]

Indices:
[[710 623 227]
 [306 194 479]]


The *search()* method returns computed distances to the found objects as well as their index positions in the data. Let us iterate over it and print the results.

In [None]:
def print_results(queries, distances, indices, sentences):
    for query, query_distances, query_indices in zip(queries, distances, indices):
        print(f"\nQuery: {query}")
        for i, (query_index, query_distance) in enumerate(zip(query_indices, query_distances), 1):
            document = sentences[query_index]
            print(f"{i}. Distance = {query_distance:.2f}. {document}")

In [None]:
print_results(queries, distances, indices, sentences)


Query: I am reading an interesting book
1. Distance = 265.98. The book astounds as Grossman richly, deeply develops characters and portrays suffering, but his portrayal of women still suffers from a lot of the unfortunate stereotypes and moralizing that we would expect of a writer from his time.
2. Distance = 272.89. The book astounds with Grossman's rich, deep character development and portrayal of suffering, but his portrayal of women still suffers from a lot of the unfortunate stereotypes and moralizing that we would expect of a writer from his time.
3. Distance = 276.92. This article reads like satire.

Query: I want to buy a car
1. Distance = 95.72. Musk decided to offer up his personal car.
2. Distance = 178.98. Musk decided to offer up his personal Tesla roadster.
3. Distance = 229.11. I can actually see him getting into a Lincoln saying this.


# IndexIVFFlat
Let us accelerate the search procedure through **inverted file index**. For that, we are going to switch to *IndexIVFFlat* which constructs a Voronoi diagram under the hood and uses it to reduce the search scope during inference.

To declare this type of index, we need to provide a **quantinizer** that will measure the distances between centers of the regions and **nlist** parameter which defines the number of regions.

In [None]:
nlist = 20 # number of regions (Voronoi cells)
quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFFlat(quantizer, dim, nlist)

Compared to *IndexFlatL2*, this time we have to train the index before adding the embeddings to it. During training, we will be building a Voronoi diagram. After that, newly added objects will be categorized into one of the regions.

On large datasets, the training procedure may take some time.

In [None]:
assert not index.is_trained
index.train(embeddings)

assert index.is_trained
index.add(embeddings)

print(f"Total number of documents: {index.ntotal}")

Total number of documents: 832


By default, for a new point, we search only one closest centroid to it and use all the objects inside that region as potential candidates. This way we don't check candidates from other close regions which could have potentially been true nearest neighbours. This can sometimes result in lower accuracy. To fix this, it is possible to adjust the number of searched regions. FAISS allows specifying this parameter as the **nprobe** attribute.

In [None]:
index.nprobe = 3 # increasing the search scope to 3 regions

In [None]:
k = 3
queries = ['I am reading an interesting book', 'I want to buy a car']
embedded_queries = embed(model, sentences=queries)

In [None]:
distances, indices = index.search(embedded_queries, k)
print_results(queries, distances, indices, sentences)


Query: I am reading an interesting book
1. Distance = 276.92. This article reads like satire.
2. Distance = 279.86. The doctor bears some responsibility for successful care.
3. Distance = 283.93. The attorney bears some responsibility for successful care.

Query: I want to buy a car
1. Distance = 95.72. Musk decided to offer up his personal car.
2. Distance = 178.98. Musk decided to offer up his personal Tesla roadster.
3. Distance = 229.11. I can actually see him getting into a Lincoln saying this.


# Performance comparison
We are going to measure how much time it takes to find the nearest neighbour (k = 1) for 1000 queries for described indexes for different dataset sizes.

In [None]:
dim = 100
k = 1
dataset_sizes = [int(size) for size in np.linspace(5000, 100000, 39)]

# values consist of an index class and constructor parameters that will be used to initialise an index
indexes = {
    'IndexFlatIP': (faiss.IndexFlatIP, (dim,)),
    'IndexFlatL2': (faiss.IndexFlatL2, (dim,)),
    'IndexIVFFlat': (faiss.IndexIVFFlat, (faiss.IndexFlatL2(dim), dim, 50,)) # 50 regions
}

In [2]:
df_speed = pd.DataFrame(columns=['index_type', 'dataset_size', 'time'], index=range(len(dataset_sizes) * len(indexes)))

i = 0
for dataset_size in tqdm(dataset_sizes):
    for index_type, index_params in indexes.items():
        dataset = np.random.randn(dataset_size, dim)
        index_class, params = index_params
        index = index_class(*params)
        if not index.is_trained:
            index.train(dataset)
        index.add(dataset)
        query = np.random.randn(1000, dim)

        # measuring search time
        start = time.time()
        index.search(query, k)
        end = time.time()

        df_speed.loc[i, 'index_type'] = index_type
        df_speed.loc[i, 'dataset_size'] = dataset_size
        df_speed.loc[i, 'time'] = end - start

        i += 1

In [None]:
fig = px.line(df_speed, x='dataset_size', y='time', color='index_type',
              title='Index search time for datasets of different sizes',
              labels=dict(dataset_size='Dataset size', time='Time (s)', index_type='Index'))
fig.show()

As we can see, *IndexFlatIP* and *IndexFlatL2* scale linearly as dataset size increases. As expected, both algorithms have a similar performance. However, the situation is different with *IndexIVFFlat* which searches neighbours more efficiently and scales much better.