## FAISS

- An AI library developed by Facebook
- A library for efficient similarity search
- It has lots of indexes to computer the approximate nearest neighbours vectors

https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/

https://github.com/facebookresearch/faiss/wiki

https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

- Distance Calculation method : L2 (Euclidean distance)

- Cell-Probe Method

1. How they work?
2. How to use them?
3. Compare them with Brute force approach

In [None]:
!pip install faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.4


In [None]:
#Efficient, High dimentional indexing, GPU Accelarion, Versatility - Flat, IVF , PQ

In [None]:
import faiss
import numpy as np

In [None]:
# Generate some random vectors for demonstration

dimension = 64
num_vectors = 1000

query_vector = np.random.random((1, dimension)).astype('float32')
data_vectors = np.random.random((num_vectors, dimension)).astype('float32')

In [None]:
query_vector

array([[0.83631843, 0.24012427, 0.8614218 , 0.15864664, 0.2133401 ,
        0.9668993 , 0.7471294 , 0.25806025, 0.04115563, 0.89170384,
        0.52258253, 0.99731773, 0.52784187, 0.20670287, 0.47734925,
        0.27800924, 0.732037  , 0.77009296, 0.3643049 , 0.7415578 ,
        0.978897  , 0.07220161, 0.4572621 , 0.32064703, 0.5628677 ,
        0.2582574 , 0.02254519, 0.03717828, 0.4304883 , 0.8752154 ,
        0.82644206, 0.7229053 , 0.6928956 , 0.03229824, 0.6929262 ,
        0.4507195 , 0.5699979 , 0.02610633, 0.9497388 , 0.8031532 ,
        0.9929684 , 0.28322113, 0.63157773, 0.01299521, 0.25403017,
        0.73477316, 0.507482  , 0.5712722 , 0.8850522 , 0.9285125 ,
        0.7747772 , 0.55473506, 0.19500738, 0.9684543 , 0.9188128 ,
        0.87259597, 0.68696654, 0.689767  , 0.03645605, 0.15864179,
        0.5065967 , 0.11222789, 0.31783998, 0.25767252]], dtype=float32)

In [None]:
len(query_vector)

1

In [None]:
data_vectors

array([[0.40953824, 0.6455848 , 0.02474913, ..., 0.4171508 , 0.74445766,
        0.16831145],
       [0.96388644, 0.9969586 , 0.9585898 , ..., 0.09090518, 0.12181119,
        0.14733487],
       [0.585377  , 0.07516428, 0.537337  , ..., 0.9120703 , 0.9705723 ,
        0.81738025],
       ...,
       [0.50279367, 0.47600177, 0.13900325, ..., 0.5442151 , 0.49166384,
        0.22557332],
       [0.01815673, 0.09730221, 0.03071379, ..., 0.07988773, 0.9094777 ,
        0.85385627],
       [0.13798845, 0.21091416, 0.11956664, ..., 0.9482063 , 0.1706366 ,
        0.32964477]], dtype=float32)

In [None]:
len(data_vectors)

1000

In [None]:
# create a simple 'flat' index

# index = datastructure to help us perform effieciently similarity search

index = faiss.IndexFlatL2(dimension)

In the context of Faiss, an "index" refers to a data structure that is constructed to efficiently perform similarity search on a set of vectors. The index is built on the dataset, allowing for quick retrieval of nearest neighbors or similar vectors when given a query vector. Faiss provides various types of indexes, each suitable for different scenarios and datasets.


In [None]:
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x798f84699c80> >

In [None]:
# index > vectors

# Add data vectors to index

index.add(data_vectors)

In [None]:
# Perform a vector search

k = 5 # number of nearest neighbours to retrieve

distances, indices = index.search(query_vector, k)

In [None]:
# print the results

print("Query Vector:\n", query_vector)
print("\nNearest Neighbours:")
for i in range(k):
  print(f"Index : {indices[0][i]}, Distance: {distances[0][i]}")

Query Vector:
 [[0.83631843 0.24012427 0.8614218  0.15864664 0.2133401  0.9668993
  0.7471294  0.25806025 0.04115563 0.89170384 0.52258253 0.99731773
  0.52784187 0.20670287 0.47734925 0.27800924 0.732037   0.77009296
  0.3643049  0.7415578  0.978897   0.07220161 0.4572621  0.32064703
  0.5628677  0.2582574  0.02254519 0.03717828 0.4304883  0.8752154
  0.82644206 0.7229053  0.6928956  0.03229824 0.6929262  0.4507195
  0.5699979  0.02610633 0.9497388  0.8031532  0.9929684  0.28322113
  0.63157773 0.01299521 0.25403017 0.73477316 0.507482   0.5712722
  0.8850522  0.9285125  0.7747772  0.55473506 0.19500738 0.9684543
  0.9188128  0.87259597 0.68696654 0.689767   0.03645605 0.15864179
  0.5065967  0.11222789 0.31783998 0.25767252]]

Nearest Neighbours:
Index : 963, Distance: 6.606414318084717
Index : 431, Distance: 6.76371955871582
Index : 276, Distance: 7.1480817794799805
Index : 525, Distance: 7.3619384765625
Index : 967, Distance: 7.407787322998047


In [None]:
# Finding nearest neighbours

In [None]:
query_vector = np.array([[10.0]* 64], dtype='float32')

In [None]:
query_vector

array([[10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.,
        10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.,
        10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.,
        10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.,
        10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]],
      dtype=float32)

In [None]:
dimensions = 64
num_vectors = 1000
data_vectors = np.random.normal(loc=10, scale=1, size=(num_vectors, dimension)).astype('float32')

In [None]:
data_vectors

array([[10.976    ,  8.334698 ,  9.825231 , ..., 10.316686 , 10.298373 ,
        10.415993 ],
       [ 9.917686 ,  9.356894 , 10.280453 , ..., 10.649538 ,  9.656099 ,
         7.9423347],
       [ 9.821852 , 10.658847 ,  9.545566 , ..., 10.434568 , 10.512574 ,
         9.674537 ],
       ...,
       [ 9.518309 ,  9.511969 ,  8.602661 , ..., 11.113427 ,  8.587387 ,
         9.772464 ],
       [10.388731 , 10.094328 ,  8.88617  , ..., 10.442276 ,  9.386941 ,
         9.586914 ],
       [ 9.285284 , 10.785852 ,  9.863935 , ..., 10.706663 ,  9.920414 ,
        11.345318 ]], dtype=float32)

In [None]:
index = faiss.IndexFlatL2(dimension)

In [None]:
index.add(data_vectors)

In [None]:
k = 5
distances, indices = index.search(query_vector, k)

In [None]:
print("Query Vector:\n", query_vector)
print("\nNearest Neighbours:")
for i in range(k):
  index_number = indices[0][i]
  distance_value = distances[0][i]
  actual_number = data_vectors[index_number][0]
  print(f"Index : {index_number}, Actual Number : {actual_number}, Distance: {distance_value}")

Query Vector:
 [[10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
  10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
  10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10. 10.
  10. 10. 10. 10. 10. 10. 10. 10. 10. 10.]]

Nearest Neighbours:
Index : 405, Actual Number : 9.512840270996094, Distance: 33.853981018066406
Index : 317, Actual Number : 10.656074523925781, Distance: 37.03443145751953
Index : 684, Actual Number : 11.339433670043945, Distance: 38.95702362060547
Index : 213, Actual Number : 10.985352516174316, Distance: 40.08647918701172
Index : 127, Actual Number : 9.810687065124512, Distance: 40.38578796386719
