## Faiss (Facebook AI Similarity Search) for semantic search
Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

### What is similarity search?

Given a set of vectors  in dimension , Faiss builds a data structure in RAM from it. After the structure is constructed, when given a new vector  in dimension  it performs efficiently the operation:

where  is the Euclidean distance ().

In Faiss terms, the data structure is an index, an object that has an add method to add  vectors. Note that the ’s are assumed to be fixed.

Computing the argmin is the search operation on the index.

This is all what Faiss is about. It can also:

return not just the nearest neighbor, but also the 2nd nearest, 3rd, …, k-th nearest neighbor
search several vectors at a time rather than one (batch processing). For many index types, this is faster than searching one vector after another
trade precision for speed, ie. give an incorrect result 10% of the time with a method that’s 10x faster or uses 10x less memory
perform maximum inner product search  instead of minimum Euclidean search. There is also limited support for other distances (L1, Linf, etc.).
return all elements that are within a given radius of the query point (range search)
store the index on disk rather than in RAM.
index binary vectors rather than floating-point vectors
ignore a subset of index vectors according to a predicate on the vector ids.

The notebook is focused on performing similarity searches within a dataset of text. 

In [1]:
#Install Packages
!pip install faiss-cpu
!pip install sentence-transformers


Collecting faiss-cpu
  Downloading faiss_cpu-1.7.4-cp311-cp311-macosx_11_0_arm64.whl (2.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.4

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Obtaining dependency information for transformers<5.0.0,>=4.6.0 from https://files.pythonhost

  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.2/536.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading huggingface_hub-0.19.4-py3-none-any.whl (311 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.7/311.7 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-2.1.1-cp311-none-macosx_11_0_arm64.whl (59.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m00:01[0m:00:01[0m
[?25hDownloading transformers-4.35.2-py3-none-any.whl (7.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
[?25hDownloading torchvision-0.16.1-cp311-cp311-macosx_11_0_arm64.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0

In [28]:
# import necessary libraries
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

In [4]:
pd.set_option('display.max_colwidth', 100)

'''
This line sets a pandas display option. Specifically, it changes the maximum width of a column when
displaying a pandas DataFrame to 100 characters. This means that strings in a DataFrame will be truncated 
to a maximum of 100 characters when displayed. 
This is useful for readability, especially when working with large texts or datasets.
'''

'\nThis line sets a pandas display option. Specifically, it changes the maximum width of a column when\ndisplaying a pandas DataFrame to 100 characters. This means that strings in a DataFrame will be truncated \nto a maximum of 100 characters when displayed. \nThis is useful for readability, especially when working with large texts or datasets.\n'

In [7]:
#read data file
df = pd.read_csv("input_text.csv")
df.shape


(8, 2)

In [10]:
#display dataframe
df

Unnamed: 0,text,category
0,Meditation and yoga can improve mental health,Health
1,"Fruits, whole grains and vegetables helps control blood pressure",Health
2,These are the latest fashion trends for this week,Fashion
3,Vibrant color jeans for male are becoming a trend,Fashion
4,The concert starts at 7 PM tonight,Event
5,Navaratri dandiya program at Expo center in Mumbai this october,Event
6,Exciting vacation destinations for your next trip,Travel
7,Maldives and Srilanka are gaining popularity in terms of low budget vacation places,Travel


In [14]:
encoder = SentenceTransformer("all-mpnet-base-v2")
vectors = encoder.encode(df.text)


.gitattributes: 100%|███████████████████████| 1.18k/1.18k [00:00<00:00, 647kB/s]
1_Pooling/config.json: 100%|████████████████████| 190/190 [00:00<00:00, 424kB/s]
README.md: 100%|███████████████████████████| 10.6k/10.6k [00:00<00:00, 12.2MB/s]
config.json: 100%|█████████████████████████████| 571/571 [00:00<00:00, 1.30MB/s]
config_sentence_transformers.json: 100%|████████| 116/116 [00:00<00:00, 285kB/s]
data_config.json: 100%|█████████████████████| 39.3k/39.3k [00:00<00:00, 223kB/s]
pytorch_model.bin: 100%|█████████████████████| 438M/438M [00:44<00:00, 9.77MB/s]
sentence_bert_config.json: 100%|█████████████| 53.0/53.0 [00:00<00:00, 98.6kB/s]
special_tokens_map.json: 100%|██████████████████| 239/239 [00:00<00:00, 847kB/s]
tokenizer.json: 100%|█████████████████████████| 466k/466k [00:00<00:00, 715kB/s]
tokenizer_config.json: 100%|███████████████████| 363/363 [00:00<00:00, 1.05MB/s]
train_script.py: 100%|█████████████████████| 13.1k/13.1k [00:00<00:00, 14.8MB/s]
vocab.txt: 100%|████████████

model 'all-mpnet-base-v2' specified here is a pre-trained model provided by the Sentence Transformers library, designed to encode text into embeddings. This model is based on the MPNet architecture, which is effective for generating sentence-level embeddings.

In [16]:
vectors.shape


(8, 768)

In [17]:
dimentionality = vectors.shape[1]
dimentionality


768

In [20]:
index = faiss.IndexFlatL2(dimentionality)

IndexFlatL2 is used, which is a simple flat index that performs a brute-force L2 (Euclidean) distance search. The parameter dim (dimensionality of the embeddings) is passed to the constructor to initialize the index with the appropriate size. This index will be used for adding and searching vectors based on their L2 distance.

In [26]:
index.add(vectors)
index

<faiss.swigfaiss.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x2951fadf0> >

This line of code uses the add method of the index object (a FAISS index created in the previous cell) to add the vectors stored in the vectors array. This process is essential in building the index, which allows for efficient similarity searches later. The vectors array contains the embeddings generated from the text data, and adding them to the index means that these embeddings can be quickly searched to find the most similar items.

In [27]:
search_query = "I want to buy a t-shirt"
# search_query = "where should I go for holidays"
# search_query = "An apple a day keeps the doctor away"
vec = encoder.encode(search_query)
vec.shape


(768,)

The encode method converts the text into a dense vector representation, capturing its semantic meaning.

In [29]:
svec = np.array(vec).reshape(1,-1)
svec.shape

(1, 768)

Reshaping the vector into a 2D array is a necessary step to align with the FAISS indexing and search requirements.

In [30]:
faiss.normalize_L2(svec)


This function normalizes the vectors to unit length according to the L2 norm, which is a common preprocessing step in many similarity search tasks. The normalization ensures that the similarity search is based on the angle between vectors rather than their magnitude.

In [40]:
distances, I = index.search(svec, k=2)
distances


array([[1.3956809, 1.3956809]], dtype=float32)

This does a search on the index (the FAISS index created earlier) using new_vec as the query vector. The variable k=2 specifies that the two nearest neighbors in the index should be returned for each query vector. The function returns two arrays: distances, which contains the distances of the nearest neighbors from the query, and I, which contains the indices of these neighbors in the index.

In [43]:
I

array([[ 2, 10]])

In [42]:
I.tolist()

[[2, 10]]

In [44]:
row_indices = I.tolist()[0]
row_indices


[2, 10]

Above code converts the I array to a list and then selects the first element of that list. Recall that I is an array of indices representing the positions of the nearest neighbors in the dataset for each query vector. Since the previous cells suggest that only one query vector was used, I.tolist()[0] would retrieve the list of indices corresponding to that single query. These indices are then assigned to the variable row_indices.

In [57]:
# Ensure row_indices only contains valid indices for df
valid_row_indices = [idx for idx in row_indices if idx in df.index]

# Access only the valid rows in df
valid_rows = df.loc[valid_row_indices]
valid_rows


Unnamed: 0,text,category
2,These are the latest fashion trends for this week,Fashion


In [59]:
# search_query


The above notebook involves following steps:

Installing and importing necessary libraries.    
Loading and displaying a dataset.    
Generating embeddings for the text data using a pre-trained sentence transformer model.    
Setting up a FAISS index for efficient similarity search.    
Encoding a search query and using the FAISS index to find the most similar items in the dataset to this query.    
Displaying the results of the similarity search.    