### Lab - Facebook AI Similarity Search (FAISS) 

In this notebook, we'll learn about Facebook AI Similarity Search. Facebook released Facebook AI Similarity Search (Faiss) library in March' 2017. FAISS library allows to search multimedia documents that are similar to each other where query based search enginess fall short.  

###  What is a vector database
Traditional databases are made up of structured tables containing symbolic information. For example, a collection of images is represented as a table. Each row in the table contains the image identifier and image description. 

A vector database is a type of database that stores data as high-dimensional vectors, which are mathematical representations of features or attributes.  The vectors are usually generated by applying some kind of transformation or embedding function to the raw data, such as text, images, audio, video, and others. The embedding function can be based on various methods, such as machine learning models, word embeddings, feature extraction algorithms. 

AI tools, like text embedding (word2vec) or convolutional neural net (CNN) descriptors trained with deep learning, generate high-dimensional vectors.

### How to Vector representation?
The vector representation for images is designed to produce similar vectors for similar images, where similar vectors are defined as those that are nearby in Euclidean space. 

### Benefits of Vector Database
The main advantage of a vector database is that it allows for fast and accurate similarity search and retrieval of data based on their vector distance or similarity


In [None]:
# In this lab, we will use FAISS to generate vectors 
# 1. We will use Amazon_Shareholder_Letter_1997.txt as the input document 
# 2. We will split document into sentences
# 3. Create a new index and train it on the data
# 4. Given a query, i.e. "What did Jeff Bezos say about the internet?", find the K most similar sentences

Further reading:

- https://www.pinecone.io/learn/faiss-tutorial/

In [None]:
# We will the need python libraries for this tutorial. A basic understanding of python is required. 
# You can install the libraries using pip if not in your notebook pre-installed. 

In [None]:
!pip install faiss-cpu
import requests
from io import StringIO
import pandas as pd
import numpy as np
import faiss

In [None]:
res = requests.get('https://raw.githubusercontent.com/r2rajan/genai/main/FAISS/Amazon_Shareholder_Letter_1997.txt')
# create dataframe
data = pd.read_csv(StringIO(res.text), sep='\t', on_bad_lines='skip', header=None, names=['Sentences'])
data.head()

In [None]:
# we take all the sentences from the Amazon Shareholder letter into a python list 
# you will get an output of 42 sentences
sentences = data['Sentences'].tolist()
sentences[:5]
len(sentences)

In [None]:
# List of sentences from Amazon Shareholder letter
sentences

In [None]:
# remove duplicates and NaN
sentences = [word for word in list(set(sentences)) if type(word) is str]

In [None]:
# You need to install sentence_transformers library. This framework provides an easy method to compute 
# dense vector representations for sentences, paragraphs, and images.
# For additional reading https://pypi.org/project/sentence-transformers/
!pip install sentence-transformers
import sentence_tranformers

In [None]:
# The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. 
# and achieve state-of-the-art performance in various task. Read the pypi library link about supported models. 
# You need to initialize sentence transformer model. 
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
# create sentence embeddings using the multi-qa-MiniLM-L6 model from hugging face
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

In [None]:
# Let's get the dimensions 
d = sentence_embeddings.shape[1]
d

In [None]:
# Let's build our first vector index using Indexflat L2
# IndexFlatL2 measures the L2 (or Euclidean) distance between all given points between our query vector(xq), and the vectors(y) loaded into the index. 
# It’s simple, accurate, but not fast. 
# You want the index to have the same dimension as your emmeddings
index = faiss.IndexFlatL2(d)

In [None]:
# Check to see if the index is trained. IndexFlatL2 training is not required and it will return true
index.is_trained

In [None]:
# Let's load your sentence embeddings in to the index
index.add(sentence_embeddings)

In [None]:
index.ntotal

In [None]:
#Query = xq
#vectors to return = k
#Then search with a given query `xq` and number of nearest neigbors to return `k`.
k = 4
xq = model.encode(["What did Bezos say about internet"])

In [None]:
#You will be get 4 nearest locations returned by the query. Along with this you will know how long it takes to return the results.

In [None]:
%%time
D, I = index.search(xq, k)  # search
print(I)

In [None]:
# Let's see the results of query and 4 nearest neighbours related to Jeff Bezos and Internet
for i,location in enumerate(I[0].tolist()):
    print(location, ":", sentences[location])

In [None]:
# we have 4 vectors to return (k) - so we initialize a zero array to hold them
vecs = np.zeros((k, d))
# then iterate through each location ID from I and reconstruct the vector from the index 
# Add the reconstructed vector to our zero-array
for i, val in enumerate(I[0].tolist()):
    vecs[i, :] = index.reconstruct(val)

In [None]:
# Let's look the shape of the numpy array 
vecs.shape

In [None]:
# Here are the actual vectors of our result. 
vecs[0][:100]

In [None]:
# That's the end of this simple lab to explore vector databases and vector search
# You used a simple flat index and did a exhaustive search on a very small dataset. 
# A flat index is not ideal for very large datasets with billions of parameters where performance is key
# Next Steps: How to improve the performance by partitioning the index