# Vector Similarity Search With Faiss 

In this work, we will focus on similarity search on vectors. Following steps will be followed:

- Creating random sentences
- Creating embeddings
- Creating a local vector database from those embeddings
- Semantic search with given embedded query (A direct eculidian search)
- Using Faiss library and seeing the performance improvement

In [77]:
# Imports
import os
import csv
import random
import pandas as pd
import numpy as np
import faiss
from dotenv import load_dotenv
from faker import Faker
from openai import OpenAI

## Generate Random Sentences
Let's generate random sentences with faker library.

In [78]:
# Random sentence generator
fake = Faker()
def generate_meaningful_sentence(min_words=5, max_words=15):
    sentence = fake.sentence(nb_words=random.randint(min_words, max_words))
    return sentence

# Csv file writer
def create_csv(filename, num_sentences=20000, min_words=8, max_words=15):
    with open(filename, mode='w', newline='') as file:
        writer = csv.writer(file)
        writer.writerow(["Sentences"])
        for _ in range(num_sentences):
            sentence = generate_meaningful_sentence(min_words, max_words)
            writer.writerow([sentence])

In [79]:
# Create and write to csv file
create_csv('docs/random_sentences.csv')

## Creating Embeddings

Let's use openai "text-embedding-ada-002" library to create library.

Using this library is not free, if you want a free version you can use an open source one.

In [80]:
# Import csv file
data = pd.read_csv('docs/random_sentences.csv')
data.head()

Unnamed: 0,Sentences
0,Less hair win focus government edge less knowl...
1,Several should high size turn sound side autho...
2,Above field cup trial door use challenge owner.
3,Now role player social before it good page for...
4,Them everything very official eat audience wai...


In [81]:
# Create Embeddings with OpenAI
load_dotenv()
openai_api_key = os.getenv('OPENAI_API_KEY')
client = OpenAI()

def create_embeddings(data):
        embeddings = client.embeddings.create(model="text-embedding-ada-002", input=list(data))
        embeddings_array = np.array([x.embedding for x in embeddings.data], float)
        return embeddings_array

In [82]:
# Create embeddings with chunks using OpenAI
def chunk_sentences_and_create_embeddings(data, chunk_size):
    embedding_list = []
    num_sentences = len(data)
    for i in range(0, num_sentences, chunk_size):
        chunk_data = data[i:i+chunk_size]
        embeddings = create_embeddings(chunk_data)
        embedding_list.append(embeddings)
    return np.array(embedding_list, float)

embedding_array = chunk_sentences_and_create_embeddings(data["Sentences"], chunk_size=1000)

In [94]:
# Reshape array
embedding_array = embedding_array.reshape(-1, 1536)

In [97]:
# Save Embeddings to csv file
pd.DataFrame(
    data=embedding_array
).to_csv('docs/embeddings.csv', index=False)

In [103]:
# Load embeddings
def load_embeddings(file_path):
    embeddings_df = pd.read_csv(file_path)
    embeddings_array = embeddings_df.to_numpy()
    return embeddings_array

embedding_array = load_embeddings("docs/embeddings.csv")

# Semantic Search

Semantic search is a technique used to retrieve information based on the meaning and context of the query rather than just keyword matching. It aims to understand the intent behind the query and provide relevant results.

In [120]:
# query embedding
def query_embedding(query):
    query_embedding = client.embeddings.create(model="text-embedding-ada-002", input=query)
    return query_embedding.data[0].embedding

query = "I don't want to be here"
query_embedding = query_embedding(query)
len(query_embedding)

1536

## Flat L2 Index (Direct Euclidean Search)

The flat L2 index is a type of index used in similarity search algorithms that calculates the Euclidean distance between vectors to measure their similarity.

In [107]:
# Faiss Index
dimension = 1536 # dimension of the embeddings
index = faiss.IndexFlatL2(dimension) # Initialize the index
index.add(embedding_array) # Add the data to the index


In [108]:
# See if the index is trained
print(index.is_trained)

# Number of elements in the index
print(index.ntotal)

True
20000


In [130]:
# Search for the nearest neighbors of the query embedding
query_vector = np.array(query_embedding, float).reshape(1, -1)
D, I = index.search(query_vector, 5)
print(I)

[[17167 15240  9998  9215 14756]]


In [131]:
# Let's see the most related sentences
for i in I[0]:
    print(data.iloc[i]["Sentences"])

Recent history image feel morning outside ever try interest want leave resource.
Attention machine interest I visit bad now to despite present.
Weight billion save cold situation front school operation option interest down class meeting goal not allow find difference different myself.
Reality walk among by decide out look return.
Not product seat rock evening myself good.


There are quite good matches. Especially the first sentence is directly related and the other ones seems to have some related meanings without having the same direct words in them.

## Speeding Up

This was an exact search. What we did actually one by one, calculating exact ecludian distances between the matrixes, and list the smallest distance to largest. Because of the embeddings, the closest has the highest meaning similarity.
But this is in the end, not a very fast process. When data becomes big, it will be so much harder to make this search. Also, it will have no meaning to calculate most of the matrixes (They will be too far away.)
Instead, we can limit the search into a limited area. We can generate centroids and make the search around it. In this way, we will gain so much speed in search traded off with accuracy.


In [132]:
# Intialize the index
nlist = 100 # number of clusters
k = 4 # number of nearest neighbors
quantizer = faiss.IndexFlatL2(dimension) # the other index
index = faiss.IndexIVFFlat(quantizer, dimension, nlist) # the index

In [133]:
# Check if the index is trained
index.is_trained

False

Index is not trained yet. We need to train it first.

In [134]:
# Train the index
index.train(embedding_array)
index.is_trained

True

In [135]:
# Search for the nearest neighbors of the query embedding
index.add(embedding_array)
D, I = index.search(query_vector, k)

In [139]:
I

array([[15240,  9998,  9215,  6322]], dtype=int64)

Default number of visited cells is default 1. We can increase it with the parameter "index.nprobe"

In [137]:
# Increase the number of visited cells
index.nprobe = 10
D, I = index.search(query_vector, k)

In [140]:
I

array([[15240,  9998,  9215,  6322]], dtype=int64)

The result is different from the exact search, It can be better with increasing nprobe. nlist and k. It's a tradeoff with the speed.