### FAISS

Facebook AI Similarity Search is used for effective similarity search and clusterring of dense vector. It's algo searches in sets of vectors of any size. It is completely written in C++ with wrappers for Python.


In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI

embeddings = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-mpnet-base-v2")
embeddings

  from .autonotebook import tqdm as notebook_tqdm


HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

### Similarity Search

Given a set of vectors $x_i$ in dimension $d$, Faiss builds a data structure in RAM from it. Once the structure is constructed, and a new given vector $x$ in dimension $d$ it performs efficiently the operation:
$j = argmin_i||x-x_i||$

where ||.|| is the Eucliden distance ($L^2$).

In FAISS data structure is an index, an object that has an add method to add $x_i$ vectors. Computing the argmin is the search algorithm on the index.

FAISS can also:
- Return $k^th$ nearest neighbours.
- Search several vectors at a time (batch processing).
- Trade precision for speed.
- Performe maximum inner product search $argmax_i(x,x_i)$ instead of minimum Euclidean search. Limited support for other distances.
- Store the index on disk rather than RAM.
- Index binary vectors rather than floating point vectors.

Mostly Used Similarity Search are:

- Cosin Similarity
- Euclidean Distance

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "Cats and dogs are both popular pets.",
    "I love my pet cat.",
    "Dogs are great companions."
] 

my_qestion = "What do cats and dogs have in common?"

In [4]:
doc_embed=embeddings.embed_documents(documents)
query_embed = embeddings.embed_query(my_qestion)
cosine_similarity([query_embed], doc_embed)

array([[0.20420025, 0.28294987, 0.69093253, 0.45672216, 0.57053787]])

In [5]:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances([query_embed], doc_embed)

array([[1.26158608, 1.19753927, 0.78621559, 1.04237982, 0.92678164]])

### FAISS

In [6]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

In [7]:
## Create a FAISS index
## The FAISS index is used to store and search the embeddings.
index = faiss.IndexFlatL2(768)
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000001EE47720DE0> >

In [8]:
## create a new FAISS vector store
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [10]:
## Add documents to the FAISS index
data = ['This is cat', 'This is dog', 'This is cat and dog', 'I love my cat', 'Dogs are great']
vector_store.add_texts(data)
## Perform a similarity search. Search for the 
response = vector_store.similarity_search('What does cat and dogs have in common', k=1)
## Display the content of the first response
response[0].page_content

'This is cat and dog'

In [13]:
## uuid4 is used to generate unique identifiers for the documents
from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]

## Generating unique identifiers for each document
## This is important for FAISS to keep track of the documents.
uuids = [str(uuid4()) for _ in range(len(documents))]


In [None]:
## define a new FAISS index with dot product similarity
import faiss
index = faiss.IndexFlatIP(768)

## Create a new FAISS vector store with the new index
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)
## Add the documents to the FAISS vector store
## This will index the documents and make them searchable.
vector_store.add_documents(documents,ids=uuids)

['a6c6d81f-5c95-492d-89c0-763c3f4f2d86',
 '9965042e-9324-4db5-be04-e88c6149e3c7',
 '58286f76-23af-4aba-85a5-ffe071d90580',
 'c58446fe-0132-4da5-b796-fb03e74baafc',
 '0784b69b-3a60-4e69-835d-0066cb75e31d',
 '9d5f7884-7b84-4b2c-9975-c503ebaa3437',
 '5150d679-ab6e-41b9-ae41-cb9aa238da46',
 'e38ac633-d312-42eb-8646-f91066386ffb',
 '4bde7fba-f441-4ff6-bb83-d0b2a6de62c0',
 '94ebdeba-7cdf-439e-af24-3a2cd56f4c2d']

##### Quering Vector Store

In [None]:
## Perform a similarity search
## This will return the top k most similar documents to the query.
vector_store.similarity_search("LangChain provides abstractions to make working with LLMs easy",k=2)

[Document(id='58286f76-23af-4aba-85a5-ffe071d90580', metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain - come check it out!'),
 Document(id='e38ac633-d312-42eb-8646-f91066386ffb', metadata={'source': 'tweet'}, page_content='LangGraph is the best framework for building stateful, agentic applications!')]