### FAISS

Facebook AI Similarity Search is used for effective similarity search and clusterring of dense vector. It's algo searches in sets of vectors of any size. It is completely written in C++ with wrappers for Python.


In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

os.environ['HF_TOKEN'] = os.getenv('HF_TOKEN')
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')

In [2]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_openai import ChatOpenAI

embeddings = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-mpnet-base-v2")
embeddings

  from .autonotebook import tqdm as notebook_tqdm


HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

### Similarity Search

Given a set of vectors $x_i$ in dimension $d$, Faiss builds a data structure in RAM from it. Once the structure is constructed, and a new given vector $x$ in dimension $d$ it performs efficiently the operation:
$j = argmin_i||x-x_i||$

where ||.|| is the Eucliden distance ($L^2$).

In FAISS data structure is an index, an object that has an add method to add $x_i$ vectors. Computing the argmin is the search algorithm on the index.

FAISS can also:
- Return $k^th$ nearest neighbours.
- Search several vectors at a time (batch processing).
- Trade precision for speed.
- Performe maximum inner product search $argmax_i(x,x_i)$ instead of minimum Euclidean search. Limited support for other distances.
- Store the index on disk rather than RAM.
- Index binary vectors rather than floating point vectors.

Mostly Used Similarity Search are:

- Cosin Similarity
- Euclidean Distance

In [3]:
from sklearn.metrics.pairwise import cosine_similarity

documents = [
    "The cat sat on the mat.",
    "The dog chased the cat.",
    "Cats and dogs are both popular pets.",
    "I love my pet cat.",
    "Dogs are great companions."
] 

my_qestion = "What do cats and dogs have in common?"

In [4]:
doc_embed=embeddings.embed_documents(documents)
query_embed = embeddings.embed_query(my_qestion)
cosine_similarity([query_embed], doc_embed)

array([[0.20420025, 0.28294987, 0.69093253, 0.45672216, 0.57053787]])

In [5]:
from sklearn.metrics.pairwise import euclidean_distances
euclidean_distances([query_embed], doc_embed)

array([[1.26158608, 1.19753927, 0.78621559, 1.04237982, 0.92678164]])

### FAISS

In [6]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore

In [7]:
## Create a FAISS index
## The FAISS index is used to store and search the embeddings.
index = faiss.IndexFlatL2(768)
index

<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x000001EE47720DE0> >

In [8]:
## create a new FAISS vector store
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [10]:
## Add documents to the FAISS index
data = ['This is cat', 'This is dog', 'This is cat and dog', 'I love my cat', 'Dogs are great']
vector_store.add_texts(data)
## Perform a similarity search. Search for the 
response = vector_store.similarity_search('What does cat and dogs have in common', k=1)
## Display the content of the first response
response[0].page_content

'This is cat and dog'

In [13]:
## uuid4 is used to generate unique identifiers for the documents
from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)

document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)

document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)

document_4 = Document(
    page_content="Robbers broke into the city bank and stole $1 million in cash.",
    metadata={"source": "news"},
)

document_5 = Document(
    page_content="Wow! That was an amazing movie. I can't wait to see it again.",
    metadata={"source": "tweet"},
)

document_6 = Document(
    page_content="Is the new iPhone worth the price? Read this review to find out.",
    metadata={"source": "website"},
)

document_7 = Document(
    page_content="The top 10 soccer players in the world right now.",
    metadata={"source": "website"},
)

document_8 = Document(
    page_content="LangGraph is the best framework for building stateful, agentic applications!",
    metadata={"source": "tweet"},
)

document_9 = Document(
    page_content="The stock market is down 500 points today due to fears of a recession.",
    metadata={"source": "news"},
)

document_10 = Document(
    page_content="I have a bad feeling I am going to get deleted :(",
    metadata={"source": "tweet"},
)

documents = [
    document_1,
    document_2,
    document_3,
    document_4,
    document_5,
    document_6,
    document_7,
    document_8,
    document_9,
    document_10,
]

## Generating unique identifiers for each document
## This is important for FAISS to keep track of the documents.
uuids = [str(uuid4()) for _ in range(len(documents))]


In [None]:
## define a new FAISS index with dot product similarity
import faiss
index = faiss.IndexFlatIP(768)

## Create a new FAISS vector store with the new index
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)
## Add the documents to the FAISS vector store
## This will index the documents and make them searchable.
vector_store.add_documents(documents,ids=uuids)

['a6c6d81f-5c95-492d-89c0-763c3f4f2d86',
 '9965042e-9324-4db5-be04-e88c6149e3c7',
 '58286f76-23af-4aba-85a5-ffe071d90580',
 'c58446fe-0132-4da5-b796-fb03e74baafc',
 '0784b69b-3a60-4e69-835d-0066cb75e31d',
 '9d5f7884-7b84-4b2c-9975-c503ebaa3437',
 '5150d679-ab6e-41b9-ae41-cb9aa238da46',
 'e38ac633-d312-42eb-8646-f91066386ffb',
 '4bde7fba-f441-4ff6-bb83-d0b2a6de62c0',
 '94ebdeba-7cdf-439e-af24-3a2cd56f4c2d']

##### Quering Vector Store

In [None]:
## Perform a similarity search
## This will return the top k most similar documents to the query.
vector_store.similarity_search("LangChain provides abstractions to make working with LLMs easy",k=2)

[Document(id='58286f76-23af-4aba-85a5-ffe071d90580', metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain - come check it out!'),
 Document(id='e38ac633-d312-42eb-8646-f91066386ffb', metadata={'source': 'tweet'}, page_content='LangGraph is the best framework for building stateful, agentic applications!')]

Performing Similarity Search with filters on metadata.

In [None]:
## Query directly with the vector store
## We can use the filter parameter to filter the results based on metadata.
results = vector_store.similarity_search(
    'Langchain provides abstractions to make workings with LLMs easy',
    k=2,
    filter={'source':'tweet'}
)

## Display the results
for res in results:
    print(f'-{res.page_content}-[{res.metadata}]')

-Building an exciting new project with LangChain - come check it out!-[{'source': 'tweet'}]
-LangGraph is the best framework for building stateful, agentic applications!-[{'source': 'tweet'}]


We can apply advance metadata filters for doing the same similarity search. The current list of supported operators are:


    $eq (equals)
    $neq (not equals)
    $gt (greater than)
    $lt (less than)
    $gte (greater than or equal)
    $lte (less than or equal)
    $in (membership in list)
    $nin (not in list)
    $and (all conditions must match)
    $or (any condition must match)
    $not (negation of condition)


In [None]:
## Using advance metadata filtering for doing the same similarity search
results = vector_store.similarity_search(
    'Langchain provides abstractions to make workings with LLMs easy',
    k=2,
    filter={'source':{'$eq':'news'}}
)

for res in results:
    print(f'-{res.page_content}-[{res.metadata}]')

-Robbers broke into the city bank and stole $1 million in cash.-[{'source': 'news'}]
-The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.-[{'source': 'news'}]


We can also get the similarity score for each results which we are getting using **similarity_search_with_score()**

In [36]:
results = vector_store.similarity_search_with_score(
    'How is the stock market',
    k=2,
    filter={'source':'news'}
)

for res,scr in results:
    print(f'*[Score={scr}] {res.page_content}-[{res.metadata}]')

*[Score=0.3349035978317261] The stock market is down 500 points today due to fears of a recession.-[{'source': 'news'}]
*[Score=0.0808696299791336] The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.-[{'source': 'news'}]


##### Retriever
We can also convert our vector store into a retriever for easy usage in our chains. This is how we will use it in RAG applications.

There are three different type of search_type we can select from:
- Similarity,
- Maximum Marginal Relevance (MMR)
- Similarity Score Threshold

In [38]:
retriever = vector_store.as_retriever(search_type = 'mmr',search_kwargs={'k':1})
retriever.invoke('Stealing from the bank is crime', filter={'source':'news'})

[Document(id='c58446fe-0132-4da5-b796-fb03e74baafc', metadata={'source': 'news'}, page_content='Robbers broke into the city bank and stole $1 million in cash.')]

#### Saving and Loading 

Till now we were storing in-memory, but we can also save and load the FAISS index in local. This is useful as we don't have to recreate it each time we use it.

**Saving** - save_local(folder_path, index_name)

In [39]:
## Saving FAISS index in local
vector_store.save_local('faiss_index')

**Loading** - load_local()

In [43]:
## loading faiss index
new_vectore_store = FAISS.load_local(
    folder_path='faiss_index', 
    embeddings=embeddings, 
    allow_dangerous_deserialization=True
)

docs = new_vectore_store.similarity_search('qux', k=1)

In [44]:
docs

[Document(id='58286f76-23af-4aba-85a5-ffe071d90580', metadata={'source': 'tweet'}, page_content='Building an exciting new project with LangChain - come check it out!')]

In [79]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
emb = embeddings.embed_query('This is ashu')
len(emb)

1536

##### Building a Basic RAG

In [95]:
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

## loading the LLM model
model= ChatOpenAI(model='gpt-4o-mini')

## loading the text document
loader = TextLoader('speech.txt')
documents = loader.load()

## defining index
index = faiss.IndexFlatIP(1536)

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap=50, add_start_index = True)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model='text-embedding-3-small')
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)
doc_ids = vector_store.add_documents(texts)


In [96]:
retriever_score = vector_store.as_retriever(search_type = 'similarity_score_threshold', search_kwargs={'score_threshold':0.4})
retriever = vector_store.as_retriever(search_type = 'mmr', search_kwargs={'k':2})

In [97]:
retriever.invoke('What did PM talk about transformation of the Northeast')

[Document(id='0e37222c-426e-41da-92af-826f8d5a28df', metadata={'source': 'speech.txt', 'start_index': 2382}, page_content='The transformation that the Northeast has seen in the last 11 years is not just about numbers—it is change that can be felt on the ground. We have not just built a connection with the Northeast through government schemes—we have built a bond from the heart. You might be surprised to hear this: ministers from our central government have visited the Northeast more than 700 times. And it wasn’t just about visiting and leaving—the rule was to stay overnight. They experienced the land, they saw the'),
 Document(id='8204108c-2697-4a6f-972e-13dde727d68a', metadata={'source': 'speech.txt', 'start_index': 12352}, page_content="Rising Northeast is not just an investors' summit — it is a movement. It is a call to action. The future of Bharat will rise to new heights through the bright future of the Northeast. I have complete faith in all the business leaders. Come, let us tog

In [98]:
retriever_score.invoke('What did PM talk about transformation of the Northeast')

[Document(id='0e37222c-426e-41da-92af-826f8d5a28df', metadata={'source': 'speech.txt', 'start_index': 2382}, page_content='The transformation that the Northeast has seen in the last 11 years is not just about numbers—it is change that can be felt on the ground. We have not just built a connection with the Northeast through government schemes—we have built a bond from the heart. You might be surprised to hear this: ministers from our central government have visited the Northeast more than 700 times. And it wasn’t just about visiting and leaving—the rule was to stay overnight. They experienced the land, they saw the'),
 Document(id='d4efb4ba-e28d-4f68-b663-036fd64a6ca3', metadata={'source': 'speech.txt', 'start_index': 3764}, page_content='Revolution in the Northeast. For a long time, the Northeast remained neglected. But now, the Northeast is becoming a land of opportunities. We have invested hundreds of thousands of crores of rupees in connectivity infrastructure in the Northeast. If y

In [99]:
prompt = hub.pull('rlm/rag-prompt')
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

In [100]:
example = prompt.invoke(
    {'context':'(context goes here)', 'question':'(question goes here)'}
).to_messages()

assert len(example)==1
print(example[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: (question goes here) 
Context: (context goes here) 
Answer:


In [101]:
## context(retriever),prompt(hub),model(openai),parser(langchain)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {'context': retriever | format_docs, 'question':RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [102]:
rag_chain.invoke('What did PM talk about transformation of the Northeast')

'PM discussed the transformation of the Northeast as a significant change felt on the ground, emphasizing a deep connection and bond established through government initiatives. He highlighted the commitment of central government ministers, who have visited the region over 700 times to engage more meaningfully. The "Rising Northeast" initiative was described as a movement aimed at elevating the region and contributing to the broader growth of Bharat (India).'