### Facebook AI Similarity Search (Faiss) 
a library that allows us to quickly search for multimedia documents that are similar to each other.

In [1]:
from langchain_openai import ChatOpenAI
from constants import OPENAI_KEY
import os
os.environ["OPENAI_API_KEY"] = OPENAI_KEY

llm = ChatOpenAI(openai_api_key=OPENAI_KEY, temperature=0.7)

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from langchain_community.document_loaders import YoutubeLoader, TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

In [48]:
## loader = TextLoader("state_of_the_union.txt")
loader = TextLoader("D:\youtube\langchain\state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
db = FAISS.from_documents(docs, embeddings)


In [49]:
query = "What did the president say about Jackson"
docs = db.similarity_search(query)

In [50]:
docs

[Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n\nWe cannot let this happen. \n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in priv

In [6]:
print(docs[0].page_content)

In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

A former top litigator in private practice. A former federal publi

#### Retriver

In [53]:
retriever = db.as_retriever()
retriever.get_relevant_documents("summary of the doc by 10 words only")

[Document(page_content='To get there, I call on Congress to fund ARPA-H, the Advanced Research Projects Agency for Health. \n\nIt’s based on DARPA—the Defense Department project that led to the Internet, GPS, and so much more.  \n\nARPA-H will have a singular purpose—to drive breakthroughs in cancer, Alzheimer’s, diabetes, and more. \n\nA unity agenda for the nation. \n\nWe can do this. \n\nMy fellow Americans—tonight , we have gathered in a sacred space—the citadel of our democracy. \n\nIn this Capitol, generation after generation, Americans have debated great questions amid great strife, and have done great things. \n\nWe have fought for freedom, expanded liberty, defeated totalitarianism and terror. \n\nAnd built the strongest, freest, and most prosperous nation the world has ever known. \n\nNow is the hour. \n\nOur moment of responsibility. \n\nOur test of resolve and conscience, of history itself. \n\nIt is in this moment that our character is formed. Our purpose is found. Our fut

In [10]:
### We can also convert the vectorstore into a Retriever class. This allows us to easily use it in other LangChain methods, which largely work with retrievers
retriever = db.as_retriever()
docs = retriever.invoke(query)
print(docs[0].page_content)

In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. 

We cannot let this happen. 

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. 

A former top litigator in private practice. A former federal publi

In [13]:
## By default, the vector store retriever uses similarity search. 
# If the underlying vector store supports maximum marginal relevance search,
retriever = db.as_retriever(search_type="mmr")
docs = retriever.get_relevant_documents("what did he say about ketanji brown jackson")

In [12]:
docs

[Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n\nWe cannot let this happen. \n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in priv

#### Similarity Search with score

In [14]:
## return not only the documents but also the distance score of the query to them ie L2 distance. 
# Therefore, a lower score is better.
query = "what did he say about ketanji brown jackson"
docs_and_scores = db.similarity_search_with_score(query)
docs_and_scores[0]

(Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n\nWe cannot let this happen. \n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in priv

In [19]:
embedding_vector = embeddings.embed_query(query)
docs_and_scores = db.similarity_search_with_score_by_vector(embedding_vector)
#db.similarity_search_by_vector

In [20]:
docs_and_scores

[(Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n\nWe cannot let this happen. \n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in pri

In [26]:
for i in docs_and_scores:
    print(i)


(Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n\nWe cannot let this happen. \n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in priv

In [32]:
for i, x in enumerate(docs_and_scores):
        print(i, x)

0 (Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n\nWe cannot let this happen. \n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in pr

In [64]:
for doc, score in docs_and_scores:
    #print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")
    print(f"Metadata: {doc.metadata}, Score: {score}")

Metadata: {'source': 'D:\\youtube\\langchain\\state_of_the_union.txt'}, Score: 0.48037999868392944
Metadata: {'source': 'D:\\youtube\\langchain\\state_of_the_union.txt'}, Score: 0.5242013931274414
Metadata: {'source': 'D:\\youtube\\langchain\\state_of_the_union.txt'}, Score: 0.5252190232276917
Metadata: {'source': 'D:\\youtube\\langchain\\state_of_the_union.txt'}, Score: 0.5510718822479248


In [39]:
# Similarity score threshold retrieval
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)


#docs = retriever.get_relevant_documents("what did he say about jackson")

In [40]:
retriever = db.as_retriever(search_kwargs={"k": 1})

In [44]:
retriever.get_prompts

<bound method Runnable.get_prompts of VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x000001F38673BB90>, search_kwargs={'k': 1})>

In [46]:
#retriever.get_relevant_documents("summary of the doc")


In [47]:


#docs = retriever.get_relevant_documents("summary")
#len(docs)

#### Saving and loading

In [33]:
### You can also save and load a FAISS index. 
## This is useful so you don’t have to recreate it everytime you use it.
db.save_local("faiss_index")
new_db = FAISS.load_local("faiss_index", embeddings)

docs = new_db.similarity_search(query)

docs[0]

Document(page_content='In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections. \n\nWe cannot let this happen. \n\nTonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in priva

#### Serializing and De-Serializing to bytes

In [35]:
## serializes FAISS index and size would be much lesser. this can be helpful if you wish to store the index in database like sql.
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

pkl = db.serialize_to_bytes()  # serializes the faiss
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

db = FAISS.deserialize_from_bytes(
    embeddings=embeddings, serialized=pkl
)  # Load the index


  from .autonotebook import tqdm as notebook_tqdm
.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<?, ?B/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<?, ?B/s] 
README.md: 100%|██████████| 10.6k/10.6k [00:00<?, ?B/s]
config.json: 100%|██████████| 612/612 [00:00<00:00, 102kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<?, ?B/s] 
data_config.json: 100%|██████████| 39.3k/39.3k [00:00<?, ?B/s]
pytorch_model.bin: 100%|██████████| 90.9M/90.9M [00:03<00:00, 28.5MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<?, ?B/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<?, ?B/s] 
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 11.4MB/s]
tokenizer_config.json: 100%|██████████| 350/350 [00:00<00:00, 43.5kB/s]
train_script.py: 100%|██████████| 13.2k/13.2k [00:00<?, ?B/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 50.6MB/s]
modules.json: 100%|██████████| 349/349 [00:00<?, ?B/s] 


### Merging

In [54]:
db1 = FAISS.from_texts(["Hello How r u "], embeddings)
db2 = FAISS.from_texts(["I am fine, tell me about you?"], embeddings)

db1.docstore._dict

{'a1ea8bde-3cef-48f4-a028-dba4314ce452': Document(page_content='Hello How r u ')}

In [58]:
db2.docstore._dict

{'62df44a9-1ba6-48df-9dc9-63d4cc24b43c': Document(page_content='I am fine, tell me about you?')}

In [59]:
db1.merge_from(db2)

In [60]:
db1.docstore._dict

{'a1ea8bde-3cef-48f4-a028-dba4314ce452': Document(page_content='Hello How r u '),
 '62df44a9-1ba6-48df-9dc9-63d4cc24b43c': Document(page_content='I am fine, tell me about you?')}

In [61]:
from langchain.schema import Document

list_of_documents = [
    Document(page_content="foo", metadata=dict(page=1)),
    Document(page_content="bar", metadata=dict(page=1)),
    Document(page_content="foo", metadata=dict(page=2)),
    Document(page_content="barbar", metadata=dict(page=2)),
    Document(page_content="foo", metadata=dict(page=3)),
    Document(page_content="fofof", metadata=dict(page=3)),
    Document(page_content="bar burr", metadata=dict(page=3)),
    Document(page_content="foo", metadata=dict(page=4)),
    Document(page_content="bar bruh", metadata=dict(page=4)),
    Document(page_content="fo", metadata=dict(page=4)),
]
dbnew = FAISS.from_documents(list_of_documents, embeddings)
results_with_scores = dbnew.similarity_search_with_score("fo")
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

Content: fo, Metadata: {'page': 4}, Score: 0.0
Content: foo, Metadata: {'page': 1}, Score: 0.16252771019935608
Content: foo, Metadata: {'page': 2}, Score: 0.16252771019935608
Content: foo, Metadata: {'page': 3}, Score: 0.16252771019935608


In [62]:
results_with_scores = dbnew.similarity_search_with_score("bar")
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

Content: bar, Metadata: {'page': 1}, Score: 0.0
Content: barbar, Metadata: {'page': 2}, Score: 0.01111103966832161
Content: bar bruh, Metadata: {'page': 4}, Score: 0.2908173203468323
Content: foo, Metadata: {'page': 1}, Score: 0.31312763690948486


In [67]:
results_with_scores = dbnew.similarity_search_with_score("bar", filter=dict(page=1))
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

Content: bar, Metadata: {'page': 1}, Score: 0.0
Content: foo, Metadata: {'page': 1}, Score: 0.31312763690948486


In [68]:
results = dbnew.max_marginal_relevance_search("foo", filter=dict(page=1))
for doc in results:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")

Content: foo, Metadata: {'page': 1}
Content: bar, Metadata: {'page': 1}


In [69]:
results = dbnew.similarity_search("foo", filter=dict(page=1), k=1, fetch_k=4)
for doc in results:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}")

Content: foo, Metadata: {'page': 1}


### Delete

You can also delete ids. Note that the ids to delete should be the ids in the docstore.

In [70]:
dbnew.index_to_docstore_id

{0: '0c2d29a8-7ec6-4327-b375-fe154d772290',
 1: '05ea62df-1219-4845-b391-0293b99169bf',
 2: 'e89c5276-5a9c-4986-a437-5e0b8c674c96',
 3: 'af33d11e-b3ad-4690-a373-b94c8344740e',
 4: '491acee6-77b9-448c-8a8d-efa5fa8a1d73',
 5: 'ff5cf0a9-4ef7-4772-9a61-00615b736580',
 6: 'aff7a0db-1fc7-4933-be46-69b1fe8e411d',
 7: '7dcfd08f-41ce-490b-a2f1-d61f5374a8fa',
 8: '920af885-ad1c-4232-82ae-9d0e6ce65da3',
 9: '40d78d29-10e0-44db-ae08-bbc34d3d7908'}

In [71]:
dbnew.delete([dbnew.index_to_docstore_id[0]])

True

In [72]:
dbnew.index_to_docstore_id

{0: '05ea62df-1219-4845-b391-0293b99169bf',
 1: 'e89c5276-5a9c-4986-a437-5e0b8c674c96',
 2: 'af33d11e-b3ad-4690-a373-b94c8344740e',
 3: '491acee6-77b9-448c-8a8d-efa5fa8a1d73',
 4: 'ff5cf0a9-4ef7-4772-9a61-00615b736580',
 5: 'aff7a0db-1fc7-4933-be46-69b1fe8e411d',
 6: '7dcfd08f-41ce-490b-a2f1-d61f5374a8fa',
 7: '920af885-ad1c-4232-82ae-9d0e6ce65da3',
 8: '40d78d29-10e0-44db-ae08-bbc34d3d7908'}

In [73]:
# Is now missing
0 in db.index_to_docstore_id

True