## FAISS (Facebook AI Similarity Search)
FAISS (Facebook AI Similarity Search) is an open-source library developed by Meta AI for efficient similarity search in high-dimensional spaces. It is optimized for:
- Large-scale AI workloads
- High-speed similarity searches
- GPU acceleration

*FAISS stores vector embeddings and performs fast searches using Approximate Nearest Neighbor (ANN) search.*

### Step 1: Generate Embeddings

In [6]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader("sample.txt")
doc = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=100,chunk_overlap=20)
docs = text_splitter.split_documents(doc)
docs

Created a chunk of size 125, which is longer than the specified 100
Created a chunk of size 139, which is longer than the specified 100


[Document(metadata={'source': 'sample.txt'}, page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).'),
 Document(metadata={'source': 'sample.txt'}, page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.'),
 Document(metadata={'source': 'sample.txt'}, page_content="Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks.")]

In [7]:
from langchain_community.embeddings import OllamaEmbeddings

# Initialize the Ollama embeddings model
embeddings = OllamaEmbeddings(model='gemma2:2b')

  embeddings = OllamaEmbeddings(model='gemma2:2b')


In [8]:

from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(documents=docs, embedding=embeddings)
db



<langchain_community.vectorstores.faiss.FAISS at 0x27f3504b100>

In [14]:
# Query data 
query = "What is Langchain"
query_result = db.similarity_search(query= 'what are the different sources can be managed by Langchain?')
query_result

[Document(id='626c8a0e-98c4-41cc-bce8-6d05329c4c92', metadata={'source': 'sample.txt'}, page_content="Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks."),
 Document(id='21623649-d47c-450c-92d4-08e9c639130f', metadata={'source': 'sample.txt'}, page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).'),
 Document(id='ea6f0b89-3d1d-422c-a062-01f76925d269', metadata={'source': 'sample.txt'}, page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.')]

#### As a Retriever

We can aslos convert the vectore store into a Retriever calss. This allows us to easily use it in other LangChain methods, wich largely work with retrievers.

In [15]:
retriever = db.as_retriever()
query_results = retriever.invoke(query)
query_results

[Document(id='626c8a0e-98c4-41cc-bce8-6d05329c4c92', metadata={'source': 'sample.txt'}, page_content="Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks."),
 Document(id='21623649-d47c-450c-92d4-08e9c639130f', metadata={'source': 'sample.txt'}, page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).'),
 Document(id='ea6f0b89-3d1d-422c-a062-01f76925d269', metadata={'source': 'sample.txt'}, page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.')]

In [16]:
query_results[0].page_content

"Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks."

### Similarlity Search with score

There are some FAISS specific methods. One of them is similarity_search_with_score, which allow you to return not nply the documents but alos the ditance score of the query to them, The returned distance score is L2 distance. Therefore a lower score is better.

In [17]:
docs_sscore = db.similarity_search_with_score(query)
docs_sscore

[(Document(id='626c8a0e-98c4-41cc-bce8-6d05329c4c92', metadata={'source': 'sample.txt'}, page_content="Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks."),
  10290.141),
 (Document(id='21623649-d47c-450c-92d4-08e9c639130f', metadata={'source': 'sample.txt'}, page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).'),
  11005.59),
 (Document(id='ea6f0b89-3d1d-422c-a062-01f76925d269', metadata={'source': 'sample.txt'}, page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.'),
  12517.192)]

In [18]:
embedding_vector = embeddings.embed_query(query)
embedding_vector

[-0.3284129202365875,
 2.8094985485076904,
 -3.145840883255005,
 -0.9294917583465576,
 0.18280042707920074,
 -0.44189560413360596,
 -2.7430031299591064,
 -2.424830436706543,
 -0.7043176889419556,
 1.700805425643921,
 0.32224759459495544,
 0.40906283259391785,
 -2.805396556854248,
 -0.445677250623703,
 0.728775143623352,
 2.187290906906128,
 -1.7118418216705322,
 -0.4260862171649933,
 -6.695915699005127,
 -0.07434077560901642,
 2.9762957096099854,
 -0.4367716610431671,
 -3.03924298286438,
 -2.3669512271881104,
 1.558365821838379,
 -0.464424729347229,
 4.339969635009766,
 -2.7953035831451416,
 1.7545745372772217,
 -2.185659646987915,
 -1.7389909029006958,
 -0.5775207877159119,
 -2.750875234603882,
 -0.7579691410064697,
 1.258087158203125,
 2.755009651184082,
 1.4888110160827637,
 0.07752092182636261,
 -0.18749843537807465,
 -3.5551626682281494,
 -0.2695659101009369,
 0.7486269474029541,
 2.962690830230713,
 -1.3393417596817017,
 0.2872186601161957,
 -0.536932647228241,
 1.098877191543579

In [19]:
docs_score = db.similarity_search_by_vector(embedding_vector)
docs_score

[Document(id='626c8a0e-98c4-41cc-bce8-6d05329c4c92', metadata={'source': 'sample.txt'}, page_content="Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks."),
 Document(id='21623649-d47c-450c-92d4-08e9c639130f', metadata={'source': 'sample.txt'}, page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).'),
 Document(id='ea6f0b89-3d1d-422c-a062-01f76925d269', metadata={'source': 'sample.txt'}, page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.')]

#### Saving and loadeing to vector DB

In [20]:
db.save_local("faiss_index")

In [23]:
new_db = FAISS.load_local('faiss_index', embeddings, allow_dangerous_deserialization=True)
docs = new_db.similarity_search(query)

In [24]:
docs

[Document(id='626c8a0e-98c4-41cc-bce8-6d05329c4c92', metadata={'source': 'sample.txt'}, page_content="Using LangChain's document loaders, we can efficiently fetch data from multiple sources and utilize it for various AI-based tasks."),
 Document(id='21623649-d47c-450c-92d4-08e9c639130f', metadata={'source': 'sample.txt'}, page_content='LangChain is an open-source framework designed to help developers build applications powered by large language models (LLMs).'),
 Document(id='ea6f0b89-3d1d-422c-a062-01f76925d269', metadata={'source': 'sample.txt'}, page_content='It provides tools for loading, processing, and managing different types of data sources such as text files, PDFs, web pages, and databases.')]