### FAISS
#### Facebook AI Similarity Search(FAISS) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. 


In [19]:
from langchain_community.document_loaders import TextLoader  # Data ingestion
from langchain_community.embeddings import OllamaEmbeddings  # Embeddings
from langchain_community.vectorstores import FAISS           # Vector Stores
from langchain_text_splitters import CharacterTextSplitter   # Char text splitter

loader=TextLoader("five_engineering_disciplines.txt")
documents=loader.load()
text_splitter=CharacterTextSplitter(chunk_size=300,chunk_overlap=30)
docs=text_splitter.split_documents(documents)


In [20]:
docs

[Document(metadata={'source': 'five_engineering_disciplines.txt'}, page_content='The five-pillar framework shown in Figure 1.5 emerged directly from the systems\nchallenges that distinguish ML from traditional software. Each pillar\naddresses specific challenge categories while recognizing their interdependencies:\nData Engineering (Chapter 6) addresses the data-related challenges we identified:\nquality assurance, scale management, drift detection, and distribution\nshift. This pillar encompasses building robust data pipelines that ensure quality,\nhandle massive scale, maintain privacy, and provide the infrastructure upon\nwhich all ML systems depend. For systems like Waymo, this means managing\nterabytes of sensor data per vehicle, validating data quality in real-time,\ndetecting distribution shifts across different cities and weather conditions, and\nmaintaining data lineage for debugging and compliance. The techniques covered\ninclude data versioning, quality monitoring, drift det

In [21]:
# Creating embeddings with Ollama and store in FAISS
# Dependencies:
# gemma3 does not support embeddings
# gemma:2b pulled on to my local machine (command prompt) with "ollama run gemma:2b" 
# Install pip install faiss-cpu before executing FAISS

embeddings=OllamaEmbeddings(model="gemma:2b")
db=FAISS.from_documents(docs,embeddings)
db

<langchain_community.vectorstores.faiss.FAISS at 0x1c91e0e08f0>

In [24]:
# Querying 
query="Explain about Waymo?"
docs=db.similarity_search(query)
docs[0].page_content

'The five-pillar framework shown in Figure 1.5 emerged directly from the systems\nchallenges that distinguish ML from traditional software. Each pillar\naddresses specific challenge categories while recognizing their interdependencies:\nData Engineering (Chapter 6) addresses the data-related challenges we identified:\nquality assurance, scale management, drift detection, and distribution\nshift. This pillar encompasses building robust data pipelines that ensure quality,\nhandle massive scale, maintain privacy, and provide the infrastructure upon\nwhich all ML systems depend. For systems like Waymo, this means managing\nterabytes of sensor data per vehicle, validating data quality in real-time,\ndetecting distribution shifts across different cities and weather conditions, and\nmaintaining data lineage for debugging and compliance. The techniques covered\ninclude data versioning, quality monitoring, drift detection algorithms,\nand privacy-preserving data processing.\nTraining Systems (C

In [None]:
### Retriever acts like a interface to retrieve details from Vector store to provide the details.

### AS a Retriever 
#### We can also convert the vectorstores into a Retriever class. This allows us to easily use it in other LangChain methods, which largely work with retrievers

In [26]:
retriever=db.as_retriever()
docs=retriever.invoke(query)
docs[0].page_content


'The five-pillar framework shown in Figure 1.5 emerged directly from the systems\nchallenges that distinguish ML from traditional software. Each pillar\naddresses specific challenge categories while recognizing their interdependencies:\nData Engineering (Chapter 6) addresses the data-related challenges we identified:\nquality assurance, scale management, drift detection, and distribution\nshift. This pillar encompasses building robust data pipelines that ensure quality,\nhandle massive scale, maintain privacy, and provide the infrastructure upon\nwhich all ML systems depend. For systems like Waymo, this means managing\nterabytes of sensor data per vehicle, validating data quality in real-time,\ndetecting distribution shifts across different cities and weather conditions, and\nmaintaining data lineage for debugging and compliance. The techniques covered\ninclude data versioning, quality monitoring, drift detection algorithms,\nand privacy-preserving data processing.\nTraining Systems (C

### Similarity Search with Score 
##### There are some FAISS specific methods. One of them is similarity_search_with_score, which allows you to return not only the documents but also the distance score of the query to them.The returned distance score is L2 distance. Therefore, a lower score is better.


In [28]:
docs_similarity_score=db.similarity_search_with_score(query)
docs_similarity_score # you must see the score (like 4557.5547)) of each document , which is also called "Manhattan Score"


[(Document(id='5411a3cf-6254-4d0c-817d-f882c791d3a4', metadata={'source': 'five_engineering_disciplines.txt'}, page_content='The five-pillar framework shown in Figure 1.5 emerged directly from the systems\nchallenges that distinguish ML from traditional software. Each pillar\naddresses specific challenge categories while recognizing their interdependencies:\nData Engineering (Chapter 6) addresses the data-related challenges we identified:\nquality assurance, scale management, drift detection, and distribution\nshift. This pillar encompasses building robust data pipelines that ensure quality,\nhandle massive scale, maintain privacy, and provide the infrastructure upon\nwhich all ML systems depend. For systems like Waymo, this means managing\nterabytes of sensor data per vehicle, validating data quality in real-time,\ndetecting distribution shifts across different cities and weather conditions, and\nmaintaining data lineage for debugging and compliance. The techniques covered\ninclude da

In [29]:
# Can we pass vectors directly and pull the sentences?

embedding_vector=embeddings.embed_query(query)
embedding_vector

[-0.16244126856327057,
 -0.5529950261116028,
 -1.1976211071014404,
 1.253867745399475,
 1.125076413154602,
 1.9726648330688477,
 -0.526867687702179,
 -1.705436110496521,
 -0.24934111535549164,
 0.47159725427627563,
 0.9513911604881287,
 -0.418928861618042,
 -3.6008641719818115,
 -0.24061466753482819,
 -0.9998064637184143,
 -0.9752729535102844,
 4.981388568878174,
 -1.6201950311660767,
 -1.0284048318862915,
 0.004647890105843544,
 0.9154911041259766,
 -0.35932761430740356,
 1.398829460144043,
 -0.6094000339508057,
 -1.0641615390777588,
 0.07642630487680435,
 0.7974633574485779,
 0.46997585892677307,
 0.7658807039260864,
 -3.182512044906616,
 -0.8836308717727661,
 1.9052541255950928,
 -0.2487129271030426,
 1.586466908454895,
 -0.7239973545074463,
 0.3568274676799774,
 1.5544346570968628,
 0.12234538048505783,
 2.827087879180908,
 -0.42380228638648987,
 -1.0067087411880493,
 0.18449804186820984,
 0.30362391471862793,
 -0.8231160640716553,
 -0.07804828882217407,
 0.04001285508275032,
 1.03

In [30]:
# then searching with similarity search vector 
docs_score=db.similarity_search_by_vector(embedding_vector)
docs_score

[Document(id='5411a3cf-6254-4d0c-817d-f882c791d3a4', metadata={'source': 'five_engineering_disciplines.txt'}, page_content='The five-pillar framework shown in Figure 1.5 emerged directly from the systems\nchallenges that distinguish ML from traditional software. Each pillar\naddresses specific challenge categories while recognizing their interdependencies:\nData Engineering (Chapter 6) addresses the data-related challenges we identified:\nquality assurance, scale management, drift detection, and distribution\nshift. This pillar encompasses building robust data pipelines that ensure quality,\nhandle massive scale, maintain privacy, and provide the infrastructure upon\nwhich all ML systems depend. For systems like Waymo, this means managing\nterabytes of sensor data per vehicle, validating data quality in real-time,\ndetecting distribution shifts across different cities and weather conditions, and\nmaintaining data lineage for debugging and compliance. The techniques covered\ninclude dat

In [33]:
### Saving local and loading again 
db.save_local("faiss_index")

In [37]:
# loading the same FAISS_Indiex file 
new_db=FAISS.load_local("faiss_index",embeddings,allow_dangerous_deserialization=True)
docs=new_db.similarity_search(query)
docs

[Document(id='5411a3cf-6254-4d0c-817d-f882c791d3a4', metadata={'source': 'five_engineering_disciplines.txt'}, page_content='The five-pillar framework shown in Figure 1.5 emerged directly from the systems\nchallenges that distinguish ML from traditional software. Each pillar\naddresses specific challenge categories while recognizing their interdependencies:\nData Engineering (Chapter 6) addresses the data-related challenges we identified:\nquality assurance, scale management, drift detection, and distribution\nshift. This pillar encompasses building robust data pipelines that ensure quality,\nhandle massive scale, maintain privacy, and provide the infrastructure upon\nwhich all ML systems depend. For systems like Waymo, this means managing\nterabytes of sensor data per vehicle, validating data quality in real-time,\ndetecting distribution shifts across different cities and weather conditions, and\nmaintaining data lineage for debugging and compliance. The techniques covered\ninclude dat