In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
from langchain.schema import HumanMessage
from langchain_openai import ChatOpenAI

In [4]:
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import OpenSearchVectorSearch
from langchain_openai import OpenAIEmbeddings


loader = TextLoader("art_of_war.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

Similarity Search demonstration. 

In [5]:
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    embeddings,
    opensearch_url="https://localhost:9200",
    http_auth=("admin", "admin"),
    use_ssl = False,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    engine="faiss",
    space_type="innerproduct",
    ef_construction=256,
    m=48,
)

In [6]:
import time

query = "What can one do about being prepared to fight?"

# Record the start time
start_time = time.time()

# Perform the similarity search
docs = docsearch.similarity_search(query, k=10)

# Calculate the elapsed time
elapsed_time = time.time() - start_time

print(docs[0].page_content)

print("Similarity search using FAISS took {:.2f} seconds.".format(elapsed_time))

23. Throw your soldiers into positions whence there is no escape,
and they will prefer death to flight. If they will face death, there
is nothing they may not achieve. Officers and men alike will put forth
their uttermost strength. 

24. Soldiers when in desperate straits lose the sense of fear. If
there is no place of refuge, they will stand firm. If they are in
hostile country, they will show a stubborn front. If there is no help
for it, they will fight hard. 

25. Thus, without waiting to be marshaled, the soldiers will be constantly
on the qui vive; without waiting to be asked, they will do your will;
without restrictions, they will be faithful; without giving orders,
they can be trusted. 

26. Prohibit the taking of omens, and do away with superstitious doubts.
Then, until death itself comes, no calamity need be feared.
Similarity search using FAISS took 0.40 seconds.


This program uses the OpenSearch Docker instance, the python 3.12 virtual environment, and TextLoader to load the documents from a text file, generates embeddings, and indexes using OpenSearch. 

The system is now set up to perform semantic searches using OpenSearch. The document embeddings are indexed and can be queried to find semantically relevant results based on the vector space model.

What remains to be done is to replace FAISS with an index that supports the KNN search algorithm, namely by leveraging NMSLIB. 

In [7]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import OpenSearchVectorSearch

In [8]:
docsearch = OpenSearchVectorSearch.from_documents(
    docs,
    embeddings,
    opensearch_url="https://localhost:9200",
    http_auth=("admin", "admin"),
    use_ssl = False,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
    engine = "nmslib",
    space_type="innerproduct",
    ef_construction=256,
    m=48,
)

In [9]:
import time

query = "What can one do about being prepared to fight?"

# Record the start time
start_time = time.time()

# Perform the similarity search
docs = docsearch.similarity_search(query, k=10)

# Calculate the elapsed time
elapsed_time = time.time() - start_time


print(docs[0].page_content)

print("Similarity search using NMSLIB took {:.2f} seconds.".format(elapsed_time))

23. Throw your soldiers into positions whence there is no escape,
and they will prefer death to flight. If they will face death, there
is nothing they may not achieve. Officers and men alike will put forth
their uttermost strength. 

24. Soldiers when in desperate straits lose the sense of fear. If
there is no place of refuge, they will stand firm. If they are in
hostile country, they will show a stubborn front. If there is no help
for it, they will fight hard. 

25. Thus, without waiting to be marshaled, the soldiers will be constantly
on the qui vive; without waiting to be asked, they will do your will;
without restrictions, they will be faithful; without giving orders,
they can be trusted. 

26. Prohibit the taking of omens, and do away with superstitious doubts.
Then, until death itself comes, no calamity need be feared.
Similarity search using NMSLIB took 0.16 seconds.


There is a measurable difference betIen NMSLIB and FAISS. 

In [10]:
query = "What should one keep in mind when applying to college?"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)

18. If asked how to cope with a great host of the enemy in orderly
array and on the point of marching to the attack, I should say: "Begin
by seizing something which your opponent holds dear; then he will
be amenable to your will." 

19. Rapidity is the essence of war: take advantage of the enemy's
unreadiness, make your way by unexpected routes, and attack unguarded
spots. 

20. The following are the principles to be observed by an invading
force: The further you penetrate into a country, the greater will
be the solidarity of your troops, and thus the defenders will not
prevail against you. 

21. Make forays in fertile country in order to supply your army with
food. 

22. Carefully study the well-being of your men, and do not overtax
them. Concentrate your energy and hoard your strength. Keep your army
continually on the move, and devise unfathomable plans.


In this Jupyter notebook, I successfully demonstrate the process of building a semantic search system using two distinct in-memory vector databases: FAISS (Facebook AI Similarity Search) and OpenSearch. Initially, I take a sample PDF file and utilize the LangChain PDF loader to chunk the content, converting the text into embeddings using OpenAI's technology. These embeddings are then loaded into a FAISS in-memory store for semantic search. Through the Python Notebook, I show how to query the FAISS vector store, effectively performing similarity searches based on the embedded text.

Subsequently, I transition to using OpenSearch, a scalable search engine, replacing FAISS while maintaining the same methodology. I load the same PDF into an OpenSearch index equipped with a KNN (k-nearest neighbour) search algorithm, using the same chunking algorithm provided by the LangChain PDF loader. Finally, I demonstrate querying the OpenSearch KNN-enabled index, thereby showcasing the ability to conduct semantic searches in a Python Notebook environment. This exercise not only highlights the versatility of semantic search technologies but also underscores my practical application of these advanced tools in processing and extracting meaningful information from text data.