In [1]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

### Load VectorStore

In [2]:
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('./input/state_of_the_union.txt').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

Created a chunk of size 1163, which is longer than the specified 1000
Created a chunk of size 1015, which is longer than the specified 1000


In [3]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

In [4]:
model_name = '/root/.cache/huggingface/hub/dataroot/models/GanymedeNil/text2vec-large-chinese'

In [5]:
embeddings = HuggingFaceEmbeddings(model_name=model_name, 
                                   model_kwargs={'device': 'cuda:0'},
                                   encode_kwargs={'normalize_embeddings': False})

No sentence-transformers model found with name /root/.cache/huggingface/hub/dataroot/models/GanymedeNil/text2vec-large-chinese. Creating a new one with MEAN pooling.


In [6]:
%time
db = FAISS.from_documents(documents, embedding=embeddings)

CPU times: user 5 µs, sys: 3 µs, total: 8 µs
Wall time: 15.3 µs


In [7]:
# db.save_local("faiss_index")
# new_db = FAISS.load_local("faiss_index", embeddings)
# docs = new_db.similarity_search(query)
# docs[0]

### Similarity search

In [8]:
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

In [9]:
print(docs[0].page_content)

Now, even after paying for what we spent on my watch, we will still face the massive deficit we had when I took office. More importantly, the cost of Medicare, Medicaid and Social Security will continue to skyrocket. That's why I've called for a bipartisan fiscal commission, modeled on a proposal by Republican Judd Gregg and Democrat Kent Conrad. This can't be one of those Washington gimmicks that lets us pretend we solved a problem. The commission will have to provide a specific set of solutions by a certain deadline. Yesterday, the Senate blocked a bill that would have created this commission. So I will issue an executive order that will allow us to go forward, because I refuse to pass this problem on to another generation of Americans. And when the vote comes tomorrow, the Senate should restore the pay-as-you-go law that was a big reason why we had record surpluses in the 1990s.


### Similarity search by vector

It is also possible to do a search for documents similar to a given embedding vector using similarity_search_by_vector which accepts an embedding vector as a parameter instead of a string.

In [10]:
query_embedding = embeddings.embed_query(query)

In [11]:
docs = db.similarity_search_by_vector(query_embedding)

In [12]:
docs.__len__()

4

In [13]:
print(docs[0].page_content)

Now, even after paying for what we spent on my watch, we will still face the massive deficit we had when I took office. More importantly, the cost of Medicare, Medicaid and Social Security will continue to skyrocket. That's why I've called for a bipartisan fiscal commission, modeled on a proposal by Republican Judd Gregg and Democrat Kent Conrad. This can't be one of those Washington gimmicks that lets us pretend we solved a problem. The commission will have to provide a specific set of solutions by a certain deadline. Yesterday, the Senate blocked a bill that would have created this commission. So I will issue an executive order that will allow us to go forward, because I refuse to pass this problem on to another generation of Americans. And when the vote comes tomorrow, the Senate should restore the pay-as-you-go law that was a big reason why we had record surpluses in the 1990s.
