# Vector Stores

In [1]:
import os
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain_chroma import Chroma

### Load Document and Split

In [2]:
loader = TextLoader('some_data/FDR_State_of_Union_1944.txt')
document = loader.load() # This returns a document - which is a piece of text along with its metadata

In [3]:
# Split into chunks on the basis of tokens
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents=document)

### Connect to OpenAI for Embeddings

In [4]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

### Create Chroma DB instance to embed the docs and store the db

In [5]:
db = Chroma.from_documents(documents=docs, embedding=embeddings, persist_directory='some_data/speech_embedding_db')

### Load Embeddings from Disk

In [6]:
db_connection = Chroma(persist_directory='some_data/speech_embedding_db/', embedding_function=embeddings)

In [7]:
query = "What did FDR say about the cost of food law?"
docs = db_connection.similarity_search(query=query)

In [8]:
print(docs[0].page_content)

That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.

Therefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:

(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.

(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.

(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer may expect for his production; and

### Add a new Document

In [9]:
loader = TextLoader("some_data/Lincoln_State_of_Union_1862.txt")
document = loader.load()

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents=document)

Created a chunk of size 608, which is longer than the specified 500
Created a chunk of size 539, which is longer than the specified 500
Created a chunk of size 686, which is longer than the specified 500


In [10]:
# Add it to Chroma
db = Chroma.from_documents(docs, embedding=embeddings, persist_directory='some_data/speech_embedding_db')

In [11]:
docs = db.similarity_search('slavery')
print(docs[0].page_content)

As to the first article, the main points are, first, the emancipation; secondly, the length of time for consummating it (thirty-seven years); and, thirdly, the compensation.

The emancipation will be unsatisfactory to the advocates of perpetual slavery, but the length of time should greatly mitigate their dissatisfaction. The time spares both races from the evils of sudden derangement—in fact, from the necessity of any derangement—while most of those whose habitual course of thought will be disturbed by the measure will have passed away before its consummation. They will never see it. Another class will hail the prospect of emancipation, but will deprecate the length of time. They will feel that it gives too little to the now living slaves. But it really gives them much. It saves them from the vagrant destitution which must largely attend immediate emancipation in localities where their numbers are very great, and it gives the inspiring assurance that their posterity shall be free fore

### Retreiver Property

#### Few Example code on how to use retreiver
```
# Retrieve more documents with higher diversity
# Useful if your dataset has many similar documents
docsearch.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 6, 'lambda_mult': 0.25}
)
                '''
                search_kwargs (Optional[Dict]): Keyword arguments to pass to the
                search function. Can include things like:
                k: Amount of documents to return (Default: 4) score_threshold: Minimum relevance threshold

                for similarity_score_threshold

                fetch_k: Amount of documents to pass to MMR algorithm
                (Default: 20)

                lambda_mult: Diversity of results returned by MMR;
                1 for minimum diversity and 0 for maximum. (Default: 0.5)

                '''

# Fetch more documents for the MMR algorithm to consider
# But only return the top 5
docsearch.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 5, 'fetch_k': 50}
)

# Only retrieve documents that have a relevance score
# Above a certain threshold
docsearch.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={'score_threshold': 0.8}
)

# Only get the single most similar document from the dataset
docsearch.as_retriever(search_kwargs={'k': 1})

# Use a filter to only retrieve documents from a specific paper
docsearch.as_retriever(
    search_kwargs={'filter': {'paper_title':'GPT-4 Technical Report'}}
)
```

In [32]:
db_connection = Chroma(persist_directory='some_data/speech_embedding_db', embedding_function=embeddings)


In [36]:
# Use the MMR algorithm - Maximum Marginal relevance
retriever = db_connection.as_retriever(search_type="mmr", search_kwargs={'k':2, 'lambda_mult':0.25})
docs = retriever.invoke("Food Law")

In [37]:
print(f"Total Number of documents returned - {len(docs)}")
print("\n")
print(docs[0].page_content)

Total Number of documents returned - 2


(4) Early reenactment of. the stabilization statute of October, 1942. This expires June 30, 1944, and if it is not extended well in advance, the country might just as well expect price chaos by summer.

(5) A national service law- which, for the duration of the war, will prevent strikes, and, with certain appropriate exceptions, will make available for war production or for any other essential services every able-bodied adult in this Nation.

These five measures together form a just and equitable whole. I would not recommend a national service law unless the other laws were passed to keep down the cost of living, to share equitably the burdens of taxation, to hold the stabilization line, and to prevent undue profits.

The Federal Government already has the basic power to draft capital and property of all kinds for war purposes on a basis of just compensation.

As you know, I have for three years hesitated to recommend a national service act. Tod

In [38]:
# Only get the single most similar document from the dataset
retriever = db_connection.as_retriever(search_kwargs={'k':1})
docs = retriever.invoke("Food Law")

print(f"Total Number of documents returned - {len(docs)}")
print("\n")
print(docs[0].page_content)

Total Number of documents returned - 1


(4) Early reenactment of. the stabilization statute of October, 1942. This expires June 30, 1944, and if it is not extended well in advance, the country might just as well expect price chaos by summer.

(5) A national service law- which, for the duration of the war, will prevent strikes, and, with certain appropriate exceptions, will make available for war production or for any other essential services every able-bodied adult in this Nation.

These five measures together form a just and equitable whole. I would not recommend a national service law unless the other laws were passed to keep down the cost of living, to share equitably the burdens of taxation, to hold the stabilization line, and to prevent undue profits.

The Federal Government already has the basic power to draft capital and property of all kinds for war purposes on a basis of just compensation.

As you know, I have for three years hesitated to recommend a national service act. Tod

In [43]:
# Only retrieve documents that has a relevance score above a certain threshold
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5})
docs = retriever.invoke("Food Law")

  self.vectorstore.similarity_search_with_relevance_scores(
No relevant docs were retrieved using the relevance score threshold 0.5


In [44]:
print(f"Total Number of documents returned - {len(docs)}")
print("\n")
print(docs[0].page_content)

Total Number of documents returned - 0




IndexError: list index out of range