# Vector Store

- Can store large N-dimensional vectors
- Can directly index an embedded vector to its associated string text document
- Can be queried, allowing for a cosine similarity search between a new vector not in the database and the store vectors
- Can easily add, update, or delete new vectors

We're going to use an open-source vector store called Chroma, which works great with Langchain.

In [28]:
import warnings
warnings.filterwarnings('ignore')

In [1]:
import chromadb
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

Logic:
1. Load document --> Split into chunks
2. Use embeddings --> Embed chunks --> Vectors
3. Vector chunks --> Save ChromaDB
4. "query" --> similarity search chromadb
   

In [2]:
loader = TextLoader("some_data/FDR_State_of_Union_1944.txt")

documents = loader.load()

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)

docs = text_splitter.split_documents(documents)

In [3]:
embedding_function = OpenAIEmbeddings()

In [6]:
# Save
db = Chroma.from_documents(docs, embedding_function, persist_directory="./db/speech_new_db")
db.persist()

This creates 2 parquets and an index folder:
- chroma-collections: strings
- chroma-embeddings: vectors
- index: connects the two above

**Load DB from persisted directory**

In [7]:
db_new_connection = Chroma(persist_directory="./db/speech_new_db", embedding_function=embedding_function)

In [8]:
db_new_connection

<langchain.vectorstores.chroma.Chroma at 0x7f5e786100d0>

In [9]:
new_doc = "What did FDR say about the cost of food law?"
# new_doc = "cost of food law, FDR"  # this would be very similary, since it's just similarity search

# Chroma vectorizes this and returns document most similar to this
# It is not interpreting the question using a LLM model

In [10]:
similar_docs = db_new_connection.similarity_search(new_doc, k=3)

In [11]:
len(similar_docs)

3

In [12]:
print(similar_docs[0].page_content)  # Even smaller chunks would be more effective

That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.

Therefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:

(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate cost of the war to our sons and daughters. The tax bill now under consideration by the Congress does not begin to meet this test.

(2) A continuation of the law for the renegotiation of war contracts—which will prevent exorbitant profits and assure fair prices to the Government. For two long years I have pleaded with the Congress to take undue profits out of war.

(3) A cost of food law—which will enable the Government (a) to place a reasonable floor under the prices the farmer may expect for his production; and

## Adding new vectors

In [13]:
loader = TextLoader("some_data/Lincoln_State_of_Union_1862.txt")
documents = loader.load()

In [14]:
docs = text_splitter.split_documents(documents)

Created a chunk of size 608, which is longer than the specified 500
Created a chunk of size 539, which is longer than the specified 500
Created a chunk of size 686, which is longer than the specified 500


In [15]:
db_new_connection = Chroma.from_documents(
    docs,
    embedding_function,
    persist_directory="./speech_new_db"
)

In [16]:
docs = db_new_connection.similarity_search("slavery")

In [17]:
docs[0].page_content

'As to the second article, I think it would be impracticable to return to bondage the class of persons therein contemplated. Some of them, doubtless, in the property sense belong to loyal owners, and hence provision is made in this article for compensating such. The third article relates to the future of the freed people. It does not oblige, but merely authorizes Congress to aid in colonizing such as may consent. This ought not to be regarded as objectionable on the one hand or on the other, insomuch as it comes to nothing unless by the mutual consent of the people to be deported and the American voters, through their representatives in Congress.\n\nI can not make it better known than it already is that I strongly favor colonization; and yet I wish to say there is an objection urged against free colored persons remaining in the country which is largely imaginary, if not sometimes malicious.\n\nIt is insisted that their presence would injure and displace white labor and white laborers. 

## Retrievers

Easier way to access methods of the vector db.

In more advanced topics, we usually pass a retriever object rather than the vector store itself.

In [18]:
type(db_new_connection)

langchain.vectorstores.chroma.Chroma

In [20]:
retriever = db_new_connection.as_retriever()
type(retriever)

langchain.vectorstores.base.VectorStoreRetriever

In [21]:
results = retriever.get_relevant_documents("cost food of law")

In [23]:
len(results)

4

### MultiQuery retrieval

Documents in the vector store may contain phrasing that we are not aware of due to their size. This can cause issues trying to think of the correct query string for similarity comparison.

We can use an LLM to generate multiple variations of our query suing MultiQueryRetriever, allowing us to focus on key ideas rather than exact phrasing.

In [24]:
from langchain.document_loaders import WikipediaLoader

In [29]:
loader = WikipediaLoader(query="MKUltra")
documents = loader.load()

len(documents)

9

In [30]:
from langchain.text_splitter import CharacterTextSplitter

In [31]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(documents)

Created a chunk of size 516, which is longer than the specified 500


In [32]:
len(docs)

19

Since we are adding a bit of randomness in our queries with an LLM, we should set a low temperature such as zero.

In [33]:
from langchain.embeddings import OpenAIEmbeddings

embedding_function = OpenAIEmbeddings(temperature=0)

In [34]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(docs, embedding_function, persist_directory="./db/some_new_mkultra")
db.persist()

In [35]:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chat_models import ChatOpenAI

In [36]:
question = "When was this declassifier?"

In [37]:
llm = ChatOpenAI()
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db.as_retriever(), llm=llm)

In [38]:
# Logging behind the scenes

import logging
logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

This generates multiple queries for our question. It will try to find the better phrasing.

In [39]:
# This will not directly answer an query
# returns n docs that are most similar/relevant
unique_docs = retriever_from_llm.get_relevant_documents(query=question)

INFO:langchain.retrievers.multi_query:Generated queries: ['1. Can you provide the date of the declassification?', '2. Do you have any information about the specific date when the declassification took place?', '3. Could you please share the timeline or date of the declassifier?']


In [40]:
len(unique_docs)

6

In [41]:
unique_docs[0].page_content

"== See also ==\nHuman experimentation in the United States\nProject MKULTRA\nProject ARTICHOKE\nProject CHATTER\nProject MKDELTA\nCIA cryptonym\nKurt Blome\nErich Traub\n\n\n== References ==\n\nBibliographyGoliszek, Andrew, In the name of science : a history of secret programs, medical research, and human experimentation St. Martin's Press, 2003\nSummary Report of CIA Investigation of MKNAOMI (US National Archives, released under the JFK Assassination Records Act, December 2017)"

**This has a problem:** It is returning the entire document. We want only an "answer". This will lbe called context compression.