# MultiVector Retriever
It can often be beneficial to store multiple vectors per document.  
There are multiple use cases where this is beneficial.  
LangChain has a base MultiVectorRetriever which makes querying this type of setup easy.  
A lot of the complexity lies in how to create the multiple vectors per document.  
This notebook covers some of the common ways to create those vectors and use the MultiVectorRetriever.

The methods to create multiple vectors per document include:

* Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
* Summary: create a summary for each document, embed that along with (or instead of) the document.
* Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.

Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

In [70]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
# from langchain_chroma import Chroma
from langchain_community.vectorstores.chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [26]:
# loaders = [
#     TextLoader("../../../text_files/paul_graham_mit_essays.txt"),
#     # TextLoader("../../../text_files/state_of_the_union.txt"),
# ]
# docs = []
# for loader in loaders:
#     docs.extend(loader.load())
# print(len(docs))
# print(docs[0].page_content[:100])
# docs[0].metadata['test'] = "123"
# print(docs[0].metadata)

In [135]:
loaders = [
    TextLoader("../../../text_files/paul_graham_mit_essays.txt"),
    TextLoader("../../../text_files/state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
print(len(docs))
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
# text_splitter 會保留原始的 metadata
docs = text_splitter.split_documents(docs)
print(len(docs))

for doc in docs:
    print(len(doc.page_content))

2
8
9947
9939
9566
6440
9947
9902
9874
9194


## Smaller chunks
Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. Note that this is what the ParentDocumentRetriever does. Here we show what is going on under the hood.

In [136]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs] # unique id for each document

In [137]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [138]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc]) # Split the document into smaller chunks
    for _sub_doc in _sub_docs:
        _sub_doc.metadata[id_key] = _id # Add the parent document id to the metadata
    sub_docs.extend(_sub_docs)

In [139]:
print(len(sub_docs))

330


In [140]:
# test = list(zip(doc_ids, docs))
# print(len(test))
# print(test[0])

In [141]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs))) # 存成 key-value pair

In [143]:
# Vectorstore alone retrieves the small chunks
res = retriever.vectorstore.similarity_search("justice breyer")[0]
print(res.page_content)
print(res.metadata)

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
{'doc_id': 'a1cd6eac-eeeb-46f6-9e0c-e0b9a8091195', 'source': '../../../text_files/state_of_the_union.txt'}


In [146]:
# test
results = retriever.invoke("justice breyer")

results[0].page_content

'But in my administration, the watchdogs have been welcomed back. \n\nWe’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.  \n\nAnd tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud. \n\nBy the end of this year, the deficit will be down to less than half what it was before I took office.  \n\nThe only president ever to cut the deficit by more than one trillion dollars in a single year. \n\nLowering your costs also means demanding more competition. \n\nI’m a capitalist, but capitalism without competition isn’t capitalism. \n\nIt’s exploitation—and it drives up prices. \n\nWhen corporations don’t have to compete, their profits go up, your prices go up, and small businesses and family farmers and ranchers go under. \n\nWe see it happening with ocean carriers moving goods in and out of America. \n\nDuring the pandemic, these foreign-owned companies raised prices by as much 

In [145]:
# Retriever returns larger chunks
results = retriever.invoke("justice breyer")
# print(len(res))

for result in results:
    # print(result.metadata[id_key])
    print(len(result.page_content))
    print(result.metadata)
    print(result.page_content[:100])
    # print(result.page_content)
    print("-----")


9874
{'source': '../../../text_files/state_of_the_union.txt'}
But in my administration, the watchdogs have been welcomed back. 

We’re going after the criminals w
-----


The default search type the retriever performs on the vector database is a similarity search.  
LangChain Vector Stores also support searching via [Max Marginal Relevance](https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.max_marginal_relevance_search) so if you want this instead you can just set the search_type property as follows:

In [80]:
from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr

len(retriever.invoke("justice breyer")[0].page_content)

9874

## questions:
* Q:when retrieval small/large chunk, is it compare the query to small or large chunk for similar search?
* Q:why need `byte_store`?
* Q:althernative to `InMemoryByteStore`


## Summary
Oftentimes a `summary may be able to distill more accurately what a chunk is about`, leading to better retrieval. Here we show how to create summaries, and then embed those.

In [104]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [105]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser() # this parser will only return "content", filter out other fields
)

In [106]:
summaries = chain.batch(docs, {"max_concurrency": 5})

In [111]:
print(type(summaries))
print(len(summaries))
print(summaries[0])

for summary in summaries:
    print(len(summary))

<class 'list'>
8
The document discusses the idea of starting a startup while in college or right after graduation. It highlights the advantages of starting a startup at a young age, such as having more stamina, being able to live cheaply, and being more flexible to try different ideas. The document also mentions that the prevailing winds change after graduation and suggests that the mid-twenties might be the sweet spot for startup founders. It emphasizes the importance of thinking cheaply and being able to recover from mistakes. The author encourages young people to consider starting a startup early but advises them to wait if they are uncertain.
637
660
733
562
535
682
632
557


In [128]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [129]:
# conver summaries to langchain Documents
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

In [130]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [131]:
# We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)

In [133]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [134]:
print(sub_docs[0].page_content[:100])
print(sub_docs[0].metadata)

But in my administration, the watchdogs have been welcomed back. 

We’re going after the criminals w
{'doc_id': '7b41dde9-e5a5-4e32-8f15-d2a9ddbcdd9a', 'source': '../../../text_files/state_of_the_union.txt'}


In [122]:
retrieved_docs = retriever.invoke("justice breyer")

In [127]:
print(retrieved_docs[0].page_content[:100])
print(retrieved_docs[0].metadata)

But in my administration, the watchdogs have been welcomed back. 

We’re going after the criminals w
{'source': '../../../text_files/state_of_the_union.txt'}


# Hypothetical Queries
An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document. These questions can then be embedded

In [147]:
functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

In [158]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-3.5-turbo").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

In [160]:
chain.invoke(docs[0])

['What are the advantages of starting a startup while still in college?',
 'How does poverty serve as an advantage for young founders in startups?',
 'Why is it important for startup founders to have the ability to live and think cheaply?']

In [152]:
print(len(docs))

8


In [161]:
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5}) # when using gpt-4, it will easily exceed the tier1 rate limit

In [164]:
print(len(hypothetical_questions))
for i, questions in enumerate(hypothetical_questions):
    print(f"questions for doc-{i}: {questions}")
    print("=======================================================")

8
questions for doc-0: ['What are the advantages of starting a startup while still in college?', 'How does poverty serve as an advantage for young founders in startups?', 'Why is it important for startup founders to operate cheaply and allow their ideas to evolve?']
questions for doc-1: ['How does the cost factor affect traditional long distance carriers in adapting to new technologies like VoIP?', 'What role does rootlessness play in the mobility and decision-making process of startup founders?', 'In what ways can the location and environment impact the success rate of startups, as discussed in the document?']
questions for doc-2: ['What factors determine the success of startup founders at top schools?', 'How can ignorance be a beneficial factor in discovering new ideas for startups?', 'What are the key differences between class projects and real startups in terms of problem-solving and measurement of progress?']
questions for doc-3: ['How can students prepare for starting a startup w

In [165]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [166]:
question_docs = []
for i, question_list in enumerate(hypothetical_questions): # length of hypothetical_questions is the same as docs
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

In [174]:
print(len(question_docs))

24


In [167]:
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [168]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [169]:
sub_docs

[Document(page_content='How can the nomination of a Supreme Court Justice impact the legacy of excellence on the Court?', metadata={'doc_id': '6c4c7d99-203f-4b81-8afd-92d9ecd380e9'}),
 Document(page_content='In what ways can bipartisan support be mobilized to advance liberty, justice, and equality for all Americans?', metadata={'doc_id': '6c4c7d99-203f-4b81-8afd-92d9ecd380e9'}),
 Document(page_content='How can the Justice Department address pandemic fraud and prosecute criminals who stole relief money?', metadata={'doc_id': '9b0c98a8-6b06-43a1-8abf-8959268e04c7'}),
 Document(page_content='What measures can be taken to secure the border and fix the immigration system simultaneously?', metadata={'doc_id': '6c4c7d99-203f-4b81-8afd-92d9ecd380e9'})]

In [170]:
retrieved_docs = retriever.invoke("justice breyer")

In [171]:
len(retrieved_docs[0].page_content)

9194