# MultiVector Retriever
It can often be beneficial to store multiple vectors per document.  
There are multiple use cases where this is beneficial.  
LangChain has a base MultiVectorRetriever which makes querying this type of setup easy.  
A lot of the complexity lies in how to create the multiple vectors per document.  
This notebook covers some of the common ways to create those vectors and use the MultiVectorRetriever.

The methods to create multiple vectors per document include:

* Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
* Summary: create a summary for each document, embed that along with (or instead of) the document.
* Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.

Note that this also enables another method of adding embeddings - manually. This is great because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

In [None]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryByteStore
# from langchain_chroma import Chroma
from langchain_community.vectorstores.chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
# loaders = [
#     TextLoader("../../../text_files/paul_graham_mit_essays.txt"),
#     # TextLoader("../../../text_files/state_of_the_union.txt"),
# ]
# docs = []
# for loader in loaders:
#     docs.extend(loader.load())
# print(len(docs))
# print(docs[0].page_content[:100])
# docs[0].metadata['test'] = "123"
# print(docs[0].metadata)

In [None]:
loaders = [
    TextLoader("../../../text_files/paul_graham_mit_essays.txt"),
    TextLoader("../../../text_files/state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
print(len(docs))
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
# text_splitter 會保留原始的 metadata
docs = text_splitter.split_documents(docs)
print(len(docs))

for doc in docs:
    print(len(doc.page_content))

## Smaller chunks
Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. Note that this is what the ParentDocumentRetriever does. Here we show what is going on under the hood.

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
import uuid

doc_ids = [str(uuid.uuid4()) for _ in docs] # unique id for each document

In [None]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [None]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc]) # Split the document into smaller chunks
    for _sub_doc in _sub_docs:
        _sub_doc.metadata[id_key] = _id # Add the parent document id to the metadata
    sub_docs.extend(_sub_docs)

In [None]:
print(len(sub_docs))

In [None]:
# test = list(zip(doc_ids, docs))
# print(len(test))
# print(test[0])

In [None]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs))) # 存成 key-value pair

In [None]:
# Vectorstore alone retrieves the small chunks
res = retriever.vectorstore.similarity_search("justice breyer")[0]
print(res.page_content)
print(res.metadata)

In [None]:
# test
results = retriever.invoke("justice breyer")

results[0].page_content

In [None]:
# Retriever returns larger chunks
results = retriever.invoke("justice breyer")
# print(len(res))

for result in results:
    # print(result.metadata[id_key])
    print(len(result.page_content))
    print(result.metadata)
    print(result.page_content[:100])
    # print(result.page_content)
    print("-----")


The default search type the retriever performs on the vector database is a similarity search.  
LangChain Vector Stores also support searching via [Max Marginal Relevance](https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.max_marginal_relevance_search) so if you want this instead you can just set the search_type property as follows:

In [None]:
from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr

len(retriever.invoke("justice breyer")[0].page_content)

## questions:
* Q:when retrieval small/large chunk, is it compare the query to small or large chunk for similar search?
* Q:why need `byte_store`?
* Q:althernative to `InMemoryByteStore`


## Summary
Oftentimes a `summary may be able to distill more accurately what a chunk is about`, leading to better retrieval. Here we show how to create summaries, and then embed those.

In [None]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

In [None]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser() # this parser will only return "content", filter out other fields
)

In [None]:
summaries = chain.batch(docs, {"max_concurrency": 5})

In [None]:
print(type(summaries))
print(len(summaries))
print(summaries[0])

for summary in summaries:
    print(len(summary))

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
# conver summaries to langchain Documents
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

In [None]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
# We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
#     doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)

In [None]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [None]:
print(sub_docs[0].page_content[:100])
print(sub_docs[0].metadata)

In [None]:
retrieved_docs = retriever.invoke("justice breyer")

In [None]:
print(retrieved_docs[0].page_content[:100])
print(retrieved_docs[0].metadata)

# Hypothetical Queries
An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document. These questions can then be embedded

In [None]:
functions = [
    {
        "name": "hypothetical_questions",
        "description": "Generate hypothetical questions",
        "parameters": {
            "type": "object",
            "properties": {
                "questions": {
                    "type": "array",
                    "items": {"type": "string"},
                },
            },
            "required": ["questions"],
        },
    }
]

In [None]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser

chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-3.5-turbo").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

In [None]:
chain.invoke(docs[0])

In [None]:
print(len(docs))

In [None]:
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5}) # when using gpt-4, it will easily exceed the tier1 rate limit

In [None]:
print(len(hypothetical_questions))
for i, questions in enumerate(hypothetical_questions):
    print(f"questions for doc-{i}: {questions}")
    print("=======================================================")

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
question_docs = []
for i, question_list in enumerate(hypothetical_questions): # length of hypothetical_questions is the same as docs
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

In [None]:
print(len(question_docs))

In [None]:
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [None]:
sub_docs

In [None]:
retrieved_docs = retriever.invoke("justice breyer")

In [None]:
len(retrieved_docs[0].page_content)