# Multiple vectors per document

It can often be useful to store multiple vectors per document. There are multiple use cases where this is beneficial. For example, we can embed multiple chunks of a document and associate those embeddings with the parent document, allowing retriever hits on the chunks to return the larger document.

LangChain implements a base MultiVectorRetriever, which simplifies this process. Much of the complexity lies in how to create the multiple vectors per document. This notebook covers some of the common ways to create those vectors and use the MultiVectorRetriever.

The methods to create multiple vectors per document include:

- Smaller chunks: split a document into smaller chunks, and embed those (this is ParentDocumentRetriever).
- Summary: create a summary for each document, embed that along with (or instead of) the document.
- Hypothetical questions: create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.

Note that this also enables another method of adding embeddings - manually. This is useful because you can explicitly add questions or queries that should lead to a document being recovered, giving you more control.

In [2]:
import os
from dotenv import load_dotenv
load_dotenv()


True

In [3]:
from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

loaders = [
    TextLoader("../paul_graham_essay.txt"),
    TextLoader("../state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)


# Store full documents



In [4]:

# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

In [5]:
print(docs)
len(docs)


[Document(metadata={'source': '../paul_graham_essay.txt'}, page_content='What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early

12

In [6]:
import uuid

from langchain.retrievers.multi_vector import MultiVectorRetriever

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

doc_ids = [str(uuid.uuid4()) for _ in docs]

In [7]:
doc_ids

['11e20baa-8938-45fc-ae12-d0bc948697a1',
 'a4669108-c1e8-45e6-b822-796a28931965',
 '8514e99b-e833-4049-9ccb-12240ad85c5c',
 'fffd47d1-34d3-4679-b3a4-e25b506b8dba',
 'c472e9cc-9692-4c12-8e29-97a78b3aab71',
 '402f3049-c0ca-44dd-93fc-152e3a1a866d',
 '6f15aadb-014b-42b8-af8e-6a3c15f6d458',
 '7e1ee0fe-6e87-4ba5-b42e-a1be806fb97e',
 '1832202e-c2c6-4781-8c53-92e418ea8ca2',
 'add97977-8c80-43f8-98fc-e243714efa13',
 'f8b16ba3-5b0a-4c1a-8b5c-c632cd8f95af',
 '47c1d651-55f0-455f-8f30-f2325650b885']

# Smaller chunks
Often times it can be useful to retrieve larger chunks of information, but embed smaller chunks. 

This allows for embeddings to capture the semantic meaning as closely as possible, but for as much context as possible to be passed downstream. 

Note that this is what the ParentDocumentRetriever does. Here we show what is going on under the hood.

# Create sub splits
split a list of documents into smaller chunks while maintaining metadata associated with each chunk.


In [8]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)
    

In [9]:
sub_docs

[Document(metadata={'source': '../paul_graham_essay.txt', 'doc_id': '11e20baa-8938-45fc-ae12-d0bc948697a1'}, page_content="What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep."),
 Document(metadata={'source': '../paul_graham_essay.txt', 'doc_id': '11e20baa-8938-45fc-ae12-d0bc948697a1'}, page_content='The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with al

# Store the sub_docs and original docs

Finally, we index the documents in our vector store and document store:


1. Add the list of sub-documents (sub_docs) created earlier (from the splitting process) into the **vector store**.
2. Store the original documents in the **document store**, keyed by their document IDs, so they can be retrieved in full when needed.

In [10]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# The vector store alone will retrieve small chunks:



In [11]:
retriever.vectorstore.similarity_search("justice breyer")

[Document(metadata={'doc_id': 'f8b16ba3-5b0a-4c1a-8b5c-c632cd8f95af', 'source': '../state_of_the_union.txt'}, page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.'),
 Document(metadata={'doc_id': '47c1d651-55f0-455f-8f30-f2325650b885', 'source': '../state_of_the_union.txt'}, page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.'),
 Document(metadata

# Whereas the retriever will return the larger parent document:



In [12]:
retriever.invoke("justice breyer")

[Document(metadata={'source': '../state_of_the_union.txt'}, page_content='But in my administration, the watchdogs have been welcomed back. \n\nWe’re going after the criminals who stole billions in relief money meant for small businesses and millions of Americans.  \n\nAnd tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud. \n\nBy the end of this year, the deficit will be down to less than half what it was before I took office.  \n\nThe only president ever to cut the deficit by more than one trillion dollars in a single year. \n\nLowering your costs also means demanding more competition. \n\nI’m a capitalist, but capitalism without competition isn’t capitalism. \n\nIt’s exploitation—and it drives up prices. \n\nWhen corporations don’t have to compete, their profits go up, your prices go up, and small businesses and family farmers and ranchers go under. \n\nWe see it happening with ocean carriers moving goods in and out of America. \n\nDur

The default search type the retriever performs on the vector database is a similarity search. LangChain vector stores also support searching via Max Marginal Relevance. This can be controlled via the search_type parameter of the retriever:

In [13]:
from langchain.retrievers.multi_vector import SearchType

retriever.search_type = SearchType.mmr
len(retriever.invoke("justice breyer")[0].page_content)

9874

# Associating summaries with a document for retrieval:
Now, we need to associate summaries with a document for retrieval


In [14]:
import getpass
import os

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [15]:
import uuid

from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | llm
    | StrOutputParser()
)

In [16]:
summaries = chain.batch(docs, {"max_concurrency": 5})

In [17]:
summaries
len(summaries)

12

# Store the summaries 

### Summaries are for the main bigger documents

Initialize a MultiVectorRetriever as before, indexing the summaries in our vector store, and retaining the original documents in our document store:



In [18]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [28]:
summary_docs
len(summary_docs)


12

### Querying the vector store will return summaries (of the bigger documents):



In [29]:
sub_docs = retriever.vectorstore.similarity_search("justice breyer")

In [30]:

sub_docs[0]

Document(metadata={'doc_id': '6836889c-5321-4c5e-9067-6d2a47b323f6'}, page_content='The document is a speech by the President discussing key issues and initiatives for the nation. The President highlights the constitutional responsibility of nominating Supreme Court justices, specifically nominating Judge Ketanji Brown Jackson, emphasizing her qualifications and broad support. \n\nThe speech addresses immigration reform, advocating for a secure border while providing pathways to citizenship for certain groups and improving the immigration system. The President stresses the need to protect women\'s rights, particularly in light of challenges to Roe v. Wade, and calls for the passage of the bipartisan Equality Act to support LGBTQ+ rights.\n\nThe President introduces a "Unity Agenda," outlining four major priorities: combating the opioid epidemic, addressing mental health issues, supporting veterans, and aiming to end cancer. He outlines specific actions and commitments, including increa

In [31]:
sub_docs

[Document(metadata={'doc_id': '6836889c-5321-4c5e-9067-6d2a47b323f6'}, page_content='The document is a speech by the President discussing key issues and initiatives for the nation. The President highlights the constitutional responsibility of nominating Supreme Court justices, specifically nominating Judge Ketanji Brown Jackson, emphasizing her qualifications and broad support. \n\nThe speech addresses immigration reform, advocating for a secure border while providing pathways to citizenship for certain groups and improving the immigration system. The President stresses the need to protect women\'s rights, particularly in light of challenges to Roe v. Wade, and calls for the passage of the bipartisan Equality Act to support LGBTQ+ rights.\n\nThe President introduces a "Unity Agenda," outlining four major priorities: combating the opioid epidemic, addressing mental health issues, supporting veterans, and aiming to end cancer. He outlines specific actions and commitments, including incre

# Whereas the retriever will return the larger source document:



In [32]:
retrieved_docs = retriever.invoke("justice breyer")

In [33]:
retrieved_docs

[Document(metadata={'source': '../state_of_the_union.txt'}, page_content='One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. \n\nA former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. \n\nAnd if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. \n\nWe can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.  \n\nWe’ve set up joint p

In [34]:
len(retrieved_docs[0].page_content)

9194

# Hypothetical Queries
An LLM can also be used to generate a list of hypothetical questions that could be asked of a particular document. These questions can then be embedded

These questions can then be embedded and associated with the documents to improve retrieval.

In [35]:
from typing import List

from langchain_core.pydantic_v1 import BaseModel, Field


class HypotheticalQuestions(BaseModel):
    """Generate hypothetical questions."""

    questions: List[str] = Field(..., description="List of questions")


chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o-mini").with_structured_output(
        HypotheticalQuestions
    )
    | (lambda x: x.questions)
)

In [36]:
chain.invoke(docs[0])

['If the author had not discovered Lisp, how might their career path have differed?',
 'What if the author had pursued a career in philosophy instead of switching to AI?',
 "How would the author's perspective on programming have changed if they had access to modern technology during their early experiences?"]

In [37]:
# Batch chain over documents to generate hypothetical questions
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})


# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]


# Generate Document objects from hypothetical questions
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )


retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

# Note that querying the underlying vector store will retrieve hypothetical questions that are semantically similar to the input query:



In [38]:
sub_docs = retriever.vectorstore.similarity_search("justice breyer")

sub_docs

[Document(metadata={'doc_id': '8617081e-381f-4e64-b7bf-049b89ee38d8'}, page_content='What might happen if Judge Ketanji Brown Jackson is confirmed to the Supreme Court?'),
 Document(metadata={'doc_id': 'b4c85df7-b3b4-4809-9e1a-4a1301f9fa12'}, page_content='What if the Justice Department successfully prosecutes the pandemic fraud cases, how might that impact public trust in government relief programs?'),
 Document(metadata={'doc_id': '8617081e-381f-4e64-b7bf-049b89ee38d8'}, page_content='How would changes in immigration laws impact the economy and the lives of immigrants in the United States?'),
 Document(metadata={'doc_id': 'c09253fb-66f9-4c2d-8aa1-0d5665fb3283'}, page_content='What if the speaker had decided to pursue angel investing sooner after the acquisition by Yahoo?')]

# And invoking the retriever will return the corresponding document:



In [39]:
retrieved_docs = retriever.invoke("justice breyer")
len(retrieved_docs[0].page_content)

9194