# MultiVectorRetriever

- Author: [YooKyung Jeon](https://github.com/sirena1)
- Peer Review: [choincnp](https://github.com/choincnp), [Hye-yoonJeong](https://github.com/Hye-yoonJeong)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

In LangChain, there's a special feature called `MultiVectorRetriever` that enables efficient querying of documents in various contexts. This feature allows documents to be stored and managed with multiple vectors, significantly enhancing the accuracy and efficiency of information retrieval.

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Methods for Generating Multiple Vectors Per Document](#methods-for-generating-multiple-vectors-per-document)
- [Chunk + Original Document Retrieval](#chunk--original-document-retrieval)
- [Storing summaries in vector storage](#storing-summaries-in-vector-storage)
- [Utilizing Hypothetical Queries to explore document content](#utilizing-hypothetical-queries-to-explore-document-content)

### References

- [LangChain: Query Construction](https://blog.langchain.dev/query-construction/)
- [LangGraph: Self-Reflective RAG](https://blog.langchain.dev/agentic-rag-with-langgraph/)
- [Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity](https://arxiv.org/abs/2403.14403)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

%%capture --no-stderr
!pip install langchain-opentutorial

In [None]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_openai",
        "langchain-chroma",
    ],
    verbose=False,
    upgrade=False,
)

In [1]:
from dotenv import load_dotenv

load_dotenv(override=True)

ModuleNotFoundError: No module named 'dotenv'

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "09-FewShot-Prompt-Templates",
    }
)

## Methods for Generating Multiple Vectors Per Document

1. **Creating Small Chunks**: Divide the document into smaller chunks and generate separate embeddings for each chunk. This method enables a more granular focus on specific parts of the document. It can be implemented using the `ParentDocumentRetriever`, making it easier to explore detailed information.

2. **Summary Embeddings**: Generate a summary for each document and create embeddings based on this summary. Summary embeddings are particularly useful for quickly grasping the core content of a document. By focusing only on the summary instead of analyzing the entire document, efficiency can be significantly improved.

3. **Utilizing Hypothetical Questions**: Create relevant hypothetical questions for each document and generate embeddings based on these questions. This approach is helpful when deeper exploration of specific topics or content is needed. Hypothetical questions enable a broader perspective on the document's content, facilitating a more comprehensive understanding.

4. **Manual Addition**: Users can manually add specific questions or queries that should be considered during document retrieval. This method provides users with more control over the search process, allowing for customized searches tailored to their specific needs.


The preprocessing process involves loading data from a text file and splitting the loaded documents into specified sizes.

The split documents can later be used for tasks such as vectorization and retrieval.

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")
docs = loader.load()

The original documents loaded from the data are stored in the docs variable.

In [None]:
print(docs[5].page_content[:500])

## Chunk + Original Document Retrieval

When searching through large volumes of information, embedding data into smaller chunks can be highly beneficial.

With `MultiVectorRetriever`, documents can be stored and managed as multiple vectors.

- The original documents are stored in the `docstore`.
- The embedded documents are stored in the `vectorstore`.

This allows for splitting documents into smaller units, enabling more accurate searches. Additionally, the contents of the original document can be accessed when needed.


In [None]:
import uuid
from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_vector import MultiVectorRetriever

# Vector store for indexing child chunks
vectorstore = Chroma(
    collection_name="small_bigger_chunks",
    embedding_function=OpenAIEmbeddings(model="text-embedding-ada-002"),
)

# Storage layer for parent documents
store = InMemoryByteStore()

id_key = "doc_id"

# Retriever (initially empty)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)

# Generate document IDs
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Verify two of the generated IDs
print(doc_ids[:2])

Defining `parent_text_splitter` for Larger Chunks and `child_text_splitter` for Smaller Chunks

Here, we define `parent_text_splitter` for splitting into larger chunks and `child_text_splitter` for splitting into smaller chunks.


In [None]:
# Create a RecursiveCharacterTextSplitter object for larger chunks
parent_text_splitter = RecursiveCharacterTextSplitter(chunk_size=600)

# Splitter to be used for generating smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

Create Parent documents as larger chunks.

In [None]:
parent_docs = []

for i, doc in enumerate(docs):
    # Retrieve the ID of the current document
    _id = doc_ids[i]
    # Split the current document into smaller parent documents
    parent_doc = parent_text_splitter.split_documents([doc])

    for _doc in parent_doc:
        # Store the document ID in the metadata
        _doc.metadata[id_key] = _id
    parent_docs.extend(parent_doc)

Verify the `doc_id` assigned to `parent_docs`


In [None]:
# Check the metadata of the generated Parent documents.
parent_docs[0].metadata

Create Child documents as relatively smaller chunks.

In [None]:
child_docs = []
for i, doc in enumerate(docs):
    # Retrieve the ID of the current document
    _id = doc_ids[i]
    # Split the current document into child documents
    child_doc = child_text_splitter.split_documents([doc])
    for _doc in child_doc:
        # Store the document ID in the metadata
        _doc.metadata[id_key] = _id
    child_docs.extend(child_doc)

Verify the `doc_id` assigned to `child_docs`.

In [None]:
# Check the metadata of the generated Child documents.
child_docs[0].metadata

Check the number of chunks for each split document.

In [None]:
print(f"Number of split parent_docs: {len(parent_docs)}")
print(f"Number of split child_docs: {len(child_docs)}")

Add the newly created smaller child document set to the vector store

Next, map the parent documents to the generated UUIDs and add them to the `docstore`.

- Use the `mset()` method to store document IDs and their content as key-value pairs in the document store.

In [None]:
# Add both parent and child documents to the vector store
retriever.vectorstore.add_documents(parent_docs)
retriever.vectorstore.add_documents(child_docs)

# Store the original documents in the docstore
retriever.docstore.mset(list(zip(doc_ids, docs)))

Perform Similarity Search and Display the Most Similar Document Chunk

Use the `retriever.vectorstore.similarity_search` method to search within child and parent document chunks.

The first document chunk with the highest similarity will be displayed.

In [None]:
# Perform similarity search on the vectorstore
relevant_chunks = retriever.vectorstore.similarity_search(
    "What is the name of the generative AI created by Samsung Electronics?"
)
print(f"Number of retrieved documents: {len(relevant_chunks)}")

In [None]:
for chunk in relevant_chunks:
    print(chunk.page_content, end="\n\n")
    print(">" * 100, end="\n\n")

Execute a Query Using the `retriever.invoke()` Method

The `retriever.invoke()` method performs a search across the full content of the original documents.

In [None]:
relevant_docs = retriever.invoke(
    "What is the name of the generative AI created by Samsung Electronics?"
)
print(f"Number of retrieved documents: {len(relevant_docs)}", end="\n\n")
print("=" * 100, end="\n\n")
print(relevant_docs[0].page_content)

The default search type performed by the retriever in the vector database is similarity search.

LangChain Vector Stores also support searching using [Max Marginal Relevance](https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.max_marginal_relevance_search). 

If you want to use this method instead, you can configure the `search_type` property as follows.

- Set the `search_type` property of the `retriever` object to `SearchType.mmr`.
  - This specifies that the MMR (Maximal Marginal Relevance) algorithm should be used during the search.

In [None]:
from langchain.retrievers.multi_vector import SearchType

# Set the search type to Maximal Marginal Relevance (MMR)
retriever.search_type = SearchType.mmr

# Search all related documents
print(
    retriever.invoke(
        "What is the name of the generative AI created by Samsung Electronics?"
    )[0].page_content
)

In [None]:
from langchain.retrievers.multi_vector import SearchType

# Set search type to similarity_score_threshold
retriever.search_type = SearchType.similarity_score_threshold
retriever.search_kwargs = {"score_threshold": 0.3}

# Search all related documents
print(
    retriever.invoke(
        "What is the name of the generative AI created by Samsung Electronics?"
    )[0].page_content
)

In [None]:
from langchain.retrievers.multi_vector import SearchType

# Set search type to similarity and k value to 1
retriever.search_type = SearchType.similarity
retriever.search_kwargs = {"k": 1}

# Search all related documents
print(
    len(
        retriever.invoke(
            "What is the name of the generative AI created by Samsung Electronics?"
        )
    )
)

## Storing summaries in vector storage

Summaries can often provide a more accurate extraction of the contents of a chunk, which can lead to better search results.

This section describes how to generate summaries and how to embed them.

In [None]:
# Importing libraries for loading PDF files and splitting text
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the PDF file loader
loader = PyMuPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")

# Split text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=50)

# Load a PDF file and run Text Split
split_docs = loader.load_and_split(text_splitter)

# Output the number of split documents
print(f"Number of split documents: {len(split_docs)}")

In [None]:
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI


summary_chain = (
    {"doc": lambda x: x.page_content}
    # Create a prompt template for document summaries
    | ChatPromptTemplate.from_messages(
        [
            ("system", "You are an expert in summarizing documents in Korean."),
            (
                "user",
                "Summarize the following documents in 3 sentences in bullet points format.\n\n{doc}",
            ),
        ]
    )
    # Using OpenAI's ChatGPT model to generate summaries
    | ChatOpenAI(temperature=0, model="gpt-4o-mini")
    | StrOutputParser()
)

Summarize the documents in the `docs` list in batch using the `chain.batch` method.
- Here, we set the `max_concurrency` parameter to 10 to allow up to 10 documents to be processed simultaneously.

In [None]:
# Handling batches of documents
summaries = summary_chain.batch(split_docs, {"max_concurrency": 10})

In [None]:
len(summaries)

Print the summary to see the results.

In [None]:
# Prints the contents of the original document.
print(split_docs[33].page_content, end="\n\n")
# Print a summary.
print("[summary]")
print(summaries[33])

Initialize the `Chroma` vector store to index the child chunks. Use `OpenAIEmbeddings` as the embedding function.

- Use `“doc_id”` as the key representing the document ID.


In [None]:
import uuid

# Create a vector store to store the summary information.
summary_vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
)

# Create a repository to store the parent document.
store = InMemoryStore()

# Specify a key name to store the document ID.
id_key = "doc_id"

# Initialize the searcher (empty at startup).
retriever = MultiVectorRetriever(
    vectorstore=summary_vectorstore,  # vector store
    byte_store=store,  # byte store
    id_key=id_key,  # document ID
)
# Create a document ID.
doc_ids = [str(uuid.uuid4()) for _ in split_docs]

Save the summarized document and its metadata (here, the `Document ID` for the summary you created).


In [None]:
summary_docs = [
    # Create a Document object with the summary as the page content and the document ID as metadata.
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

The number of articles in the digest matches the number of original articles.


In [None]:
# Number of documents in the summary
len(summary_docs)

- Add `summary_docs` to the vector store with `retriever.vectorstore.add_documents(summary_docs)`.
- Map `doc_ids` and `docs` with `retriever.docstore.mset(list(zip(doc_ids, docs))))` to store them in the document store.


In [None]:
retriever.vectorstore.add_documents(
    summary_docs
)  # Add the summarized document to the vector repository.

# Map the document ID to the document and store it in the document store.
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

Perform a similarity search using the `similarity_search` method of the `vectorstore` object.


In [None]:
# Perform a similarity search.
result_docs = summary_vectorstore.similarity_search(
    "What is the name of the generative AI created by Samsung Electronics?"
)

In [None]:
# Output 1 result document.
print(result_docs[0].page_content)

Use the `invoke()` of the `retriever` object to retrieve documents related to your question.


In [None]:
# Search for and fetch related articles.
retrieved_docs = retriever.invoke(
    "What is the name of the generative AI created by Samsung Electronics?"
)
print(retrieved_docs[0].page_content)

## Utilizing Hypothetical Queries to explore document content

LLM can also be used to generate a list of questions that can be hypothesized about a particular document.

These generated questions can be embedded to further explore and understand the content of the document.

Generating hypothetical questions can help you identify key topics and concepts in your documentation, and can encourage readers to ask more questions about the content of your documentation.


Below is an example of creating a hypothesis question utilizing `Function Calling`.

In [None]:
functions = [
    {
        "name": "hypothetical_questions",  # Specify a name for the function.
        "description": "Generate hypothetical questions",  # Write a description of the function.
        "parameters": {  # Define the parameters of the function.
            "type": "object",  # Specifies the type of the parameter as an object.
            "properties": {  # Defines the properties of an object.
                "questions": {  # Define the 'questions' attribute.
                    "type": "array",  # Type 'questions' as an array.
                    "items": {
                        "type": "string"
                    },  # Specifies the array's element type as String.
                },
            },
            "required": ["questions"],  # Specify 'questions' as a required parameter.
        },
    }
]

Use `ChatPromptTemplate` to define a prompt template that generates three hypothetical questions based on the given document.

- Set `functions` and `function_call` to call the virtual question generation functions.
- Use `JsonKeyOutputFunctionsParser` to parse the generated virtual questions and extract the values corresponding to the `questions` key.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
from langchain_openai import ChatOpenAI

hypothetical_query_chain = (
    {"doc": lambda x: x.page_content}
    # We ask you to create exactly 3 hypothetical questions that you can answer using the documentation below. This number can be adjusted.
    | ChatPromptTemplate.from_template(
        "Generate a list of exactly 3 hypothetical questions that the below document could be used to answer. "
        "Potential users are those interested in the AI industry. Create questions that they would be interested in. "
        "Output should be written in Korean:\n\n{doc}"
    )
    | ChatOpenAI(max_retries=0, model="gpt-4o-mini").bind(
        functions=functions, function_call={"name": "hypothetical_questions"}
    )
    # Extract the value corresponding to the “questions” key from the output.
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

Output the answers to the documents.

- The output contains the three Hypothetical Queries you created.


In [None]:
# Run the chain for the given document.
hypothetical_query_chain.invoke(split_docs[33])

Use the `chain.batch` method to process multiple requests for `split_docs` data at the same time.

In [None]:
# Create a batch of hypothetical questions for a list of articles
hypothetical_questions = hypothetical_query_chain.batch(
    split_docs, {"max_concurrency": 10}
)

In [None]:
hypothetical_questions[33]

Below is the process for storing the Hypothetical Queries you created in Vector Storage, the same way we did before.


In [None]:
# Vector store to use for indexing child chunks
hypothetical_vectorstore = Chroma(
    collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# Storage hierarchy for parent documents
store = InMemoryStore()

id_key = "doc_id"
# Retriever (empty on startup)
retriever = MultiVectorRetriever(
    vectorstore=hypothetical_vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in split_docs]  # Create a document ID

Add metadata (document IDs) to the `question_docs` list.


In [None]:
question_docs = []
# save hypothetical_questions
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend(
        # Create a Document object for each question in the list of questions, and include the document ID for that question in the metadata.
        [Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
    )

Add the hypothesized query to the document, and add the original document to `docstore`.


In [None]:
# Add the hypothetical_questions document to the vector repository.
retriever.vectorstore.add_documents(question_docs)

# Map the document ID to the document and store it in the document store.
retriever.docstore.mset(list(zip(doc_ids, split_docs)))

Perform a similarity search using the `similarity_search` method of the `vectorstore` object.


In [None]:
# Search the vector repository for similar documents.
result_docs = hypothetical_vectorstore.similarity_search(
    "What is the name of the generative AI created by Samsung Electronics?"
)

Below are the results of the similarity search.

Here, we've only added the hypothesized query we created, so it returns the documents with the highest similarity among the hypothesized queries we created.


In [None]:
# Output the results of the similarity search.
for doc in result_docs:
    print(doc.page_content)
    print(doc.metadata)

Use the `invoke` method of the `retriever` object to retrieve documents related to the query.


In [None]:
# Search for and fetch related articles.
retrieved_docs = retriever.invoke(result_docs[1].page_content)

# Output the documents found.
for doc in retrieved_docs:
    print(doc.page_content)