<a href="https://colab.research.google.com/github/sugarforever/LangChain-Advanced/blob/main/Retrievers/04_MultiVector_Retriever/04_MultiVector_Retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 04 MultiVector Retriever

To improve the retrieval effectiveness, in practice we can store multiple vectors per document. It's been proved in multiple use cases.

LangChain provides a retriever component `MultiVectorRetriever` which support such mechanism. It can be implemented in the following methods:

- Smaller Chunks
  
  Split a document into smaller chunks, and embed them.

- Summary

  Create a summary for each document, embed that along with (or instead of) the document.

- Hypothetical questions

  Create hypothetical questions that each document would be appropriate to answer, embed those along with (or instead of) the document.

## Smaller Chunks

In [None]:
!pip install -q -U langchain openai chromadb tiktoken pypdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m448.1/448.1 kB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m62.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.6/276.6 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.3/66.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!wget -O nvidia_10q_2023.pdf https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/19771e6b-cc29-4027-899e-51a0c386111e.pdf

--2023-10-15 20:15:41--  https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/19771e6b-cc29-4027-899e-51a0c386111e.pdf
Resolving d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)... 18.64.171.68, 18.64.171.136, 18.64.171.43, ...
Connecting to d18rn0p25nwr6d.cloudfront.net (d18rn0p25nwr6d.cloudfront.net)|18.64.171.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 458891 (448K) [application/pdf]
Saving to: ‘nvidia_10q_2023.pdf’


2023-10-15 20:15:41 (4.97 MB/s) - ‘nvidia_10q_2023.pdf’ saved [458891/458891]



In [None]:
import os
os.environ['OPENAI_API_KEY'] = "your valid openai api key"

In [None]:
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import PyPDFLoader

In [None]:
loaders = [ PyPDFLoader('./nvidia_10q_2023.pdf') ]
docs = []
for l in loaders:
    docs.extend(l.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

In [None]:
len(docs)

51

In [None]:
print(docs[6])

page_content="NVIDIA CORPORATION AND SUBSIDIARIES\nCONDENSED CONSOLIDATED STATEMENTS OF SHAREHOLDERS’ EQUITY\nFOR THE SIX MONTHS ENDED JULY 30, 2023 AND JULY 31, 2022\n(Unaudited)\nCommon Stock\nOutstandingAdditional\nPaid-in\nCapitalAccumulated Other\nComprehensive LossRetained\nEarningsTotal\nShareholders'\nEquity (In millions, except per share data) Shares Amount\nBalances, January 29, 2023 2,466 $ 2 $ 11,971 $ (43) $ 10,171 $ 22,101 \nNet income — — — — 8,232 8,232 \nOther comprehensive loss — — — (8) — (8)\nIssuance of common stock from stock plans 14 — 247 — — 247 \nTax withholding related to vesting of restricted stock units (3) — (1,179) — — (1,179)\nShares repurchased (8) — (1) — (3,283) (3,284)\nCash dividends declared and paid ($0.08 per common share) — — — — (199) (199)\nStock-based compensation — — 1,591 — — 1,591 \nBalances, July 30, 2023 2,469 $ 2 $ 12,629 $ (51) $ 14,921 $ 27,501 \nBalances, January 30, 2022 2,506 $ 3 $ 10,385 $ (11) $ 16,235 $ 26,612 \nNet income — — —

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
import uuid
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

In [None]:
sub_docs = []
for i, doc in enumerate(docs):
    _id = doc_ids[i]
    _sub_docs = child_text_splitter.split_documents([doc])
    for _doc in _sub_docs:
        _doc.metadata[id_key] = _id
    sub_docs.extend(_sub_docs)

In [None]:
print(sub_docs[1])

page_content='For the quarterly period ended July 30, 2023\nOR\n☐ TRANSITION REPORT PURSUANT T O SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nCommission file number: 0-23985\nNVIDIA CORPORATION\n(Exact name of registrant as specified in its charter)\nDelaware 94-3177549\n(State or other jurisdiction of (I.R.S. Employer\nincorporation or organization) Identification No.)' metadata={'source': './nvidia_10q_2023.pdf', 'page': 0, 'doc_id': 'bad39a26-308b-4d75-918e-59ee4976c9e4'}


In [None]:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
# Vectorstore alone retrieves the small chunks
similar_docs = retriever.vectorstore.similarity_search("What is the gross margin?")

In [None]:
similar_docs

[Document(page_content='Gross margin increased from a year ago and sequentially, primarily reflecting growth in Data Center sales. The year-on-year increase also\nreflects the impact on the year-ago gross margin from $1.34 billion in inventory provisions and related charges.\nOperating expenses were up 10% from a year ago and up 6% sequentially, primarily driven by compensation and benefits, including stock-based', metadata={'doc_id': '3cd9d92f-c3d9-49cc-9528-d94ac2790659', 'page': 26, 'source': './nvidia_10q_2023.pdf'}),
 Document(page_content='Gross margin increased from a year ago and sequentially, primarily reflecting growth in Data Center sales. The year-on-year increase also\nreflects the impact on the year-ago gross margin from $1.34 billion in inventory provisions and related charges.\nOperating expenses were up 10% from a year ago and up 6% sequentially, primarily driven by compensation and benefits, including stock-based', metadata={'doc_id': 'b37124b4-df02-48a4-808d-66f87640

In [None]:
relevant_docs = retriever.get_relevant_documents("What is the gross margin?")

In [None]:
len(relevant_docs)

2

In [None]:
relevant_docs[0]

Document(page_content='Second Quarter of Fiscal Year 2024 Summary\nThree Months Ended\n July 30, 2023 April 30, 2023 July 31, 2022Quarter-over-Quarter\nChangeYear-over-Year\nChange\n($ in millions, except per share data)\nRevenue $ 13,507 $ 7,192 $ 6,704 88 % 101 %\nGross margin 70.1 % 64.6 % 43.5 % 5.5 pts 26.6 pts\nOperating expenses $ 2,662 $ 2,508 $ 2,416 6 % 10 %\nOperating income $ 6,800 $ 2,140 $ 499 218 % 1,263 %\nNet income $ 6,188 $ 2,043 $ 656 203 % 843 %\nNet income per diluted share $ 2.48 $ 0.82 $ 0.26 202 % 854 %\nWe specialize in markets where our computing platforms can provide tremendous acceleration for applications. These platforms incorporate\nprocessors, interconnects, software, algorithms, systems, and services to deliver unique value. Our platforms address four large markets where\nour expertise is critical: Data Center, Gaming, Professional Visualization, and Automotive.\nRevenue for the second quarter of fiscal year 2024 was $13.51 billion, up 101% from a year

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [None]:
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever, memory=memory)

In [None]:
result = qa({"question": "What is the gross margin?"})

In [None]:
result

{'question': 'What is the gross margin?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin for the second quarter and first half of fiscal year 2024 was 70.1% and 68.2%, respectively.')],
 'answer': ' The gross margin for the second quarter and first half of fiscal year 2024 was 70.1% and 68.2%, respectively.'}

In [None]:
result = qa({"question": "What is the main contribution to it?"})
result

{'question': 'What is the main contribution to it?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin for the second quarter and first half of fiscal year 2024 was 70.1% and 68.2%, respectively.'),
  HumanMessage(content='What is the main contribution to it?'),
  AIMessage(content=' Higher revenue from Compute GPUs of 208% and 112%, respectively, and lower inventory provisions.')],
 'answer': ' Higher revenue from Compute GPUs of 208% and 112%, respectively, and lower inventory provisions.'}

## Summary

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
import uuid
from langchain.schema.document import Document

In [None]:
chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(max_retries=0)
    | StrOutputParser()
)

In [None]:
summaries = chain.batch(docs, {"max_concurrency": 5})

In [None]:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="summaries",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
summary_docs = [Document(page_content=s,metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]

In [None]:
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
sub_docs = vectorstore.similarity_search("What is the gross margin?")
sub_docs[0]

Document(page_content="In the second quarter of fiscal year 2024, the company's revenue was $13.51 billion, an increase of 101% compared to the previous year and 88% sequentially. The Data Center sector saw significant growth, with revenue up 171% year-on-year and 141% sequentially. Gaming revenue increased by 22% year-on-year and 11% sequentially. Professional Visualization revenue decreased by 24% year-on-year, but increased by 28% sequentially. Automotive revenue increased by 15% year-on-year, but decreased by 15% sequentially. Gross margin also increased, primarily due to growth in Data Center sales. Operating expenses increased by 10% year-on-year and 6% sequentially.", metadata={'doc_id': 'c43be092-0450-4d24-a169-68c5867e372a'})

In [None]:
retrieved_docs = retriever.get_relevant_documents("What is the gross margin?")
retrieved_docs[0]

Document(page_content='Second Quarter of Fiscal Year 2024 Summary\nThree Months Ended\n July 30, 2023 April 30, 2023 July 31, 2022Quarter-over-Quarter\nChangeYear-over-Year\nChange\n($ in millions, except per share data)\nRevenue $ 13,507 $ 7,192 $ 6,704 88 % 101 %\nGross margin 70.1 % 64.6 % 43.5 % 5.5 pts 26.6 pts\nOperating expenses $ 2,662 $ 2,508 $ 2,416 6 % 10 %\nOperating income $ 6,800 $ 2,140 $ 499 218 % 1,263 %\nNet income $ 6,188 $ 2,043 $ 656 203 % 843 %\nNet income per diluted share $ 2.48 $ 0.82 $ 0.26 202 % 854 %\nWe specialize in markets where our computing platforms can provide tremendous acceleration for applications. These platforms incorporate\nprocessors, interconnects, software, algorithms, systems, and services to deliver unique value. Our platforms address four large markets where\nour expertise is critical: Data Center, Gaming, Professional Visualization, and Automotive.\nRevenue for the second quarter of fiscal year 2024 was $13.51 billion, up 101% from a year

In [None]:
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever, memory=ConversationBufferMemory(memory_key="chat_history", return_messages=True))

In [None]:
result = qa({"question": "What is the gross margin?"})
result

{'question': 'What is the gross margin?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin for the second quarter and first half of fiscal year 2024 was 70.1% and 68.2%, respectively.')],
 'answer': ' The gross margin for the second quarter and first half of fiscal year 2024 was 70.1% and 68.2%, respectively.'}

In [None]:
result = qa({"question": "What is the main contribution to it?"})
result

{'question': 'What is the main contribution to it?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin for the second quarter and first half of fiscal year 2024 was 70.1% and 68.2%, respectively.'),
  HumanMessage(content='What is the main contribution to it?'),
  AIMessage(content=' Higher revenue from Compute GPUs of 208% and 112%, respectively, and lower inventory provisions.')],
 'answer': ' Higher revenue from Compute GPUs of 208% and 112%, respectively, and lower inventory provisions.'}

## Hypothetical Questions

In [None]:
functions = [
    {
      "name": "hypothetical_questions",
      "description": "Generate hypothetical questions",
      "parameters": {
        "type": "object",
        "properties": {
          "questions": {
            "type": "array",
            "items": {
                "type": "string"
              },
          },
        },
        "required": ["questions"]
      }
    }
  ]

In [None]:
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
chain = (
    {"doc": lambda x: x.page_content}
    # Only asking for 3 hypothetical questions, but this could be adjusted
    | ChatPromptTemplate.from_template("Generate a list of 3 hypothetical questions that the below document could be used to answer:\n\n{doc}")
    | ChatOpenAI(max_retries=0, model="gpt-4").bind(functions=functions, function_call={"name": "hypothetical_questions"})
    | JsonKeyOutputFunctionsParser(key_name="questions")
)

In [None]:
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})

In [None]:
vectorstore = Chroma(
    collection_name="hypo-questions",
    embedding_function=OpenAIEmbeddings()
)
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

In [None]:
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
    question_docs.extend([Document(page_content=s,metadata={id_key: doc_ids[i]}) for s in question_list])

In [None]:
question_docs

[Document(page_content='What is the type of report filed by NVIDIA Corporation to the Securities and Exchange Commission?', metadata={'doc_id': '3fd2d11e-ff02-4281-993a-7bb966011e52'}),
 Document(page_content='What is the state of incorporation for NVIDIA Corporation?', metadata={'doc_id': '3fd2d11e-ff02-4281-993a-7bb966011e52'}),
 Document(page_content='What is the number of outstanding shares of common stock of NVIDIA Corporation as of August 18, 2023?', metadata={'doc_id': '3fd2d11e-ff02-4281-993a-7bb966011e52'}),
 Document(page_content="What were NVIDIA's financial results for the quarter ended July 30, 2023?", metadata={'doc_id': '63b7d66e-3fe0-49b7-861b-cb86fd114dae'}),
 Document(page_content='What are the risk factors identified by NVIDIA for the quarter ended July 30, 2023?', metadata={'doc_id': '63b7d66e-3fe0-49b7-861b-cb86fd114dae'}),
 Document(page_content='How does NVIDIA Corporation disclose material financial information to its investors?', metadata={'doc_id': '63b7d66e-3

In [None]:
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

In [None]:
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0), retriever, memory=ConversationBufferMemory(memory_key="chat_history", return_messages=True))

In [None]:
result = qa({"question": "What is the gross margin?"})
result

{'question': 'What is the gross margin?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin was 70.1% for the second quarter and 68.2% for the first half of fiscal year 2024.')],
 'answer': ' The gross margin was 70.1% for the second quarter and 68.2% for the first half of fiscal year 2024.'}

In [None]:
result = qa({"question": "What is the main contribution to it?"})
result

{'question': 'What is the main contribution to it?',
 'chat_history': [HumanMessage(content='What is the gross margin?'),
  AIMessage(content=' The gross margin was 70.1% for the second quarter and 68.2% for the first half of fiscal year 2024.'),
  HumanMessage(content='What is the main contribution to it?'),
  AIMessage(content=' Higher revenue from Compute GPUs.')],
 'answer': ' Higher revenue from Compute GPUs.'}