<a href="https://colab.research.google.com/github/rvernica/notebook/blob/main/mongodb/langchain-parent-document-retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain MongoDB Integration - Parent Document Retrieval

This notebook is a companion to the [Parent Document Retrieval](https://www.mongodb.com/docs/atlas/ai-integrations/langchain/parent-document-retrieval/) page. Refer to the page for set-up instructions and detailed explanations.

<a target="_blank" href="https://colab.research.google.com/github/mongodb/docs-notebooks/blob/main/ai-integrations/langchain-hybrid-search.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
!pip install --quiet --upgrade \
  langchain \
  langchain-community \
  langchain-core \
  langchain-mongodb \
  langchain-voyageai \
  langchain-google-genai \
  pymongo \
  pypdf

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m121.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/60.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/50.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.7/50.7 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [5]:
!curl ifconfig.me

35.245.220.36

In [2]:
import os
from google.colab import userdata

os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")
os.environ["VOYAGE_API_KEY"] = userdata.get("VOYAGE_API_KEY")
MONGODB_URI = userdata.get("MONGODB_URI")

In [21]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
import json

# Load the PDF
loader = PyPDFLoader("https://investors.mongodb.com/node/12881/pdf")
data = loader.load()

# Chunk into parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=20)
docs = parent_splitter.split_documents(data)

# Print a document
print(json.dumps(docs[0].metadata, indent=4))
docs[0].page_content

{
    "producer": "West Corporation using ABCpdf",
    "creator": "PyPDF",
    "creationdate": "2024-12-09T21:06:39+00:00",
    "title": "MongoDB, Inc. Announces Third Quarter Fiscal 2025 Financial Results",
    "source": "https://investors.mongodb.com/node/12881/pdf",
    "total_pages": 8,
    "page": 0,
    "page_label": "1"
}


'MongoDB, Inc. Announces Third Quarter Fiscal 2025 Financial Results\nDecember 9, 2024\nThird Quarter Fiscal 2025 Total Revenue of $529.4 million, up 22% Year-over-Year\nContinued Strong Customer Growth with Over 52,600 Customers as of October 31, 2024\nMongoDB Atlas Revenue up 26% Year-over-Year; 68% of Total Q3 Revenue\nNEW YORK , Dec. 9, 2024 /PRNewswire/ -- MongoDB, Inc. (NASDAQ: MDB) today announced its financial results for the third quarter ended October\n31, 2024.\n\xa0\n  \xa0\n"MongoDB\'s third quarter results were significantly ahead of expectations on the top and bottom line, driven by better-than-expected EA performance\nand 26% Atlas revenue growth.\xa0 We continue to see success winning new business due to the superiority of MongoDB\'s developer data platform in\naddressing a wide variety of mission-critical use cases," said Dev Ittycheria, President and Chief Executive Officer of MongoDB .\n"We continue to invest in our legacy app modernization and AI offerings as our d

In [6]:
from langchain_mongodb.retrievers import MongoDBAtlasParentDocumentRetriever
from langchain_voyageai import VoyageAIEmbeddings

# Define the embedding model to use
embedding_model = VoyageAIEmbeddings(model="voyage-3-large")

# Define the chunking method for the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)

# Specify the database and collection name
database_name = "langchain_db"
collection_name = "parent_document"

# Create the parent document retriever
parent_doc_retriever = MongoDBAtlasParentDocumentRetriever.from_connection_string(
    connection_string = MONGODB_URI,
    child_splitter = child_splitter,
    embedding_model = embedding_model,
    database_name = database_name,
    collection_name = collection_name,
    text_key = "page_content",
    relevance_score_fn = "dotProduct",
    search_kwargs = { "k": 10 },
)

In [17]:
# Ingest the documents into Atlas
parent_doc_retriever.add_documents(docs)

In [22]:
# Get the vector store instance from the retriever
vector_store = parent_doc_retriever.vectorstore

# Use helper method to create the vector search index
vector_store.create_vector_search_index(
   dimensions = 1024,       # The dimensions of the vector embeddings to be indexed
   wait_until_complete = 60 # Number of seconds to wait for the index to build (can take around a minute)
)


In [25]:
# Run a vector search query
results = parent_doc_retriever.invoke("AI technology")

for result in results:
    print(json.dumps(result.metadata, indent=4))
    print(result.page_content[:1000])

{
    "_id": "bf362080-fa47-49c1-bfcf-54c2fedb995f",
    "producer": "West Corporation using ABCpdf",
    "creator": "PyPDF",
    "creationdate": "2024-12-09T21:06:39+00:00",
    "title": "MongoDB, Inc. Announces Third Quarter Fiscal 2025 Financial Results",
    "source": "https://investors.mongodb.com/node/12881/pdf",
    "total_pages": 8,
    "page": 1,
    "page_label": "2"
}
downturns and/or the effects of rising interest rates, inflation and volatility in the global economy and financial markets on our business and future
operating results; our potential failure to meet publicly announced guidance or other expectations about our business and future operating results; our
limited operating history; our history of losses; failure of our platform to satisfy customer demands; the effects of increased competition; our
investments in new products and our ability to introduce new features, services or enhancements; our ability to effectively expand our sales and
marketing organization; o

In [24]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import  RunnablePassthrough
from langchain_google_genai import ChatGoogleGenerativeAI

# Define a prompt template
template = """
   Use the following pieces of context to answer the question at the end.
   {context}
   Question: {query}?
"""
prompt = PromptTemplate.from_template(template)
model = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

# Construct a chain to answer questions on your data
chain = (
   {"context": parent_doc_retriever, "query": RunnablePassthrough()}
   | prompt
   | model
   | StrOutputParser()
)

# Prompt the chain
query = "In a list, what are MongoDB's latest AI announcements?"
answer = chain.invoke(query)
print(answer)

MongoDB's latest AI announcements include:

*   Launched a MongoDB University course focused on building AI applications with MongoDB and AWS.
*   Announced new technology integrations for AI, data analytics, and automating database deployments across on-premises, cloud, and edge environments at Microsoft Ignite.
*   Launched the MongoDB AI Applications Program (MAAP) in July 2024.
*   Capgemini, Confluent, IBM, Unstructured, and QuantumBlack, AI by McKinsey have joined the MAAP ecosystem.
