# 3.4 Parent document retriever


## Setup

### Install dependencies

In [None]:
%pip install python-dotenv~=1.0 docarray~=0.40.0 pypdf~=5.1 --upgrade --quiet
%pip install chromadb~=0.5.18 sentence-transformers~=3.3 --upgrade --quiet 
%pip install langchain~=0.3.7 langchain_openai~=0.2.6 langchain_community~=0.3.5 langchain-chroma~=0.1.4 langchainhub~=0.1.21 --upgrade --quiet
%pip install langchain_experimental~=0.3.3 --upgrade --quiet

# If running locally, you can do this instead:
#%pip install -r ../requirements.txt

### Load environment variables

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

# If running in Google Colab, you can use this code instead:
# from google.colab import userdata
# os.environ["AZURE_OPENAI_API_KEY"] = userdata.get("AZURE_OPENAI_API_KEY")
# os.environ["AZURE_OPENAI_ENDPOINT"] = userdata.get("AZURE_OPENAI_ENDPOINT")

### Setup models

In [None]:
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
api_version = "2024-10-01-preview"
llm = AzureChatOpenAI(deployment_name="gpt-4o", temperature=0.0, openai_api_version=api_version)
embedding_model = AzureOpenAIEmbeddings(model="text-embedding-3-large", openai_api_version=api_version)

### Setup LangSmith tracing for this notebook

In [None]:
import os

# API key etc is in the .env file
# my_name = "Totoro"
# os.environ["LANGCHAIN_TRACING_V2"] = "true"
# os.environ["LANGCHAIN_PROJECT"] = f"tokyo24-test-{my_name}"

# How to use the Parent Document Retriever

When splitting documents for retrieval, there are often conflicting desires:

1. You may want to have small documents, so that their embeddings can most
    accurately reflect their meaning. If too long, then the embeddings can
    lose meaning.
2. You want to have long enough documents that the context of each chunk is
    retained.

The `ParentDocumentRetriever` strikes that balance by splitting and storing
small chunks of data. During retrieval, it first fetches the small chunks
but then looks up the parent ids for those chunks and returns those larger
documents.

**Note** that "parent document" refers to the document that a small chunk
originated from. This can either be the whole raw document **OR a larger
chunk**.

In [None]:
from langchain.retrievers import ParentDocumentRetriever

In [None]:
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
loaders = [
    TextLoader("../data/paul_graham_essay.txt"),
    TextLoader("../data/state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

## Retrieving full documents

In this mode, we want to retrieve the full documents. Therefore, we only specify a child splitter.

In [None]:
# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=embedding_model
)
# The storage layer for the parent documents (using an in-memory store for simplicity)
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

In [None]:
retriever.add_documents(docs, ids=None)

This should yield two keys, because we added two documents.

In [None]:
list(store.yield_keys())

Let's now call the vector store search functionality - we should see that it returns small chunks (since we're storing the small chunks).

In [None]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [None]:
print(sub_docs[0].page_content)

Let's now retrieve from the overall **retriever**. This should return large documents - since it returns the documents where the smaller chunks are located.

In [None]:
retrieved_docs = retriever.invoke("justice breyer")

In [None]:
len(retrieved_docs[0].page_content)

## Retrieving larger chunks

Sometimes, the full documents can be too big to want to retrieve them as is. In that case, what we really want to do is to first split the raw documents into larger chunks, and then split it into smaller chunks. We then index the smaller chunks, but on retrieval we retrieve the larger chunks (but still not the full documents).

In [None]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=embedding_model
)
# The storage layer for the parent documents
store = InMemoryStore()

In [None]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [None]:
retriever.add_documents(docs)

We can see that there are much more than two documents now - these are the larger chunks.

In [None]:
len(list(store.yield_keys()))

Let's make sure the underlying vector store still retrieves the small chunks.

In [None]:
sub_docs = vectorstore.similarity_search("justice breyer")

In [None]:
print(sub_docs[0].page_content)

In [None]:
retrieved_docs = retriever.invoke("justice breyer")

In [None]:
len(retrieved_docs[0].page_content)

In [None]:
print(retrieved_docs[0].page_content)