# **LangChain `ParentDocumentRetriever` Quick Reference**

## Introduction

The `ParentDocumentRetriever` operates by first splitting documents into smaller chunks, which are then stored and indexed. During retrieval, it retrieves these small chunks based on a query and subsequently looks up their parent document IDs to return the larger original documents or predefined larger chunks. This approach strikes a balance between having sufficiently small segments for accurate embeddings and maintaining enough context for meaningful retrieval.

### Key Features

- **Chunking Strategy**: The retriever allows for both small chunk retrieval and parent document lookup, ensuring that the embeddings accurately reflect the content's meaning while preserving context.
- **Dynamic Retrieval**: It can dynamically fetch parent documents based on the retrieved chunks, enhancing the relevance of the results returned to users.
- **Metadata Handling**: It supports metadata fields, allowing users to retain relevant information associated with child documents during retrieval.

### Use Cases

1. **Contextual Retrieval**: Ideal for applications where understanding the context surrounding a specific piece of information is crucial, such as in question-answering systems.
2. **Efficient Document Management**: Useful in scenarios where large documents need to be managed and accessed quickly without losing important contextual information.

The `ParentDocumentRetriever` thus serves as an effective tool within LangChain, enhancing how users can retrieve and interact with large sets of textual data while maintaining contextual integrity.

---

## Preparation

### Installing Required Libraries
This section installs the necessary Python libraries for working with LangChain, OpenAI embeddings, and Chroma vector store. These libraries include:
- `langchain-openai`: Provides integration with OpenAI's embedding models.
- `langchain_community`: Contains community-contributed modules and tools for LangChain.
- `langchain_experimental`: Includes experimental features and utilities for LangChain.
- `langchain-chroma`: Enables integration with the Chroma vector database.
- `chromadb`: The core library for the Chroma vector database.

In [None]:
!pip install -qU langchain-openai
!pip install -qU langchain_community
!pip install -qU langchain_experimental
!pip install -qU langchain-chroma>=0.1.2
!pip install -qU chromadb

### Initializing OpenAI Embeddings
This section demonstrates how to securely fetch an OpenAI API key using Kaggle's `UserSecretsClient` and initialize the OpenAI embedding model. The `OpenAIEmbeddings` class is used to create an embedding model instance, which will be used to convert text into numerical embeddings.

Key steps:
1. **Fetch API Key**: The OpenAI API key is securely retrieved using Kaggle's `UserSecretsClient`.
2. **Initialize Embeddings**: The `OpenAIEmbeddings` class is initialized with the `text-embedding-3-small` model and the fetched API key.

This setup ensures that the embedding model is ready for use in downstream tasks, such as caching embeddings or creating vector stores.

In [None]:
from langchain_openai import OpenAIEmbeddings
from kaggle_secrets import UserSecretsClient

# Fetch API key securely
user_secrets = UserSecretsClient()
my_api_key = user_secrets.get_secret("api-key-openai")

# Initialize OpenAI embeddings
embed = OpenAIEmbeddings(model="text-embedding-3-small", api_key=my_api_key)

---

## 1. Document Retrieval & Management

### **Adding and Retrieving Documents**

This example demonstrates how to add documents to the `ParentDocumentRetriever` and subsequently retrieve them based on a query. It showcases the fundamental workflow of indexing and retrieving documents.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="doc_retrieval_add_retrieve")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Sample documents (create Document objects)
documents = [
    Document(page_content="Document 1 content goes here.", metadata={"source": "doc1"}),
    Document(page_content="Document 2 content goes here.", metadata={"source": "doc2"}),
]

# Add documents to the retriever
retriever.add_documents(documents)

# Retrieve documents relevant to a query
query = "content goes here"
relevant_docs = retriever.invoke(query)

print("Retrieved Documents:")
for doc in relevant_docs:
    print(f"Source: {doc.metadata['source']}, Content: {doc.page_content}")

### **Filtering Retrieved Documents by Metadata**

This example illustrates how to retrieve documents while filtering them based on specific metadata criteria. This is useful when you want to narrow down search results to documents from particular sources or with certain attributes.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="doc_retrieval_filter_metadata")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever with child metadata fields
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
    child_metadata_fields=["source"],
)

# Sample documents (use Document objects)
documents = [
    Document(page_content="Document 1 content goes here.", metadata={"source": "internal"}),
    Document(page_content="Document 2 content goes here.", metadata={"source": "external"}),
]

# Add documents to the retriever
retriever.add_documents(documents)

# Retrieve documents relevant to a query with metadata filter
query = "content goes here"
metadata_filter = {"source": "internal"}
relevant_docs = retriever.invoke(query, metadata=metadata_filter)

print("Retrieved Documents with 'internal' source:")
for doc in relevant_docs:
    print(f"Source: {doc.metadata['source']}, Content: {doc.page_content}")

### **Updating Documents in the Retriever**
    
While the `ParentDocumentRetriever` does not provide a direct method to update documents, you can manage updates by removing existing documents and adding the updated versions. This example demonstrates how to perform such an update.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document  # Import the Document class

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="doc_retrieval_update_docs")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Sample documents (use Document objects)
documents = [
    Document(page_content="Original content of Document 1.", metadata={"doc_id": "doc1"}),
    Document(page_content="Original content of Document 2.", metadata={"doc_id": "doc2"}),
]

# Add documents to the retriever
retriever.add_documents(documents)

# Updated document (use Document object)
updated_document = Document(page_content="Updated content of Document 1.", metadata={"doc_id": "doc1"})

# Remove the old document (assuming a remove method exists)
# Since there's no direct remove method, we'll reinitialize the retriever without the old document
# In practice, implement a remove method or manage updates appropriately

# For illustration, reinitialize the retriever with the same collection name
retriever = ParentDocumentRetriever(
    vectorstore=Chroma(embedding_function=embed, collection_name="doc_retrieval_update_docs"),
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add updated document
retriever.add_documents([updated_document])

# Retrieve updated document
query = "updated content"
relevant_docs = retriever.invoke(query)

print("Retrieved Updated Documents:")
for doc in relevant_docs:
    print(f"Doc ID: {doc.metadata['doc_id']}, Content: {doc.page_content}")

---
    
## 2. Batch Processing
    
### **Batch Adding Multiple Documents**
    
This example demonstrates how to add multiple documents to the `ParentDocumentRetriever` in a single batch operation. Batch processing can improve efficiency when dealing with large volumes of data.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document  # Import the Document class

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="batch_processing_add")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Batch of documents (use Document objects)
documents = [
    Document(page_content="Content of Document 1.", metadata={"source": "doc1"}),
    Document(page_content="Content of Document 2.", metadata={"source": "doc2"}),
    Document(page_content="Content of Document 3.", metadata={"source": "doc3"}),
]

# Batch add documents
retriever.add_documents(documents)

# Verify addition by retrieving a document
query = "Content of Document 2."
relevant_docs = retriever.invoke(query)

print("Retrieved Document:")
for doc in relevant_docs:
    print(f"Source: {doc.metadata['source']}, Content: {doc.page_content}")

### **Batch Retrieving Documents for Multiple Queries**
    
This example showcases how to perform batch retrievals for multiple queries simultaneously. Batch retrieval can significantly speed up the process when handling multiple search requests.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document  # Import the Document class

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="batch_processing_retrieve_queries")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Python is a versatile programming language.", metadata={"doc_id": "doc1"}),
    Document(page_content="Java is widely used in enterprise applications.", metadata={"doc_id": "doc2"}),
    Document(page_content="JavaScript powers the web.", metadata={"doc_id": "doc3"}),
]
retriever.add_documents(documents)

# List of queries
queries = [
    "programming language",
    "enterprise applications",
    "web development",
]

# Batch retrieve documents for all queries
results = retriever.batch(queries)

for i, docs in enumerate(results):
    print(f"Results for Query {i+1}:")
    for doc in docs:
        print(f"Doc ID: {doc.metadata['doc_id']}, Content: {doc.page_content}")
    print("---")

### **Batch Processing with Configuration**
    
This example demonstrates how to use different configurations for each batch invocation. This flexibility allows for customized retrieval behaviors based on specific requirements.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document  # Import the Document class

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="batch_processing_no_runnableconfig")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Machine learning enables computers to learn from data.", metadata={"doc_id": "doc1"}),
    Document(page_content="Deep learning is a subset of machine learning.", metadata={"doc_id": "doc2"}),
    Document(page_content="Artificial intelligence encompasses machine learning and deep learning.", metadata={"doc_id": "doc3"}),
]
retriever.add_documents(documents)

# Define batch inputs
inputs = ["machine learning", "deep learning"]

# Batch retrieve documents
results = retriever.batch(inputs)

for i, docs in enumerate(results):
    print(f"Results for Query '{inputs[i]}':")
    for doc in docs:
        print(f"Doc ID: {doc.metadata['doc_id']}, Content: {doc.page_content}")
    print("---")

---
    
## 3. Streaming
    
### **Streaming Retrieval of Documents**
    
This example demonstrates how to stream the retrieval of documents based on a query. Streaming allows processing results incrementally as they become available, which can be beneficial for real-time applications.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="streaming_retrieval")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Stream processing allows handling data in real-time.", metadata={"doc_id": "doc1"}),
    Document(page_content="Batch processing handles large volumes of data at once.", metadata={"doc_id": "doc2"}),
    Document(page_content="Real-time analytics requires efficient streaming.", metadata={"doc_id": "doc3"}),
]
retriever.add_documents(documents)

# Stream retrieval of documents
query = "real-time data processing"

# Process each chunk returned by the stream method
for chunk in retriever.stream(query):
    for doc in chunk:  # Iterate through the documents in the chunk
        print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")

### **Streaming Events During Retrieval**
    
This example showcases how to generate and handle a stream of events related to the retrieval process. Event streaming provides insights into the internal operations and progress of the retriever.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="streaming_events_retrieval")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Event-driven architectures respond to events.", metadata={"doc_id": "doc1"}),
    Document(page_content="Streaming data enables real-time processing.", metadata={"doc_id": "doc2"}),
    Document(page_content="Asynchronous events improve system responsiveness.", metadata={"doc_id": "doc3"}),
]
retriever.add_documents(documents)

# Stream events during retrieval
query = "real-time processing"

# Iterate over chunks returned by the stream method
for chunk in retriever.stream(query):  # Each chunk is a list of Document objects
    for doc in chunk:  # Iterate through individual documents in the chunk
        print(f"Event: Retrieved Document ID {doc.metadata['doc_id']}, Content: {doc.page_content}")

### **Combining Streaming with Listeners**
    
While the `ParentDocumentRetriever` primarily supports synchronous streaming, you can enhance the retrieval process by integrating listeners that react to streamed data.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Define listener functions
def on_start(run_obj):
    print("Retrieval process started.")

def on_end(run_obj):
    print("Retrieval process completed.")

def on_error(run_obj):
    print("An error occurred during retrieval.")

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="streaming_with_listeners")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever with listeners
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
).with_listeners(
    on_start=on_start,
    on_end=on_end,
    on_error=on_error
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Listener functions can react to retrieval events.", metadata={"doc_id": "doc1"}),
    Document(page_content="Event listeners enhance the functionality of retrievers.", metadata={"doc_id": "doc2"}),
    Document(page_content="Proper error handling ensures system robustness.", metadata={"doc_id": "doc3"}),
]
retriever.add_documents(documents)

# Stream retrieval with listeners
query = "retrieval events"

for chunk in retriever.stream(query):  # Process each chunk (list of Document objects)
    for doc in chunk:  # Process each Document object within the chunk
        print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")

---
    
## 4. Event Handling
    
### **Binding Synchronous Lifecycle Listeners**
    
This example demonstrates how to bind synchronous lifecycle listeners (`on_start` and `on_end`) to the `ParentDocumentRetriever`. These listeners execute custom functions at different stages of the retrieval process.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Define listener functions
def on_start(run_obj):
    print("Retrieval started.")

def on_end(run_obj):
    print("Retrieval ended.")

def on_error(run_obj):
    print("An error occurred during retrieval.")

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="event_handling_bind_listeners")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever with listeners
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
).with_listeners(
    on_start=on_start,
    on_end=on_end,
    on_error=on_error
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Event listeners allow custom actions during retrieval.", metadata={"doc_id": "doc1"}),
    Document(page_content="They can be used to log retrieval activities.", metadata={"doc_id": "doc2"}),
]
retriever.add_documents(documents)

# Invoke retrieval to trigger listeners
query = "event listeners"

retrieved_docs = retriever.invoke(query)

for doc in retrieved_docs:
    print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")

### **Dispatching Custom Events**
    
Although the `ParentDocumentRetriever` does not directly expose methods for dispatching custom events, you can integrate custom event dispatching within your application logic. This example illustrates how to simulate custom event handling during the retrieval process.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Define a custom event dispatcher
def dispatch_custom_event(event_name, data):
    print(f"Custom Event: {event_name}, Data: {data}")

# Define listener functions with custom event dispatching
def on_start(run_obj):
    dispatch_custom_event("retrieval_started", {"query": run_obj.input})

def on_end(run_obj):
    dispatch_custom_event("retrieval_completed", {"num_documents": len(run_obj.output)})

def on_error(run_obj):
    dispatch_custom_event("retrieval_error", {"error": str(run_obj.error)})

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="event_handling_custom_events")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever with custom event listeners
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
).with_listeners(
    on_start=on_start,
    on_end=on_end,
    on_error=on_error
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Custom events provide flexibility in handling retrieval processes.", metadata={"doc_id": "doc1"}),
    Document(page_content="They can be tailored to specific application needs.", metadata={"doc_id": "doc2"}),
]
retriever.add_documents(documents)

# Invoke retrieval to trigger custom events
query = "custom events"

try:
    retrieved_docs = retriever.invoke(query)
    for doc in retrieved_docs:
        print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")
except Exception as e:
    print(f"Exception during retrieval: {e}")

### **Using Custom Callbacks with Listeners**
    
This example shows how to integrate custom callback functions with the retriever's lifecycle listeners to perform additional operations, such as logging or data transformation, during the retrieval process.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Define custom callback functions
def log_start(run_obj):
    print(f"[LOG] Retrieval started for query: '{run_obj.input}'")

def log_end(run_obj):
    print(f"[LOG] Retrieval ended. Number of documents retrieved: {len(run_obj.output)}")

def log_error(run_obj):
    print(f"[LOG] Retrieval failed with error: {run_obj.error}")

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="event_handling_custom_callbacks")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever with custom callbacks
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
).with_listeners(
    on_start=log_start,
    on_end=log_end,
    on_error=log_error
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Callbacks enhance the functionality of retrieval processes.", metadata={"doc_id": "doc1"}),
    Document(page_content="They allow for custom operations during retrieval.", metadata={"doc_id": "doc2"}),
]
retriever.add_documents(documents)

# Invoke retrieval to trigger custom callbacks
query = "callbacks in retrieval"

retrieved_docs = retriever.invoke(query)

for doc in retrieved_docs:
    print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")

---
    
## 5. Error Handling
    
### **Implementing Retry Logic with `with_retry`**
    
This example demonstrates how to add retry logic to the `ParentDocumentRetriever` using the `with_retry` method. The retriever will attempt to retry the retrieval operation upon encountering specified exceptions.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="error_handling_retry_logic")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Reliable retrieval is crucial for applications.", metadata={"doc_id": "doc1"}),
]
retriever.add_documents(documents)

# Apply retry logic to the retriever
retriever_with_retry = retriever.with_retry(
    stop_after_attempt=3,
    retry_if_exception_type=(ValueError,),
    wait_exponential_jitter=True
)

# Invoke retrieval with retry logic
query = "reliable retrieval"

try:
    retrieved_docs = retriever_with_retry.invoke(query)
    for doc in retrieved_docs:
        print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")
except ValueError as e:
    print(f"Retrieval failed after retries: {e}")

### **Handling Specific Exceptions with Retries**
    
This example showcases how to configure the `with_retry` method to handle specific exception types. The retriever will only retry upon encountering the specified exceptions, allowing for more granular error management.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstore and docstore with a unique collection name
vectorstore = Chroma(embedding_function=embed, collection_name="error_handling_specific_retries")
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents (use Document objects)
documents = [
    Document(page_content="Selective error handling allows for precise control.", metadata={"doc_id": "doc1"}),
]
retriever.add_documents(documents)

# Apply selective retry logic
retriever_with_retry = retriever.with_retry(
    stop_after_attempt=2,
    retry_if_exception_type=(ValueError,),  # Only retry on ValueError
    wait_exponential_jitter=False
)

# Invoke retrieval with selective retry logic
query = "selective error handling"

try:
    retrieved_docs = retriever_with_retry.invoke(query)
    for doc in retrieved_docs:
        print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")
except Exception as e:
    print(f"Retrieval failed: {e}")

### **Combining Retry with Fallbacks**
    
While the `ParentDocumentRetriever` categorizes fallbacks separately, combining retry logic with fallbacks can enhance robustness. This example demonstrates how to set up both retry mechanisms and fallback retrievers to ensure successful retrieval even in the face of multiple failures.

In [None]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.retrievers.parent_document_retriever import ParentDocumentRetriever
from langchain.schema import Document

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, add_start_index=True)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, add_start_index=True)

# Initialize vectorstores and docstores for primary and fallback with unique collection names
primary_vectorstore = Chroma(embedding_function=embed, collection_name="error_handling_retry_with_fallbacks_primary")
fallback_vectorstore = Chroma(embedding_function=embed, collection_name="error_handling_retry_with_fallbacks_fallback")
primary_store = InMemoryStore()
fallback_store = InMemoryStore()

# Initialize the primary retriever
primary_retriever = ParentDocumentRetriever(
    vectorstore=primary_vectorstore,
    docstore=primary_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents to the primary retriever
primary_retriever.add_documents([
    Document(page_content="Primary retriever document content.", metadata={"doc_id": "primary_doc"}),
])

# Apply retry logic to the primary retriever
primary_with_retry = primary_retriever.with_retry(
    stop_after_attempt=2,
    retry_if_exception_type=(ValueError,),
    wait_exponential_jitter=False
)

# Initialize the fallback retriever
fallback_retriever = ParentDocumentRetriever(
    vectorstore=fallback_vectorstore,
    docstore=fallback_store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add sample documents to the fallback retriever
fallback_retriever.add_documents([
    Document(page_content="Fallback retriever document content.", metadata={"doc_id": "fallback_doc"}),
])

# Combine the primary retriever with fallback retrievers
combined_retriever = primary_with_retry.with_fallbacks(
    fallbacks=[fallback_retriever],
    exceptions_to_handle=(ValueError,)
)

# Invoke combined retriever
query = "robust retrieval"

retrieved_docs = combined_retriever.invoke(query)

for doc in retrieved_docs:
    print(f"Retrieved Document: {doc.metadata['doc_id']}, Content: {doc.page_content}")

---

## 6. Best Practices

### **Using ParentDocumentRetriever for Full and Larger Chunk Retrieval**

#### **Loading and Preparing Documents**
This section demonstrates loading documents from text files and preparing them for retrieval by using the `TextLoader`.

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load documents
loaders = [
    TextLoader("/kaggle/input/paul-graham-essay/paul_graham_essay.txt"),
    TextLoader("/kaggle/input/paul-graham-essay/state_of_the_union.txt"),
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

#### **Retrieving Full Documents Using Small Chunks**
In this mode, documents are split into small chunks for indexing and retrieval. The `ParentDocumentRetriever` is configured to use only a child splitter.

In [None]:
# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="full_documents", embedding_function=embed)

# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

# Add documents to the retriever
retriever.add_documents(docs, ids=None)

# List keys in the document store
print("Number of keys:", list(store.yield_keys()))

# Perform similarity search in vectorstore
sub_docs = vectorstore.similarity_search("justice breyer")
print("Content length:", len(sub_docs[0].page_content))
print(sub_docs[0].page_content)

# Retrieve documents using the retriever
retrieved_docs = retriever.invoke("justice breyer")
print("Content length:", len(retrieved_docs[0].page_content))
print(len(retrieved_docs[0].page_content))

#### **Retrieving Larger Chunks with Parent Splitting**
In this mode, documents are first split into larger chunks (parent documents), which are further split into smaller chunks (child documents). This provides a balance between granularity and context.

In [None]:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# This text splitter is used to create the child documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="split_parents", embedding_function=embed)

# The storage layer for the parent documents
store = InMemoryStore()

# Initialize the retriever with parent and child splitters
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents to the retriever
retriever.add_documents(docs)

# Check the number of keys in the document store
print("Number of keys:", len(list(store.yield_keys())))

In [None]:
# Perform similarity search in vectorstore
sub_docs = vectorstore.similarity_search("justice breyer")
print("Content length:", len(sub_docs[0].page_content))
print(sub_docs[0].page_content)

In [None]:
# Retrieve documents using the retriever
retrieved_docs = retriever.invoke("justice breyer")
print("Content length:", len(retrieved_docs[0].page_content))
print(retrieved_docs[0].page_content)

## Conclusion

The `ParentDocumentRetriever` is a versatile tool that combines the strengths of granular similarity searches with the ability to retrieve documents at a broader contextual level. By enabling users to split documents into multiple hierarchical levels, it allows for a customizable and efficient retrieval process. Whether you are retrieving small snippets for precise searches or larger chunks for contextual analysis, the `ParentDocumentRetriever` offers an intuitive and scalable solution to document retrieval challenges. Its seamless integration with text splitting tools, vector stores like `Chroma`, and metadata-based storage ensures it can adapt to a wide range of use cases, providing both accuracy and context in retrieval tasks.