# **Parent Document Retriver**

Parent Document retriever is a technique where large documents are split into smaller pieces, called "child chunks." These chunks are stored in a way that lets the system find and compare specific parts of a document with a user’s query. The large document, or "parent," is still kept but is only retrieved if one of its child chunks is relevant to the query.

Reference: [Parent Document Retriver](https://python.langchain.com/docs/how_to/parent_document_retriever/)

# **📚 Notebook Summary & Step-by-Step Guide**

## **🎯 What This Notebook Does**
This notebook implements **Parent Document Retriever** - a sophisticated technique that splits documents into **small searchable chunks** (children) while preserving **large context blocks** (parents). When a small chunk matches a query, the system retrieves the entire parent document, providing richer context while maintaining precise search capabilities.

## **🔧 Key Libraries & Their Roles**

### **Core RAG Libraries**
- **`langchain`** - Main orchestration with parent-child retrieval components
- **`langchain-openai`** - OpenAI integration for embeddings and chat
- **`chromadb`** - Vector database for child chunk storage
- **`athina`** - RAG evaluation and monitoring framework

### **Document Management**
- **`ParentDocumentRetriever`** - Core component managing parent-child relationships
- **`InMemoryStore`** - Storage layer for parent documents
- **`RecursiveCharacterTextSplitter`** - Dual-level document splitting

### **Vector Storage**
- **`Chroma`** - Vector database for child chunk embeddings
- **`OpenAIEmbeddings`** - Embedding model for semantic search

## **📋 Step-by-Step Process**

### **Step 1: Environment Setup**
- Install required packages: `athina`, `chromadb`
- Configure OpenAI and Athina API keys
- Set up dual-storage architecture

### **Step 2: Dual-Level Document Splitting**
- **Parent Splitter**: Create large chunks (2000 characters) for context
- **Child Splitter**: Create small chunks (400 characters) for search
- **Purpose**: Balance search precision with context richness

### **Step 3: Storage Architecture Setup**
- **Vector Store**: Chroma database for child chunk embeddings
- **Document Store**: InMemoryStore for parent document storage
- **Relationship Mapping**: Link child chunks to parent documents

### **Step 4: Parent Document Retriever Configuration**
- Create `ParentDocumentRetriever` with dual storage
- Configure parent and child splitters
- Set up automatic parent-child relationship management

### **Step 5: Retrieval Process**
- **Search Phase**: Query matches against small child chunks
- **Retrieval Phase**: Return corresponding large parent documents
- **Context Assembly**: Provide rich context for response generation

### **Step 6: Enhanced RAG Pipeline**
- Integrate parent document retriever into RAG chain
- Generate responses with enriched context from parent documents
- Evaluate improvement in response quality and context richness

## **🚀 Key Advantages of Parent Document Retrieval**

### **Search Precision vs Context Richness**
- **Small Chunks**: Enable precise matching and efficient search
- **Large Parents**: Provide comprehensive context for generation
- **Best of Both**: Combine accuracy with contextual depth

### **Context Optimization**
- **Avoid Truncation**: Full parent documents prevent information loss
- **Rich Context**: More background information for better responses
- **Semantic Continuity**: Preserve document flow and relationships

### **Performance Benefits**
- **Efficient Search**: Small chunks create focused embeddings
- **Comprehensive Retrieval**: Parent docs provide complete context
- **Reduced Noise**: Precise child matching reduces irrelevant content

## **🏗️ Architecture Comparison**

### **Traditional RAG:**
```
Document → Split into chunks → Embed chunks → Search chunks → Return chunks
```

### **Parent Document RAG:**
```
Document → Split into Parents → Split into Children → Embed Children → Search Children → Return Parents
```

## **📊 Dual-Storage System**

| Component | Storage Type | Size | Purpose | Search Target |
|-----------|-------------|------|---------|---------------|
| **Child Chunks** | Vector DB (Chroma) | 400 chars | Search precision | ✅ Searchable |
| **Parent Documents** | Memory Store | 2000 chars | Context richness | ❌ Retrieved only |

## **🔄 Retrieval Workflow**

```
User Query → Embed Query → Search Child Chunks → Identify Matches → Retrieve Parent Docs → Generate Response
     ↓              ↓              ↓                  ↓                    ↓
"What is X?" → Vector(query) → Child[1,5,7] → Parent IDs → Parent[A,B] → Rich Answer
```

## **⚖️ Trade-off Analysis**

| Aspect | Traditional Chunking | Parent Document Retrieval | Winner |
|--------|---------------------|---------------------------|---------|
| **Search Precision** | Good | Excellent | Parent Doc |
| **Context Richness** | Limited | Comprehensive | Parent Doc |
| **Storage Efficiency** | Better | More complex | Traditional |
| **Setup Complexity** | Simple | Advanced | Traditional |
| **Response Quality** | Good | Superior | Parent Doc |

## **🎯 Use Cases & Benefits**

### **Ideal Scenarios:**
- **Long Documents**: Technical manuals, research papers, legal documents
- **Complex Topics**: Multi-paragraph explanations needed
- **Contextual Queries**: Questions requiring background information
- **Domain Expertise**: Specialized knowledge requiring comprehensive context

### **Key Improvements:**
- **Reduced Context Loss**: No important information truncation
- **Better Coherence**: Responses maintain logical flow
- **Enhanced Accuracy**: More complete information for generation
- **Improved User Experience**: More comprehensive and helpful answers

## **💡 Learning Outcomes**
Students will understand:
- Advanced document chunking strategies for optimal RAG performance
- Parent-child relationship management in retrieval systems
- Dual-storage architecture patterns for RAG optimization
- Trade-offs between search precision and context richness
- Production considerations for complex document retrieval
- Memory management in multi-level document systems
- Performance optimization techniques for large document collections

## **Initial Setup**

In [None]:
! pip install --q athina chromadb

In [None]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ['ATHINA_API_KEY'] = userdata.get('ATHINA_API_KEY')

## **Indexing**

In [None]:
# load embedding model
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
# load data
from langchain.document_loaders import CSVLoader
loader = CSVLoader("./context.csv")
documents = loader.load()

### **Parent Child Text Spliting**

In [None]:
# split pages content
from langchain.text_splitter import RecursiveCharacterTextSplitter

# create the parent documents - The big chunks
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)

# create the child documents - The small chunks
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

# The storage layer for the parent chunks
from langchain.storage import InMemoryStore
store = InMemoryStore()

In [None]:
from langchain.vectorstores import Chroma
vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings)

## **Retriever**

In [None]:
# create retriever
from langchain.retrievers import ParentDocumentRetriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [None]:
# add documents to vectorstore
retriever.add_documents(documents)

## **RAG Chain**

In [None]:
# create llm
from langchain_openai import ChatOpenAI
llm = ChatOpenAI()

In [None]:
# create document chain
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser

template = """"
You are a helpful assistant that answers questions based on the following context
Context: {context}

Question: {input}

Answer:

"""
prompt = ChatPromptTemplate.from_template(template)

# Setup RAG pipeline
rag_chain = (
    {"context": retriever,  "input": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# response
response = rag_chain.invoke("who played the lead roles in the movie leaving las vegas")
response

'Nicolas Cage played the role of Ben Sanderson, the alcoholic screenwriter, and Elisabeth Shue played the role of Sera, the sex worker, in the movie "Leaving Las Vegas."'

## **Preparing Data for Evaluation**

In [None]:
question = ["who played the lead roles in the movie leaving las vegas"]
response = []
contexts = []
ground_truth = ["Nicolas Cage stars as a suicidal alcoholic who has ended his personal and professional life to drink himself to death in Las Vegas ."]
# Inference
for query in question:
  response.append(rag_chain.invoke(query))
  contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])

# To dict
data = {
    "query": question,
    "response": response,
    "context": contexts,
    "expected_response": ground_truth
}

In [None]:
# create dataset
from datasets import Dataset
dataset = Dataset.from_dict(data)

In [None]:
# create dataframe
import pandas as pd
df = pd.DataFrame(dataset)

In [None]:
df

Unnamed: 0,query,response,context,expected_response
0,who played the lead roles in the movie leaving las vegas,Nicolas Cage and Elisabeth Shue played the lead roles in the movie Leaving Las Vegas.,"['Leaving Las Vegas is a 1995 American drama film written and directed by Mike Figgis and based on the semi-autobiographical 1990 novel of the same name by John O\'Brien. Nicolas Cage stars as a suicidal alcoholic in Los Angeles who, having lost his family and been recently fired, has decided to move to Las Vegas and drink himself to death. He loads a supply of liquor and beer into his BMW and gets drunk as he drives from Los Angeles to Las Vegas. Once there, he develops a romantic relations...",Nicolas Cage stars as a suicidal alcoholic who has ended his personal and professional life to drink himself to death in Las Vegas .


In [None]:
# Convert to dictionary
df_dict = df.to_dict(orient='records')

# Convert context to list
for record in df_dict:
    if not isinstance(record.get('context'), list):
        if record.get('context') is None:
            record['context'] = []
        else:
            record['context'] = [record['context']]

## **Evaluation in Athina AI**

We will use **Context Recall** eval here. It Measures the extent to which the retrieved context aligns with the expected response. Please refer to our [documentation](https://docs.athina.ai/api-reference/evals/preset-evals/overview) for further details

In [None]:
# set api keys for Athina evals
from athina.keys import AthinaApiKey, OpenAiApiKey
OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

In [None]:
# load dataset
from athina.loaders import Loader
dataset = Loader().load_dict(df_dict)

In [None]:
# evaluate
from athina.evals import RagasContextRecall
RagasContextRecall(model="gpt-4o").run_batch(data=dataset).to_df()

evaluating with [context_recall]


100%|██████████| 1/1 [00:01<00:00,  1.49s/it]


You can view your dataset at: https://app.athina.ai/develop/3e8a5c23-5ddc-4dd3-ae9a-0790587da1f5


Unnamed: 0,query,context,response,expected_response,display_name,failed,grade_reason,runtime,model,ragas_context_recall
0,who played the lead roles in the movie leaving las vegas,"['Leaving Las Vegas is a 1995 American drama film written and directed by Mike Figgis and based on the semi-autobiographical 1990 novel of the same name by John O\'Brien. Nicolas Cage stars as a suicidal alcoholic in Los Angeles who, having lost his family and been recently fired, has decided to move to Las Vegas and drink himself to death. He loads a supply of liquor and beer into his BMW and gets drunk as he drives from Los Angeles to Las Vegas. Once there, he develops a romantic relations...",Nicolas Cage and Elisabeth Shue played the lead roles in the movie Leaving Las Vegas.,Nicolas Cage stars as a suicidal alcoholic who has ended his personal and professional life to drink himself to death in Las Vegas .,Ragas Context Recall,,Context Recall metric is calculated by dividing the number of sentences in the ground truth that can be attributed to retrieved context by the total number of sentences in the grouund truth,2316,gpt-4o,1.0
