# Parent Document Retriever

- Author: [Yun Eun](https://github.com/yuneun92)
- Design:
- Peer Review:
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

This tutorial focuses on the `ParentDocumentRetriever` implementation, a tool designed to balance document search and chunking.

When splitting documents for search, two competing needs arise:

1. **Small Chunks** : Needed for accurate meaning representation in embeddings
2. **Context Preservation** : Required for maintaining document coherence

> How It Works

`ParentDocumentRetriever` manages this balance by:

1. Splitting documents into small searchable chunks
2. Maintaining connections to parent documents via IDs
3. Loading multiple files through `TextLoader` objects

> Benefits

1. **Efficient Search:** Quick identification of relevant content
2. **Context Awareness:** Access to broader document context when needed
3. **Flexible Structure:** Works with both complete documents and larger chunks as parent documents

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Full Document Retrieval](#full-document-retrieval)
- [Adjusting Larger Chunk Sizes](#adjusting-larger-chunk-sizes)
---

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial langchain_chroma

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_community",
        "langchain_openai",
        "chromadb",
    ],
    verbose=False,
    upgrade=False,
)

from langchain.storage import InMemoryStore
from langchain_community.document_loaders import TextLoader
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever

In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Parent-Document-Retriever",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

First, let's load the documents that we'll use as data.

In [5]:
loaders = [
    # load file (It could be multiple files)
    TextLoader("./data/appendix-keywords.txt"),
]
# If your os is window, execute the following line 
# loader = TextLoader("./data/appendix-keywords.txt", encoding="utf-8")

docs = []
for loader in loaders:
    # Load the document using the loader and add it to the docs list.
    docs.extend(loader.load())


In [6]:
docs

[Document(metadata={'source': './data/appendix-keywords.txt'}, page_content='Semantic Search\n\nDefinition: Semantic search refers to a search method that understands the meaning behind user queries, going beyond simple keyword matching to return relevant results.  \nExample: When a user searches for "solar system planets," the search returns information about related planets like Jupiter and Mars.  \nRelated Keywords: Natural Language Processing, Search Algorithms, Data Mining  \n\n\nEmbedding\n\nDefinition: Embedding is the process of converting textual data, such as words or sentences, into low-dimensional continuous vectors. This enables computers to understand and process text.  \nExample: The word "apple" can be represented as a vector like [0.65, -0.23, 0.17].  \nRelated Keywords: Natural Language Processing, Vectorization, Deep Learning  \n\n\nToken\n\nDefinition: A token refers to smaller units of text obtained by breaking it into parts, such as words, sentences, or phrases.  

## Full Document Retrieval

In this mode, we aim to search through complete documents. Therefore, we'll only specify the `child_splitter`.

Later, we'll also specify the `parent_splitter` to compare the results.

In [7]:
# Define Child Splitter with chunk size
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Create a Chroma DB collection -- in memory version
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()

# Create Retriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

Documents are added using the `retriever.add_documents(docs, ids=None)` function:
* If `ids` is `None`, they will be automatically generated.
* Setting `add_to_docstore=False` prevents duplicate document additions. However, `ids` values are required to check for duplicates.

In [8]:
# Add documents to the retriever. 'docs' is a list of documents, and 'ids' is a list of unique document identifiers.
retriever.add_documents(docs, ids=None, add_to_docstore=True)

This code should return two keys because we added two documents.

- Convert the keys returned by the `store` object's `yield_keys()` method into a list.

In [9]:
# Return all keys from the store as a list.
list(store.yield_keys())

['739c2480-1aac-4090-ae21-ceebc416d099']

Let's try calling the vector store search function.

Since we are storing small chunks, we should see small chunks returned in the search results.

Perform similarity search using the `similarity_search` method of the vectorstore object.

In [10]:
# Perform similarity search
sub_docs = vectorstore.similarity_search("Word2Vec")

# Print the page_content property of the first element in the sub_docs list.
print(sub_docs[0].page_content)

Word2Vec


Now let's search through the entire retriever. In this process, since it **returns the documents** containing the small chunks, relatively larger documents will be returned.

Use the `invoke()` method of the `retriever` object to retrieve documents related to the query.

In [11]:
# Retrieve and fetch documents
retrieved_docs = retriever.invoke("Word2Vec")

In [12]:
# Print the length of the page content of the retrieved document
print(
    f"Document length: {len(retrieved_docs[0].page_content)}",
    end="\n\n=====================\n\n",
)

# Print a portion of the document
print(retrieved_docs[0].page_content[2000:2500])

Document length: 10044


 old.  
Related Keywords: Database, Query, Data Management  


CSV

Definition: CSV (Comma-Separated Values) is a file format used to store data, where each value is separated by a comma. It is often used for saving and exchanging tabular data.  
Example: A CSV file with headers "Name, Age, Job" could contain data like "John, 30, Developer."  
Related Keywords: Data Format, File Handling, Data Exchange  


JSON

Definition: JSON (JavaScript Object Notation) is a lightweight data interchange form


## Adjusting Larger Chunk Sizes

Like the previous results, **the entire document may be too large to search through as is** .

In this case, what we actually want to do is first split the raw document into larger chunks, and then split those into smaller chunks.

Then we index the small chunks, but search for larger chunks during retrieval (though still not the entire document).

- Use `RecursiveCharacterTextSplitter` to create parent and child documents.

    - Parent documents have `chunk_size` set to 1000.
    - Child documents have `chunk_size` set to 200, creating smaller sizes than the parent documents.




In [13]:
# Text splitter used to generate parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)

# Text splitter used to generate child documents
# Should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Vector store to be used for indexing child chunks
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# Storage layer for parent documents
store = InMemoryStore()

This is the code to initialize `ParentDocumentRetriever`:
* The `vectorstore` parameter specifies the vector store that stores document vectors.
* The `docstore` parameter specifies the document store that stores document data.
* The `child_splitter` parameter specifies the document splitter used to split child documents.
* The `parent_splitter` parameter specifies the document splitter used to split parent documents.

`ParentDocumentRetriever` handles hierarchical document structures, separately splitting and storing parent and child documents. This allows effective use of both parent and child documents during retrieval.

In [14]:
retriever = ParentDocumentRetriever(
    # Specify the vector store
    vectorstore=vectorstore,
    # Specify the document store
    docstore=store,
    # Specify the child document splitter
    child_splitter=child_splitter,
    # Specify the parent document splitter
    parent_splitter=parent_splitter,
)

Add docs to the `retriever` object. This adds new documents to the set of documents that `retriever` can search through.

In [15]:
# Add documents to the retriever
retriever.add_documents(docs)

Now you can see there are many more documents. These are the larger chunks.

In [16]:
# Generate keys from the store, convert to list, and return the length
len(list(store.yield_keys()))

12

In [17]:
# Perform similarity search
sub_docs = vectorstore.similarity_search("Word2Vec")
# Print the page_content property of the first element in the sub_docs list
print(sub_docs[0].page_content)

Word2Vec


Now let's use the `invoke()` method of the `retriever` object to search for documents.

In [18]:
# Retrieve and fetch documents
retrieved_docs = retriever.invoke("Word2Vec")

# Return the length of the page content of the first retrieved document
print(retrieved_docs[0].page_content)

Crawling

Definition: Crawling is the automated process of visiting web pages to collect data. It is commonly used for search engine optimization and data analysis.  
Example: Google’s search engine crawls websites to collect and index content.  
Related Keywords: Data Collection, Web Scraping, Search Engine  


Word2Vec

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces, representing semantic relationships between words.  
Example: In a Word2Vec model, "king" and "queen" are represented as vectors close to each other in the vector space.  
Related Keywords: Natural Language Processing, Embedding, Semantic Similarity  


LLM (Large Language Model)
