<a href="https://colab.research.google.com/github/sugarforever/LangChain-Tutorials/blob/main/LangChain_ParentDocumentRetriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain Parent Document Retriever 简介

当为检索而切分文档时，通常在切分大小的选择上存在困惑：

- 您可能希望拥有较小的文档切片，以便它们的嵌入可以更准确地反映它们的含义。如果太长，嵌入可能会失去意义。
- 您可能又希望有足够长的文档，以保留更完整的上下文。

`ParentDocumentRetriever` 通过分级切分和存储文档数据块来取得平衡。在检索过程中，首先获取小块，然后查找这些块的父文档切片，并返回这些较大的文档片段。

In [1]:
import os

os.environ['OPENAI_API_KEY'] = '您的有效openai api key'

In [2]:
!pip install -q -U langchain openai chromadb

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m405.5/405.5 kB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m64.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m6.7 MB/s[0m eta 

## 准备一个PDF文档

In [3]:
!wget https://developer.apple.com/carplay/documentation/CarPlay-App-Programming-Guide.pdf

--2023-08-15 23:02:42--  https://developer.apple.com/carplay/documentation/CarPlay-App-Programming-Guide.pdf
Resolving developer.apple.com (developer.apple.com)... 17.253.21.203, 17.253.21.201, 2620:149:a10:f000::5, ...
Connecting to developer.apple.com (developer.apple.com)|17.253.21.203|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7694976 (7.3M) [application/pdf]
Saving to: ‘CarPlay-App-Programming-Guide.pdf’


2023-08-15 23:02:42 (83.6 MB/s) - ‘CarPlay-App-Programming-Guide.pdf’ saved [7694976/7694976]



In [4]:
!pip install -q pymupdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [6]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import PyMuPDFLoader

In [7]:
loader = PyMuPDFLoader('./CarPlay-App-Programming-Guide.pdf')
docs = loader.load()

In [8]:
len(docs)

54

## ParentDocumentRetriever支持的参数

支持的参数中，值得关注的是 `parent_splitter` 和 `child_splitter`。它们分别指定父文档拆分器和子文档拆分器。

### 不指定 parent_splitter

这时，文档不会进行父子两级拆分。原始文档即父文档。

父文档存储在 `InMemoryStore` 中，子文档的嵌入数据被存储在向量存储中。本例中我们使用了 `Chromadb`。

In [9]:
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

In [10]:
!pip install -q tiktoken

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.7 MB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m1.1/1.7 MB[0m [31m15.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25h

**注意，这里我们即使让 `retriever` 自行生成文档ID，也需要指定参数 `ids` 的值为 `None`。**

In [11]:
retriever.add_documents(docs, ids=None)

In [12]:
list(store.yield_keys())

['bde53b56-5891-4810-af11-6a7042726ab3',
 '4946fe18-4418-423b-9ead-58bc0f43f757',
 '15665906-0cd0-4f2f-b977-a5f5c107fc93',
 '402e060f-4e13-45c3-95c2-d4e28fff91a0',
 'c58b06dd-d710-4ce3-9eda-20b1b3d7cacb',
 '1eac9574-6dc9-4fb0-828c-36c02a3c762a',
 'e17e0a47-b75c-46aa-9e6d-9752b43f4295',
 '5ebcb89b-8a2a-441b-acba-e96c82f2f42c',
 '7cdd12f3-2928-4ba9-a4df-5b18b6105a63',
 'a64caaaf-2e26-4ef3-8eda-18ccf4f0eec8',
 '0d0cc9f6-2a00-4141-adb3-dd75049f8eca',
 'b27d1666-793c-46c1-b9df-e06f1a8540f6',
 '84a84e09-b043-490a-a07a-6d2adc1b07eb',
 'ceff23ee-1135-4132-85aa-2864bf59c1ed',
 'a2b9ef34-03a1-491a-82ed-2a181c008a49',
 'abb99674-280c-4bae-a4a8-fac50d329b1f',
 '75193b94-8d2e-4529-94e5-a00227b74316',
 '765ed16f-d29b-4b6e-bf3b-b9a4aaccd235',
 'de9d138c-f14d-47d2-8dda-b2db6f9c110d',
 'a72b0f9c-1afb-45cd-912f-a4ff57840f14',
 'd937a765-881c-49df-a0e1-d1689bcedcc8',
 '504d4182-f2c3-4377-9291-669c4551b668',
 '79c67194-d1a2-4a1a-8bbf-f26a8cb87aa0',
 '539f3a4d-0a45-4b5a-8f58-034484dfcfdb',
 'a89e12d9-956d-

In [13]:
sub_docs = vectorstore.similarity_search("How to build a CarPlay navigation app?")

In [14]:
len(sub_docs)

4

In [15]:
for sub_doc in sub_docs:
  print(len(sub_doc.page_content))

325
391
327
351


In [16]:
retrieved_docs = retriever.get_relevant_documents("How to build a CarPlay navigation app?")

In [17]:
len(retrieved_docs)

3

In [18]:
for retrieved_doc in retrieved_docs:
  print(len(retrieved_doc.page_content))

1562
4179
2195


### 指定 parent_splitter

这时，文档进行父子两级拆分。原始文档被 `parent_splitter` 拆分成较大的块后，再由 `child_splitter` 拆分成更小的块。

In [19]:
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
vectorstore = Chroma(collection_name="carplay_collection", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()

In [20]:
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [21]:
retriever.add_documents(documents=docs, ids=None)

In [22]:
len(list(store.yield_keys()))

238

In [23]:
sub_docs = vectorstore.similarity_search("How to build a CarPlay navigation app?")

len(sub_docs)

4

In [24]:
for sub_doc in sub_docs:
  print(len(sub_doc.page_content))

152
161
126
197


In [25]:
retrieved_docs = retriever.get_relevant_documents("How to build a CarPlay navigation app?")

In [26]:
len(retrieved_docs)

3

In [27]:
for retrieved_doc in retrieved_docs:
  print(len(retrieved_doc.page_content))

494
408
455


In [28]:
print(retrieved_docs[2].page_content)

Build a CarPlay navigation app 
The following section describes how to create a CarPlay navigation app. 
CarPlay navigation apps have additional UI elements and capabilities that are different from 
other CarPlay app types. Skip this section if you are not creating a navigation app. 
Additional templates for navigation apps 
CarPlay navigation apps use additional templates to display map information, a keyboard, and 
voice control feedback. 
Base View
