# Docugami
本笔记本介绍了如何从 `Docugami` 加载文档。它阐述了与替代数据加载器相比，使用该系统的优点。

## 先决条件
1. 安装必要的 Python 包。
2. 获取工作区的访问令牌，并确保将其设置为 `DOCUGAMI_API_KEY` 环境变量。
3. 获取一些 docset 和文档 ID，用于处理过的文档，具体描述参见：https://help.docugami.com/home/docugami-api

In [1]:
# You need the dgml-utils package to use the DocugamiLoader (run pip install directly without "poetry run" if you are not using poetry)
!poetry run pip install docugami-langchain dgml-utils==0.3.0 --upgrade --quiet

## 快速开始

1. 创建一个 [Docugami 工作区](http://www.docugami.com)（提供免费试用）
2. 添加您的文档（PDF、DOCX 或 DOC），并允许 Docugami 对其进行摄取和聚类，形成相似文档集，例如保密协议 (NDA)、租赁协议和服v务协议。该系统支持的文档类型没有固定集，创建的聚类取决于您的特定文档，之后您可以[更改文档集分配](https://help.docugami.com/home/working-with-the-doc-sets-view)。
3. 通过工作区的 Developer Playground 创建一个访问令牌。[详细说明](https://help.docugami.com/home/docugami-api)
4. 探索 [Docugami API](https://api-docs.docugami.com) 以获取已处理文档集 ID 的列表，或者获取特定文档集的文档 ID。
6. 使用下方详细介绍的 DocugamiLoader 来获取文档的丰富语义块。
7. 可选地，构建和发布一个或多个[报告或摘要](https://help.docugami.com/home/reports)。这有助于 Docugami 基于您的偏好改进语义 XML 和标签，然后这些标签将作为元数据添加到 DocugamiLoader 输出中。使用诸如[自查询检索器](/docs/how_to/self_query)之类的技术来实现高精度的文档问答。

## 相较于其他分块技术的优势

对文档进行适当分块对于从文档中检索至关重要。存在许多分块技术，包括依赖于空格和基于字符长度递归分块的简单技术。Docugami 提供了一种不同的方法：

1. **智能分块：** Docugami 将每个文档分解为分层语义 XML 树，其中包含不同大小的块，从单个单词或数值到整个章节。这些块遵循文档的语义轮廓，提供了比任意长度或简单的基于空格的分块更有意义的表示。
2. **语义注释：** 块会用跨文档集一致的语义标签进行注释，从而跨多个文档实现一致的分层查询，即使它们的书写和格式不同。例如，在一组租赁协议中，您可以轻松识别关键条款，如房东 (Landlord)、租户 (Tenant) 或续约日期 (Renewal Date)，以及更复杂的信息，例如任何转租条款 (sub-lease provision) 的措辞，或者特定司法管辖区在终止条款 (Termination Clause) 中是否包含例外条款部分。
3. **结构化表示：** 此外，XML 树还使用表示标题、段落、列表、表格和其他常见元素的属性来指示每个文档的结构轮廓，并对所有支持的文档格式（如扫描的 PDF 或 DOCX 文件）一致地进行此操作。它能妥善处理长篇文档的特性，如页眉/页脚或多栏流，以实现干净的文本提取。
4. **附加元数据：** 如果用户一直在使用 Docugami，这些块也会被注释上附加的元数据。这些附加元数据可用于实现高精度的文档问答，而没有上下文窗口限制。有关详细代码演练，请参见下文。

In [2]:
import os

from docugami_langchain.document_loaders import DocugamiLoader

## 加载文档

如果设置了 DOCUGAMI_API_KEY 环境变量，则无需在加载器中显式传递它，否则可以将其作为 `access_token` 参数传递。

In [3]:
DOCUGAMI_API_KEY = os.environ.get("DOCUGAMI_API_KEY")

In [4]:
docset_id = "26xpy3aes7xp"
document_ids = ["d7jqdzcj50sj", "cgd1eacfkchw"]

# To load all docs in the given docset ID, just don't provide document_ids
loader = DocugamiLoader(docset_id=docset_id, document_ids=document_ids)
chunks = loader.load()
len(chunks)

120

每个 `Document`（实际上是实际 PDF、DOC 或 DOCX 的一个块）的 `metadata` 包含一些有用的附加信息：

1. **id 和 source：** 块在 Docugami 中源自的文件（PDF、DOC 或 DOCX）的 ID 和名称。
2. **xpath：** 块在文档 XML 表示中的 XPath。用于直接引用文档 XML 中实际块的出处。
3. **structure：** 块的结构属性，例如 h1、h2、div、table、td 等。如果调用者需要，可用于筛选特定类型的块。
4. **tag：** 使用各种生成和提取技术获得的块的语义标签。更多详情请参阅：https://github.com/docugami/DFM-benchmarks

您可以通过设置 `DocugamiLoader` 实例的以下属性来控制分块行为：

1. 您可以设置最小和最大分块大小，系统会尽量在最小截断的情况下遵守这些设置。您可以设置 `loader.min_text_length` 和 `loader.max_text_length` 来控制这些。
2. 默认情况下，只返回块的文本。但是，Docugami 的 XML 知识图包含其他丰富的附加信息，包括块内实体的语义标签。如果您希望在返回的块中包含额外的 XML 元数据，请设置 `loader.include_xml_tags = True`。
3. 此外，如果您希望 Docugami 在返回的块中包含父块，可以设置 `loader.parent_hierarchy_levels`。子块通过 `loader.parent_id_key` 值指向父块。例如，这对于 [MultiVector Retriever](/docs/how_to/multi_vector) 进行 [small-to-big](https://www.youtube.com/watch?v=ihSiRrOUwmg) 检索非常有用。请参阅本笔记本稍后的详细示例。

In [5]:
loader.min_text_length = 64
loader.include_xml_tags = True
chunks = loader.load()

for chunk in chunks[:5]:
    print(chunk)

page_content='MASTER SERVICES AGREEMENT\n <ThisServicesAgreement> This Services Agreement (the “Agreement”) sets forth terms under which <Company>MagicSoft, Inc. </Company>a <Org><USState>Washington </USState>Corporation </Org>(“Company”) located at <CompanyAddress><CompanyStreetAddress><Company>600 </Company><Company>4th Ave</Company></CompanyStreetAddress>, <Company>Seattle</Company>, <Client>WA </Client><ProvideServices>98104 </ProvideServices></CompanyAddress>shall provide services to <Client>Daltech, Inc.</Client>, a <Company><USState>Washington </USState>Corporation </Company>(the “Client”) located at <ClientAddress><ClientStreetAddress><Client>701 </Client><Client>1st St</Client></ClientStreetAddress>, <Client>Kirkland</Client>, <State>WA </State><Client>98033</Client></ClientAddress>. This Agreement is effective as of <EffectiveDate>February 15, 2021 </EffectiveDate>(“Effective Date”). </ThisServicesAgreement>' metadata={'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-sectio

下面是如何在多个文档上使用 Docugami Loader 进行文档问答：

你可以在多个文档上像使用标准加载器一样使用 Docugami Loader 进行文档问答，并且它生成的块（chunks）能更好地遵循文档的自然结构。有很多优秀的教程可以教你如何做到这一点，例如[这个教程](https://www.youtube.com/watch?v=3yPBVii7Ct0)。我们可以使用相同的代码，但改用 `DocugamiLoader` 来获得更好的分块效果，而不是直接使用基本的分块技术加载文本或 PDF 文件。

In [6]:
!poetry run pip install --upgrade langchain-openai tiktoken langchain-chroma hnswlib

In [7]:
# For this example, we already have a processed docset for a set of lease documents
loader = DocugamiLoader(docset_id="zo954yqy53wp")
chunks = loader.load()

# strip semantic metadata intentionally, to test how things work without semantic metadata
for chunk in chunks:
    stripped_metadata = chunk.metadata.copy()
    for key in chunk.metadata:
        if key not in ["name", "xpath", "id", "structure"]:
            # remove semantic metadata
            del stripped_metadata[key]
    chunk.metadata = stripped_metadata

print(len(chunks))

4674


加载器返回的文档已经过拆分，因此我们无需使用文本拆分器。作为一种选择，我们可以利用每个文档中的元数据，例如结构或标签属性，来进行任何我们想要的后处理。

我们将直接使用 `DocugamiLoader` 的输出来照常设置检索问答链。

In [8]:
from langchain.chains import RetrievalQA
from langchain_chroma import Chroma
from langchain_openai import OpenAI, OpenAIEmbeddings

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=chunks, embedding=embedding)
retriever = vectordb.as_retriever()
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=retriever, return_source_documents=True
)

In [9]:
# Try out the retriever with an example query
qa_chain("What can tenants do with signage on their properties?")

{'query': 'What can tenants do with signage on their properties?',
 'result': ' Tenants can place or attach signage (digital or otherwise) to their property after receiving written permission from the landlord, which permission shall not be unreasonably withheld. The signage must conform to all applicable laws, ordinances, etc. governing the same, and tenants must remove all such signs by the termination of the lease.',
 'source_documents': [Document(page_content='6.01 Signage. Tenant may place or attach to the Premises signs (digital or otherwise) or other such identification as needed after receiving written permission from the Landlord, which permission shall not be unreasonably withheld. Any damage caused to the Premises by the Tenant’s erecting or removing such signs shall be repaired promptly by the Tenant at the Tenant’s expense. Any signs or other form of identification allowed must conform to all applicable laws, ordinances, etc. governing the same. Tenant also agrees to have 

## 使用 Docugami 知识图谱实现高精度文档问答

大型文档的一个问题是，问题的正确答案可能依赖于文档中相距遥远的部分。典型的分块技术，即使有重叠，也难以向 LLM 提供足够的上下文来回答这类问题。随着即将推出的超大上下文 LLM 的出现，可能可以将大量 token，甚至整个文档塞入上下文中，但对于非常长的文档或大量文档，这仍然会达到某些限制。

例如，如果我们提出一个更复杂的问题，要求 LLM 从文档的不同部分提取信息，即使是 OpenAI 强大的 LLM 也无法正确回答。

In [10]:
chain_response = qa_chain("What is rentable area for the property owned by DHA Group?")
chain_response["result"]  # correct answer should be 13,500 sq ft

" I don't know."

In [11]:
chain_response["source_documents"]

[Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_WA.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:CatalystGroup/dg:chunk[6]/dg:chunk'}),
 Document(page_content='1.6 Rentable Area of the Premises.', metadata={'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'structure': 'lim h1', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:MenloGroup/dg:chunk[6]/dg:chunk'}),


乍一看，答案似乎合理，但实际上是错误的。如果您仔细查看此答案的源块，您会发现文档的切块并未将房东姓名和可出租面积置于同一上下文中，而是产生了不相关的块，因此答案是错误的（应为 **13,500 sq ft**）。

Docugami 在这方面很有帮助。如果用户正在[使用 Docugami](https://help.docugami.com/home/reports)，则会使用不同的技术来注释块及其创建的附加元数据。稍后将添加更多技术方法。

具体来说，我们要求 Docugami 在其输出中返回 XML 标签以及附加元数据：

In [12]:
loader = DocugamiLoader(docset_id="zo954yqy53wp")
loader.include_xml_tags = (
    True  # for additional semantics from the Docugami knowledge graph
)
chunks = loader.load()
print(chunks[0].metadata)

{'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': '47297e277e556f3ce8b570047304560b', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'structure': 'h1 h1 p', 'tag': 'chunk Lease', 'Lease Date': 'March  29th , 2019', 'Landlord': 'Menlo Group', 'Tenant': 'Shorebucks LLC', 'Premises Address': '1564  E Broadway Rd ,  Tempe ,  Arizona  85282', 'Term of Lease': '96  full calendar months', 'Square Feet': '16,159'}


我们可以使用 [自查询检索器](/docs/how_to/self_query) 来提高查询的准确性，使用以下附加元数据：

In [13]:
!poetry run pip install --upgrade lark --quiet

In [14]:
from langchain.chains.query_constructor.schema import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_chroma import Chroma

EXCLUDE_KEYS = ["id", "xpath", "structure"]
metadata_field_info = [
    AttributeInfo(
        name=key,
        description=f"The {key} for this chunk",
        type="string",
    )
    for key in chunks[0].metadata
    if key.lower() not in EXCLUDE_KEYS
]

document_content_description = "Contents of this chunk"
llm = OpenAI(temperature=0)

vectordb = Chroma.from_documents(documents=chunks, embedding=embedding)
retriever = SelfQueryRetriever.from_llm(
    llm, vectordb, document_content_description, metadata_field_info, verbose=True
)
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    verbose=True,
)

我们再次运行相同的问题。它返回了正确的结果，因为所有块都带有元数据键/值对，其中包含有关文档的关键信息，即使这些信息在物理上距离用于生成答案的源块很远。

In [15]:
qa_chain(
    "What is rentable area for the property owned by DHA Group?"
)  # correct answer should be 13,500 sq ft



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What is rentable area for the property owned by DHA Group?',
 'result': ' The rentable area of the property owned by DHA Group is 13,500 square feet.',
 'source_documents': [Document(page_content='1.6 Rentable Area of the Premises.', metadata={'Landlord': 'DHA Group', 'Lease Date': 'March  29th , 2019', 'Premises Address': '111  Bauer Dr ,  Oakland ,  New Jersey ,  07436', 'Square Feet': '13,500', 'Tenant': 'Shorebucks LLC', 'Term of Lease': '84  full calendar  months', 'id': '5b39a1ae84d51682328dca1467be211f', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'lim h1', 'tag': 'chunk', 'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/docset:OFFICELEASE/docset:WITNESSETH-section/docset:WITNESSETH/dg:chunk/dg:chunk/docset:BasicLeaseInformation/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS-section/docset:BASICLEASEINFORMATIONANDDEFINEDTERMS/docset:DhaGroup/dg:chunk[6]/dg:chunk'}),
  D

这次的答案是正确的，因为自查询检索器在元数据的 landlord 属性上创建了一个过滤器，正确地筛选出了专门关于 DHA Group landlord 的文档。由此产生的源块都与此 landlord 相关，这提高了答案的准确性，尽管 landlord 没有被直接提及包含正确答案的特定块中。

# 高级主题：文档知识图谱层级的小到大检索

文档本身就具有半结构化特性，DocugamiLoader 能够导航文档的语义和结构轮廓，为其返回的块提供父块引用。例如，这在使用 [MultiVector Retriever](/docs/how_to/multi_vector) 进行 [小到大检索](https://www.youtube.com/watch?v=ihSiRrOUwmg) 时非常有用。

要获取父块引用，您可以将 `loader.parent_hierarchy_levels` 设置为非零值。

In [16]:
from typing import Dict, List

from docugami_langchain.document_loaders import DocugamiLoader
from langchain_core.documents import Document

loader = DocugamiLoader(docset_id="zo954yqy53wp")
loader.include_xml_tags = (
    True  # for additional semantics from the Docugami knowledge graph
)
loader.parent_hierarchy_levels = 3  # for expanded context
loader.max_text_length = (
    1024 * 8
)  # 8K chars are roughly 2K tokens (ref: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)
loader.include_project_metadata_in_doc_metadata = (
    False  # Not filtering on vector metadata, so remove to lighten the vectors
)
chunks: List[Document] = loader.load()

# build separate maps of parent and child chunks
parents_by_id: Dict[str, Document] = {}
children_by_id: Dict[str, Document] = {}
for chunk in chunks:
    chunk_id = chunk.metadata.get("id")
    parent_chunk_id = chunk.metadata.get(loader.parent_id_key)
    if not parent_chunk_id:
        # parent chunk
        parents_by_id[chunk_id] = chunk
    else:
        # child chunk
        children_by_id[chunk_id] = chunk

In [17]:
# Explore some of the parent chunk relationships
for id, chunk in list(children_by_id.items())[:5]:
    parent_chunk_id = chunk.metadata.get(loader.parent_id_key)
    if parent_chunk_id:
        # child chunks have the parent chunk id set
        print(f"PARENT CHUNK {parent_chunk_id}: {parents_by_id[parent_chunk_id]}")
        print(f"CHUNK {id}: {chunk}")

PARENT CHUNK 7df09fbfc65bb8377054808aac2d16fd: page_content='OFFICE LEASE\n THIS OFFICE LEASE\n <Lease>(the "Lease") is made and entered into as of <LeaseDate>March 29th, 2019</LeaseDate>, by and between Landlord and Tenant. "Date of this Lease" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease. </Lease>\nW I T N E S S E T H\n <TheTerms> Subject to and on the terms and conditions of this Lease, Landlord leases to Tenant and Tenant hires from Landlord the Premises. </TheTerms>\n1. BASIC LEASE INFORMATION AND DEFINED TERMS.\nThe key business terms of this Lease and the defined terms used in this Lease are as follows:' metadata={'xpath': '/docset:OFFICELEASE-section/dg:chunk', 'id': '7df09fbfc65bb8377054808aac2d16fd', 'name': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'source': 'Sample Commercial Leases/Shorebucks LLC_NJ.pdf', 'structure': 'h1 h1 p h1 p lim h1 p', 'tag': 'chunk Lease chunk TheTerms'}
CHUNK 47297e277e556f3ce8b570047304560b: p

In [18]:
from langchain.retrievers.multi_vector import MultiVectorRetriever, SearchType
from langchain.storage import InMemoryStore
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="big2small", embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryStore()

# The retriever (empty to start)
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=store,
    search_type=SearchType.mmr,  # use max marginal relevance search
    search_kwargs={"k": 2},
)

# Add child chunks to vector store
retriever.vectorstore.add_documents(list(children_by_id.values()))

# Add parent chunks to docstore
retriever.docstore.mset(parents_by_id.items())

In [19]:
# Query vector store directly, should return chunks
found_chunks = vectorstore.similarity_search(
    "what signs does Birch Street allow on their property?", k=2
)

for chunk in found_chunks:
    print(chunk.page_content)
    print(chunk.metadata[loader.parent_id_key])

24. SIGNS.
 <SIGNS>No signage shall be placed by Tenant on any portion of the Project. However, Tenant shall be permitted to place a sign bearing its name in a location approved by Landlord near the entrance to the Premises (at Tenant's cost) and will be furnished a single listing of its name in the Building's directory (at Landlord's cost), all in accordance with the criteria adopted <Frequency>from time to time </Frequency>by Landlord for the Project. Any changes or additional listings in the directory shall be furnished (subject to availability of space) for the then Building Standard charge. </SIGNS>
43090337ed2409e0da24ee07e2adbe94
<TheExterior> Tenant agrees that all signs, awnings, protective gates, security devices and other installations visible from the exterior of the Premises shall be subject to Landlord's prior written approval, shall be subject to the prior approval of the <Org>Landmarks </Org><Landmarks>Preservation Commission </Landmarks>of the City of <USState>New <Org

In [20]:
# Query retriever, should return parents (using MMR since that was set as search_type above)
retrieved_parent_docs = retriever.invoke(
    "what signs does Birch Street allow on their property?"
)
for chunk in retrieved_parent_docs:
    print(chunk.page_content)
    print(chunk.metadata["id"])

21. SERVICES AND UTILITIES.
 <SERVICESANDUTILITIES>Landlord shall have no obligation to provide any utilities or services to the Premises other than passenger elevator service to the Premises. Tenant shall be solely responsible for and shall promptly pay all charges for water, electricity, or any other utility used or consumed in the Premises, including all costs associated with separately metering for the Premises. Tenant shall be responsible for repairs and maintenance to exit lighting, emergency lighting, and fire extinguishers for the Premises. Tenant is responsible for interior janitorial, pest control, and waste removal services. Landlord may at any time change the electrical utility provider for the Building. Tenant’s use of electrical, HVAC, or other services furnished by Landlord shall not exceed, either in voltage, rated capacity, use, or overall load, that which Landlord deems to be standard for the Building. In no event shall Landlord be liable for damages resulting from th