In [62]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Docugami
This notebook covers how to load documents from `Docugami`. See [README](./README.md) for more details, and the advantages of using this system over alternative data readers.

## Prerequisites
1. Follow the Quick Start section in [README](./README.md)
2. Grab an access token for your workspace, and make sure it is set as the DOCUGAMI_API_KEY environment variable
3. Grab some docset and document IDs for your processed documents, as described here: https://help.docugami.com/home/docugami-api

## Load Documents

If the DOCUGAMI_API_KEY environment variable is set, there is no need to pass it in to the reader explicitly otherwise you can pass it in as the `access_token` parameter.

The DocugamiReader has a default minimum chunk size of 32. Chunks smaller than that are appended to subsequent chunks. Set min_chunk_size to 0 to get all structural chunks regardless of size.

In [63]:
from base import DocugamiReader

docset_id="tjwrr2ekqkc3"
docset_name="SEC 10-Q reports"
document_ids=["ui7pkriyckwi", "1be3o7ch10iy"]

reader = DocugamiReader()
chunks = reader.load_data(docset_id=docset_id, document_ids=document_ids)

for chunk in chunks[:5]:
    print(chunk)
    print("*"*32)

Doc ID: 030129a9-ff06-47cb-a91d-16b92ebde04f
Text: UNITED STATES SECURITIES AND EXCHANGE COMMISSION
********************************
Doc ID: 1455074b-4e06-4764-b850-783a44412f22
Text: Washington , D.C. 20549  FORM 10-Q
********************************
Doc ID: 06515ffb-1cf3-48af-9354-b677a080df00
Text: ( Mark One )  ☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d)
OF THE SECURITIES EXCHANGE ACT OF 1934
********************************
Doc ID: 8d8643a5-26a0-4de1-a66c-a2b8759f8c7b
Text: For the quarterly period ended June 25, 2022
********************************
Doc ID: 45de6d72-4cac-44cd-b238-80c790dee052
Text: or  ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE
SECURITIES EXCHANGE ACT OF 1934
********************************


The `metadata` for each `Document` (really, a chunk of an actual PDF, DOC or DOCX) contains some useful additional information:

1. **id and source:** ID and Name of the file (PDF, DOC or DOCX) the chunk is sourced from within Docugami.
2. **xpath:** XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.
3. **structure:** Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.
4. **tag:** Semantic tag for the chunk, using various generative and extractive techniques. More details here: https://github.com/docugami/DFM-benchmarks

You can control chunking behavior by setting the following properties on the `DocugamiReader` instance:

1. You can set min and max chunk size, which the system tries to adhere to with minimal truncation. You can set `reader.min_text_length` and `reader.max_text_length` to control these.
2. By default, only the text for chunks is returned. However, Docugami's XML knowledge graph has additional rich information including semantic tags for entities inside the chunk. Set `reader.include_xml_tags = True` if you want the additional xml metadata on the returned chunks.
3. In addition, you can set `reader.parent_hierarchy_levels` if you want Docugami to return parent chunks in the chunks it returns. The child chunks point to the parent chunks via the `reader.parent_id_key` value. This is useful for [small-to-big](https://www.youtube.com/watch?v=ihSiRrOUwmg) retrieval.

In [64]:
reader.min_text_length = 1024 * 4 # ~1k tokens
reader.max_text_length = 1024 * 24  # ~6k tokens
reader.include_xml_tags = True
reader.include_project_metadata_in_doc_metadata = False
chunks = reader.load_data(docset_id=docset_id)

for chunk in chunks[:5]:
    print(chunk)
    print("*" * 32)

Doc ID: b3854cc2-3ce3-4c98-8f8a-18e08fb98cf2
Text: UNITED STATES SECURITIES AND EXCHANGE COMMISSION
<USState>Washington</USState>, D.C. <ZipCode>20549 </ZipCode>  FORM
10-Q  (Mark One)  <ReportingPeriod> ☒ QUARTERLY REPORT PURSUANT TO
SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF <Act>1934
</Act></ReportingPeriod>  <FinancialReportingPeriod>
<FinancialReportingPeriod>For the quarterly p...
********************************
Doc ID: 56c3a970-cb2a-4efe-87c4-763160fd50fa
Text: Yes ☒ No ☐  <CompanySize> Indicate by check mark whether the
Registrant is a large accelerated filer, an accelerated filer, a non-
accelerated filer, a smaller reporting company, or an emerging growth
company. See the definitions of “large accelerated filer,”
“accelerated filer,” “smaller reporting company,” and “emerging growth
company” in Rule ...
********************************
Doc ID: 99a089b9-5da3-4f4a-baea-5f84b24cb4c4
Text: <FinancialStatementNotes> See accompanying Notes to Condensed
Consolidated Fina

## Basic Use: Docugami Reader for Document QA

You can use the Docugami Reader like a standard reader for Document QA over multiple docs, albeit with much better chunks that follow the natural contours of the document. There are many great tutorials on how to do this, e.g. [this one](https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html). We can just use the same code, but use the `DocugamiReader` for better chunking, instead of loading text or PDF files directly with basic splitting techniques.

The documents returned by the reader are already split into chunks. Optionally, we can use the metadata on each chunk, for example the structure or tag attributes, to do any post-processing we want.

We will just use the output of the `DocugamiReader` as-is to set up a query engine the usual way.

In [65]:
import chromadb
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.embeddings import OpenAIEmbedding
from llama_index.storage.storage_context import StorageContext
from llama_index.vector_stores import ChromaVectorStore

db = chromadb.PersistentClient(path="/tmp/docugami/chroma_db")
chroma_collection = db.get_or_create_collection("docugami_test")
embed_model = OpenAIEmbedding()

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = VectorStoreIndex.from_documents(
    chunks, storage_context=storage_context, service_context=service_context
)

query_engine = index.as_query_engine(similarity_top_k=5)

In [66]:
retriever = index.as_retriever()
retriever.retrieve("How much did Microsoft spend for opex in the latest quarter?")

[NodeWithScore(node=TextNode(id_='13e19753-b4af-4ec9-9cb0-2507992e4836', embedding=None, metadata={'xpath': '/dg:chunk/dg:chunk/dg:chunk[2]/dg:chunk[2]/dg:chunk[7]', 'id': 'a8862469fb69a5d4b316ebd22d63991d', 'name': '2023 Q3 MSFT.pdf', 'structure': 'lim h1 h1 lim lim lim lim p lim lim h1 h1 table p p p h1 table h1 p p p h1 table', 'tag': 'chunk table ResearchandDevelopmentExpenses RDExpenses SalesandMarketingExpenses', 'Financial Reporting Period': 'For the  Quarterly Period  Ended  September 30, 2023', 'Company': 'MICROSOFT CORPORATION'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=['xpath', 'id', 'structure'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='01ccab79-9d91-44f0-9cc6-349f6994b008', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'xpath': '/dg:chunk/dg:chunk/dg:chunk[2]/dg:chunk[2]/dg:chunk[7]', 'id': 'a8862469fb69a5d4b316ebd22d63991d', 'name': '2023 Q3 MSFT.pdf', 'structure': 'lim h1 h1 lim lim lim lim p lim lim h1 h1 table p p p 

In [67]:
# Try out the query engine with example query
response = query_engine.query("How much did Microsoft spend for opex in the latest quarter?")
print(response.response)

Microsoft spent $86 million for operating expenses in the latest quarter.


In [68]:
for node in response.source_nodes:
    print(node.node.extra_info["name"])
    print(node.node.text)

2023 Q3 MSFT.pdf
• Operating expenses increased <OperatingExpenseIncrease>$86 million </OperatingExpenseIncrease>or <OperatingExpenseIncrease>2% </OperatingExpenseIncrease>driven by investments in <AzureInvestments>Azure </AzureInvestments>and other cloud services.
 More Personal Computing
 Revenue increased <RevenueIncrease>$334 million </RevenueIncrease>or <RevenueGrowth>3%</RevenueGrowth>.
 • Windows revenue increased <WindowsRevenueIncrease>$254 million </WindowsRevenueIncrease>or <WindowsRevenueGrowth>5% </WindowsRevenueGrowth>driven by growth in Windows Commercial and Windows OEM. Windows Commercial products and cloud services revenue increased <WindowsCommercialRevenueGrowth>8% </WindowsCommercialRevenueGrowth>driven by demand for <WindowsRevenue>Microsoft </WindowsRevenue>365. Windows OEM revenue increased <WindowsRevenueGrowth>4%</WindowsRevenueGrowth>.
 • Gaming revenue increased <GamingRevenueIncrease>$309 million </GamingRevenueIncrease>or <GamingRevenueGrowth>9% </GamingRe

## Using Docugami to Add Metadata to Chunks for High Accuracy Document QA

One issue with large documents is that the correct answer to your question may depend on chunks that are far apart in the document. Typical chunking techniques, even with overlap, will struggle with providing the LLM sufficent context to answer such questions. With upcoming very large context LLMs, it may be possible to stuff a lot of tokens, perhaps even entire documents, inside the context but this will still hit limits at some point with very long documents, or a lot of documents.

For example, if we ask a more complex question that requires the LLM to draw on chunks from different parts of the document, even OpenAI's powerful LLM is unable to answer correctly.

In [69]:
response = query_engine.query(
    "How much commercial paper does apple have outstanding as of April 2023?"
)
print(response.response)  # the correct answer should be 2.7%, listed on page 24 of "2023 Q2 MSFT.pdf"

As of April 2023, Apple has $2.0 billion of commercial paper outstanding.


In [70]:
for node in response.source_nodes:
    print(node.node.extra_info["name"])
    print(node.node.text)

2023 Q2 AAPL.pdf
645</Number>) </RepaymentsofCommercialPaperSixMonthsEndedApril1March2620232022></td> <td><RepaymentsofCommercialPaper>(<Number>5,144</Number>) </RepaymentsofCommercialPaper></td></tr> <tr><td><Maturities90DaysOrLessRepaymentsofCommercialPaperNet>Repayments of commercial paper, net </Maturities90DaysOrLessRepaymentsofCommercialPaperNet></td> <td><RepaymentsofCommercialPaperNetSixMonthsEndedApril1March2620232022>(<Number>2,645</Number>) </RepaymentsofCommercialPaperNetSixMonthsEndedApril1March2620232022></td> <td><RepaymentsofCommercialPaperNet>(<Number>3,953</Number>) </RepaymentsofCommercialPaperNet></td></tr><tr><td><Maturities90DaysOrLess/></td></tr> <tr><td><Maturities90DaysOrLessTotalProceedsFromRepaymentsofCommercialPaperNe>Total proceeds from/(repayments of) commercial paper, net </Maturities90DaysOrLessTotalProceedsFromRepaymentsofCommercialPaperNe></td> <td><TotalProceedsFromRepaymentsofCommercialPaperNetSixMonthsEndedApril1><Money>$ (7,960) </Money> <Money>$ <

At first glance the answer may seem plausible, but if you review the source chunks carefully for this answer, you will see that the chunking of the document did not end up putting the Landlord name and the rentable area in the same context, since they are far apart in the document. The query engine therefore ends up finding unrelated chunks from other documents not even related to the **Birch Street** landlord. That landlord happens to be mentioned on the first page of the file **TruTone Lane 1.docx** file, and none of the source chunks used by the query engine contain the correct answer (**$78,000**), and the answer is therefore incorrect.

Docugami can help here. Chunks are annotated with additional metadata created using different techniques if a user has been [using Docugami](https://help.docugami.com/home/reports). More technical approaches will be added later.

Specifically, let's load the data again and this time let's ask the reader to include additional metadata that is returned on the documents returned by docugami after some additional use, in the form of some simple key/value pairs on all the text chunks:

In [71]:
chunks[0].metadata

{'xpath': '/dg:chunk/dg:chunk/dg:chunk[1]',
 'id': '8d298e24da52d0b0220b9674f373f2fc',
 'name': '2022 Q3 AAPL.pdf',
 'structure': 'h1 h1 h1 p p p h1 p p p h1 p h1 p lim h1 p p h1 h1 h1 lim h1 lim h1 lim h1 lim lim h1 lim h1 lim h1 lim h1 lim h1 div lim p h1 h1 p h1 h1 div h1 h1 lim h1 p h1 div',
 'tag': 'chunk ReportingPeriod FinancialReportingPeriod TransitionReport TransitionPeriod CompanyName Phone SecuritiesRegisteredPursuanttoSection12b TheNasdaqStockMarketLLC FilingCompliance'}

Note semantic metadata tags like Lease Date, Landlord, Tenant, etc that are based on key chunks in the document even if they don't appear near the chunk in question.

In [78]:
from llama_index.indices.vector_store.retrievers import VectorIndexAutoRetriever
from llama_index.vector_stores.types import MetadataInfo, VectorStoreInfo, VectorStoreQueryMode
from llama_index.query_engine import RetrieverQueryEngine

EXCLUDE_KEYS = ["id", "xpath", "structure"]
metadata_field_info = [
    MetadataInfo(
        name=key,
        description=f"The {key} for this chunk",
        type="str",
    )
    for key in chunks[0].metadata
    if key.lower() not in EXCLUDE_KEYS
]

vector_store_info = VectorStoreInfo(
    content_info=f"Key metadata about {docset_name}",
    metadata_info=metadata_field_info,
)
retriever = VectorIndexAutoRetriever(
    index,
    vector_store_info=vector_store_info,
    vector_store_query_mode=VectorStoreQueryMode.MMR,
    similarity_top_k=10,
)

query_engine_auto_retriever = RetrieverQueryEngine.from_args(retriever=retriever)

In [79]:
retriever.retrieve("How much did apple report for services revenue?")

[]

Let's run the same question again. It returns the correct result since all the chunks have metadata key/value pairs on them carrying key information about the document even if this information is physically very far away from the source chunk used to generate the answer.

In [76]:
response = query_engine_auto_retriever.query(
    "How much commercial paper does apple have outstanding as of April 2023?"
)
print(response.response)

Empty Response


In [75]:
for node in response.source_nodes:
    print(node.node.extra_info["name"])
    print(node.node.text)