In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Docugami
This notebook covers how to load documents from `Docugami`. See [README](./README.md) for more details, and the advantages of using this system over alternative data loaders.

## Prerequisites
1. Follow the Quick Start section in [README](./README.md)
2. Grab an access token for your workspace, and make sure it is set as the DOCUGAMI_API_KEY environment variable
3. Grab some docset and document IDs for your processed documents, as described here: https://help.docugami.com/home/docugami-api

## Load Documents

If the DOCUGAMI_API_KEY environment variable is set, there is no need to pass it in to the loader explicitly otherwise you can pass it in as the `access_token` parameter.

In [7]:
from llama_index import download_loader

DocugamiReader = download_loader('DocugamiReader')

docset_id="ecxqpipcoe2p"
document_ids=["43rj0ds7s0ur", "bpc1vibyeke2"]

loader = DocugamiReader()
documents = loader.load_data(docset_id=docset_id, document_ids=document_ids)
documents

[Document(id_='c1adad58-13c4-4455-b286-68ade1aa23ef', embedding=None, metadata={'xpath': '/docset:MutualNon-disclosure/docset:MutualNon-disclosure/docset:MUTUALNON-DISCLOSUREAGREEMENT-section/docset:MUTUALNON-DISCLOSUREAGREEMENT/docset:ThisMutualNon-disclosureAgreement', 'id': '43rj0ds7s0ur', 'name': 'NDA simple layout.docx', 'structure': 'p', 'tag': 'ThisMutualNon-disclosureAgreement'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='368d8592f11eea5a4d5283bea95d58615ecb5c26d0ff334589530154567ba1c7', text='MUTUAL NON-DISCLOSURE AGREEMENT This  Mutual Non-Disclosure Agreement  (this “ Agreement ”) is entered into and made effective as of  April  4 ,  2018  between  Docugami Inc. , a  Delaware  corporation , whose address is  150  Lake Street South ,  Suite  221 ,  Kirkland ,  Washington  98033 , and  Caleb Divine , an individual, whose address is  1201  Rt  300 ,  Newburgh  NY  12550 .', start_char_idx=None, end_char_idx=None, text_template='{meta

The `metadata` for each `Document` (really, a chunk of an actual PDF, DOC or DOCX) contains some useful additional information:

1. **id and name:** ID and Name of the file (PDF, DOC or DOCX) the chunk is sourced from within Docugami.
2. **xpath:** XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.
3. **structure:** Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.
4. **tag:** Semantic tag for the chunk, using various generative and extractive techniques. More details here: https://github.com/docugami/DFM-benchmarks

## Basic Use: Docugami Loader for Document QA

You can use the Docugami Loader like a standard loader for Document QA over multiple docs, albeit with much better chunks that follow the natural contours of the document. There are many great tutorials on how to do this, e.g. [this one](https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html). We can just use the same code, but use the `DocugamiLoader` for better chunking, instead of loading text or PDF files directly with basic splitting techniques.

In [3]:
from llama_index import GPTVectorStoreIndex

DocugamiReader = download_loader('DocugamiReader')

# For this example, we already have a processed docset for a set of lease documents
docset_id="wh2kned25uqm"
documents = loader.load_data(docset_id=docset_id)

The documents returned by the loader are already split into chunks. Optionally, we can use the metadata on each chunk, for example the structure or tag attributes, to do any post-processing we want.

We will just use the output of the `DocugamiLoader` as-is to set up a query engine the usual way.

In [4]:
index = GPTVectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

In [6]:
# Try out the query engine with example query
response = query_engine.query("What can tenants do with signage on their properties?")
print(response.response)
for node in response.source_nodes:
    print(node)


Tenants can place or attach signs (digital or otherwise) or other forms of identification to their properties after receiving written permission from the landlord. Any signs or other forms of identification must conform to all applicable laws, ordinances, etc. governing the same. Tenants must also have any window or glass identification completely removed and cleaned at their expense promptly upon vacating the premises.
NodeWithScore(node=Node(text='Signage.  Tenant  may place or attach to the  Premises signs  (digital or otherwise) or other such identification as needed after receiving written permission from the  Landlord , which permission shall not be unreasonably withheld. Any damage caused to the Premises by the  Tenant ’s erecting or removing such signs shall be repaired promptly by the  Tenant  at the  Tenant ’s expense . Any signs or other form of identification allowed must conform to all applicable laws, ordinances, etc. governing the same.  Tenant  also agrees to have any 

## Using Docugami to Add Metadata to Chunks for High Accuracy Document QA

One issue with large documents is that the correct answer to your question may depend on chunks that are far apart in the document. Typical chunking techniques, even with overlap, will struggle with providing the LLM sufficent context to answer such questions. With upcoming very large context LLMs, it may be possible to stuff a lot of tokens, perhaps even entire documents, inside the context but this will still hit limits at some point with very long documents, or a lot of documents.

For example, if we ask a more complex question that requires the LLM to draw on chunks from different parts of the document, even OpenAI's powerful LLM is unable to answer correctly.

In [9]:
response = query_engine.query("What is the security deposit for the property owned by Birch Street?")
print(response.response) # the correct answer should be $78,000
for node in response.source_nodes:
    print(node.metadata["name"])
    print(node.node.text)


The security deposit for the property owned by Birch Street is not specified in the context information provided.
Shorebucks LLC_CO.pdf
1.12 Security Deposit . As of the Date of this  Lease , there is no  Security Deposit .
Shorebucks LLC_AZ.pdf
22. SECURITY DEPOSIT . The  Security Deposit  shall be held by  Landlord  as security for  Tenant 's full and faithful performance  of this  Lease  including the payment of  Rent .  Tenant  grants  Landlord  a security interest in the  Security Deposit . The  Security Deposit  may be commingled with other funds of  Landlord  and  Landlord  shall have no liability for payment of any interest on the  Security Deposit .  Landlord  may apply the  Security Deposit  to the extent required to cure any default by  Tenant . If  Landlord  so applies the  Security Deposit ,  Tenant  shall deliver to  Landlord  the amount necessary to replenish the  Security Deposit  to its original sum within  five  days  after notice from  Landlord . The  Security Depos



At first glance the answer may seem reasonable, but if you review the source chunks carefully for this answer, you will see that the chunking of the document did not end up putting the Landlord name and the rentable area in the same context, since they are far apart in the document. The query engine therefore ends up finding unrelated chunks from other documents not even related to the **Birch Street** landlord. That landlord happens to be mentioned on the first page of the file **TruTone Lane 1.docx** file, and none of the source chunks used by the query engine contain the correct answer (**$78,000**), and the answer is therefore incorrect.

Docugami can help here. Chunks are annotated with additional metadata created using different techniques if a user has been [using Docugami](https://help.docugami.com/home/reports). More technical approaches will be added later.

Specifically, let's look at the additional metadata that is returned on the documents returned by docugami after some additional use, in the form of some simple key/value pairs on all the text chunks:

In [10]:
docset_id="wh2kned25uqm"
documents = loader.load_data(docset_id=docset_id)
documents[0].metadata

{'xpath': '/docset:OFFICELEASEAGREEMENT-section/docset:OFFICELEASEAGREEMENT/docset:ThisOfficeLeaseAgreement',
 'id': 'v1bvgaozfkak',
 'name': 'TruTone Lane 2.docx',
 'structure': 'p',
 'tag': 'ThisOfficeLeaseAgreement',
 'Landlord': 'BUBBA CENTER PARTNERSHIP',
 'Tenant': 'Truetone Lane LLC'}

In [11]:
index = GPTVectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

Let's run the same question again. It returns the correct result since all the chunks have metadata key/value pairs on them carrying key information about the document even if this infromation is physically very far away from the source chunk used to generate the answer.

In [12]:
response = query_engine.query("What is the security deposit for the property owned by Birch Street?")
print(response.response) # the correct answer should be $78,000
for node in response.source_nodes:
    print(node.metadata["name"])
    print(node)


The security deposit for the property owned by Birch Street is $78,000.
TruTone Lane 1.docx
NodeWithScore(node=Node(text='$ 20,023.78  of the  Security  to the  Tenant  and the  Security obligation  shall be  $ 31,976.72  and remain until the expiration or earlier termination of this  Lease .', doc_id='d34995dc-cbe2-4f70-a248-ca0e8c937d7b', embedding=None, doc_hash='84ec2102e9e9cc07487556772b8f97aa14e01d6f763ba1315e0ae2132d67691c', extra_info={'xpath': '/docset:Rider/docset:RIDERTOLEASE-section/docset:RIDERTOLEASE/docset:FixedRent/docset:TermYearPeriod/docset:Lease/docset:_42hSmokingProhibitedTenant/docset:TenantsEmployees/docset:TheArea/docset:_56SecurityDeposit-section/docset:_56SecurityDeposit/docset:TheForegoing/docset:TheSecurity', 'id': 'omvs4mysdk6b', 'name': 'TruTone Lane 1.docx', 'structure': 'p', 'tag': 'TheSecurity', 'Landlord': 'BIRCH STREET ,  LLC', 'Tenant': 'Trutone Lane LLC'}, node_info={'start': 0, 'end': 171}, relationships={<DocumentRelationship.SOURCE: '1'>: '659e3