In [28]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [29]:
%pip install --upgrade llama_index chromadb dgml-utils --quiet

[0mNote: you may need to restart the kernel to use updated packages.


# Docugami
This notebook covers how to load documents from `Docugami`. See [README](./README.md) for more details, and the advantages of using this system over alternative data readers.

## Prerequisites
1. Follow the Quick Start section in [README](./README.md)
2. Grab an access token for your workspace, and make sure it is set as the DOCUGAMI_API_KEY environment variable
3. Grab some docset and document IDs for your processed documents, as described here: https://help.docugami.com/home/docugami-api

## Load Documents

If the DOCUGAMI_API_KEY environment variable is set, there is no need to pass it in to the reader explicitly otherwise you can pass it in as the `access_token` parameter.

In [30]:
from base import DocugamiReader

docset_id = "26xpy3aes7xp"
document_ids = ["d7jqdzcj50sj", "cgd1eacfkchw"]

reader = DocugamiReader()
chunks = reader.load_data(docset_id=docset_id, document_ids=document_ids)

len(chunks)

116

In [31]:
# Inspect one of the chunks
print(chunks[6].text)
print(chunks[6].metadata)

2.4 If scheduled onsite visits are cancelled less than ten (10) working days in advance of the scheduled date, Company is entitled to charge fifty percent (50%) of the expected revenue associated with this onsite activity as compensation.
{'xpath': '/dg:chunk/docset:MASTERSERVICESAGREEMENT-section/docset:MASTERSERVICESAGREEMENT/dg:chunk[1]/docset:Standard/dg:chunk[3]/dg:chunk[2]/docset:ADailyBasis/dg:chunk[4]/docset:IfScheduledOnsiteVisits', 'id': 'cgd1eacfkchw', 'name': 'Master Services Agreement - Daltech.docx', 'structure': 'lim p', 'tag': 'chunk IfScheduledOnsiteVisits', 'Liability': '', 'Workers Compensation Insurance': '$1,000,000', 'Limit': '$1,000,000', 'Commercial General Liability Insurance': '$2,000,000', 'Technology Professional Liability Errors Omissions Policy': '$5,000,000', 'Excess Liability Umbrella Coverage': '$9,000,000', 'Client': 'Daltech, Inc.', 'Services Agreement Date': 'INITIAL STATEMENT  OF WORK (SOW)  The purpose of this SOW is to describe the Software and Se

The `metadata` for each `Document` (really, a chunk of an actual PDF, DOC or DOCX) contains some useful additional information:

1. **id and name:** ID and Name of the file (PDF, DOC or DOCX) the chunk is sourced from within Docugami.
2. **xpath:** XPath inside the XML representation of the document, for the chunk. Useful for source citations directly to the actual chunk inside the document XML.
3. **structure:** Structural attributes of the chunk, e.g. h1, h2, div, table, td, etc. Useful to filter out certain kinds of chunks if needed by the caller.
4. **tag:** Semantic tag for the chunk, using various generative and extractive techniques. More details here: https://github.com/docugami/DFM-benchmarks

You can control chunking behavior by setting the following properties on the `DocugamiReader` instance:

1. There is a minimum chunk size of 32 that can be changed. Chunks smaller than the minimum size are appended to subsequent chunks. You can set `reader.min_chunk_size = 0` to get all structural chunks regardless of size or `reader.min_chunk_size = 1024` to get very large chunks.
2. By default, only the text for chunks is returned. However, Docugami's XML knowledge graph has additional rich information including semantic tags for entities inside the chunk. Set `reader.include_xml_tags = True` if you want the additional xml metadata on the returned chunks.

## Basic Use: Docugami Reader for Document QA

You can use the Docugami Reader like a standard reader for Document QA over multiple docs, albeit with much better chunks that follow the natural contours of the document. There are many great tutorials on how to do this, e.g. [this one](https://gpt-index.readthedocs.io/en/latest/getting_started/starter_example.html). We can just use the same code, but use the `DocugamiReader` for better chunking, instead of loading text or PDF files directly with basic splitting techniques.

In [32]:
# For this example, we already have a processed docset for a set of lease documents
docset_id = "zo954yqy53wp"
chunks = reader.load_data(docset_id=docset_id)

# strip semantic metadata intentionally, to test how things work without semantic metadata
for chunk in chunks:
    stripped_metadata = chunk.metadata.copy()
    for key in chunk.metadata:
        if key not in ["name", "xpath", "id", "structure"]:
            # remove semantic metadata
            del stripped_metadata[key]
    chunk.metadata = stripped_metadata

print(len(chunks))
print(chunks[0].text)
print(chunks[0].metadata)

4663
OFFICE LEASE THIS OFFICE LEASE (the "Lease") is made and entered into as of March 29th , 2019 , by and between Landlord and Tenant. "Date of this Lease" shall mean the date on which the last one of the Landlord and Tenant has signed this Lease.
{'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/dg:chunk/docset:Lease', 'id': 'cpzwzcurck2t', 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf', 'structure': 'h1 h1 p'}


The documents returned by the reader are already split into chunks. Optionally, we can use the metadata on each chunk, for example the structure or tag attributes, to do any post-processing we want.

We will just use the output of the `DocugamiReader` as-is to set up a query engine the usual way.

In [33]:
from llama_index import VectorStoreIndex, ServiceContext, StorageContext
from llama_index.llms import LangChainLLM
from llama_index.embeddings import LangchainEmbedding

# pick other providers and swap as needed
from llama_index.vector_stores import ChromaVectorStore
from langchain.chat_models.openai import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
import chromadb

def create_query_engine(chunks, index_name):
    llm = LangChainLLM(ChatOpenAI())
    embeddings = LangchainEmbedding(OpenAIEmbeddings())

    chroma_client = chromadb.PersistentClient(path="./temp/chroma.backup")
    chroma_collection = chroma_client.create_collection(index_name, get_or_create=True)
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    service_context = ServiceContext.from_defaults(llm=llm, embed_model=embeddings)

    index = VectorStoreIndex.from_documents(
        chunks, storage_context=storage_context, service_context=service_context
    )

    return index.as_query_engine(similarity_top_k=5)

In [34]:
# Try out the query engine with example query
stripped_metadata_query_engine = create_query_engine(chunks, index_name="stripped_metadata_index")
response = stripped_metadata_query_engine.query("What can tenants do with signage on their properties?")
print(response.response + "\n")

for node in response.source_nodes:
  print(node)

Tenants are not given any specific rights or options regarding signage on their properties based on the provided context information.

Node ID: 161c8835-a933-4cd7-b207-cef0d6e0c78e
Text: (b) Enter upon and take possession of the Premises, by changing
locks if necessary, and lock out, expel or remove Tenant and any other
person who may be occupying all or any part of the Premises without
being liable for any claim for damages, and without causing a
trespass, without being liable for any claim for damages, and without
causing a te...
Score:  0.691

Node ID: a574cc2a-538b-4295-bc6a-39cc0ebcfc02
Text: (b) Enter upon and take possession of the Premises, by changing
locks if necessary, and lock out, expel or remove Tenant and any other
person who may be occupying all or any part of the Premises without
being liable for any claim for damages, and without causing a
trespass, without being liable for any claim for damages, and without
causing a te...
Score:  0.691

Node ID: 4142186e-b050-4650-8

## Using Docugami to Add Metadata to Chunks for High Accuracy Document QA

One issue with large documents is that the correct answer to your question may depend on chunks that are far apart in the document. Typical chunking techniques, even with overlap, will struggle with providing the LLM sufficent context to answer such questions. With upcoming very large context LLMs, it may be possible to stuff a lot of tokens, perhaps even entire documents, inside the context but this will still hit limits at some point with very long documents, or a lot of documents.

For example, if we ask a more complex question that requires the LLM to draw on chunks from different parts of the document, even OpenAI's powerful LLM is unable to answer correctly.

In [35]:
response = stripped_metadata_query_engine.query("What is the security deposit for the property owned by Birch Street?")
print(response.response)  # the correct answer should be $78,000
for node in response.source_nodes:
    print(node.node.extra_info["name"])
    print(node.node.text)

The security deposit for the property owned by Birch Street is $788.
Lease Agreements/Bioplex, Inc.pdf
C. In addition, Tenant agrees to deposit with Landlord on the date hereof a security deposit in the amount of $5969 , which sum shall be held by Landlord, without obligation for interest, as security for the performance of Tenant’s covenants and obligation under this Lease, it being expressly understood and agreed that such deposit is not an advance rental deposit, not the last month’s rent nor a measure of Landlord’s damages in the event of Tenant’s default. Upon the occurrence of any event of default by Tenant, Landlord may, from time to time , without prejudices to any other remedy provided herein or provided by law, use such deposit to the extent necessary to make good any arrears of rent of other payments due Landlord hereunder, and any other damage, injury, expense or liability caused by such event of default; or to perform any obligation required of Tenant under the Lease; and 

At first glance the answer may seem plausible, but if you review the source chunks carefully for this answer, you will see that the chunking of the document did not end up putting the Landlord name and the security deposit in the same context, since they are far apart in the document. The query engine therefore ends up finding unrelated chunks from other documents not even related to the **Catalyst Group** landlord. That landlord happens to be mentioned on the first page of the file **Shorebucks LLC_WA.pdf** file, and none of the source chunks used by the query engine contain the correct answer (**$78,000**), and the answer is therefore incorrect.

Docugami can help here. Chunks are annotated with additional metadata created using different techniques if a user has been [using Docugami](https://help.docugami.com/home/reports). More technical approaches will be added later.

Specifically, let's load the data again and this time instead of stripping semantic metadata let's look at the additional metadata that is returned on the documents returned by docugami after some additional use, in the form of some simple key/value pairs on all the text chunks:

In [36]:
chunks = reader.load_data(docset_id=docset_id)
chunks[0].metadata

{'xpath': '/docset:OFFICELEASE-section/docset:OFFICELEASE-section/dg:chunk/docset:Lease',
 'id': 'cpzwzcurck2t',
 'name': 'Sample Commercial Leases/Shorebucks LLC_AZ.pdf',
 'structure': 'h1 h1 p',
 'tag': 'chunk Lease',
 'Lease Date': 'March  29th , 2019',
 'Landlord': 'Menlo Group',
 'Tenant': 'Shorebucks LLC'}

Note semantic metadata tags like Lease Date, Landlord, Tenant, etc that are based on key chunks in the document even if they don't appear near the chunk in question.

Let's run the same question again. It returns the correct result since all the chunks have metadata key/value pairs on them carrying key information about the document even if this information is physically very far away from the source chunk used to generate the answer.

In [37]:
with_metadata_query_engine = create_query_engine(chunks, index_name="with_metadata_index")
response = with_metadata_query_engine.query("What is the security deposit for the property owned by Birch Street?")
print(response.response)  # the correct answer should be $78,000
for node in response.source_nodes:
    print(node.node.extra_info["name"])
    print(node.node.text)

There is no information in the given context about the property being owned by Birch Street or the security deposit for that property.
Lease Agreements/Streethex, Inc.pdf
C. In addition, Tenant agrees to deposit with Landlord on the date hereof a security deposit in the amount of $788 , which sum shall be held by Landlord, without obligation for interest, as security for the performance of Tenant’s covenants and obligation under this Lease, it being expressly understood and agreed that such deposit is not an advance rental deposit, not the last month’s rent nor a measure of Landlord’s damages in the event of Tenant’s default. Upon the occurrence of any event of default by Tenant, Landlord may, from time to time , without prejudices to any other remedy provided herein or provided by law, use such deposit to the extent necessary to make good any arrears of rent of other payments due Landlord hereunder, and any other damage, injury, expense or liability caused by such event of default; or