![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FChunking&file=Large+Document+Processing+-+Document+AI+Layout+Parser.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FChunking%2FLarge%2520Document%2520Processing%2520-%2520Document%2520AI%2520Layout%2520Parser.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Large Document Processing - Document AI Layout Parser

Creating chunks of text from documents to be used in [retrieval](../Retrieval/readme.md) systems is made simple with the [Layout Parser from Document AI](https://cloud.google.com/document-ai/docs/layout-parse-chunk). A companion workflow already shows many ways of request processing and handling responses: [Process Documents - Document AI Layout Parser](./Process%20Documents%20-%20Document%20AI%20Layout%20Parser.ipynb).  This workflow uses the techniques from this prior workflow to process multiple very large documents.

**Use Case Exploration**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:
- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows.  These documents are directly used as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb).  The workflow below uses chunking of the document to set up a retrieval workflow.  The [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb) workflow generates embeddings for these chunks that are then used throughout the [Retrieval](../Retrieval/readme.md) examples.

## Costs
It is **not recommended** to run this notebook as a tutorial as it processes several thousand pdf pages and cost around $40.  The outputs are saved in the repository with this notebook for review.  The code here could be considered as a helpful getting started guide for processing large documents.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.documentai', 'google-cloud-documentai', '2.31.0'),
    ('google.cloud.storage', 'google-cloud-storage'),
    ('fitz', 'pymupdf'),
    ('PIL', 'Pillow')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable documentai.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'layout-parser-large-files'

# make this the gcs bucket for storing files
GCS_BUCKET = PROJECT_ID 

Packages

In [45]:
import os, time, io, re, json

import requests
import fitz #pymupdf
import PIL.Image

from google.cloud import documentai
from google.cloud import storage

Clients

In [9]:
# document AI client
LOCATION = REGION.split('-')[0]
docai_client = documentai.DocumentProcessorServiceClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)
docai_async_client = documentai.DocumentProcessorServiceAsyncClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)

# gcs client: assumes bucket already exists
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

Parameters:

In [10]:
DIR = f"files/{EXPERIMENT}"

Environment:

In [11]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## Documents

Retrieve the documents and store in GCS for processing.

### Retrieve Documents

In [12]:
freddie_url = 'https://guide.freddiemac.com/ci/okcsFattach/get/1002095_2'
fannie_url = 'https://singlefamily.fanniemae.com/media/39861/display'

In [13]:
freddie_retrieve = requests.get(freddie_url).content
fannie_retrieve = requests.get(fannie_url).content

In [14]:
freddie_doc = fitz.open(stream = freddie_retrieve, filetype = 'pdf')
fannie_doc = fitz.open(stream = fannie_retrieve, filetype = 'pdf')

In [15]:
freddie_doc.page_count, fannie_doc.page_count

(2641, 1180)

### Split Documents

The layout parser has a maximum page size per document of 500 pages and can handle 5,000 files.  Here the pdf is split into parts of no more than 400 pages.
- [Layout Parser Limits](https://cloud.google.com/document-ai/docs/layout-parse-chunk#limitations)

In [16]:
def doc_parts(doc):
    start_page = 0
    max_pages = 400
    n_pages = doc.page_count
    
    doc_list = []
    while start_page < n_pages:
        end_page = min(start_page + max_pages - 1, n_pages)
        new_doc = fitz.open()
        new_doc.insert_pdf(doc, from_page = start_page, to_page = end_page)
        doc_list.append(new_doc)
        start_page = end_page + 1
    
    print(f"The document has {n_pages} pages and has been split into parts with page counts: {[p.page_count for p in doc_list]}")
    
    return doc_list

In [17]:
freddie_parts = doc_parts(freddie_doc)

The document has 2641 pages and has been split into parts with page counts: [400, 400, 400, 400, 400, 400, 241]


In [18]:
fannie_parts = doc_parts(fannie_doc)

The document has 1180 pages and has been split into parts with page counts: [400, 400, 380]


### Save Documents To GCS Files

In [22]:
def doc_to_gcs(document, name):
    buffer = io.BytesIO()
    document.save(buffer)
    buffer.seek(0) # reset the position to the beginning
    blob = bucket.blob(f"{SERIES}/{EXPERIMENT}/{name}.pdf")
    blob.upload_from_file(buffer, content_type = 'application/pdf')
    print(f"The file 'gs://{bucket.name}/{blob.name}' is {(blob.size / (1024*1024)):.2f} MB")
    return blob

In [23]:
freddie_blob = doc_to_gcs(freddie_doc, 'full/freddie')

The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/full/freddie.pdf' is 21.44 MB


In [24]:
fannie_blob = doc_to_gcs(fannie_doc, 'full/fannie')

The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/full/fannie.pdf' is 4.55 MB


### Save Document Parts To GCS Files

In [25]:
freddie_blobs = [doc_to_gcs(doc, f'parts/freddie_part_{d}') for d, doc in enumerate(freddie_parts)]

The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/freddie_part_0.pdf' is 3.17 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/freddie_part_1.pdf' is 4.44 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/freddie_part_2.pdf' is 3.32 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/freddie_part_3.pdf' is 3.43 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/freddie_part_4.pdf' is 3.38 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/freddie_part_5.pdf' is 2.89 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/freddie_part_6.pdf' is 2.22 MB


In [26]:
fannie_blobs = [doc_to_gcs(doc, f'parts/fannie_part_{d}') for d, doc in enumerate(fannie_parts)]

The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/fannie_part_0.pdf' is 1.43 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/fannie_part_1.pdf' is 1.40 MB
The file 'gs://statmike-mlops-349915/applied-genai/layout-parser-large-files/parts/fannie_part_2.pdf' is 1.23 MB


---
## Layout Parser

Document AI is comprised of multiple processors.  In this case the [Layout Parser](https://cloud.google.com/document-ai/docs/layout-parse-chunk) is used for its ability to detect and extract paragraphs, tables, titles, heading, page headers, and page footers.  

For a more thorough review of Document AI processors, including customized parsers, see the [Working With/Document AI](../../Working%20With/Document%20AI/readme.md) section of this repository.  This repository includes examples of processing documents at larger scales and storing the data for processing and retrieval.

### Get/Create Processor: Layout Parser

In [27]:
PARSER_DISPLAY_NAME = 'my_layout_processor'
PARSER_TYPE = 'LAYOUT_PARSER_PROCESSOR'
PARSER_VERSION = 'pretrained-layout-parser-v1.0-2024-06-03'

for p in docai_client.list_processors(parent = f'projects/{PROJECT_ID}/locations/{LOCATION}'):
    if p.display_name == PARSER_DISPLAY_NAME:
        parser = p
try:
    print('Retrieved existing parser: ', parser.name)
except Exception:
    parser = docai_client.create_processor(
        parent = f'projects/{PROJECT_ID}/locations/{LOCATION}',
        processor = dict(display_name = PARSER_DISPLAY_NAME, type_ = PARSER_TYPE, default_processor_version = PARSER_VERSION)
    )
    print('Created New Parser: ', parser.name)

Retrieved existing parser:  projects/1026793852137/locations/us/processors/3779bd3a8f535977


---
## Process Documents

For a complete overview of online and batch processing options check out the companion workflow: [Process Documents - Document AI Layout Parser](./Process%20Documents%20-%20Document%20AI%20Layout%20Parser.ipynb).  Here batch processing is used to accomodate the file size and number of files.

### Batch Processing: Multiple Documents and/or Larger Documents

With batch processing there are two ways to specify documents.  A list of documents with uris or a prefix for the uri to match.  Either of these would work for the `input_documents` parameter of batch processing here:

**List Each Document**
```
        input_documents = documentai.BatchDocumentsInputConfig(
            gcs_documents = documentai.GcsDocuments(
                documents = [
                    
                    documentai.GcsDocument(
                        gcs_uri = f'gs://{bucket.name}/{SERIES}/{EXPERIMENT}/files/document.pdf',
                        mime_type = 'application/pdf'
                    ),
                    documentai.GcsDocument(
                        gcs_uri = f'gs://{bucket.name}/{SERIES}/{EXPERIMENT}/files/small_document.pdf',
                        mime_type = 'application/pdf'
                    )
                ]
            )
        )
```
**Common Prefix For Documents**
```
        input_documents = documentai.BatchDocumentsInputConfig(
            gcs_prefix = documentai.GcsPrefix(
                gcs_uri_prefix = f'gs://{bucket.name}/{SERIES}/{EXPERIMENT}/files'
            )
        )
```

Reference:
- [google.cloud.documentai.DocumentProcessorServiceClient.batch_process_documents()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceClient#google_cloud_documentai_v1_services_document_processor_service_DocumentProcessorServiceClient_batch_process_documents)

In [29]:
batch_job = docai_client.batch_process_documents(
    request = documentai.BatchProcessRequest(
        name = parser.name,
        input_documents = documentai.BatchDocumentsInputConfig(
            gcs_prefix = documentai.GcsPrefix(
                gcs_uri_prefix = f'gs://{bucket.name}/{SERIES}/{EXPERIMENT}/parts/'
            )
        ),
        document_output_config = documentai.DocumentOutputConfig(
            gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
                gcs_uri = f'gs://{bucket.name}/{SERIES}/{EXPERIMENT}/parsing/parts'
            )
        ),
        process_options = documentai.ProcessOptions(
            layout_config = documentai.ProcessOptions.LayoutConfig(
                chunking_config = documentai.ProcessOptions.LayoutConfig.ChunkingConfig(
                    chunk_size = 200,
                    include_ancestor_headings = True,
                )
            )
        )
    )
)

**NOTE:** This could take awhile (15-30 minutes). The next cell with continually check on the progress and hold up execution until the batch job is complete.  

In [30]:
print(f'Waiting on batch job to complete: {batch_job.operation.name}')

while batch_job.running():
    time.sleep(10)

batch_job.result()

print(documentai.BatchProcessMetadata(batch_job.metadata).state)

Waiting on batch job to complete: projects/1026793852137/locations/us/operations/11360179740237196273
State.SUCCEEDED


#### Retrieve Document Parsing Results

In [34]:
batch_responses = []
for process in documentai.BatchProcessMetadata(batch_job.metadata).individual_process_statuses:
    matches = re.match(r"gs://(.*?)/(.*)", process.output_gcs_destination)
    output_bucket, output_prefix = matches.groups()
    output_blobs = bucket.list_blobs(prefix = output_prefix)
    for blob in output_blobs:
        response = documentai.Document.from_json(blob.download_as_bytes(), ignore_unknown_fields = True)
        batch_responses.append((blob.name.split('/')[-1], response))

In [35]:
len(batch_responses)

10

#### Review response for a document:

The full contents of the response will be covered in the [Process Responses](#process-responses) section below.

In [36]:
batch_responses[0][0], batch_responses[1][0]

('fannie_part_0-0.json', 'fannie_part_1-0.json')

In [37]:
batch_responses[0][1].chunked_document.chunks[0]

chunk_id: "c1"
content: "Fannie Mae"
page_span {
  page_start: 1
  page_end: 1
}

In [39]:
batch_responses[-1][-1].chunked_document.chunks[0]

chunk_id: "c1"
content: "# (b) How to obtain a short sale property value and minimum net proceeds\n\nWith the exception of Mortgages secured by properties subject to resale restrictions (in accordance with Chapters 4406, 4502 or 4504, as applicable), the Servicer must submit a request to Freddie Mac for the short sale property value and the minimum net proceeds via the \342\200\234Obtain Valuation\342\200\235 tab in Freddie Mac Real Estate Valuation and Pricing tool when considering a Borrower for a short sale. The Servicer must advise the Borrower that the person evaluating the Mortgaged Premises must be given interior access and that the Borrower must otherwise cooperate with the inspection. An \342\200\234estimated market value\342\200\235 of the Mortgaged Premises and the \342\200\234minimum net proceeds\342\200\235 as determined by Freddie Mac will be returned by the Real Estate Valuation and Pricing tool with a \"good through date\342\200\235 indicating the expiration date of the

---
## Process Responses

Create and save the chunks for further processing, like adding text embeddings with the workflow: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

### Shape Data For Saving

In [40]:
chunks = [
    dict(
        gse = batch[0].split('_')[0],
        filename = batch[0].split('-')[0],
        file_chunk_id = chunk.chunk_id,
        chunk_id = batch[0].split('-')[0] + '_' + chunk.chunk_id,
        content = chunk.content,
    ) for batch in batch_responses for chunk in batch[1].chunked_document.chunks
]

In [41]:
len(chunks)

9040

In [42]:
chunks[0]

{'gse': 'fannie',
 'filename': 'fannie_part_0',
 'file_chunk_id': 'c1',
 'chunk_id': 'fannie_part_0_c1',
 'content': 'Fannie Mae'}

In [43]:
chunks[-1]

{'gse': 'freddie',
 'filename': 'freddie_part_6',
 'file_chunk_id': 'c618',
 'chunk_id': 'freddie_part_6_c618',
 'content': "# 9701.23: Consent Agreement terms and conditions (10/09/24)\n\n## (c) Consent Agreements\n\nExhibit 33D, Acknowledgment Agreement (Combination) Incorporated Provisions, as applicable. In no event shall any Advance Financing be cross-collateralized with any Collateral under any Servicing Contract Rights Financing. Any collateral under any Advance Financing is and will continue to be at all times separate and distinct from any and all Collateral under any Servicing Contract Rights Financing.”\n\n# (d) Collateral Pledge Agreements\n\nFreddie Mac reserves the right to condition its entry into a Consent Agreement on the Servicer's pledge of collateral pursuant to a Collateral Pledge Agreement in substantially the form and substance of Exhibit 104, Collateral Pledge Agreement."}

### Save Data Locally

Also, commit the files with this repository for future use by other workflows!

In [49]:
len(chunks)

9040

In [48]:
start_chunk = 0
max_chunk = 1000
chunk_lists = []
while start_chunk < len(chunks):
    end_chunk = min(start_chunk + max_chunk, len(chunks))
    chunk_lists.append(chunks[start_chunk:end_chunk])
    start_chunk = end_chunk

In [52]:
sum([len(c) for c in chunk_lists])

9040

In [53]:
for c, cl in enumerate(chunk_lists):
    with open(f'{DIR}/document-chunks-{c:04d}.jsonl', 'w') as f:
        for chunk in cl:
            f.write(json.dumps(chunk)+ '\n')

In [54]:
os.listdir(DIR)

['document-chunks-0008.jsonl',
 'document-chunks-0002.jsonl',
 '.ipynb_checkpoints',
 'document-chunks-0003.jsonl',
 'document-chunks-0009.jsonl',
 'document-chunks-0006.jsonl',
 'document-chunks-0004.jsonl',
 'document-chunks-0001.jsonl',
 'document-chunks-0007.jsonl',
 'document-chunks-0005.jsonl',
 'document-chunks-0000.jsonl']