![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FWorking+With%2FDocument+AI&file=Document+AI+-+Process+Documents.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/Document%20AI%20-%20Process%20Documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FWorking%2520With%2FDocument%2520AI%2FDocument%2520AI%2520-%2520Process%2520Documents.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/Document%20AI%20-%20Process%20Documents.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Working%20With/Document%20AI/Document%20AI%20-%20Process%20Documents.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Document AI - Process Documents
> From the [Working With Document AI](https://github.com/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/readme.md) series in the [vertex-ai-mlops](https://github.com/statmike/vertex-ai-mlops/blob/main/readme.md) repository.

Document AI is an API where you interact with processors to extract information from documents.  You enable the API, create an instance of a processor in your project, send in document(s), receive back JSON with the extracted information:

<p align="center" width="100%"><center>
    <img src="../../architectures/architectures/images/working with/documentai/readme/high_level.png">
</center></p>

This workflow covers all the ways to process a document, or many documents, using Python as the client. For details on how to extract elements from the responses see the next workflow: [Document AI - Process Responses](./Document%20AI%20-%20Process%20Responses.ipynb)

---
**Documents**

Document AI sources are documents.  There are many supported document types (file formats):
- Supported [Document Types](https://cloud.google.com/document-ai/docs/file-types) like pdf, gif, tiff, jpeg, pn, gmp, webp
- Additional support for [DocX files is in preview](https://cloud.google.com/document-ai/docs/enterprise-document-ocr#supported_file_formats).

---
**Processing**

Processing can be orchestrated with one of the [client libraries](https://cloud.google.com/document-ai/docs/libraries), [REST](https://cloud.google.com/document-ai/docs/reference/rest), or [RPC](https://cloud.google.com/document-ai/docs/reference/rpc).  This workflow will use the [Python Client for Document AI](https://cloud.google.com/python/docs/reference/documentai/latest).

```
from google.cloud import documentai

docai = documentai.DocumentProcessorServiceClient()
```

> There is also an async client that can be used.  The methods have the same names and can be awaited with `await`:
> - `docai_async = documentai.DocumentProcessorServiceAsyncClient()`

Processing can be be done online (one document) or in batch (multiple documents):
- online (one document):
    - `docai.process_document(request = documentai.types.ProcessRequest(client_options = ))`
- batch (multiple documents):
    - `docai.batch_process_documents(request = documentai.types.BatchProcessRequest(client_options = ))`

---
**Inputs & Outputs**

The following table breaks down the input and output locations by the type of processing:


<table style='text-align:center;vertical-align:middle;border:1px solid black' width="90%" cellpadding="1" cellspacing="0">
    <caption>Inputs & Outputs</caption>
    <col>
    <col>
    <col>
<!--..........................................................................................-->
    <thead>
        <tr>
            <th scope="col" style="width:20%">
                Processing Mode
            </th>
            <th scope="col" style="width:40%">
                Inputs
            </th>
            <th scope="col" style="width:40%">
                Outputs
            </th>
        </tr>
    </thead>
    <tbody>
<!--..........................................................................................-->
        <tr>
            <td>
                Online<br>(Single Document Per Request)
            </td>
            <td>
                <table>
                    <tr style='text-align:center'>
                        <td>One of:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>Document in GCS:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                        <pre>
response = doc_ai.process_document(
    request = documentai.types.ProcessRequest(
        <b>inline_document</b> = documentai.types.Document(
            uri = 'gs://bucket/path/to/object.ext'
        )
    )
)
                        </pre>
                        </td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>Document as bytes</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                        <pre>
response = doc_ai.process_document(
    request = documentai.types.ProcessRequest(
        # provide a bytes object
        <b>raw_document</b> = documentai.types.RawDocument(
            content = 
        )
    )
)
                        </pre>
                        </td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>Document in GCS</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                        <pre>
response = doc_ai.process_document(
    request = documentai.types.ProcessRequest(
        # provide GCS URI as string
        <b>gcs_document</b> = documentai.types.GcsDocument(
            gcs_uri = 'gs://bucket/path/to/object'
        )
    )
)
                        </pre>
                        </td>
                    </tr>
                </table>
            </td>
            <td  style='text-align:left'>
                The response is an object containing the document response.
                <br><pre>type(response) is documentai.types.ProcessResponse()</pre>
                <br><br>This has a parameter with the document:
                <br><pre>type(response.document) is documentai.types.Document()</pre>
                <br><br>The document object contains parameters with document components, like:
                <ul>
                    <li>response.document.text is a string with full text of the document</li>
                    <li>response.document.pages is a list of documentai.types.Document.Pagee objects</li>
                    <li>response.document.entities is a list of documentai.types.Document.Entity objects</li>
                </ul>
                <br>The document object contains method for converting to Python objects:
                <ul>
                    <li>response.document.to_dict() for dictionary</li>
                    <li>response.document.to_json() for JSON</li>
                </ul>
            </td>
        </tr>
<!--..........................................................................................-->
        <tr>
            <td>
                Batch<br>(Multiple Documents Per Request)
            </td>
            <td>
                <table>
                    <tr style='text-align:center'>
                        <td>One of:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>List of documents in GCS:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                <pre>
doc_ai.batch_process_documents(
    request = documentai.types.BatchProcessRequest(
        <b>input_documents</b> = documentai.types.BatchDOcumentsInputConfig(
            # provide a list of document objects that each have parameter gcs_uri = GCS URI as string
            <b>gcs_documents</b> = documentai.types.GcsDocuments(
                gcs_uri = [documentai.types.GcsDocument(gcs_uri = ), ...]
            )
        )
    )
)
                </pre>
                        </td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>All documents with GCS prefix:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                            <pre>
doc_ai.batch_process_documents(
    request = documentai.types.BatchProcessRequest(
        <b>input_documents</b> = documentai.types.BatchDocumentsInputConfig(
            # provide a GCS URI (prefix) as string
            <b>gcs_prefix</b> = documentai.types.GcsPrefix(
                gcs_uri_prefix = 
            )
        )
    )
)
                            </pre>
                        </td>
                    </tr>
                </table>
            </td>
            <td style='text-align:left'>
                The batch processing job includes a parameter for configuring the output location of JSON files in GCS.<br><br>
                <pre>
doc_ai.batch_process_documents(
    request = documentai.BatchProcessRequest(
       <b>document_output_config</b> = documentai.types.DocumentOutputConfig(
            <b>gcs_output_config</b> = documentai.types.GcsOutputConfig(
                gcs_uri = 'gs://bucket/path/to/output', # the output JSON will writen to this directory
                field_mask = , # optional: fields to include in output
                sharding_config = # optional: sharding config for output
            )
        )
    )
)
                </pre>
            </td>
        </tr>     
<!--..........................................................................................-->
    </tbody>
</table>


---
**Processing Specifics**

There are limits to processing requests:
- the number of request that can be made over a period of time: [Quotas](https://cloud.google.com/document-ai/quotas#quotas)
- the amount and size of content (documents, pages): [Content Limits](https://cloud.google.com/document-ai/quotas#content_limits)
- the processing request for each processor (parser) also has limits: [Processor Specific Limits](https://cloud.google.com/document-ai/quotas#processor_limits)

What does this actually mean?  Let's pick a single processor and walk through it, the OCR Parser. [This page](https://cloud.google.com/document-ai/docs/processors-list) has all the specifics for each parser.
- Parser Limits: The OCR parser
    - limit of 15 pages for an online requests and 500 for a batch requests
- Content Limits:
    - file size: 20MB online, and 1GB batch
    - files: 1 for online, 5000 for batch
        - but the OCR parser has a 500 page limit for batch
    - If the file type is an image (not PDF) then each page can be a max of 40 megapixels
- Requests (Qoutas):
    - overall
        - 10,000 active pages per project
    - users:
        - 1800 requests per minute
    - online (per minute):
        - 600 per project
        - 120 per project/processor/multi-region (US, EU)
        - 6 per project/processor/single-region
    - batch (concurrent jobs):
        - 10 per project
        - 5 per project/multi-region
        - 5 per project/single-region


---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/Document%20AI%20-%20Process%20Documents.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [2]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [3]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [4]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.documentai', 'google-cloud-documentai'),
    ('google.cloud.documentai', 'google-cloud-storage'),
    ('google.cloud.documentai', 'google-cloud-bigquery'),
    ('PyPDF2', 'PyPDF2')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### API Enablement

In [6]:
!gcloud services enable documentai.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [7]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [8]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [9]:
REGION = 'us-central1'
SERIES = 'working-with-docai'
EXPERIMENT = 'process-documents'

# make this the gcs bucket for storing files
GCS_BUCKET = PROJECT_ID

# BigQuery Objects
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE_PREFIX = EXPERIMENT

Packages

In [135]:
import os, shutil, glob, json, asyncio, datetime, io

import PyPDF2

from google.cloud import documentai
from google.cloud import storage
from google.cloud import bigquery

Clients

In [11]:
# document AI client
LOCATION = REGION.split('-')[0]
docai = documentai.DocumentProcessorServiceClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)

# gcs client: assumes bucket already exists
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

# bq client
bq = bigquery.Client(project = PROJECT_ID)

---
## Documents

This section prepares documents for processing.  In this case there are documents in a local folder in the repository that are prepared for online and batch serving by either loading with directly or copying to a GCS location within the bucket defined above with parameter `GCS_BUCKET`.

The files in the local folder `../shared files/docs` are printed pages (to .pdf) from the following [Wikipedia](https://www.wikipedia.org/) pages:

|Document Name|Link|
|---|---|
|`../shared files/docs/Bayes' theorem - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Bayes%27_theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem)|
|`../shared files/docs/sports/Baseball - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Baseball](https://en.wikipedia.org/wiki/Baseball)|
|`../shared files/docs/sports/Football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Football](https://en.wikipedia.org/wiki/Football)|
|`../shared files/docs/sports/Association football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Association_football](https://en.wikipedia.org/wiki/Association_football)|
|`../shared files/docs/sports/American football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/American_football](https://en.wikipedia.org/wiki/American_football)|
|`../shared files/docs/sports/Hockey - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Hockey](https://en.wikipedia.org/wiki/Hockey)|
|`../shared files/docs/sports/Basketball - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Basketball](https://en.wikipedia.org/wiki/Basketball)|
|`../shared files/docs/sports/Cricket - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Cricket](https://en.wikipedia.org/wiki/Cricket)|
|`../shared files/docs/sports/Rugby football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Rugby_football](https://en.wikipedia.org/wiki/Rugby_football)|
|`../shared files/docs/sports/Golf - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Golf](https://en.wikipedia.org/wiki/Golf)|
|`../shared files/docs/jam_bands/Jam band - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Jam_band](https://en.wikipedia.org/wiki/Jam_band)|
|`../shared files/docs/jam_bands/Widespread Panic - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Widespread_Panic](https://en.wikipedia.org/wiki/Widespread_Panic)|
|`../shared files/docs/jam_bands/Cream (band) - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Cream_(band)](https://en.wikipedia.org/wiki/Cream_(band))|
|`../shared files/docs/jam_bands/Phish - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Phish](https://en.wikipedia.org/wiki/Phish)|
|`../shared files/docs/jam_bands/The Allman Brothers Band - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/The_Allman_Brothers_Band](https://en.wikipedia.org/wiki/The_Allman_Brothers_Band)|
|`../shared files/docs/jam_bands/Grateful Dead - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Grateful_Dead](https://en.wikipedia.org/wiki/Grateful_Dead)|



### Get The Documents

If you are working from a clone of this notebooks repository then the documents are already present. The following cell checks for the documents folder, `/docs`, and if it is missing gets it (`git clone`):

In [12]:
if not os.path.exists('../shared files/docs'):
    print('Retrieving documents...')
    if not os.path.exists('temp'):
        os.makedirs('temp')
    !git clone https://www.github.com/statmike/vertex-ai-mlops temp/vertex-ai-mlops
    shutil.copytree('temp/vertex-ai-mlops/Working With/Document AI/shared files/docs', '../shared files/docs')
    shutil.rmtree('temp/vertex-ai-mlops')
    print('Documents are now in folder `../shared files/docs`')
else:
    print('Documents Found in folder `../shared files/docs`')

Documents Found in folder `../shared files/docs`


### Copy Documents To GCS

Make a copy of the `../shared files/docs` folder in the GCS Bucket defined above with parameter `GCS_BUCKET`.  This will add a prefix (folder structure) of `/{SERIES}/{EXPERIMENT}`.

In [57]:
glob.glob(f'../shared files/docs/**/**')

['../shared files/docs/jam_bands/Widespread Panic - Wikipedia.pdf',
 '../shared files/docs/jam_bands/Cream (band) - Wikipedia.pdf',
 '../shared files/docs/jam_bands/The Allman Brothers Band - Wikipedia.pdf',
 '../shared files/docs/jam_bands/Jam band - Wikipedia.pdf',
 '../shared files/docs/jam_bands/Grateful Dead - Wikipedia.pdf',
 '../shared files/docs/jam_bands/Phish - Wikipedia.pdf',
 '../shared files/docs/sports/Golf - Wikipedia.pdf',
 '../shared files/docs/sports/Cricket - Wikipedia.pdf',
 '../shared files/docs/sports/Hockey - Wikipedia.pdf',
 '../shared files/docs/sports/Association football - Wikipedia.pdf',
 '../shared files/docs/sports/American football - Wikipedia.pdf',
 '../shared files/docs/sports/Football - Wikipedia.pdf',
 '../shared files/docs/sports/Rugby football - Wikipedia.pdf',
 '../shared files/docs/sports/Baseball - Wikipedia.pdf',
 '../shared files/docs/sports/Basketball - Wikipedia.pdf']

In [58]:
for file in glob.glob(f'../shared files/docs/**/**'):
    blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/{file[3:]}')
    blob.upload_from_filename(file)

In [59]:
print(f"View the bucket directly here:\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET}/{SERIES}/{EXPERIMENT};tab=objects&project={PROJECT_ID}")

View the bucket directly here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/working-with-docai/process-documents;tab=objects&project=statmike-mlops-349915


List files in bucket:

In [60]:
for blob in list(bucket.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}/shared files/docs')):
    print(blob.name)

working-with-docai/process-documents/shared files/docs/jam_bands/Cream (band) - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/jam_bands/Grateful Dead - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/jam_bands/Jam band - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/jam_bands/Phish - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/jam_bands/The Allman Brothers Band - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/jam_bands/Widespread Panic - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/sports/American football - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/sports/Association football - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/sports/Baseball - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/sports/Basketball - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/spor

---
## Processors

When submitting documents for processing in Document AI, the client routes the document to a processor.  There are many processors:
- [Full processor and detail list](https://cloud.google.com/document-ai/docs/processors-list)
- Check out the helpful table for processors in this workflows [readme file](./readme.md) 

When setting up a processor you can also pick versions or it will default to a version.

This section shows how to:
- list available processors in the project: console and Python Client
    - describe processor(s)
- get/create a processor with desired type and version


### List Processors In This Project

If any have already been created, list them:

What are the processors already created in this project environment?

In [61]:
processors = list(docai.list_processors(parent = f'projects/{PROJECT_ID}/locations/{LOCATION}'))
len(processors)

4

In [62]:
if processors:
    print(f'View the processors in the console with this link:\nhttps://console.cloud.google.com/ai/document-ai/processors?project={PROJECT_ID}\n\n')
    for p, processor in enumerate(processors):
        print(
            f'Processors {p}: ', processor.display_name, 
            '\n\tis of type = ', processor.type_, 
            '\n\tand version = ',processor.default_processor_version.split('/')[-1])

View the processors in the console with this link:
https://console.cloud.google.com/ai/document-ai/processors?project=statmike-mlops-349915


Processors 0:  working-with-docai 
	is of type =  OCR_PROCESSOR 
	and version =  pretrained-ocr-v2.0-2023-06-02
Processors 1:  example-dot 
	is of type =  CUSTOM_EXTRACTION_PROCESSOR 
	and version =  pretrained-foundation-model-v1.0-2023-08-22
Processors 2:  my-invoice 
	is of type =  INVOICE_PROCESSOR 
	and version =  pretrained-invoice-v1.3-2022-07-15
Processors 3:  my_general_processor 
	is of type =  FORM_PARSER_PROCESSOR 
	and version =  pretrained-form-parser-v1.0-2020-09-23


### Create/Get A Processor

For this workflow we will use the [OCR parser](https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr). We can check for an existing processor in the project that the OCR Parser with desired version and if it is not present then create one.  The processor will be connected with Python variable `PARSER` and referred to as a parser as it is used.

Get the type and version from the list of available processors: https://cloud.google.com/document-ai/docs/processors-list

In [63]:
TYPE = 'OCR_PROCESSOR'
VERSION = 'pretrained-ocr-v2.0-2023-06-02'

Get an existing processor:

In [64]:
PARSER = ''
for processor in processors:
    if processor.type_ == TYPE and processor.default_processor_version.split('/')[-1] == VERSION:
        PARSER = processor
        break
        
if PARSER:
    print(f'There is an existing processor with the desired type and version in PARSER = {PARSER.display_name}')
else:
    print(f'Need to create a processor for the desired type and version: {TYPE}, {VERSION}')

There is an existing processor with the desired type and version in PARSER = working-with-docai


Create the processor if an existing one was not found to match:

In [65]:
if not PARSER:
    PARSER = docai.create_processor(
        parent = f'projects/{PROJECT_ID}/locations/{LOCATION}',
        processor = documentai.Processor(
            display_name = SERIES,
            type_ = TYPE
        )
    )
    set_default = docai.set_default_processor_version(
        request = documentai.SetDefaultProcessorVersionRequest(
            processor = PARSER.name,
            default_processor_version = f'{PARSER.name}/processorVersions/{VERSION}'
        )
    )
    set_default.result()
    PARSER = docai.get_processor(
        name = PARSER.name
    )
    print(f'Processor created and in PARSER variable with display name = {PARSER.display_name}')

---
## Online Processing (single document)

There are three ways to provide a single document to the client and each is covered in this section.

> NOTE: The [OCR Processor](https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr) has pages limits of 15 for online, and 500 for batch processing.

The following is the Python client reference to use for this online processing section:
- [google.cloud.documentai.DocumentProcessorServiceClient.process_document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceClient#google_cloud_documentai_v1_services_document_processor_service_DocumentProcessorServiceClient_process_document)

Using the processor stored in `PARSER` from above:

In [66]:
PARSER.name

'projects/1026793852137/locations/us/processors/d59e19cc08278630'

Specify the loation, local folder and GCS, of one of the document samples:

In [67]:
local_doc_location = '../shared files/docs/sports/Baseball - Wikipedia.pdf'
gcs_doc_location = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/{local_doc_location[3:]}'

Read the document to a bytes object:

In [68]:
with open(local_doc_location, 'rb') as f:
    local_doc = f.read()

In [69]:
type(local_doc)

bytes

---
### Document as bytes: `inline_document`

Reference:
- [documentai.ProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest)
    - (This One) `inline_document` = [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
    - `raw_document` = [documentai.RawDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.RawDocument)
    - `gcs_document` = [documentai.GcsDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocument)

In [70]:
response = docai.process_document(
    request = documentai.ProcessRequest(
        name = PARSER.name,
        inline_document = documentai.Document(
            content = local_doc,
            mime_type = 'application/pdf'
        ),
        process_options = documentai.ProcessOptions(
            from_start = 5
        )
    )
)

In [71]:
len(response.document.pages)

5

In [72]:
print(response.document.text[0:250])

10/27/23, 9:22 AM
Baseball - Wikipedia
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
✰ B
ETTS
Baseball is a bat-and-ball sport played between two
Baseball
teams of nine players each, taking turns batting and
fielding. The game


---
### Document as bytes: `raw_document`

Reference:
- [documentai.ProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest)
    - `inline_document` = [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
    - (This One) `raw_document` = [documentai.RawDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.RawDocument)
    - `gcs_document` = [documentai.GcsDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocument)

In [73]:
response = docai.process_document(
    request = documentai.ProcessRequest(
        name = PARSER.name,
        raw_document = documentai.RawDocument(
            content = local_doc,
            mime_type = 'application/pdf'
        ),
        process_options = documentai.ProcessOptions(
            from_start = 5
        )
    )
)

In [74]:
len(response.document.pages)

5

In [75]:
print(response.document.text[0:250])

10/27/23, 9:22 AM
Baseball - Wikipedia
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
✰ B
ETTS
Baseball is a bat-and-ball sport played between two
Baseball
teams of nine players each, taking turns batting and
fielding. The game


---
### Document in GCS: `gcs_document`

Reference:
- [documentai.ProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest)
    - `inline_document` = [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
    - `raw_document` = [documentai.RawDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.RawDocument)
    - (This One) `gcs_document` = [documentai.GcsDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocument)

In [76]:
response = docai.process_document(
    request = documentai.ProcessRequest(
        name = PARSER.name,
        gcs_document = documentai.GcsDocument(
            gcs_uri = gcs_doc_location,
            mime_type = 'application/pdf'
        ),
        process_options = documentai.ProcessOptions(
            from_start = 5
        )
    )
)

In [77]:
len(response.document.pages)

5

In [78]:
print(response.document.text[0:250])

10/27/23, 9:22 AM
Baseball - Wikipedia
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
✰ B
ETTS
Baseball is a bat-and-ball sport played between two
Baseball
teams of nine players each, taking turns batting and
fielding. The game


---
## Batch Processing (multiple documents)

There are two ways to provide documents to the client and each is covered in this section.

> NOTE: The [OCR Processor](https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr) has pages limits of 15 for online, and 500 for batch processing.

The following is the Python client reference to use for this batch processing section:
- [google.cloud.documentai.DocumentProcessorServiceClient.batch_process_documents()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceClient#google_cloud_documentai_v1_services_document_processor_service_DocumentProcessorServiceClient_batch_process_documents)

In [79]:
for blob in list(bucket.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}/shared files/docs/sports/B')):
    print(blob.name)

working-with-docai/process-documents/shared files/docs/sports/Baseball - Wikipedia.pdf
working-with-docai/process-documents/shared files/docs/sports/Basketball - Wikipedia.pdf


---
### Documents in GCS listed: `gcs_documents`

Specify a batch job with a list of one or more documents in GCS.

Reference:
- [documentai.BatchProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchProcessRequest)
    - `input_documents` = [documentai.BatchDocumentsInputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchDocumentsInputConfig)
        - `gcs_prefix` = [documentai.GcsPrefix()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsPrefix)
        - `gcs_documents` = [documentai.GcsDocuments](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocuments)
    - `document_output_config` = [documentai.DocumentOutputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.DocumentOutputConfig)

In [84]:
batch_job = docai.batch_process_documents(
    request = documentai.BatchProcessRequest(
        name = PARSER.name,
        input_documents = documentai.BatchDocumentsInputConfig(
            gcs_documents = documentai.GcsDocuments(
                documents = [
                    documentai.GcsDocument(
                        gcs_uri = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/shared files/docs/sports/Baseball - Wikipedia.pdf', 
                        mime_type = 'application/pdf'
                    ),
                    documentai.GcsDocument(
                        gcs_uri = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/shared files/docs/sports/Basketball - Wikipedia.pdf', 
                        mime_type = 'application/pdf'
                    )
                ]
            )
        ),
        document_output_config = documentai.DocumentOutputConfig(
            gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
                gcs_uri = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/parsing'
            )
        ),
    )
)

In [85]:
print(f'Waiting on batch job to complete: {batch_job.operation.name}')
batch_job.result()

Waiting on batch job to complete: projects/1026793852137/locations/us/operations/7439028764441085426




List the input and output locations for each document processed:

In [86]:
for d, doc in enumerate(batch_job.metadata.individual_process_statuses):
    print(f'Document {d}:\n\t{doc.input_gcs_source}\n\t{doc.output_gcs_destination}\n')

Document 0:
	gs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Baseball - Wikipedia.pdf
	gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/7439028764441085426/0

Document 1:
	gs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Basketball - Wikipedia.pdf
	gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/7439028764441085426/1



Read the results for each document and shard:

In [87]:
responses = []
for document in batch_job.metadata.individual_process_statuses:
    shards = []
    for shard in gcs.list_blobs(bucket, prefix = document.output_gcs_destination.split(f'gs://{GCS_BUCKET}/')[1]):
        if shard.content_type == 'application/json':
            print(shard.name)
            shards.append(
                documentai.Document.from_json(
                    shard.download_as_bytes(), 
                    ignore_unknown_fields = True
                )
            )
    responses.append(shards)

working-with-docai/process-documents/parsing/7439028764441085426/0/Baseball - Wikipedia-0.json
working-with-docai/process-documents/parsing/7439028764441085426/0/Baseball - Wikipedia-1.json
working-with-docai/process-documents/parsing/7439028764441085426/0/Baseball - Wikipedia-2.json
working-with-docai/process-documents/parsing/7439028764441085426/0/Baseball - Wikipedia-3.json
working-with-docai/process-documents/parsing/7439028764441085426/1/Basketball - Wikipedia-0.json
working-with-docai/process-documents/parsing/7439028764441085426/1/Basketball - Wikipedia-1.json
working-with-docai/process-documents/parsing/7439028764441085426/1/Basketball - Wikipedia-2.json


The output is sharded into multiple files.  Review the number of pages, page range, and the start of the text from the OCR for each shard:

In [88]:
for document in responses:
    for shard in document:
        print('This shard:\n',
              f'\tHas {len(shard.pages)} pages: {[page.page_number for page in shard.pages]}',
              f'\n\tThe text starts with:\n{shard.text[0:200]}\n\n'
             )

This shard:
 	Has 10 pages: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
✰ B
ETTS
Baseball is a bat-and-ball sport played between two
Baseball
teams of nine players


This shard:
 	Has 10 pages: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
a number of competitions between clubs from different countries. Other
competitions between national teams, such as the Baseball World Cup
and the Olympic baseba


This shard:
 	Has 10 pages: [21, 22, 23, 24, 25, 26, 27, 28, 29, 30] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
4. Thurston (2000), p. 15; "Official Rules/Foreword" (http://mlb.mlb.com/mlb/official_info/official_rule
s/foreword.jsp). Major League Baseball. Archived (https:


This shard:
 	Has 5 pages: [31, 32, 33, 34, 35] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
170. "Open

---
### Documents in GCS with prefix: `gcs_prefix`

Specify a batch job for all document with the same GCS prefix.

Reference:
- [documentai.BatchProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchProcessRequest)
    - `input_documents` = [documentai.BatchDocumentsInputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchDocumentsInputConfig)
        - `gcs_prefix` = [documentai.GcsPrefix()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsPrefix)
        - `gcs_documents` = [documentai.GcsDocuments](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocuments)
    - `document_output_config` = [documentai.DocumentOutputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.DocumentOutputConfig)

In [92]:
batch_job = docai.batch_process_documents(
    request = documentai.BatchProcessRequest(
        name = PARSER.name,
        input_documents = documentai.BatchDocumentsInputConfig(
            gcs_prefix = documentai.GcsPrefix(
                gcs_uri_prefix = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/shared files/docs/sports/B'
            )
        ),
        document_output_config = documentai.DocumentOutputConfig(
            gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
                gcs_uri = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/parsing'
            )
        ),
    )
)

In [93]:
batch_job.operation.name

'projects/1026793852137/locations/us/operations/4343315965496664420'

In [94]:
print(f'Waiting on batch job to complete: {batch_job.operation.name}')
batch_job.result()

Waiting on batch job to complete: projects/1026793852137/locations/us/operations/4343315965496664420




List the input and output locations for each document processed:

In [95]:
for d, doc in enumerate(batch_job.metadata.individual_process_statuses):
    print(f'Document {d}:\n\t{doc.input_gcs_source}\n\t{doc.output_gcs_destination}\n')

Document 0:
	gs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Baseball - Wikipedia.pdf
	gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/4343315965496664420/1

Document 1:
	gs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Basketball - Wikipedia.pdf
	gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/4343315965496664420/0



Read the results for each document and shard:

In [96]:
responses = []
for document in batch_job.metadata.individual_process_statuses:
    shards = []
    for shard in gcs.list_blobs(bucket, prefix = document.output_gcs_destination.split(f'gs://{GCS_BUCKET}/')[1]):
        if shard.content_type == 'application/json':
            print(shard.name)
            shards.append(
                documentai.Document.from_json(
                    shard.download_as_bytes(), 
                    ignore_unknown_fields = True
                )
            )
    responses.append(shards)

working-with-docai/process-documents/parsing/4343315965496664420/1/Baseball - Wikipedia-0.json
working-with-docai/process-documents/parsing/4343315965496664420/1/Baseball - Wikipedia-1.json
working-with-docai/process-documents/parsing/4343315965496664420/1/Baseball - Wikipedia-2.json
working-with-docai/process-documents/parsing/4343315965496664420/1/Baseball - Wikipedia-3.json
working-with-docai/process-documents/parsing/4343315965496664420/0/Basketball - Wikipedia-0.json
working-with-docai/process-documents/parsing/4343315965496664420/0/Basketball - Wikipedia-1.json
working-with-docai/process-documents/parsing/4343315965496664420/0/Basketball - Wikipedia-2.json


The output is sharded into multiple files.  Review the number of pages, page range, and the start of the text from the OCR for each shard:

In [97]:
for document in responses:
    for shard in document:
        print('This shard:\n',
              f'\tHas {len(shard.pages)} pages: {[page.page_number for page in shard.pages]}',
              f'\n\tThe text starts with:\n{shard.text[0:200]}\n\n'
             )

This shard:
 	Has 10 pages: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
✰ B
ETTS
Baseball is a bat-and-ball sport played between two
Baseball
teams of nine players


This shard:
 	Has 10 pages: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
a number of competitions between clubs from different countries. Other
competitions between national teams, such as the Baseball World Cup
and the Olympic baseba


This shard:
 	Has 10 pages: [21, 22, 23, 24, 25, 26, 27, 28, 29, 30] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
4. Thurston (2000), p. 15; "Official Rules/Foreword" (http://mlb.mlb.com/mlb/official_info/official_rule
s/foreword.jsp). Major League Baseball. Archived (https:


This shard:
 	Has 5 pages: [31, 32, 33, 34, 35] 
	The text starts with:
10/27/23, 9:22 AM
Baseball - Wikipedia
170. "Open

### Working With Batch Jobs

Batch Jobs are long running jobs. Managing these with tasks like listing, polling and canceling can be an important part of a workflows.
- [Managing long-running operations (LROs)](https://cloud.google.com/document-ai/docs/long-running-operations)

This section uses the package [google.longrunning](https://cloud.google.com/service-infrastructure/docs/service-management/reference/rpc/google.longrunning#google.longrunning.ListOperationsRequest):
- [Document AI RPC Reference for google.longrunning](https://cloud.google.com/document-ai/docs/reference/rpc/google.longrunning#google.longrunning.GetOperationRequest)

In [98]:
import google.longrunning.operations_pb2 as LRO

What is the `batch_job` operation name?

#### Operation Status

In [99]:
operation = docai.get_operation(
    request = LRO.GetOperationRequest(
        name = batch_job.operation.name
    )
)
operation

name: "projects/1026793852137/locations/us/operations/4343315965496664420"
metadata {
  type_url: "type.googleapis.com/google.cloud.documentai.v1.BatchProcessMetadata"
  value: "\010\003\032\013\010\255\301\365\252\006\020\330\301\302J\"\013\010\330\303\365\252\006\020\310\342\277\025*\217\002\nqgs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Baseball - Wikipedia.pdf\022\000\032]gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/4343315965496664420/1*9\010\001\0225HumanReviewConfig is DISABLED, skipping human review.*\221\002\nsgs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Basketball - Wikipedia.pdf\022\000\032]gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/4343315965496664420/0*9\010\001\0225HumanReviewConfig is DISABLED, skipping human review."
}
done: true
response {
  type_url: "type.googleapis.com/google.cloud.documentai.v1.BatchProcessResponse"
}

The `metadata` from the operation needs to be deserialized for review:

In [100]:
operation_metadata = documentai.BatchProcessMetadata.deserialize(
    operation.metadata.value
)
operation_metadata

state: SUCCEEDED
create_time {
  seconds: 1700618413
  nanos: 156279000
}
update_time {
  seconds: 1700618712
  nanos: 45085000
}
individual_process_statuses {
  input_gcs_source: "gs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Baseball - Wikipedia.pdf"
  status {
  }
  output_gcs_destination: "gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/4343315965496664420/1"
  human_review_status {
    state: SKIPPED
    state_message: "HumanReviewConfig is DISABLED, skipping human review."
  }
}
individual_process_statuses {
  input_gcs_source: "gs://statmike-mlops-349915/working-with-docai/process-documents/shared files/docs/sports/Basketball - Wikipedia.pdf"
  status {
  }
  output_gcs_destination: "gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/4343315965496664420/0"
  human_review_status {
    state: SKIPPED
    state_message: "HumanReviewConfig is DISABLED, skipping human review."
  }
}

Get the creation time for the batch job from the `metadata`:

In [101]:
created = datetime.datetime.fromtimestamp(operation_metadata.create_time.timestamp())
created

datetime.datetime(2023, 11, 22, 2, 0, 13, 156279)

Evaluate the creation time to see if it was in the last 1 hour:

In [102]:
created > datetime.datetime.today() - datetime.timedelta(hours = 1)

True

#### List Operations

Each operation matching the filter criteria is listed.  The metadata can be deserialized for review with `documentai.BatchProcessMetadata.deserialize(operation.metadata.value)`.  More on filter criteria can be found here: [Document AI RPC Reference for google.longrunning](https://cloud.google.com/document-ai/docs/reference/rpc/google.longrunning#google.longrunning.GetOperationRequest).

In [103]:
list_operations = docai.list_operations(
    request = LRO.ListOperationsRequest(
        name = f'projects/{PROJECT_ID}/locations/{LOCATION}/operations',
        filter = "TYPE=BATCH_PROCESS_DOCUMENTS AND STATE=DONE"
    )
)
len(list_operations.operations)

8

Filter list to operations created in last 1 hour:

In [104]:
operations = []
for op in list_operations.operations:
    metadata = documentai.BatchProcessMetadata.deserialize(op.metadata.value)
    if datetime.datetime.fromtimestamp(metadata.create_time.timestamp()) > (datetime.datetime.today() - datetime.timedelta(hours = 1)):
        operations.append(op)
len(operations)

3

---
## Async Processing

The following is the Python client reference to use for this async processing section:
- [google.cloud.documentai.DocumentProcessorServiceAsyncClient()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceAsyncClient)

### Online Async: Multiple Documents

Process multiple documents at the same time with the online async client.

In [118]:
docai_async = documentai.DocumentProcessorServiceAsyncClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)

#### List of local documents

In [124]:
docs = os.listdir('../shared files/docs/sports')
docs

['Golf - Wikipedia.pdf',
 'Cricket - Wikipedia.pdf',
 'Hockey - Wikipedia.pdf',
 'Association football - Wikipedia.pdf',
 'American football - Wikipedia.pdf',
 'Football - Wikipedia.pdf',
 'Rugby football - Wikipedia.pdf',
 'Baseball - Wikipedia.pdf',
 'Basketball - Wikipedia.pdf',
 '.ipynb_checkpoints']

In [125]:
docs = [doc for doc in docs if doc.endswith('.pdf')]

#### Read the documents to bytes

In [126]:
for d, doc in enumerate(docs):
    with open('../shared files/docs/sports/' + doc, 'rb') as f:
        docs[d] = f.read()

#### Process a single document (first 5 pages)

In [127]:
responses = []
responses.append(
    await docai_async.process_document(
        request = documentai.ProcessRequest(
            name = PARSER.name,
            inline_document = documentai.Document(
                content = docs[0],
                mime_type = 'application/pdf'
            ),
            process_options = documentai.ProcessOptions(
                from_start = 5
            )
        )
    )
)

In [128]:
type(responses[0].document)

google.cloud.documentai_v1.types.document.Document

In [129]:
len(responses[0].document.pages)

5

#### Process documents concurrently:

Keep in mind there are qouta limits that could limit the amount of simoultaneous processing in the project.  The following could be adapted to manage a limit for concurrency as well as error handling with a technique like exponential backoff.  

In [130]:
responses = await asyncio.gather(
    *[
        docai_async.process_document(
            request = documentai.ProcessRequest(
                name = PARSER.name,
                inline_document = documentai.Document(
                    content = doc,
                    mime_type = 'application/pdf'
                ),
                process_options = documentai.ProcessOptions(
                    from_start = 5
                )
            )
        ) for doc in docs
    ]
)

In [131]:
for response in responses:
    print(f'Document has {len(response.document.pages)} pages')

Document has 5 pages
Document has 5 pages
Document has 5 pages
Document has 5 pages
Document has 5 pages
Document has 5 pages
Document has 5 pages
Document has 5 pages
Document has 5 pages


In [132]:
print(responses[-1].document.text[0:200])

W
WIKIPEDIA
The Free Encyclopedia
Basketball
A
THE TEMPERAT
ROPS BELG
Basketball is a team sport in which two teams, most
Basketball
commonly of five players each, opposing one another on a
rectangula


### Online Async: Multiple Parts Of Same Document

For the OCR Parser there is a maximum number of pages of 15 for online processing.  This shows how to use async online processing to parse the entire document in shards.

In [133]:
local_doc_location

'../shared files/docs/sports/Baseball - Wikipedia.pdf'

#### How many pages are in the document?

In [136]:
with open(local_doc_location, 'rb') as pdf:
    doc = pdf.read()
reader = PyPDF2.PdfReader(io.BytesIO(doc))
num_pages = len(reader.pages)
num_pages

35

#### List of Pages Per Shard

The OCR parser can handle 15 pages per online request.  While you can split the pdf file into multiple files here, at the client, it is also possible to use the `documentai.ProcessRequest(process_options = documentai.ProcessOptions())` method to direct each request to a subset of the document pages.

- `documentai.ProcessOptions()` - [reference](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessOptions)
    - `from_start` = number of pages to process from start of document
    - `from_end` = number of pages to process from end of document
    - `individual_page_selector` = [int]
    
Here the `individual_page_selector` will be used.  It uses an index that starts at 1 to reference page numbers.

In [137]:
num_shards = num_pages // 15 + 1 * min(1, num_pages % 15)
num_shards

3

In [138]:
shards = []
for shard in range(num_shards):
    shards.append([i+1 for i in range(15*(shard), min(num_pages, 15*(shard+1)))])
shards

[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
 [31, 32, 33, 34, 35]]

#### Process Shards Concurrently:

Keep in mind there are qouta limits that could limit the amount of simoultaneous processing in the project.  The following could be adapted to manage a limit for concurrency as well as error handling with a technique like exponential backoff.  

In [139]:
responses = await asyncio.gather(
    *[
        docai_async.process_document(
            request = documentai.ProcessRequest(
                name = PARSER.name,
                inline_document = documentai.Document(
                    content = doc,
                    mime_type = 'application/pdf'
                ),
                process_options = documentai.ProcessOptions(
                    individual_page_selector = documentai.ProcessOptions.IndividualPageSelector(
                        pages = shard
                    )
                )
            )
        ) for shard in shards
    ]
)

In [140]:
for response in responses:
    print(f'Shard has {len(response.document.pages)} pages')
    pages = [page.page_number for page in response.document.pages]
    print(f'\tPage Numbers: {pages}')

Shard has 15 pages
	Page Numbers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Shard has 15 pages
	Page Numbers: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
Shard has 5 pages
	Page Numbers: [31, 32, 33, 34, 35]


In [141]:
print(responses[-1].document.text[0:200])

10/27/23, 9:22 AM
Baseball - Wikipedia
170. "Opening Day Rosters Feature 230 Players Born Outside the U.S." (https://www.mlb.com/news/op
ening-day-rosters-feature-230-players-born-outside-the-us/c-116


---
## Store Responses

For batch processing the responses are already stored as JSON in GCS.  This section covers how to also store online responses as JSON in GCS.  It also covers loading the JSON responses from GCS to BigQuery as well as retrieving them from BigQuery.

### GCS

For batch processing the results are already output to GCS as json.  This section covers a method to store the online processing results that are local to this notebook.

> Note: The `documentai.document.Document` object is a protobuf.  It has built-in methods for conversion of messages like `.to_dict()` and `.to_json()`.  These follow the `google.protobuf.json_format` - [reference](https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html).  That means the parameters of `.to_dict()` are the same as `google.protobuf.json_format.MessageToDict()` which includes `use_integer_for_enums` which should be set `= False` to get the text representation of enum values.

In [142]:
local_doc_location

'../shared files/docs/sports/Baseball - Wikipedia.pdf'

In [143]:
type(response.document)

google.cloud.documentai_v1.types.document.Document

Objects like `document.Document` have methods of `.to_json()` and `.to_dict()`. Here the `.to_dict()` is used and then `json.dumps()` is used to write to GCS as a single line json file.  Using `.to_json()` results in an json string with newlines and indenting that when saved to GCS cannot be directly loaded into BigQuery while requires a single json record per line. 

In [144]:
dict_document = documentai.Document.to_dict(response.document, use_integers_for_enums = False)

In [145]:
type(dict_document)

dict

Fixing the dictionary to match the schema of batch processing output:

In [146]:
for p, page in enumerate(dict_document['pages']):
    dict_document['pages'][p]['dimension']['width'] = int(page['dimension']['width'])
    dict_document['pages'][p]['dimension']['height'] = int(page['dimension']['height'])

Save document to GCS as json with `json.dumps()`:

In [147]:
f"{SERIES}/{EXPERIMENT}/parsing/online/{local_doc_location.split('/')[-1].split('.')[0]}.json"

'working-with-docai/process-documents/parsing/online/Baseball - Wikipedia.json'

In [148]:
blob = bucket.blob(f"{SERIES}/{EXPERIMENT}/parsing/online/{local_doc_location.split('/')[-1].split('.')[0]}.json")

In [149]:
blob.upload_from_string(data = json.dumps(dict_document), content_type = 'application/json')

In [150]:
for blob in list(bucket.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}/parsing/online')):
    print(blob.name)

working-with-docai/process-documents/parsing/online/Baseball - Wikipedia.json


#### Reload From GCS

In [151]:
reload_response = documentai.Document.from_json(
    blob.download_as_bytes(), 
    ignore_unknown_fields = True
)

In [152]:
type(reload_response)

google.cloud.documentai_v1.types.document.Document

In [153]:
print(reload_response.text[0:250])

10/27/23, 9:22 AM
Baseball - Wikipedia
170. "Opening Day Rosters Feature 230 Players Born Outside the U.S." (https://www.mlb.com/news/op
ening-day-rosters-feature-230-players-born-outside-the-us/c-116591920) Major League Baseball.
Retrieved April 24,


In [154]:
len(reload_response.pages)

5

### BigQuery

BigQuery is a great place to store, retrieve, and even query the json data from Document AI:
- Import Data:
    - [Load JSON data from GCS](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json)
- Retrieve Data:
    - [Python Client for BigQuery](https://cloud.google.com/python/docs/reference/bigquery/latest)
        - Use SQL from Python to retrieve results to Pandas dataframes
    - [BigFrames Client for BigQuery](https://cloud.google.com/python/docs/reference/bigframes/latest)
        - Use Pandas like API to retrieve results to Pandas dataframe
    - [Python Client for Google BigQuery Storage API](https://cloud.google.com/python/docs/reference/bigquerystorage/latest)
        - A client for directly injesting data from BigQuery storage without the compute of a query.
- Query Data Inside BigQuery:
    - BigQuery has a native data type for JSON: [BigQuery JSON type](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type)
    - [Query JSON data](https://cloud.google.com/bigquery/docs/json-data#query_json_data)
    - All of [Google SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax)
    
This section covers loading results from above into BigQuery.  There is another workflow in this series that will cover using the results from/in BigQuery - [Document AI - Process Responses](./Document%20AI%20-%20Process%20Responses.ipynb).

#### Listing Results in GCS

List the JSON object created in GCS by the processing above:

In [155]:
blobs = []
for blob in list(gcs.list_blobs(bucket, prefix = f'{SERIES}/{EXPERIMENT}/parsing')):
    if blob.name.endswith('.json'):
        print(blob.name)
        blobs.append(f'gs://{GCS_BUCKET}/{blob.name}')

working-with-docai/process-documents/parsing/11129619645237873382/0/Basketball - Wikipedia-0.json
working-with-docai/process-documents/parsing/11129619645237873382/0/Basketball - Wikipedia-1.json
working-with-docai/process-documents/parsing/11129619645237873382/0/Basketball - Wikipedia-2.json
working-with-docai/process-documents/parsing/11129619645237873382/1/Baseball - Wikipedia-0.json
working-with-docai/process-documents/parsing/11129619645237873382/1/Baseball - Wikipedia-1.json
working-with-docai/process-documents/parsing/11129619645237873382/1/Baseball - Wikipedia-2.json
working-with-docai/process-documents/parsing/11129619645237873382/1/Baseball - Wikipedia-3.json
working-with-docai/process-documents/parsing/12228812013921475134/0/Baseball - Wikipedia-0.json
working-with-docai/process-documents/parsing/12228812013921475134/0/Baseball - Wikipedia-1.json
working-with-docai/process-documents/parsing/12228812013921475134/0/Baseball - Wikipedia-2.json
working-with-docai/process-documen

In [156]:
len(blobs)

47

#### Create/Get BigQuery Dataset

In [157]:
ds = bigquery.DatasetReference(BQ_PROJECT, BQ_DATASET)
ds.location = REGION[0:2] # use multi-region if creating new
ds.labels = {'series': f'{SERIES}'}
ds = bq.create_dataset(dataset = ds, exists_ok = True)

#### Load JSON To BigQuery Table

Each file can have slightly different schemas due to additional element found in some documents.  For this reason, first load a single result as a new table. Then iteratively load the remaining results with `schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION]` as part of the job_config

In [574]:
# make load job configuration
job_config = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE, #.WRITE_APPEND, #.WRITE_TRUNCATE, #.WRITE_EMPTY
    create_disposition = bigquery.CreateDisposition.CREATE_IF_NEEDED, #.CREATE_NEVER
    #schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION],
    autodetect = True
)    

# save files_pages
load_job = bq.load_table_from_uri(
    source_uris = blobs[0:1],
    destination = ds.table(BQ_TABLE_PREFIX),
    location = REGION[0:2],
    job_config = job_config
)
load_job.result()

LoadJob<project=statmike-mlops-349915, location=US, id=c120cd47-6e39-4c14-afab-fb5737574b4e>

In [None]:
# make load job configuration
job_config = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition = bigquery.WriteDisposition.WRITE_APPEND, #.WRITE_APPEND, #.WRITE_TRUNCATE, #.WRITE_EMPTY
    create_disposition = bigquery.CreateDisposition.CREATE_IF_NEEDED, #.CREATE_NEVER
    schema_update_options = [bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION],
    autodetect = True
)  

for blob in blobs[1:]:
    # save files_pages
    load_job = bq.load_table_from_uri(
        source_uris = [blob],
        destination = ds.table(BQ_TABLE_PREFIX),
        location = REGION[0:2],
        job_config = job_config
    )
    load_job.result() 

#### Reload From BQ

Using the BigQuery Client for Python to retrieve result of query that constructs JSON string within BigQuery usuing the `TO_JSON_STRING` function - [JSON functions reference](https://cloud.google.com/bigquery/docs/reference/standard-sql/json_functions).

In [676]:
reload_response = bq.query(query = f'''
SELECT
    TO_JSON_STRING(
        (
            SELECT AS STRUCT *
            FROM unnest([t])
        )
    ) as json_string
FROM `{ds.reference.project}.{ds.reference.dataset_id}.{BQ_TABLE_PREFIX}` t
LIMIT 1
''')

In [677]:
reload_response = [dict(row) for row in reload_response]

In [678]:
reload_response = documentai.Document.from_json(
    reload_response[0]['json_string'], 
    ignore_unknown_fields = True
)

In [679]:
type(reload_response)

google.cloud.documentai_v1.types.document.Document

In [680]:
len(reload_response.pages)

5

In [681]:
print(reload_response.text[0:250])

10/27/23, 9:22 AM
Baseball - Wikipedia
170. "Opening Day Rosters Feature 230 Players Born Outside the U.S." (https://www.mlb.com/news/op
ening-day-rosters-feature-230-players-born-outside-the-us/c-116591920) Major League Baseball.
Retrieved April 24,
