![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FWorking+With+Document+AI&dt=Document+AI+-+Process+Documents.ipynb)

# Document AI - Process Documents
> From the [Working With Document AI](https://github.com/statmike/vertex-ai-mlops/blob/main/Working%20With%20Document%20AI/readme.md) series in the [vertex-ai-mlops](https://github.com/statmike/vertex-ai-mlops/blob/main/readme.md) repository.

Document AI is an API where you interact with processors to extract information from documents.  You enable the API, create an instance of a processor in your project, send in document(s), receive back JSON with the extracted information:

<p align="center" width="100%"><center>
    <img src="../architectures/architectures/images/working with/documentai/readme/high_level.png">
</center></p>

This workflow covers all the ways to process a document, or many documents, using Python as the client. For details on how to extract elements from the responses see the next workflow: [Document AI - Process Responses](./Document%20AI%20-%20Process%20Responses.ipynb)

---
**Documents**

Document AI sources are documents.  There are many supported document types (file formats):
- Supported [Document Types](https://cloud.google.com/document-ai/docs/file-types) like pdf, gif, tiff, jpeg, pn, gmp, webp
- Additional support for [DocX files is in preview](https://cloud.google.com/document-ai/docs/enterprise-document-ocr#supported_file_formats).

---
**Processing**

Processing can be orchestrated with one of the [client libraries](https://cloud.google.com/document-ai/docs/libraries), [REST](https://cloud.google.com/document-ai/docs/reference/rest), or [RPC](https://cloud.google.com/document-ai/docs/reference/rpc).  This workflow will use the [Python Client for Document AI](https://cloud.google.com/python/docs/reference/documentai/latest).

```
from google.cloud import documentai

docai = documentai.DocumentProcessorServiceClient()
```

> There is also an async client that can be used.  The methods have the same names and can be awaited with `await`:
> - `docai_async = documentai.DocumentProcessorServiceAsyncClient()`

Processing can be be done online (one document) or in batch (multiple documents):
- online (one document):
    - `docai.process_document(request = documentai.types.ProcessRequest(client_options = ))`
- batch (multiple documents):
    - `docai.batch_process_documents(request = documentai.types.BatchProcessRequest(client_options = ))`

---
**Inputs & Outputs**

The following table breaks down the input and output locations by the type of processing:


<table style='text-align:center;vertical-align:middle;border:1px solid black' width="90%" cellpadding="1" cellspacing="0">
    <caption>Inputs & Outputs</caption>
    <col>
    <col>
    <col>
<!--..........................................................................................-->
    <thead>
        <tr>
            <th scope="col" style="width:20%">
                Processing Mode
            </th>
            <th scope="col" style="width:40%">
                Inputs
            </th>
            <th scope="col" style="width:40%">
                Outputs
            </th>
        </tr>
    </thead>
    <tbody>
<!--..........................................................................................-->
        <tr>
            <td>
                Online<br>(Single Document Per Request)
            </td>
            <td>
                <table>
                    <tr style='text-align:center'>
                        <td>One of:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>Document in GCS:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                        <pre>
response = doc_ai.process_document(
    request = documentai.types.ProcessRequest(
        <b>inline_document</b> = documentai.types.Document(
            uri = 'gs://bucket/path/to/object.ext'
        )
    )
)
                        </pre>
                        </td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>Document as bytes</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                        <pre>
response = doc_ai.process_document(
    request = documentai.types.ProcessRequest(
        # provide a bytes object
        <b>raw_document</b> = documentai.types.RawDocument(
            content = 
        )
    )
)
                        </pre>
                        </td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>Document in GCS</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                        <pre>
response = doc_ai.process_document(
    request = documentai.types.ProcessRequest(
        # provide GCS URI as string
        <b>gcs_document</b> = documentai.types.GcsDocument(
            gcs_uri = 'gs://bucket/path/to/object'
        )
    )
)
                        </pre>
                        </td>
                    </tr>
                </table>
            </td>
            <td  style='text-align:left'>
                The response is an object containing the document response.
                <br><pre>type(response) is documentai.types.ProcessResponse()</pre>
                <br><br>This has a parameter with the document:
                <br><pre>type(response.document) is documentai.types.Document()</pre>
                <br><br>The document object contains parameters with document components, like:
                <ul>
                    <li>response.document.text is a string with full text of the document</li>
                    <li>response.document.pages is a list of documentai.types.Document.Pagee objects</li>
                    <li>response.document.entities is a list of documentai.types.Document.Entity objects</li>
                </ul>
                <br>The document object contains method for converting to Python objects:
                <ul>
                    <li>response.document.to_dict() for dictionary</li>
                    <li>response.document.to_json() for JSON</li>
                </ul>
            </td>
        </tr>
<!--..........................................................................................-->
        <tr>
            <td>
                Batch<br>(Multiple Documents Per Request)
            </td>
            <td>
                <table>
                    <tr style='text-align:center'>
                        <td>One of:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>List of documents in GCS:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                <pre>
doc_ai.batch_process_documents(
    request = documentai.types.BatchProcessRequest(
        <b>input_documents</b> = documentai.types.BatchDOcumentsInputConfig(
            # provide a list of document objects that each have parameter gcs_uri = GCS URI as string
            <b>gcs_documents</b> = documentai.types.GcsDocuments(
                gcs_uri = [documentai.types.GcsDocument(gcs_uri = ), ...]
            )
        )
    )
)
                </pre>
                        </td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>All documents with GCS prefix:</td>
                    </tr>
                    <tr style='text-align:left'>
                        <td>
                            <pre>
doc_ai.batch_process_documents(
    request = documentai.types.BatchProcessRequest(
        <b>input_documents</b> = documentai.types.BatchDocumentsInputConfig(
            # provide a GCS URI (prefix) as string
            <b>gcs_prefix</b> = documentai.types.GcsPrefix(
                gcs_uri_prefix = 
            )
        )
    )
)
                            </pre>
                        </td>
                    </tr>
                </table>
            </td>
            <td style='text-align:left'>
                The batch processing job includes a parameter for configuring the output location of JSON files in GCS.<br><br>
                <pre>
doc_ai.batch_process_documents(
    request = documentai.BatchProcessRequest(
       <b>document_output_config</b> = documentai.types.DocumentOutputConfig(
            <b>gcs_output_config</b> = documentai.types.GcsOutputConfig(
                gcs_uri = 'gs://bucket/path/to/output', # the output JSON will writen to this directory
                field_mask = , # optional: fields to include in output
                sharding_config = # optional: sharding config for output
            )
        )
    )
)
                </pre>
            </td>
        </tr>     
<!--..........................................................................................-->
    </tbody>
</table>


---
**Processing Specifics**

There are limits to processing requests:
- the number of request that can be made over a period of time: [Quotas](https://cloud.google.com/document-ai/quotas#quotas)
- the amount and size of content (documents, pages): [Content Limits](https://cloud.google.com/document-ai/quotas#content_limits)
- the processing request for each processor (parser) also has limits: [Processor Specific Limits](https://cloud.google.com/document-ai/quotas#processor_limits)

What does this actually mean?  Let's pick a single processor and walk through it, the OCR Parser. [This page](https://cloud.google.com/document-ai/docs/processors-list) has all the specifics for each parser.
- Parser Limits: The OCR parser
    - limit of 15 pages for an online requests and 500 for a batch requests
- Content Limits:
    - file size: 20MB online, and 1GB batch
    - files: 1 for online, 5000 for batch
        - but the OCR parser has a 500 page limit for batch
    - If the file type is an image (not PDF) then each page can be a max of 40 megapixels
- Requests (Qoutas):
    - overall
        - 10,000 active pages per project
    - users:
        - 1800 requests per minute
    - online (per minute):
        - 600 per project
        - 120 per project/processor/multi-region (US, EU)
        - 6 per project/processor/single-region
    - batch (concurrent jobs):
        - 10 per project
        - 5 per project/multi-region
        - 5 per project/single-region


---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Working%20With%20Document%20AI/Document%20AI%20-%20Process%20Documents.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [3]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [4]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [5]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.documentai', 'google-cloud-documentai'),
    ('google.cloud.documentai', 'google-cloud-storage'),
    ('google.cloud.documentai', 'google-cloud-bigquery'),
    ('PIL', 'Pillow'),
    #('PyPDF2', 'PyPDF2'), 
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### API Enablement

In [6]:
!gcloud services enable documentai.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [7]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [8]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [10]:
REGION = 'us-central1'
SERIES = 'working-with-docai'
EXPERIMENT = 'process-documents'

# make this the gcs bucket for storing files
GCS_BUCKET = PROJECT_ID

Packages

In [118]:
import os, shutil, glob

import IPython
import PIL
import PIL.ImageFont, PIL.Image, PIL.ImageDraw

from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage

Clients

In [53]:
# document AI client
LOCATION = REGION.split('-')[0]
docai = documentai.DocumentProcessorServiceClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)

# gcs client: assumes bucket already exists
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

---
## Documents

This section prepares documents for processing.  In this case there are documents in a local folder in the repository that are prepared for online and batch serving by either loading with directly or copying to a GCS location within the bucket defined above with parameter `GCS_BUCKET`.

The file sin the local folder `/docs` are printed pages (to .pdf) from the following [Wikipedia](https://www.wikipedia.org/) pages:

|Document Name|Link|
|---|---|
|`docs/Bayes' theorem - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Bayes%27_theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem)|
|`docs/sports/Baseball - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Baseball](https://en.wikipedia.org/wiki/Baseball)|
|`docs/sports/Football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Football](https://en.wikipedia.org/wiki/Football)|
|`docs/sports/Association football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Association_football](https://en.wikipedia.org/wiki/Association_football)|
|`docs/sports/American football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/American_football](https://en.wikipedia.org/wiki/American_football)|
|`docs/sports/Hockey - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Hockey](https://en.wikipedia.org/wiki/Hockey)|
|`docs/sports/Basketball - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Basketball](https://en.wikipedia.org/wiki/Basketball)|
|`docs/sports/Cricket - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Cricket](https://en.wikipedia.org/wiki/Cricket)|
|`docs/sports/Rugby football - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Rugby_football](https://en.wikipedia.org/wiki/Rugby_football)|
|`docs/sports/Golf - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Golf](https://en.wikipedia.org/wiki/Golf)|
|`docs/jam_bands/Jam band - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Jam_band](https://en.wikipedia.org/wiki/Jam_band)|
|`docs/jam_bands/Widespread Panic - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Widespread_Panic](https://en.wikipedia.org/wiki/Widespread_Panic)|
|`docs/jam_bands/Cream (band) - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Cream_(band)](https://en.wikipedia.org/wiki/Cream_(band))|
|`docs/jam_bands/Phish - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Phish](https://en.wikipedia.org/wiki/Phish)|
|`docs/jam_bands/The Allman Brothers Band - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/The_Allman_Brothers_Band](https://en.wikipedia.org/wiki/The_Allman_Brothers_Band)|
|`docs/jam_bands/Grateful Dead - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Grateful_Dead](https://en.wikipedia.org/wiki/Grateful_Dead)|



### Get The Documents

If you are working from a clone of this notebooks repository then the documents are already present. The following cell checks for the documents folder, `/docs`, and if it is missing gets it (`git clone`):

In [54]:
if not os.path.exists('docs'):
    print('Retrieving documents...')
    if not os.path.exists('temp'):
        os.makedirs('temp')
    !git clone https://www.github.com/statmike/vertex-ai-mlops temp/vertex-ai-mlops
    shutil.copytree('temp/vertex-ai-mlops/Working With Document AI/docs', 'docs')
    shutil.rmtree('temp/vertex-ai-mlops')
    print('Document are now in folder `/docs`')
else:
    print('Documents Found in folder `/docs`')

Documents Found in folder `/docs`


### Copy Documents To GCS

Make a copy of the `/docs` folder in the GCS Bucket defined above with parameter `GCS_BUCKET`.  This will add a prefix (folder structure) of `/{SERIES}/{EXPERIMENT}`.

In [39]:
glob.glob(f'docs/**/**')

['docs/jam_bands/Widespread Panic - Wikipedia.pdf',
 'docs/jam_bands/Cream (band) - Wikipedia.pdf',
 'docs/jam_bands/The Allman Brothers Band - Wikipedia.pdf',
 'docs/jam_bands/Jam band - Wikipedia.pdf',
 'docs/jam_bands/Grateful Dead - Wikipedia.pdf',
 'docs/jam_bands/Phish - Wikipedia.pdf',
 'docs/sports/Golf - Wikipedia.pdf',
 'docs/sports/Cricket - Wikipedia.pdf',
 'docs/sports/Hockey - Wikipedia.pdf',
 'docs/sports/Association football - Wikipedia.pdf',
 'docs/sports/American football - Wikipedia.pdf',
 'docs/sports/Football - Wikipedia.pdf',
 'docs/sports/Rugby football - Wikipedia.pdf',
 'docs/sports/Baseball - Wikipedia.pdf',
 'docs/sports/Basketball - Wikipedia.pdf']

In [42]:
for file in glob.glob(f'docs/**/**'):
    blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/{file}')
    blob.upload_from_filename(file)

In [43]:
print(f"View the bucket directly here:\nhttps://console.cloud.google.com/storage/browser/{GCS_BUCKET}/{SERIES}/{EXPERIMENT};tab=objects&project={PROJECT_ID}")

View the bucket directly here:
https://console.cloud.google.com/storage/browser/statmike-mlops-349915/working-with-docai/process-documents;tab=objects&project=statmike-mlops-349915


List files in bucket:

In [51]:
for blob in list(bucket.list_blobs(prefix = f'{SERIES}/{EXPERIMENT}/docs')):
    print(blob.name)

working-with-docai/process-documents/docs/jam_bands/Cream (band) - Wikipedia.pdf
working-with-docai/process-documents/docs/jam_bands/Grateful Dead - Wikipedia.pdf
working-with-docai/process-documents/docs/jam_bands/Jam band - Wikipedia.pdf
working-with-docai/process-documents/docs/jam_bands/Phish - Wikipedia.pdf
working-with-docai/process-documents/docs/jam_bands/The Allman Brothers Band - Wikipedia.pdf
working-with-docai/process-documents/docs/jam_bands/Widespread Panic - Wikipedia.pdf
working-with-docai/process-documents/docs/sports/American football - Wikipedia.pdf
working-with-docai/process-documents/docs/sports/Association football - Wikipedia.pdf
working-with-docai/process-documents/docs/sports/Baseball - Wikipedia.pdf
working-with-docai/process-documents/docs/sports/Basketball - Wikipedia.pdf
working-with-docai/process-documents/docs/sports/Cricket - Wikipedia.pdf
working-with-docai/process-documents/docs/sports/Football - Wikipedia.pdf
working-with-docai/process-documents/docs/

---
## Processors

When submitting documents for processing in Document AI, the client routes the document to a processor.  There are many processors:
- [Full processor and detail list](https://cloud.google.com/document-ai/docs/processors-list)
- Check out the helpful table for processors in this workflows [readme file](./readme.md) 

When setting up a processor you can also pick versions or it will default to a version.

This section shows how to:
- list available processors in the project: console and Python Client
    - describe processor(s)
- get/create a processor with desired type and version


### List Processors In This Project

If any have already been created, list them:

What are the processors already created in this project environment?

In [61]:
processors = list(docai.list_processors(parent = f'projects/{PROJECT_ID}/locations/{LOCATION}'))
len(processors)

3

In [67]:
if processors:
    print(f'View the processors in the console with this link:\nhttps://console.cloud.google.com/ai/document-ai/processors?project={PROJECT_ID}\n\n')
    for p, processor in enumerate(processors):
        print(
            f'Processors {p}: ', processor.display_name, 
            'is of type = ', processor.type_, 
            ', and version = ',processor.default_processor_version.split('/')[-1])

View the processors in the console with this link:
https://console.cloud.google.com/ai/document-ai/processors?project=statmike-mlops-349915


Processors 0:  example-dot is of type =  CUSTOM_EXTRACTION_PROCESSOR , and version =  pretrained-foundation-model-v1.0-2023-08-22
Processors 1:  my-invoice is of type =  INVOICE_PROCESSOR , and version =  pretrained-invoice-v1.3-2022-07-15
Processors 2:  my_general_processor is of type =  FORM_PARSER_PROCESSOR , and version =  pretrained-form-parser-v1.0-2020-09-23


### Create/Get A Processor

For this workflow we will use the [OCR parser](https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr). We can check for an existing processor in the project that the OCR Parser with desired version and if it is not present then create one.  The processor will be connected with Python variable `PARSER` and referred to as a parser as it is used.

Get the type and version from the list of available processors: https://cloud.google.com/document-ai/docs/processors-list

In [77]:
TYPE = 'OCR_PROCESSOR'
VERSION = 'pretrained-ocr-v2.0-2023-06-02'

Get an existing processor:

In [78]:
PARSER = ''
for processor in processors:
    if processor.type_ == TYPE and processor.default_processor_version.split('/')[-1] == VERSION:
        PARSER = processor
        break
        
if PARSER:
    print(f'There is an existing processor with the desire type and version in PARSER = {PARSER.display_name}')
else:
    print(f'Need to create a processor for the desired type and version: {TYPE}, {VERSION}')

Need to create a processor for the desired type and version: OCR_PROCESSOR, pretrained-ocr-v2.0-2023-06-02


Create the processor if an existing one was not found to match:

In [79]:
if not PARSER:
    PARSER = docai.create_processor(
        parent = f'projects/{PROJECT_ID}/locations/{LOCATION}',
        processor = documentai.Processor(
            display_name = SERIES,
            type_ = TYPE,
            default_processor_version = VERSION
        )
    )
    print(f'Processor created and in PARSER variable with display name = {PARSER.display_name}')

Processor created and in PARSER variable with display name = working-with-docai


---
## Online Processing (single document)

There are three ways to provide a single document to the client and each is covered in this section.

> NOTE: The [OCR Processor](https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr) has pages limits of 15 for online, and 500 for batch processing.

The following is the Python client reference to use for this online processing section:
- [google.cloud.documentai.DocumentProcessorServiceClient.process_document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceClient#google_cloud_documentai_v1_services_document_processor_service_DocumentProcessorServiceClient_process_document)

Using the processor stored in `PARSER` from above:

In [81]:
PARSER.name

'projects/1026793852137/locations/us/processors/77d89f0b14d4643c'

Specify the loation, local folder and GCS, of one of the document samples:

In [146]:
local_doc_location = 'docs/sports/Baseball - Wikipedia.pdf'
gcs_doc_location = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/{local_doc_location}'

Read the document to a bytes object:

In [119]:
with open(local_doc_location, 'rb') as f:
    local_doc = f.read()

In [120]:
type(local_doc)

bytes

---
### Document in GCS: `inline_document`

Reference:
- [documentai.ProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest)
    - (This One) `inline_document` = [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
    - `raw_document` = [documentai.RawDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.RawDocument)
    - `gcs_document` = [documentai.GcsDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocument)

In [135]:
response = docai.process_document(
    request = documentai.ProcessRequest(
        name = PARSER.name,
        inline_document = documentai.Document(
            content = local_doc,
            mime_type = 'application/pdf'
        ),
        process_options = documentai.ProcessOptions(
            from_start = 5
        )
    )
)

In [137]:
len(response.document.pages)

5

In [139]:
print(response.document.text[0:250])

10/27/23, 9:22 AM
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
Baseball - Wikipedia
Baseball is a bat-and-ball sport played between two
teams of nine players each, taking turns batting and
fielding. The game occurs over the c


---
### Document as bytes: `raw_document`

Reference:
- [documentai.ProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest)
    - `inline_document` = [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
    - (This One) `raw_document` = [documentai.RawDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.RawDocument)
    - `gcs_document` = [documentai.GcsDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocument)

In [141]:
response = docai.process_document(
    request = documentai.ProcessRequest(
        name = PARSER.name,
        raw_document = documentai.RawDocument(
            content = local_doc,
            mime_type = 'application/pdf'
        ),
        process_options = documentai.ProcessOptions(
            from_start = 5
        )
    )
)

In [142]:
len(response.document.pages)

5

In [143]:
print(response.document.text[0:250])

10/27/23, 9:22 AM
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
Baseball - Wikipedia
Baseball is a bat-and-ball sport played between two
teams of nine players each, taking turns batting and
fielding. The game occurs over the c


---
### Document in GCS: `gcs_document`

Reference:
- [documentai.ProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest)
    - `inline_document` = [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
    - `raw_document` = [documentai.RawDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.RawDocument)
    - (This One) `gcs_document` = [documentai.GcsDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocument)

In [148]:
response = docai.process_document(
    request = documentai.ProcessRequest(
        name = PARSER.name,
        gcs_document = documentai.GcsDocument(
            gcs_uri = gcs_doc_location,
            mime_type = 'application/pdf'
        ),
        process_options = documentai.ProcessOptions(
            from_start = 5
        )
    )
)

In [149]:
len(response.document.pages)

5

In [150]:
print(response.document.text[0:250])

10/27/23, 9:22 AM
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
Baseball - Wikipedia
Baseball is a bat-and-ball sport played between two
teams of nine players each, taking turns batting and
fielding. The game occurs over the c


---
## Batch Processing (multiple documents)

There are two ways to provide documents to the client and each is covered in this section.

> NOTE: The [OCR Processor](https://cloud.google.com/document-ai/docs/processors-list#processor_doc-ocr) has pages limits of 15 for online, and 500 for batch processing.

The following is the Python client reference to use for this batch processing section:
- [google.cloud.documentai.DocumentProcessorServiceClient.batch_process_documents()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceClient#google_cloud_documentai_v1_services_document_processor_service_DocumentProcessorServiceClient_batch_process_documents)

---
### Documents in GCS listed: `gcs_documents`

Reference:
- [documentai.BatchProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchProcessRequest)
    - `input_documents` = [documentai.BatchDocumentsInputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchDocumentsInputConfig)
        - `gcs_prefix` = [documentai.GcsPrefix()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsPrefix)
        - `gcs_documents` = [documentai.GcsDocuments](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocuments)
    - `document_output_config` = [documentai.DocumentOutputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.DocumentOutputConfig)

In [151]:
batch_job = docai.batch_process_documents(
    request = documentai.BatchProcessRequest(
        name = PARSER.name,
        input_documents = documentai.BatchDocumentsInputConfig(
            gcs_documents = documentai.GcsDocuments(
                documents = [
                    documentai.GcsDocument(
                        gcs_uri = gcs_doc_location, 
                        mime_type = 'application/pdf'
                    )
                ]
            )
        ),
        document_output_config = documentai.DocumentOutputConfig(
            gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
                gcs_uri = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/parsing'
            )
        ),
    )
)

In [153]:
print(f'Waiting on batch job to complete: {batch_job.operation.name}')
batch_job.result()

Waiting on batch job to complete: projects/1026793852137/locations/us/operations/17499314109986257616




List the input and output locaitons for each document processed:

In [159]:
for d, doc in enumerate(batch_job.metadata.individual_process_statuses):
    print(f'Document {d}:\n\t{doc.input_gcs_source}\n\t{doc.output_gcs_destination}\n')

Document 0:
	gs://statmike-mlops-349915/working-with-docai/process-documents/docs/sports/Baseball - Wikipedia.pdf
	gs://statmike-mlops-349915/working-with-docai/process-documents/parsing/17499314109986257616/0



Read the results for the first document:

In [180]:
batch_responses = []
for json_output in gcs.list_blobs(bucket, prefix = batch_job.metadata.individual_process_statuses[0].output_gcs_destination.split(f'gs://{GCS_BUCKET}/')[1]):
    if json_output.content_type == 'application/json':
        print(json_output.name)
        batch_responses.append(documentai.Document.from_json(json_output.download_as_bytes(), ignore_unknown_fields = True))

working-with-docai/process-documents/parsing/17499314109986257616/0/Baseball - Wikipedia-0.json
working-with-docai/process-documents/parsing/17499314109986257616/0/Baseball - Wikipedia-1.json
working-with-docai/process-documents/parsing/17499314109986257616/0/Baseball - Wikipedia-2.json
working-with-docai/process-documents/parsing/17499314109986257616/0/Baseball - Wikipedia-3.json


In [181]:
for document in batch_responses:
    print(len(document.pages), document.text[0:100])

10 10/27/23, 9:22 AM
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
Baseball - W
10 10/27/23, 9:22 AM
Baseball - Wikipedia
a number of competitions between clubs from different countri
10 Baseball - Wikipedia
10/27/23, 9:22 AM
4. Thurston (2000), p. 15; "Official Rules/Foreword" (http://
5 10/27/23, 9:22 AM
Baseball - Wikipedia
170. "Opening Day Rosters Feature 230 Players Born Outside th


The output is sharded and need to be wrapped together.  This is not a simple concatenation as each shared has the structre for a subset of the input document(s).

In [167]:
json_bytes = b'\n'.join(json_bytes)

In [168]:
batch_response = documentai.Document.from_json(json_bytes, ignore_unknown_fields = True)

In [170]:
len(batch_response.pages)

5

---
### Documents in GCS with prefix: `gcs_prefix`

Reference:
- [documentai.BatchProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchProcessRequest)
    - `input_documents` = [documentai.BatchDocumentsInputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.BatchDocumentsInputConfig)
        - `gcs_prefix` = [documentai.GcsPrefix()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsPrefix)
        - `gcs_documents` = [documentai.GcsDocuments](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocuments)
    - `document_output_config` = [documentai.DocumentOutputConfig()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.DocumentOutputConfig)

In [None]:
batch_response

---
## Async Processing

The following is the Python client reference to use for this async processing section:
- [google.cloud.documentai.DocumentProcessorServiceAsyncClient()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceAsyncClient)

### Multiple Online Processing Async Request

### Multiple Batch Processing Async Request

---
## Store Responses

For bath processing the responses are already stored as JSON in GCS.  This section covers how to also store online responses as JSON in GCS.  It also covers loading the JSON responses from GCS to BigQuery as well as retrieving them from BigQuery.

### GCS

### BigQuery