![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2FWorking+With+Document+AI&dt=Document+AI+-+Process+Documents.ipynb)

# Document AI - Process Documents

Document AI is an API where you interact with processors to extract information from documents.  You enable the API, create an instance of a processor in your project, send in document(s), receive back JSON with the extracted information:

<p align="center" width="100%"><center>
    <img src="../architectures/architectures/images/working with/documentai/readme/high_level.png">
</center></p>

This workflow covers all the ways to process a document, or many documents, using Python as the client. For details on how to extract elements from the responses see the next workflow: [Document AI - Process Responses](./Document%20AI%20-%20Process%20Responses.ipynb)

---
**Documents**

Document AI sources are documents.  There are many supported document types (file formats):
- Supported [Document Types](https://cloud.google.com/document-ai/docs/file-types) like pdf, gif, tiff, jpeg, pn, gmp, webp
- Additional support for [DocX files is in preview](https://cloud.google.com/document-ai/docs/enterprise-document-ocr#supported_file_formats).

---
**Processing**

Processing can be orchestrated with one of the [client libraries](https://cloud.google.com/document-ai/docs/libraries), [REST](https://cloud.google.com/document-ai/docs/reference/rest), or [RPC](https://cloud.google.com/document-ai/docs/reference/rpc).  This workflow will use the [Python Client for Document AI](https://cloud.google.com/python/docs/reference/documentai/latest).

```
from google.cloud import documentai

doc_ai_async = documentai.DocumentProcessorServiceClient()
```

> There is also an async client that can be used.  The methods have the same names and can be awaited with the `await`:
> `doc_ai = documentai.DocumentProcessorServiceAsyncClient()`

Processing can be be done online (one document) or in batch (multiple documents):
- online: `doc_ai.process_document()`
- batch: `doc_ai.batch_process_documents()`

---
**Inputs & Outputs**

File Locations:
- inputs:
    - online processing
        - `documentai.ProcessRequest()`: includes one of these parameters:
            - `inline_document` = `documentai.types.Document()`
                - source is specified with: `uri` = GCS URI as string
            - `raw_document` = `documentai.types.RawDocument()`
                - source is specified with: `content` = bytes
            - `gcs_document` = `documentai.types.GcsDocument()`
                - source is specified with: `gcs_uri` = GCS URI as string
    - batch processing
        - `documentai.BatchDocumentsInputConfig()`: includes one of these parameters:
            - `gcs_prefix` = `documentai.types.GcsPrefix()`
                - sources are specified with: `gcs_uri_prefix` = GCS URI (prefix) as string
            - `gcs_documents` = `documentai.types.GcsDocuments()`
                - sources are specified with: documents = [list of `documentai.types.GcsDocument()`]
                    - each source has parameter: `gcs_uri` = GCS URI as string
- outputs:
    - online processing
        - response returned to client as `documentai.types.ProcessResponse()`
            - `document` = `documentai.types.Document()`
                - parameters:
                    - `text` =
                    - `pages` =
                    - `entities` = 
                    - ...
                - methods:
                    - `.to_dict()` for dictionary
                    - `.to_json()` for JSON
    - batch processing
        - the `doc_ai.batch_process_document()` requests has parameter 'document_output_config` = `documentai.DocumentOutputConfig()`:
            - `gcs_output_config` = `documentai.types.DocumentOutputConfig.GcsOutputConfig()`
                - `gcs_uri` = GCS URI as string
                - `field_mask` (optional): which field to include in output
                - `sharding_config` (optional): pages per shard
            - results are stored at the GCS URI specified in `gcs_uri` as JSON files

---
**Processing Specifics**

There are limits to processing requests:
- the number of request that can be made over a period of time: [Quotas](https://cloud.google.com/document-ai/quotas#quotas)
- the amount and size of content (documents, pages): [Content Limits](https://cloud.google.com/document-ai/quotas#content_limits)
- the processing request for each processor (parser) also has limits: [Processor Specific Limits](https://cloud.google.com/document-ai/quotas#processor_limits)

What does this actually mean?  Let's pick a single processor and walk through it, the OCR Parser. [This page](https://cloud.google.com/document-ai/docs/processors-list) has all the specifics for each parser.
- Parser Limits: The OCR parser
    - limit of 15 pages for an online requests and 500 for a batch requests
- Content Limits:
    - file size: 20MB online, and 1GB batch
    - files: 1 for online, 5000 for batch
        - but the OCR parser has a 500 page limit for batch
    - If the file type is an image (not PDF) then each page can be a max of 40 megapixels
- Requests (Qoutas):
    - overall
        - 10,000 active pages per project
    - users:
        - 1800 requests per minute
    - online (per minute):
        - 600 per project
        - 120 per project/processor/multi-region (US, EU)
        - 6 per project/processor/single-region
    - batch (concurrent jobs):
        - 10 per project
        - 5 per project/multi-region
        - 5 per project/single-region


---
## TODO
- add copy of files form github to local for the Colab
- may need more file examples
- use folder for type of demo: ocr, ocr_math, ocr_font, ocr_checkbox, form, invoice, summarizer
    - some of these are actually for other workflow notebooks so dont copy all
- save online to local for use in the process request notebook

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Working%20With%20Document%20AI/Document%20AI%20-%20Process%20Documents.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.documentai', 'google-cloud-documentai'),
    ('google.cloud.documentai', 'google-cloud-storage'),
    ('google.cloud.documentai', 'google-cloud-bigquery'),
    ('PIL', 'Pillow'),
    ('PyPDF2', 'PyPDF2'), 
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable documentai.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'working-with-docai'
EXPERIMENT = 'process-documents'

# make this the gcs bucket for storing files
GCS_BUCKET = PROJECT_ID

# make this the BQ Project / Dataset / Table prefix to store responses
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT

Packages

In [8]:
import PyPDF2
import IPython
import PIL
import PIL.ImageFont, PIL.Image, PIL.ImageDraw

from google.cloud import documentai
from google.cloud.documentai_v1 import Document
from google.cloud import storage
from google.cloud import bigquery

2023-10-26 20:53:50.533237: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Clients

In [9]:
# document AI client
LOCATION = REGION.split('-')[0]
docai_client = documentai.DocumentProcessorServiceClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

# gcs client: assumes bucket already exists
gcs = storage.Client(project = PROJECT_ID)
bucket = gcs.bucket(GCS_BUCKET)

---
## Documents

Local
GCS - create structure from local documents

---
## Processors

describe

list

get/create

---
## Online Processing (single document)

---
## Batch Processing (multiple documents)

---
## Async Processing