![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FWorking+With%2FDocument+AI&file=Document+AI+Processors+-+Document+Summarizer.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/Document%20AI%20Processors%20-%20Document%20Summarizer.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FWorking%2520With%2FDocument%2520AI%2FDocument%2520AI%2520Processors%2520-%2520Document%2520Summarizer.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/Document%20AI%20Processors%20-%20Document%20Summarizer.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Working%20With/Document%20AI/Document%20AI%20Processors%20-%20Document%20Summarizer.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Document AI Processors - Document Summarizer
> From the [Working With Document AI](https://github.com/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/readme.md) series in the [vertex-ai-mlops](https://github.com/statmike/vertex-ai-mlops/blob/main/readme.md) repository.

Document AI is an API where you interact with processors to extract information from documents.  You enable the API, create an instance of a processor in your project, send in document(s), receive back JSON with the extracted information:

<p align="center" width="100%"><center>
    <img src="../../architectures/architectures/images/working with/documentai/readme/high_level.png">
</center></p>

This notebook uses the [Summarizer parser](https://cloud.google.com/document-ai/docs/processors-list#processor_SUMMARIZER) which uses generative AI to help create a summary of the document.

There are parameters that can be used to customize the parser but they are setup with the parser rather than on the request for parsing.  

---

**Processing**

A prior workflow covered all the ways to process a document, or many documents, using Python as the client: [Document AI - Process Documents](./Document%20AI%20-%20Process%20Documents.ipynb). It also shows how to store and retrieve responses from GCS and BigQuery.

---

**Responses**

There are many ways to process the responses from Document AI to extract the parts needed for a downstream applicaiton: paragraphs, tokens, entities, tables, much more!  The choice will depend on the application workflow and if a single document is being processed, a batch of documents, or an entire history of documents is being processed.  A prior workflow covered three common ways to process responses:
[Document AI - Process Responses](./Document%20AI%20-%20Process%20Responses.ipynb)
- Extraction: Python With Document AI Client
- Extraction: Python With Document AI Toolbox
- Extraction: Directly In BigQuery With SQL

---

**References:**
- [Python Client For Document AI](https://cloud.google.com/python/docs/reference/documentai/latest)
    - [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
- [Document AI Overview](https://cloud.google.com/document-ai/docs/overview)
    - [Handling the processing response](https://cloud.google.com/document-ai/docs/handle-response)

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Working%20With/Document%20AI/Document%20AI%20Processors%20-%20OCR%20Parser%20With%20Math%20Type.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.documentai', 'google-cloud-documentai'),
    ('PIL', 'Pillow'),
    ('PyPDF2', 'PyPDF2')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable documentai.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [7]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [8]:
REGION = 'us-central1'
SERIES = 'working-with-docai'
EXPERIMENT = 'summarizer'

Packages

In [9]:
import os, io, json

import PyPDF2
import IPython
import PIL, PIL.ImageFont, PIL.Image, PIL.ImageDraw

from google.cloud import documentai

Clients

In [10]:
# document AI client
LOCATION = REGION.split('-')[0]
docai = documentai.DocumentProcessorServiceClient(
    client_options = dict(api_endpoint = f"{LOCATION}-documentai.googleapis.com")
)

---
## Get The Document

This section prepares a document for processing with online processing.

|Document Name|Link|
|---|---|
|`../shared files/docs/sports/Baseball - Wikipedia.pdf`|[https://en.wikipedia.org/wiki/Baseball](https://en.wikipedia.org/wiki/Baseball)|

If you are working from a clone of this notebooks repository then the document is already present. The following cell checks for the documents folder, `../shared_files/docs`, and if it is missing gets the document used in this workflow (`wget`):

In [11]:
file = '../shared files/docs/sports/Baseball - Wikipedia.pdf'

if not os.path.exists(file):
    print('Retrieving document...')
    if not os.path.exists(os.path.dirname(file)):
      os.makedirs(os.path.dirname(file))
    import requests, urllib.parse
    r = requests.get(f'https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Working%20With/{urllib.parse.quote(file[3:])}')
    open(file, 'wb').write(r.content)
    print(f'Document now at `{file}`')
else:
    print(f'Document Found at `{file}`')

Document Found at `../shared files/docs/sports/Baseball - Wikipedia.pdf`


---
## Create/Get A Processor

For this workflow we will use the [Summarizer parser](https://cloud.google.com/document-ai/docs/processors-list#processor_SUMMARIZER). We can check for an existing processor in the project and if it is not present then create one.  The processor will be connected with Python variable `PARSER` and referred to as a parser as it is used.

Get the type and version from the list of available processors: https://cloud.google.com/document-ai/docs/processors-list

What are the processors already created in this project environment?

In [12]:
processors = list(docai.list_processors(parent = f'projects/{PROJECT_ID}/locations/{LOCATION}'))
len(processors)

4

In [13]:
TYPE = 'SUMMARY_PROCESSOR'
VERSION = 'pretrained-foundation-model-v1.0-2023-08-22'

Get an existing processor:

In [14]:
PARSER = ''
for processor in processors:
    if processor.type_ == TYPE and processor.default_processor_version.split('/')[-1] == VERSION:
        PARSER = processor
        break
        
if PARSER:
    print(f'There is an existing processor with the desire type and version in PARSER = {PARSER.display_name}')
else:
    print(f'Need to create a processor for the desired type and version: {TYPE}, {VERSION}')

Need to create a processor for the desired type and version: SUMMARY_PROCESSOR, pretrained-foundation-model-v1.0-2023-08-22


Create the processor if an existing one was not found to match:

In [16]:
if not PARSER:
    PARSER = docai.create_processor(
        parent = f'projects/{PROJECT_ID}/locations/{LOCATION}',
        processor = documentai.Processor(
            display_name = f'{SERIES}-{EXPERIMENT}',
            type_ = TYPE
        )
    )
    set_default = docai.set_default_processor_version(
        request = documentai.SetDefaultProcessorVersionRequest(
            processor = PARSER.name,
            default_processor_version = f'{PARSER.name}/processorVersions/{VERSION}'
        )
    )
    set_default.result()
    PARSER = docai.get_processor(
        name = PARSER.name
    )
    print(f'Processor created and in PARSER variable with display name = {PARSER.display_name}')

Processor created and in PARSER variable with display name = working-with-docai-summarizer


---
## Online Processing (single document)

> A prior workflow covered all the ways to process a document, or many documents, using Python as the client: [Document AI - Process Documents](./Document%20AI%20-%20Process%20Documents.ipynb). It also shows how to store and retrieve responses from GCS and BigQuery.

This section uses one of the three online processing methods: `inline_document` where the document is provided as a bytes object.

> NOTE: The [Summarizer Processor](https://cloud.google.com/document-ai/docs/processors-list#processor_SUMMARIZER) has a page limits of 15 for online, and 250 for batch processing.

The following is the Python client reference to use for this online processing section:
- [google.cloud.documentai.DocumentProcessorServiceClient.process_document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.services.document_processor_service.DocumentProcessorServiceClient#google_cloud_documentai_v1_services_document_processor_service_DocumentProcessorServiceClient_process_document)

Using the processor stored in `PARSER` from above:

In [17]:
PARSER.name

'projects/1026793852137/locations/us/processors/8f2e1b8e4b2c6604'

Read the document to a bytes object:

In [18]:
with open(file, 'rb') as f:
    local_doc = f.read()

In [19]:
type(local_doc)

bytes

---
### Document as bytes: `inline_document`

Reference:
- [documentai.ProcessRequest()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessRequest)
    - (This One) `inline_document` = [documentai.Document()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document)
    - `raw_document` = [documentai.RawDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.RawDocument)
    - `gcs_document` = [documentai.GcsDocument()](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.GcsDocument)

In [31]:
response = docai.process_document(
    request = documentai.ProcessRequest(
        name = PARSER.name,
        inline_document = documentai.Document(
            content = local_doc,
            mime_type = 'application/pdf'
        ),
        process_options = documentai.ProcessOptions(
            from_start = 3
        )
    )
)

In [27]:
len(response.document.pages)

3

In [28]:
print(response.document.text[0:250])

10/27/23, 9:22 AM
Baseball - Wikipedia
WIKIPEDIA
The Free Encyclopedia
Toggle the table of contents
Baseball
✰ B
ETTS
Baseball is a bat-and-ball sport played between two
Baseball
teams of nine players each, taking turns batting and
fielding. The game


---
## Extraction Methods

Below, the Python Client for Document AI is used to extract the parts of the OCR response.  There are other methods that could be better for different application flows.  The workflow [Document AI - Process Responses](./Document%20AI%20-%20Process%20Responses.ipynb) covers additional methods.
- Extraction: Python With Document AI Client
- Extraction: Python With Document AI Toolbox
- Extraction: Directly In BigQuery With SQL

---
## Extraction: Python With Document AI Client

The response from online processing is a [`documentai.ProcessResponse()`](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.ProcessResponse) which has a `document` attribute that is a [`documentai.Document()`](https://cloud.google.com/python/docs/reference/documentai/latest/google.cloud.documentai_v1.types.Document) protobuf.  This object can be directly used to extract elements of the document.  It can also be converted to other data types that can be iterated in Python:
- JSON with `response.document.to_json()`
- Dictionary with `response.document.to_dict()`
- > Note: These follow the `google.protobuf.json_format` - [reference](https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html).  That means the parameters of `.to_dict()` are the same as `google.protobuf.json_format.MessageToDict()` which includes `use_integer_for_enums` which should be set `= False` to get the text representation of enum values.

The workflow has a helpful guide to the structure of the `documentai.Document` object in the [readme.md](./readme.md).

### Entities

Entities are extracted key:value pairs from the overall document.  Each parser has specific entities it is trained to detect and return.  In this case, the Summarizer Parser, the entity detected is actually a generative AI preparaed summary.

In [32]:
len(response.document.entities)

1

In [33]:
response.document.entities[0]

type_: "summary"
mention_text: " \342\200\242 Baseball is a bat-and-ball sport played between two teams of nine players each, taking turns batting and fielding. \n\342\200\242 The objective of the offensive team (batting team) is to hit the ball into the field of play, away from the other team\'s players, allowing its players to run the bases, having them advance counter-clockwise around four bases to score what are called \"runs\". \n\342\200\242 The objective of the defensive team (referred to as the fielding team) is to prevent batters from becoming runners, and to prevent runners\' advance around the bases. \n\342\200\242 Baseball evolved from older bat-and-ball games already being played in England by the mid-18th century. \n\342\200\242 This game was brought by immigrants to North America, where the modern version developed. \n\342\200\242 In Major Le gue Baseball (MLB), the highest level of professional baseball in the United States and Canada, teams are divided into the Nationa

In [35]:
print(response.document.entities[0].normalized_value.text)

 • Baseball is a bat-and-ball sport played between two teams of nine players each, taking turns batting and fielding. 
• The objective of the offensive team (batting team) is to hit the ball into the field of play, away from the other team's players, allowing its players to run the bases, having them advance counter-clockwise around four bases to score what are called "runs". 
• The objective of the defensive team (referred to as the fielding team) is to prevent batters from becoming runners, and to prevent runners' advance around the bases. 
• Baseball evolved from older bat-and-ball games already being played in England by the mid-18th century. 
• This game was brought by immigrants to North America, where the modern version developed. 
• In Major Le gue Baseball (MLB), the highest level of professional baseball in the United States and Canada, teams are divided into the National League (NL) and American League (AL), each with three divisions: East, West, and Central. 
• The MLB cham