<a href="https://colab.research.google.com/github/xprilion/gcp-documentai-demo/blob/main/GCP_DocumentAI_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install requisites

In [1]:
%%capture
!pip3 install --upgrade google-cloud-documentai
!pip3 install --upgrade google-cloud-storage

## Credits

A lot of the code presented in this demo is taken from  [Optical Character Recognition (OCR) with Document AI (Python) by Holt Skinner](https://codelabs.developers.google.com/codelabs/docai-ocr-python).

## Import libraries

In [13]:
import re
import os
from typing import List

from google.api_core.client_options import ClientOptions
from google.cloud import documentai_v1 as documentai
from google.cloud import storage

In [14]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/content/key.json' 

In [15]:
PROJECT_ID = "gcp-adventure-x"
LOCATION = "us"  # Format is 'us' or 'eu'
PROCESSOR_ID = "89cd56bc4dfa8269"  # Create processor in Cloud Console

In [16]:
# Format 'gs://input_bucket/directory/file.pdf'
GCS_INPUT_URI = "gs://data-in-gcp/pdf/haldane-1932-causes-of-evolution-flat-sample.pdf"
INPUT_MIME_TYPE = "application/pdf"

In [17]:
# Format 'gs://output_bucket/directory'
GCS_OUTPUT_URI = "gs://data-in-gcp/output"

In [18]:
# Instantiates a client
docai_client = documentai.DocumentProcessorServiceClient(
    client_options=ClientOptions(api_endpoint=f"{LOCATION}-documentai.googleapis.com")
)

In [19]:
# The full resource name of the processor, e.g.:
# projects/project-id/locations/location/processor/processor-id
# You must create new processors in the Cloud Console first
RESOURCE_NAME = docai_client.processor_path(PROJECT_ID, LOCATION, PROCESSOR_ID)

In [20]:
# Cloud Storage URI for the Input Document
input_document = documentai.GcsDocument(
    gcs_uri=GCS_INPUT_URI, mime_type=INPUT_MIME_TYPE
)

In [21]:
# Load GCS Input URI into a List of document files
input_config = documentai.BatchDocumentsInputConfig(
    gcs_documents=documentai.GcsDocuments(documents=[input_document])
)

In [22]:
# Cloud Storage URI for Output directory
gcs_output_config = documentai.DocumentOutputConfig.GcsOutputConfig(
    gcs_uri=GCS_OUTPUT_URI
)

In [23]:
# Load GCS Output URI into OutputConfig object
output_config = documentai.DocumentOutputConfig(gcs_output_config=gcs_output_config)

In [24]:
# Configure Process Request
request = documentai.BatchProcessRequest(
    name=RESOURCE_NAME,
    input_documents=input_config,
    document_output_config=output_config,
)

In [25]:
# Batch Process returns a Long Running Operation (LRO)
operation = docai_client.batch_process_documents(request)

In [26]:
# Continually polls the operation until it is complete.
# This could take some time for larger files
# Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID
print(f"Waiting for operation {operation.operation.name} to complete...")
operation.result()

Waiting for operation projects/869337967308/locations/us/operations/4584907330761309997 to complete...




In [27]:
print("Document processing complete.")

Document processing complete.


In [28]:
# Once the operation is complete,
# get output document information from operation metadata
metadata = documentai.BatchProcessMetadata(operation.metadata)

In [29]:
if metadata.state != documentai.BatchProcessMetadata.State.SUCCEEDED:
    raise ValueError(f"Batch Process Failed: {metadata.state_message}")

In [30]:
documents: List[documentai.Document] = []

# Storage Client to retrieve the output files from GCS
storage_client = storage.Client()

In [31]:
# One process per Input Document
# pylint: disable=not-an-iterable
for process in metadata.individual_process_statuses:

    # output_gcs_destination format: gs://BUCKET/PREFIX/OPERATION_NUMBER/0
    # The GCS API requires the bucket name and URI prefix separately
    output_bucket, output_prefix = re.match(
        r"gs://(.*?)/(.*)", process.output_gcs_destination
    ).groups()

    # Get List of Document Objects from the Output Bucket
    output_blobs = storage_client.list_blobs(output_bucket, prefix=output_prefix)

    # DocAI may output multiple JSON files per source file
    for blob in output_blobs:
        # Document AI should only output JSON files to GCS
        if ".json" not in blob.name:
            print(f"Skipping non-supported file type {blob.name}")
            continue

        print(f"Fetching {blob.name}")

        # Download JSON File and Convert to Document Object
        document = documentai.Document.from_json(
            blob.download_as_bytes(), ignore_unknown_fields=True
        )

        documents.append(document)

Fetching output/4584907330761309997/0/haldane-1932-causes-of-evolution-flat-sample-0.json


In [32]:
# Print Text from all documents
# Truncated at 100 characters for brevity
for document in documents:
    print(document.text[:100])

THE CAUSES OF
EVOLUTION
CHAPTER I
INTRODUCTION
"Darwinism is dead."—Any sermon.
SEVENTY-TWO years ha


In [33]:
for document in documents:
    print(document.text)

THE CAUSES OF
EVOLUTION
CHAPTER I
INTRODUCTION
"Darwinism is dead."—Any sermon.
SEVENTY-TWO years have now elapsed since Darwin
and Wallace (1858) formulated the theory that
evolution had occurred largely as a result of natural
selection. The doctrine of evolution was not, of
course, new. But Lamarck and other eminent bio-
logists had failed to convince the scientific world or
the general public that evolution had occurred, still
less that it had occurred owing to the operation of
any particular set of causes. Darwin contrived to
carry a considerable measure of conviction on both
these points. The result has been that a generation
ago most people who believed in evolution held that
it had been largely due to natural selection. Nowa-
days a certain number of believers in evolution
do not regard natural selection as a cause of it,
I
B
THE CAUSES OF EVOLUTION
but I think that in general the two beliefs still go
together.
So close a correlation is rather rare in the history
of human though