# Form Parsing using Google Cloud Document AI

This notebook shows how to use Google Cloud Document AI to parse a campaign disclosure form.

It accompanies this Medium article:
https://medium.com/@lakshmanok/how-to-parse-forms-using-google-cloud-document-ai-68ad47e1c0ed

### Document

As an example, let's take this US election campaign disclosure form.

In [2]:
!ls *.pdf

scott_walker.pdf


In [3]:
from IPython.display import IFrame
IFrame("./scott_walker.pdf", width=600, height=300)

## Upload to Cloud Storage

Document AI works with documents on Cloud Storage, so let's upload the doc.

In [4]:
BUCKET="ai-analytics-solutions-kfpdemo"  # CHANGE to a bucket that you own

In [6]:
!gsutil cp scott_walker.pdf gs://{BUCKET}/formparsing/scott_walker.pdf

Copying file://scott_walker.pdf [Content-Type=application/pdf]...
/ [1 files][209.7 KiB/209.7 KiB]                                                
Operation completed over 1 objects/209.7 KiB.                                    


In [19]:
!gsutil ls gs://{BUCKET}/formparsing/scott_walker.pdf

gs://ai-analytics-solutions-kfpdemo/formparsing/scott_walker.pdf


## Enable Document AI

1. First enable Document AI in your project by visiting
https://console.developers.google.com/apis/api/documentai.googleapis.com/overview

2. Find out who you are running as:

In [14]:
!gcloud auth list

                  Credentialed Accounts
ACTIVE  ACCOUNT
*       379218021631-compute@developer.gserviceaccount.com

To set the active account, run:
    $ gcloud config set account `ACCOUNT`



3. Create a service account authorization by visiting
https://console.cloud.google.com/iam-admin/serviceaccounts/create
Give this service account Document AI Core Service Account authorization

4. Give the above ACTIVE ACCOUNT the ability to use the service account you just created.

## Call Document AI

In [20]:
%%bash
PDF="gs://ai-analytics-solutions-kfpdemo/formparsing/scott_walker.pdf" # CHANGE to your PDF file
REGION="us"  # change to EU if the bucket is in the EU

cat <<EOM > request.json
{
   "inputConfig":{
      "gcsSource":{
         "uri":"${PDF}"
      },
      "mimeType":"application/pdf"
   },
   "documentType":"general",
   "formExtractionParams":{
      "enabled":true
   }
}
EOM

# Send request to Document AI.
PROJECT=$(gcloud config get-value project)
echo "Sending the following request to Document AI in ${PROJECT} ($REGION region), saving to response.json"
cat request.json

curl -X POST \
  -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
  -H "Content-Type: application/json; charset=utf-8" \
  -d @request.json \
  https://${REGION}-documentai.googleapis.com/v1beta2/projects/${PROJECT}/locations/us/documents:process \
  > response.json

Sending the following request to Document AI in ai-analytics-solutions (us region), saving to response.json
{
   "inputConfig":{
      "gcsSource":{
         "uri":"gs://ai-analytics-solutions-kfpdemo/formparsing/scott_walker.pdf"
      },
      "mimeType":"application/pdf"
   },
   "documentType":"general",
   "formExtractionParams":{
      "enabled":true
   }
}


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3366k    0 3366k  100   246   832k     60  0:00:04  0:00:04 --:--:--  832k


In [22]:
!tail response.json

                }
              ]
            },
            "orientation": "PAGE_UP"
          }
        }
      ]
    }
  ]
}


## Parse the response

Let's use Python to parse the response and pull out specific fields.


In [4]:
import json
ifp = open('response.json')
response = json.load(ifp)

In [29]:
allText = response['text']
print(allText[:100])

10/09/2020 15:51
Image# 202010099285076251
PAGE 1/14
7
REPORT OF RECEIPTSFECAND DISBURSEMENTSFORM 3P


### Option 1: Parsing blocks of text

As an example, let's try to get the "Cash on Hand". This is in Page 2 and the answer is $75,931.36
All the data in the document is the allText field. we just need to find the right starting and ending index
for what we want to extract.

In [30]:
print(allText.index("CASH ON HAND"))

1719


We know that "Cash on Hand" is on Page 2.

In [33]:
response['pages'][1]['blocks'][5]

{'layout': {'textAnchor': {'textSegments': [{'startIndex': '1716',
     'endIndex': '1827'}]},
  'confidence': 1,
  'boundingPoly': {'normalizedVertices': [{'x': 0.068627454, 'y': 0.24873738},
    {'x': 0.6764706, 'y': 0.24873738},
    {'x': 0.6764706, 'y': 0.25757575},
    {'x': 0.068627454, 'y': 0.25757575}]},
  'orientation': 'PAGE_UP'}}

In [34]:
response['pages'][1]['blocks'][5]['layout']['textAnchor']['textSegments'][0]

{'startIndex': '1716', 'endIndex': '1827'}

In [35]:
startIndex = int(response['pages'][1]['blocks'][5]['layout']['textAnchor']['textSegments'][0]['startIndex'])
endIndex = int(response['pages'][1]['blocks'][5]['layout']['textAnchor']['textSegments'][0]['endIndex'])
allText[startIndex:endIndex]

'6. CASH ON HAND AT BEGINNING OF REPORTING PERIOD .............................................................\n'

Cool, we are at the right part of the document! Let's get the next block, which should be the actual amount.

In [55]:
def extractText(allText, elem):
    startIndex = int(elem['textAnchor']['textSegments'][0]['startIndex'])
    endIndex = int(elem['textAnchor']['textSegments'][0]['endIndex'])
    return allText[startIndex:endIndex].strip()

amount = float(extractText(allText, response['pages'][1]['blocks'][6]['layout']))
print(amount)

75931.36


### Option 2: Parsing form fields

What we did with blocks of text was quite low-level. Document AI understands that forms tend to have key-value pairs, and part of the JSON response includes these extracted key-value pairs as well.

Besides FormField Document AI also supports getting Paragraph and Table from the document.

In [47]:
response['pages'][1].keys()

dict_keys(['pageNumber', 'dimension', 'layout', 'blocks', 'paragraphs', 'lines', 'tokens', 'tables', 'formFields'])

In [51]:
response['pages'][1]['formFields'][2]

{'fieldName': {'textAnchor': {'textSegments': [{'startIndex': '1719',
     'endIndex': '1765'}]},
  'confidence': 0.9962783,
  'boundingPoly': {'normalizedVertices': [{'x': 0.0922335, 'y': 0.24873738},
    {'x': 0.4584429, 'y': 0.24873738},
    {'x': 0.4584429, 'y': 0.2587827},
    {'x': 0.0922335, 'y': 0.2587827}]},
  'orientation': 'PAGE_UP'},
 'fieldValue': {'textAnchor': {'textSegments': [{'startIndex': '1716',
     'endIndex': '1842'}]},
  'confidence': 0.9962783,
  'boundingPoly': {'normalizedVertices': [{'x': 0.068627454, 'y': 0.24873738},
    {'x': 0.90849674, 'y': 0.24873738},
    {'x': 0.90849674, 'y': 0.26767677},
    {'x': 0.068627454, 'y': 0.26767677}]},
  'orientation': 'PAGE_UP'}}

In [53]:
fieldName = extractText(allText, response['pages'][1]['formFields'][2]['fieldName'])
fieldValue = extractText(allText, response['pages'][1]['formFields'][2]['fieldValue'])
print('key={}\nvalue={}'.format(fieldName, fieldValue))

key=CASH ON HAND AT BEGINNING OF REPORTING PERIOD
value=6. CASH ON HAND AT BEGINNING OF REPORTING PERIOD .............................................................
75931.36
, , .


Enjoy!

Copyright 2020 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License