<h1> <b> Amazon Textract Primitivies and API's</b> </h1>

Amazon Textract is a document analysis service that detects and extracts printed text, and handwriting, structured data, such as fields of interest and their values, and tables from images and scans of documents. Amazon Textract's machine learning models have been trained on millions of documents so that virtually any document type you upload is automatically recognized and processed for text extraction. When information is extracted from documents, the service returns a confidence score for each element it identifies so that you can make informed decisions about how you want to use the results. Textract has four API's that we will be focusing on for todays workshop each of them play different role when it comes to processing documents.

- Selected kernal: Python3(Data Science)

<h3> <strong>For more details about each of these API's along with when to use them refer to the Workshop Guide under Module 1 </strong> </h3>

- Python library:
    - [amazon-textract-prettyprinter](https://pypi.org/project/amazon-textract-prettyprinter/), provides functions to format the output received from Textract in more easily consumable formats incl. CSV or Markdown.
    - [amazon-textract-response-parser](https://pypi.org/project/amazon-textract-response-parser/), easily parser JSON returned by Amazon Textract. Library parses JSON and provides programming language specific constructs to work with different parts of the document. textractor is an example of PoC batch processing tool that takes advantage of Textract response parser library and generate output in multiple formats.

In [1]:
import boto3
!python -m pip install amazon-textract-prettyprinter
!python -m pip install amazon-textract-response-parser
import pprint
import os
import textractprettyprinter
from trp import Document
from textractprettyprinter.t_pretty_print import Textract_Pretty_Print, get_lines_string
from textractprettyprinter.t_pretty_print import Pretty_Print_Table_Format, Textract_Pretty_Print, get_string

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


<h1> Detect Text with Amazon Textract - Local Document </h1>

Amazon Textract performs OCR using the Detect Document Text API. This API will provide the user with an extraction of all the raw text on the input document locally. To know all the API of Amazon Textract, read the [Amazon Textract API Operations](https://docs.aws.amazon.com/textract/latest/dg/API_Operations.html).

The document of this sample is W2 form, which is the Wage and Tax Statement document in US.

In [2]:
#intialize the connection to Amazon Textract
textract = boto3.client('textract')

#select the document 
document = 'w2example.jpg'


In [3]:
#Send the Document to the Detect Text API 
with open(document, 'rb') as document:
    imageBytes = bytearray(document.read())

textract_response = textract.detect_document_text(Document={'Bytes': imageBytes})

In [4]:
#Print the parsed results
doc = Document(textract_response)
extract_info = []
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        line_extraction = "Line: {}--{}".format(line.text, line.confidence)
        extract_info.append(line_extraction) 
        for word in line.words:
            word_extraction = "Word: {}--{}".format(word.text, word.confidence)
            extract_info.append(word_extraction) 
extract_info[0:15]

["Line: a Employee's social security number--99.8873519897461",
 'Word: a--99.88846588134766',
 "Word: Employee's--99.92671966552734",
 'Word: social--99.68712615966797',
 'Word: security--99.95610046386719',
 'Word: number--99.97835540771484',
 'Line: 22222--97.93190002441406',
 'Word: 22222--97.93190002441406',
 'Line: 054-22-1254--99.81170654296875',
 'Word: 054-22-1254--99.81170654296875',
 'Line: OMB No. 1545-0008--99.65917205810547',
 'Word: OMB--99.75211334228516',
 'Word: No.--99.65322875976562',
 'Word: 1545-0008--99.57217407226562',
 'Line: b Employer identification number (EIN)--99.92436981201172']

<h4>(Print Sample) Line Block of the detect_document_text response</h4>
<img src="w2_response_sample.png" alt="w2_response_sample" width="800"/>

In [None]:
#Optional Step to see fully raw Textract Output. Please refer to Workshop guide for more details! 
#pprint.pprint(textract_response)

In [5]:
#Using a post-processing library to clean up output
pretty_printed = get_lines_string(textract_json=textract_response)
print(pretty_printed[0:80])

a Employee's social security number
22222
054-22-1254
OMB No. 1545-0008
b Employ


<h1> Detect Text with Amazon Textract - S3 Document </h1>
Amazon Textract performs OCR using the Detect Document Text API. This API will provide the user with an extraction of all the raw text on the input document in S3

In [6]:
s3BucketName = 'aim316-bucket'
documentName = 'w2example.jpg'

textract_s3_doc_response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

pprinted_s3_doc_response = get_lines_string(textract_json=textract_s3_doc_response)
print(pprinted_s3_doc_response[0:80])

a Employee's social security number
22222
054-22-1254
OMB No. 1545-0008
b Employ


<h1> Analyze Document with Amazon Textract - Local Document (Forms) </h1>
The Analyze Document API builds ontop of the detect_text api, but now detecting structure within a document, by finding Tables or Form (Key:Value) values within the document.

In [7]:

document = 'invoice.jpg'

#Call the Analyze Doc API
with open(document, "rb") as document:
    response = textract.analyze_document(
        Document={
            'Bytes': document.read(),
        },
        FeatureTypes=["FORMS"])
    

In [8]:
#post-process the results
print(get_string(textract_json=response,
               output_type=[Textract_Pretty_Print.FORMS]))

|------------------|-------------------------------------------------------------------------|
| Key              | Value                                                                   |
| Invoice Number   | 0000005                                                                 |
| Date of Issue    | 06/03/2019                                                              |
| Due Date         | 07/03/2019                                                              |
| Amount Due (CAD) | $5,500.00                                                               |
| Billed To        | Aden Matchett Vandelay Group 123 Main Street Townsville, Ontario M4L2DY |
| Amount Due (CAD) | $5,500.00                                                               |
| Subtotal         | 5,500.00                                                                |
| Total            | 5,500.00                                                                |
| Tax              | 0.00                         

<h1> Analyze Document with Amazon Textract - Local Document (Tables) </h1>

The Analyze Document API builds ontop of the detect_text api, but now detecting **structure** within a document, **by finding Tables or Form (Key:Value)** values within the document.

In [9]:
document = 'invoice.jpg'

#Call the Analyze Doc API
with open(document, "rb") as document:
    response = textract.analyze_document(
        Document={
            'Bytes': document.read(),
        },
        FeatureTypes=["TABLES"])


In [10]:
#post-process the results
print(get_string(textract_json=response,
               output_type=[Textract_Pretty_Print.TABLES]))

|-------------|------------------|-----|------------|
| Description | Rate             | Qty | Line Total |
| Project     | $5,000.00        | 1   | $5,000.00  |
| Expenses    | $500.00          | 1   | $500.00    |
|             | Subtotal         |     | 5,500.00   |
|             | Tax              |     | 0.00       |
|             | Total            |     | 5,500.00   |
|             | Amount Paid      |     | 0.00       |
|             | Amount Due (CAD) |     | $5,500.00  |




<h1> Analyze Expense with Amazon Textract</h1>
The Analyze Expense API is a purpose build API designed to extract line item details in addition to key-value pairs from invoices and receipts. 

In [11]:
document = "invoice.jpg"
    
with open(document, 'rb') as document:
    imageBytes = bytearray(document.read())

response = textract.analyze_expense(Document={'Bytes': imageBytes})
#pprint.pprint(response)

<h4>(Print Sample) Summary field of the analyze_expense response</h4>
<img src="invoice_analysis_expense_value.png" alt="invoice_analysis_expense_value" width="1000"/>