# Amazon Textract
#### Automatically extract printed text, handwriting, and data from any document

## Contents 
1. [Detect text from local image](#item1)
1. [Detect text from S3 object](#item2)
1. [Reading order](#item3)
1. [Amazon Translate](#item4)
1. [Forms: Key-Value Pairs](#item5)
1. [Tables](#item6)

Install TRP - the Textract Results Parser  
Requires python 3.6 or later  
This python library makes it easier to navigate the JSON results from Textract analysis.
https://pypi.org/project/textract-trp/

In [None]:
!pip install textract-trp

In [None]:
import os
import boto3
import time
from IPython.display import Image, display, IFrame
from trp import Document
from PIL import Image as PImage, ImageDraw
from sagemaker import get_execution_role

In [None]:
# Curent AWS Region. Use this to choose corresponding S3 bucket with sample content

mySession = boto3.session.Session()
awsRegion = mySession.region_name

print('Region:', awsRegion)
print('SageMaker Execution Role:', get_execution_role())

## IAM Roles and Permissions <a name="IAM"></a>

Within SageMaker Studio, each SageMaker User has an IAM Role known as the `SageMaker Execution Role`. Each Notebook for this user will run with this Role and the Permissions specified by this Role. The name of this Role can be found in the Details section of each SageMaker User in the AWS Console.

For the code which runs in this notebook, the `SageMaker Execution Role` needs additional permissions to allow it to use Amazon Textract and Amazon Comprehend. 

1. In the AWS Console, navigate to the IAM service and add these two services to your SageMaker Execution Role:
- AmazonTextractFullAccess
- AmazonComprehendFullAccess

2. Also, an Amazon Comprehend service Role needs to be created to grant Amazon Comprehend read access to your input data.  

*Create a service role for Amazon Comprehend*  
Follow along with the instructor.  
When creating this new Role, the default Policies are sufficient (i.e., no other Policies need to be added/modified).  
In our example, we are creating a Role with the name `myComprehendServiceRole`

3. Lastly, the `SageMaker Execution Role` must be allowed to Pass the Comprehend Service Role. 

To allow this, you must attach a Policy to the `SageMaker Execution Role`. Below, the Resource entry is the ARN of the Comprehend service Role which you created. You can either create this as a new Policy and attach it or add it as an in-line Policy.  
In our example, we are creating a Role with the name `ComprehendDataAccessForSageMaker`

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::810190279255:role/myComprehendServiceRole"
        }
    ]
}
```

#### Continue, now that the proper permissions are set up
#### S3 bucket that contains sample documents
We are providing sample documents in this bucket so you do not have to manually download/upload some of the test documents

In [None]:
s3BucketName = "aws-workshops-" + awsRegion

In [None]:
# Amazon S3 client
s3 = boto3.client('s3')

# Amazon Textract client
textract = boto3.client('textract')

# 1. Detect text from local image <a name="item1"></a>

https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html

In [None]:
# Document
documentName = "./data/simple-document-image.jpg"
display(Image(filename=documentName))

In [None]:
# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print (item["Text"])

# 2. Detect text from S3 object <a name="item2"></a>

https://docs.aws.amazon.com/textract/latest/dg/API_DetectDocumentText.html

In [None]:
# Document
documentName = "textract-samples/simple-document-image.jpg"
display(Image(url=s3.generate_presigned_url('get_object', Params={'Bucket': s3BucketName, 'Key': documentName})))

In [None]:
# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })


# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print (item["Text"])

## Lines and Words of Text - JSON Structure

https://docs.aws.amazon.com/textract/latest/dg/API_BoundingBox.html  
https://docs.aws.amazon.com/textract/latest/dg/text-location.html  
https://docs.aws.amazon.com/textract/latest/dg/how-it-works-lines-words.html  

In [None]:
# Document
documentName = "./data/OneLine.png"
display(Image(filename=documentName))

In [None]:
# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print (item["Text"])

In [None]:
print("JSON Response\n===================")
display(response)

# 3. Reading order <a name="item3"></a>

In [None]:
# Document
documentName = "textract-samples/two-column-image.jpg"
display(Image(url=s3.generate_presigned_url('get_object', Params={'Bucket': s3BucketName, 'Key': documentName})))

In [None]:
# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [None]:
# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print(f'{line[0]}: {line[1]}')

# 4. Amazon Translate <a name="item4"></a>

In [None]:
# Document
documentName = "textract-samples/simple-document-image.jpg"
display(Image(url=s3.generate_presigned_url('get_object', Params={'Bucket': s3BucketName, 'Key': documentName})))

In [None]:
# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

# Amazon Translate client
translate = boto3.client('translate')

translation = []

print ('Original Language:')
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print (item["Text"])
        result = translate.translate_text(Text=item["Text"], SourceLanguageCode="en", TargetLanguageCode="es")
        translation.append(result.get('TranslatedText'))
print ('\n')

print ('Spanish:')

for line in translation:
    print(line)


# 5. Forms: Key-Value Pairs <a name="item5"></a>

https://docs.aws.amazon.com/textract/latest/dg/API_AnalyzeDocument.html

In [None]:
# Document
documentName = "textract-samples/employmentapp.png"
display(Image(url=s3.generate_presigned_url('get_object', Params={'Bucket': s3BucketName, 'Key': documentName})))

In [None]:
# Call Amazon Textract
response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["FORMS"])

doc = Document(response)

for page in doc.pages:
    for field in page.form.fields:
        print(f'Key: {field.key}\nValue: {field.value}')


#### Form Data (Key-Value Pairs) JSON Structure

https://docs.aws.amazon.com/textract/latest/dg/how-it-works-kvp.html  
https://docs.aws.amazon.com/textract/latest/dg/how-it-works-selectables.html


In [None]:
print("JSON Response\n===================")
response

# 6. Tables <a name="item6"></a>
By having Textract analyze our document for tables, we are able to extract cells based on row and column values

In [None]:
# Call Amazon Textract
response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["TABLES"])

#print(response)

doc = Document(response)

for page in doc.pages:
     # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

#### Table JSON Structure
How it works

https://docs.aws.amazon.com/textract/latest/dg/how-it-works-tables.html

In [None]:
response