# Amazon Textract and Amazon Comprehend AI Services
### Example on extracting insights from a PDF Document


## Contents 
1. [Background](#Background)
1. [Notes and Configuration](#Notes-and-Configuration)
1. [Amazon Textract](#Amazon-Textract)
1. [Amazon Comprehend](#Amazon-Comprehend)
1. [Key Phrase Extraction](#Key-Phrase-Extraction)
1. [Sentiment Analysis](#Sentiment-Analysis)
1. [Entity Recognition](#Entity-Recognition)
1. [PII Entity Recognition](#PII-Entity-Recognition)
1. [Topic Modeling](#Topic-Modeling)

  
## Background
The goal of this exercise is to learn some insights from an existing PDF document. This is done by using Amazon Textract to extract the text from the document. This text is then analyzed by several Amazon Comprehend services to produce some insights about the document.  

#### Notes and Configuration
* Kernel `Python 3 (Data Science)` works well with this notebook

In [None]:
import os
import json
import sys
import time
import boto3
import sagemaker as sm

## Setup
Set some variables that will be used throughout this example

In [None]:
region = 'us-east-1'

sess = sm.Session()
s3bucket = sess.default_bucket()    
s3prefix = 'comprehend'

# this is where the various analysis results files will be stored on the local file system of this SageMaker instance
results_dir = './results'
!mkdir -p $results_dir

# this is the IAM Role that defines which permissions this SageMaker instance has
sm_execution_role = sm.get_execution_role()
print('sagemaker execution role: ', sm_execution_role)
print('s3 bucket:', s3bucket)

## IAM Roles and Permissions:

Within SageMaker Studio, each SageMaker User has an IAM Role known as the `SageMaker Execution Role`. Each Notebook for this user will run with this Role and the Permissions specified by this Role. The name of this Role can be found in the Details section of each SageMaker User in the AWS Console.

For the code which runs in this notebook, the `SageMaker Execution Role` needs additional permissions to allow it to use Amazon Textract and Amazon Comprehend. In the AWS Console, navigate to the IAM service and add these two services to your SageMaker Execution Role:
- AmazonTextractFullAccess
- AmazonComprehendFullAccess

Also, an Amazon Comprehend service Role needs to be created to grant Amazon Comprehend read access to your input data.  
When creating this new Role, the default Policies are sufficient (i.e., no other Policies need to be added/modified).  
In our example, we are creating a Role with the name `myComprehendServiceRole`

Lastly, the `SageMaker Execution Role` must be allowed to Pass the Comprehend Service Role. To allow this, you must attach a Policy to the `SageMaker Execution Role`. Below, the Resource entry is the ARN of the Comprehend service Role which you created. You can either create this as a new Policy and attach it or add it as an in-line Policy.  
In our example, we are creating a Role with the name `ComprehendDataAccessForSageMaker`

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::810190279255:role/myComprehendServiceRole"
        }
    ]
}
```

In [None]:
# set this to the ARN of the Role you created
comprehend_role = 'arn:aws:iam::662559257807:role/ComprehendDataAccessForSageMaker'

---
## Amazon Textract
Amazon Textract is a machine learning service that automatically extracts text, handwriting and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.  
  
In the next few cells the following steps will be performed:
1. A specified PDF document will be uploaded to Amazon S3 to be analyzed by Amazon Textract.  
1. The result of this analysis is a JSON file with each element containing details about a specific instance of text in the PDF.  
1. This JSON file is copied from S3 to this local SageMaker instance.  
1. The JSON file is then read and post-processed to produce a text file with one tweet (or other social media post) per line.  


In [None]:
# create a boto3 session
# this session will be used for the remainder of this notebook
session = boto3.Session(region_name=region)


In [None]:
# create the Textract Job
textract_src_filename = 'amazon-press-release.png'
textract_src_filename = 'police-report.pdf'

# upload the source document to S3 for Textract to access
!aws s3 cp data/$textract_src_filename s3://$s3bucket/$s3prefix/$textract_src_filename

textract_client = session.client('textract')
response = textract_client.start_document_text_detection(
    DocumentLocation={
    'S3Object': {
        'Bucket': s3bucket,
        'Name': f'{s3prefix}/{textract_src_filename}'
        }
    }
)

JobId = response["JobId"]
print('JobId: %s' % (JobId))


In [None]:
response = textract_client.get_document_text_detection(JobId=JobId)
print(response['JobStatus'])

In [None]:
if response['JobStatus'] != 'SUCCEEDED':
    raise
    
pages = []
while(True):
    pages.append(response)
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
        response = textract_client.get_document_text_detection(JobId=JobId, NextToken=nextToken)

    if nextToken == None:
        break


In [None]:
lines = []

# iterate through the Textract JSON response, looking for the LINE and WORD entries
for page in pages:
    for blk in page['Blocks']:
        if blk['BlockType'] in ['LINE']:
            lines.append(blk['Text'])

textract_results_filename = 'textract-results.txt'
with open(f'./results/{textract_results_filename}', 'w') as fd:
    for line in lines:
        fd.write(f'{line}\n')



In [None]:
lines

---
## Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to discover insights from text. The service provides APIs for Keyphrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection so you can easily integrate natural language processing into your applications. The following cells will walk through several examples of how to use the API.  


## Key Phrase Extraction
Use Amazon Comprehend to extract Key Phrases in the text from the Textract analysis.  
The input is a UTF-8 text string that must contain fewer that 5,000 bytes of UTF-8 encoded characters.


In [None]:
# create the comprehend boto3 client (from the existing boto3 session)
comp_client = session.client('comprehend')


In [None]:
key_phrases = []

for line in lines:     
    response = comp_client.detect_key_phrases(Text=line, LanguageCode='en')
    for kp in response['KeyPhrases']:
        if kp['Text'] not in key_phrases:
            key_phrases.append(kp)


In [None]:
key_phrases

---
## Sentiment Analysis
Use Amazon Comprehend to determine the Sentiment of each line of text from the Textract analysis.
* POSITIVE, NEUTRAL, NEGATIVE, MIXED

The input is a UTF-8 text string that must contain fewer that 5,000 bytes of UTF-8 encoded characters.


In [None]:
sentiments = []
    
for line in lines:
    response = comp_client.detect_sentiment(Text=line, LanguageCode='en')
    sentiments.append(response['Sentiment'])


In [None]:
sentiments

---
## Entity Recognition
Use Amazon Comprehend to detect Entities in the text from the Textract analysis.  
What are the type of Entities?
* PERSON, ORGANIZATION, DATE, QUANTITY, LOCATION, TITLE, COMMERCIAL_ITEM, EVENT, OTHER

The input is a UTF-8 text string that must contain fewer that 5,000 bytes of UTF-8 encoded characters.


In [None]:

entities = []
for line in lines:
    response = comp_client.detect_entities(Text=line, LanguageCode='en')
    entities.append(response['Entities'])


In [None]:
entities

---
## PII Entity Recognition
Use Amazon Comprehend to detect PII Entities in the text from the Textract analysis.  
What are the types of PII Entities?  
* NAME, DATE-TIME, ADDRESS, USERNAME, URL, EMAIL, PHONE, CREDIT-DEBIT-EXPIRY, PASSWORD, AGE

The input is a UTF-8 text string that must contain fewer that 5,000 bytes of UTF-8 encoded characters.


In [None]:
pii_entities = []
for line in lines:
    response = comp_client.detect_pii_entities(Text=line, LanguageCode='en')
    pii_entities.append(response['Entities'])    
                    

In [None]:
pii_entities

---
## Topic Modeling
Use Amazon Comprehend to extract Topics in the text from the Textract analysis.  

In this example, we are running the analysis as an asynchronous job, so the results are stored in a file in the S3 bucket we specify.  
This analysis may take up to 10 minutes to run.  

The output results are two files:  
*topic_terms.csv:*  A list of topics in the collection. For each topic, the list includes the top terms by topic according to their weight.  
*doc-topics.csv:*   Lists the documents associated with a topic and the proportion of the document that is concerned with the topic.


In [None]:
# put the file to be analyzed into the s3 bucket
# in this example, this file is the results from running textract on a pdf
s3dest = f's3://{s3bucket}/{s3prefix}/{textract_results_filename}'
!aws s3 cp ./results/$textract_results_filename $s3dest

In [None]:
# start the Amazon Comprehend Topics Analysis job
# create a unique Job Name
JobName = 'MyJobName-%d' % (time.time())

request = {
   "ClientRequestToken": "string",
   "DataAccessRoleArn": comprehend_role,
   "InputDataConfig": { 
      "InputFormat": "ONE_DOC_PER_FILE",
      "S3Uri": s3dest
   },
   "JobName": JobName,
   "OutputDataConfig": { 
      "S3Uri": f's3://{s3bucket}/{s3prefix}/'
   }
}

# create the comprehend analysis job
response = comp_client.start_topics_detection_job(**request)
JobId = response['JobId']
print(JobId)


In [None]:
response = comp_client.describe_topics_detection_job(JobId=JobId)
print(response)


In [None]:
# the comprehend analysis results are in the s3 bucket, full path is S3Uri
s3uri = response['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
basename = os.path.basename(s3uri)

# copy the 'output.tar.gz' file from the s3 bucket to the results folder
!aws s3 cp $s3uri $results_dir

# extract the contents of this tarball, which are two files: topic-terms.csv, doc-topics.csv
!(cd $results_dir; tar xzf $basename)
!(cd $results_dir; rm -f $basename)

print('See the following files:')
!ls -l $results_dir/topic-terms.csv
!ls -l $results_dir/doc-topics.csv