# Amazon Textract and Amazon Comprehend AI Services
### Example on extracting insights from a PDF Document


## Contents 
1. [Background](#Background)
1. [Notes and Configuration](#Notes-and-Configuration)
1. [Amazon Textract](#Amazon-Textract)
1. [Amazon Comprehend](#Amazon-Comprehend)
1. [Key Phrase Extraction](#Key-Phrase-Extraction)
1. [Sentiment Analysis](#Sentiment-Analysis)
1. [Entity Recognition](#Entity-Recognition)
1. [PII Entity Recognition](#PII-Entity-Recognition)
1. [Topic Modeling](#Topic-Modeling)


  
## Background
The goal of this exercise is to learn some insights from an existing PDF document. This is done by using Amazon Textract to extract the text from the document. This text is then analyzed by several Amazon Comprehend services to produce some insights about the document.  

The PDF document used in this example is a ...

## Notes and Configuration
* Kernel `Python 3 (Data Science)` works well with this notebook
* The CSV results files produced by this script use the pipe '|' symbol as a delimiter. When viewing these files in SageMaker Studio, be sure and change the Delimiter to 'pipe'.


#### Regarding IAM Roles and Permissions:

Within SageMaker Studio, each SageMaker User has an IAM Role known as the `SageMaker Execution Role`. Each Notebook for this user will run with this Role and the Permissions specified by this Role. The name of this Role can be found in the Details section of each SageMaker User in the AWS Console.

For the code which runs in this notebook, the `SageMaker Execution Role` needs additional permissions to allow it to use Amazon Textract and Amazon Comprehend. In the AWS Console, navigate to the IAM service and add these two services to your SageMaker Execution Role:
- AmazonTextractFullAccess
- AmazonComprehendFullAccess

Also, an Amazon Comprehend service Role needs to be created to grant Amazon Comprehend read access to your input data.  
When creating this new Role, the default Policies are sufficient (i.e., no other Policies need to be added/modified).

Lastly, the `SageMaker Execution Role` must be allowed to Pass the Comprehend Service Role. To allow this, you must attach a Policy to the `SageMaker Execution Role`. Below, the Resource entry is the ARN of the Comprehend service Role which you created. You can either create this as a new Policy and attach it or add it as an in-line Policy.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "iam:GetRole",
                    "iam:PassRole"
                ],
                "Resource": "arn:aws:iam::810190279255:role/amComprehendServiceRole"
            }
        ]
    }



In [2]:
import os
import json
import sys
import time
import boto3
from sagemaker import get_execution_role

## Setup
Set some variables that will be used throughout this example

In [3]:
region = 'us-east-1'

# change this to an existing S3 bucket in your AWS account
s3bucket = 'am-tmp2'
s3prefix = 'comprehend'

# this is where the various analysis results files will be stored on the local file system of this SageMaker instance
results_dir = './results'
!mkdir -p $results_dir

# this is the IAM Role that defines which permissions this SageMaker instance has
sm_execution_role = get_execution_role()


---
## Functions
The following functions provide a wrapper around the actual API calls for Amazon Textract and Amazon Comprehend


---
## Amazon Textract
Amazon Textract is a machine learning service that automatically extracts text, handwriting and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.  
  
In the next few cells the following steps will be performed:
1. A specified PDF document will be uploaded to Amazon S3 to be analyzed by Amazon Textract.  
1. The result of this analysis is a JSON file with each element containing details about a specific instance of text in the PDF.  
1. This JSON file is copied from S3 to this local SageMaker instance.  
1. The JSON file is then read and post-processed to produce a text file with one tweet (or other social media post) per line.  


In [5]:
# create a boto3 session
# this session will be used for the remainder of this notebook
session = boto3.Session(region_name=region)


In [7]:
# create the Textract Job
textract_src_filename = 'amazon-press-release.png'

# upload the PDF to S3 for Textract to process
!aws s3 cp data/$textract_src_filename s3://$s3bucket/$s3prefix/$textract_src_filename


textract_client = session.client('textract')
response = textract_client.start_document_text_detection(
    DocumentLocation={
    'S3Object': {
        'Bucket': s3bucket,
        'Name': f'{s3prefix}/{textract_src_filename}'
        }
    }
)

JobId = response["JobId"]
print('JobId: %s' % (JobId))

upload: data/amazon-press-release.png to s3://am-tmp2/comprehend/amazon-press-release.png
JobId: 6abdedb8e09bd13c876c29b4ff0716d912856a4bf455f9cae5e1703418bf13a9


In [10]:
response = textract_client.get_document_text_detection(JobId=JobId)
print(response['JobStatus'])

SUCCEEDED


In [11]:
if response['JobStatus'] != 'SUCCEEDED':
    raise
    
pages = []
while(True):
    pages.append(response)
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
        response = client.get_document_text_detection(JobId=JobId, NextToken=nextToken)

    if nextToken == None:
        break


In [54]:
lines = []

# iterate through the Textract JSON response, looking for the LINE and WORD entries
for page in pages:
    for blk in page['Blocks']:
        if blk['BlockType'] in ['LINE']:
            lines.append(blk['Text'])

textract_results_filename = 'textract-results.txt'
with open(f'./results/{textract_results_filename}', 'w') as fd:
    for line in lines:
        fd.write(f'{line}\n')



In [23]:
lines

['August 24, 2017 at 1:45 PM EDT',
 'SEATTLE & AUSTIN, WIRE)--Aug. 24, 2017- NASDAQ:AMZN)-Amazon and Whole Foods',
 "Market today announced that Amazon's acquisition of Whole Foods Market will close on Monday August 28,",
 "2017, and the two companies will together pursue the vision of making Whole Foods Market's high-quality,",
 'natural and organic food affordable for everyone. As a down payment on that vision, Whole Foods Market will',
 'offer lower prices starting Monday on a selection of best-selling grocery staples across its stores, with more to',
 'come.',
 'In addition, Amazon and Whole Foods Market technology teams will begin to integrate Amazon Prime into',
 'the Whole Foods Market point-of-sale system, and when this work is complete, Prime members will receive',
 'special savings and in-store benefits. The two companies will invent in additional areas over time, including in',
 'merchandising and logistics, to enable lower prices for Whole Foods Market customers.',
 '"We\'r

---
## Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to discover insights from text. The service provides APIs for Keyphrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection so you can easily integrate natural language processing into your applications. The following cells will walk through several examples of how to use the API.  


## Key Phrase Extraction
Use Amazon Comprehend to extract Key Phrases in the text from the Textract analysis.


In [24]:
# create the comprehend boto3 client (from the existing boto3 session)
comp_client = session.client('comprehend')


In [27]:
key_phrases = []

for line in lines:     
    response = comp_client.detect_key_phrases(Text=line, LanguageCode='en')
    for kp in response['KeyPhrases']:
        if kp['Text'] not in key_phrases:
            key_phrases.append(kp)


In [31]:
len(key_phrases)

174

---
## Sentiment Analysis
Use Amazon Comprehend to determine the Sentiment of each line of text from the Textract analysis.
* POSITIVE, NEUTRAL, NEGATIVE

In [33]:
# print('Started Comprehend Sentiment Analysis job')
# sentiments = GetSentiments(comp_client, textract_results_filename, comprehend_sentiments_results_filename)
# print('See results file: %s\n' % (comprehend_sentiments_results_filename))

# freq = CalcFrequencies(sentiments)
# print('Frequencies:')
# for d in sentiments:
#     print('%s: %.2f' % (d, freq[d]))        

    
# keep a running total of the various sentiments
sentiments = []
    

for line in lines:
    response = comp_client.detect_sentiment(Text=line, LanguageCode='en')
    sentiments.append(response['Sentiment'])


In [34]:
sentiments

['NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'POSITIVE',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'POSITIVE',
 'NEUTRAL',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'NEUTRAL',
 'POSITIVE',
 'NEUTRAL',
 'NEUTRAL']

---
## Entity Recognition
Use Amazon Comprehend to detect Entities in the text from the Textract analysis.  
What are the type of Entities?
* PERSON, ORGANIZATION, DATE, QUANTITY, LOCATION, TITLE, COMMERCIAL_ITEM, EVENT, OTHER

In [41]:

entities = []
for line in lines:
    response = comp_client.detect_entities(Text=line, LanguageCode='en')
    entities.append(response['Entities'])



In [42]:
entities

[[{'Score': 0.9941008687019348,
   'Type': 'DATE',
   'Text': 'August 24, 2017 at 1:45 PM EDT',
   'BeginOffset': 0,
   'EndOffset': 30}],
 [{'Score': 0.9128297567367554,
   'Type': 'ORGANIZATION',
   'Text': 'SEATTLE & AUSTIN, WIRE',
   'BeginOffset': 0,
   'EndOffset': 22},
  {'Score': 0.9979518055915833,
   'Type': 'DATE',
   'Text': 'Aug. 24, 2017',
   'BeginOffset': 25,
   'EndOffset': 38},
  {'Score': 0.9296031594276428,
   'Type': 'ORGANIZATION',
   'Text': 'NASDAQ',
   'BeginOffset': 40,
   'EndOffset': 46},
  {'Score': 0.7923950552940369,
   'Type': 'ORGANIZATION',
   'Text': 'AMZN',
   'BeginOffset': 47,
   'EndOffset': 51},
  {'Score': 0.9962045550346375,
   'Type': 'ORGANIZATION',
   'Text': 'Amazon',
   'BeginOffset': 53,
   'EndOffset': 59},
  {'Score': 0.8217054009437561,
   'Type': 'ORGANIZATION',
   'Text': 'Whole Foods',
   'BeginOffset': 64,
   'EndOffset': 75}],
 [{'Score': 0.9663324356079102,
   'Type': 'ORGANIZATION',
   'Text': 'Market',
   'BeginOffset': 0,
   '

---
## PII Entity Recognition
Use Amazon Comprehend to detect PII Entities in the text from the Textract analysis.  
What are the types of PII Entities?  
* NAME, DATE-TIME, ADDRESS, USERNAME, URL, EMAIL, PHONE, CREDIT-DEBIT-EXPIRY, PASSWORD, AGE


In [43]:
pii_entities = []
for line in lines:
    response = comp_client.detect_pii_entities(Text=line, LanguageCode='en')
    pii_entities.append(response['Entities'])    
                    

In [44]:
pii_entities

[[{'Score': 0.9999969601631165,
   'Type': 'DATE_TIME',
   'BeginOffset': 0,
   'EndOffset': 15},
  {'Score': 0.9999958872795105,
   'Type': 'DATE_TIME',
   'BeginOffset': 19,
   'EndOffset': 26}],
 [{'Score': 0.9999973773956299,
   'Type': 'DATE_TIME',
   'BeginOffset': 25,
   'EndOffset': 38}],
 [{'Score': 0.9999973773956299,
   'Type': 'DATE_TIME',
   'BeginOffset': 85,
   'EndOffset': 101}],
 [{'Score': 0.9923071265220642,
   'Type': 'DATE_TIME',
   'BeginOffset': 0,
   'EndOffset': 4}],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'Score': 0.9999886751174927,
   'Type': 'NAME',
   'BeginOffset': 43,
   'EndOffset': 53}],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'Score': 0.9999813437461853,
   'Type': 'NAME',
   'BeginOffset': 5,
   'EndOffset': 16}],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [{'Score': 0.9999982118606567,
   'Type': 'URL',
   'BeginOffset': 73,
   'EndOffset': 83}],
 [],
 [],
 [{'Score': 0.999982476234436,
   'Type': 'URL',
   'Be

---
## Topic Modeling
Use Amazon Comprehend to extract Topics in the text from the Textract analysis.  

In this example, we are running the analysis as an asynchronous job, so the results are stored in a file in the S3 bucket we specify.  
This analysis may take up to 10 minutes to run.  

The output results are two files:  
*topic_terms.csv:*  A list of topics in the collection. For each topic, the list includes the top terms by topic according to their weight.  
*doc-topics.csv:*   Lists the documents associated with a topic and the proportion of the document that is concerned with the topic.


In [59]:
# put the file to be analyzed into the s3 bucket
# in this example, this file is the results from running textract on a pdf
s3dest = f's3://{s3bucket}/{s3prefix}/{textract_results_filename}'
!aws s3 cp ./results/$textract_results_filename $s3dest

upload: results/textract-results.txt to s3://am-tmp2/comprehend/textract-results.txt


In [94]:
# start the Amazon Comprehend Topics Analysis job
# create a unique Job Name
JobName = 'MyJobName-%d' % (time.time())

# Why is this IAM Role needed?
comprehend_role = 'arn:aws:iam::662559257807:role/ComprehendDataAccess-SM'

request = {
   "ClientRequestToken": "string",
   "DataAccessRoleArn": comprehend_role,
   "InputDataConfig": { 
      "InputFormat": "ONE_DOC_PER_FILE",
      "S3Uri": s3dest
   },
   "JobName": JobName,
   "OutputDataConfig": { 
      "S3Uri": f's3://{s3bucket}/{s3prefix}/'
   }
}

# create the comprehend analysis job
response = comp_client.start_topics_detection_job(**request)
JobId = response['JobId']
print(JobId)


0341c4209663f04171aaa7eb246f6fe8


In [103]:
response = comp_client.describe_topics_detection_job(JobId=JobId)
status = response['TopicsDetectionJobProperties']['JobStatus']
print(status)
        
    

COMPLETED


In [104]:
response['TopicsDetectionJobProperties']

{'JobId': '0341c4209663f04171aaa7eb246f6fe8',
 'JobName': 'MyJobName-1632756590',
 'JobStatus': 'COMPLETED',
 'SubmitTime': datetime.datetime(2021, 9, 27, 15, 29, 50, 875000, tzinfo=tzlocal()),
 'EndTime': datetime.datetime(2021, 9, 27, 15, 44, 0, 794000, tzinfo=tzlocal()),
 'InputDataConfig': {'S3Uri': 's3://am-tmp2/comprehend/textract-results.txt',
  'InputFormat': 'ONE_DOC_PER_FILE'},
 'OutputDataConfig': {'S3Uri': 's3://am-tmp2/comprehend/662559257807-TOPICS-0341c4209663f04171aaa7eb246f6fe8/output/output.tar.gz'},
 'NumberOfTopics': 10,
 'DataAccessRoleArn': 'arn:aws:iam::662559257807:role/ComprehendDataAccess-SM'}

In [105]:
# the comprehend analysis results are in the s3 bucket, full path is S3Uri
s3uri = response['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
basename = os.path.basename(s3uri)

# copy the 'output.tar.gz' file from the s3 bucket to the results folder
!aws s3 cp $s3uri $results_dir

# extract the contents of this tarball, which are two files: topic-terms.csv, doc-topics.csv
!(cd $results_dir; tar xzf $basename)
!(cd $results_dir; rm -f $basename)

print('See the following files:')
!ls -l $results_dir/topic-terms.csv
!ls -l $results_dir/doc-topics.csv

download: s3://am-tmp2/comprehend/662559257807-TOPICS-0341c4209663f04171aaa7eb246f6fe8/output/output.tar.gz to results/output.tar.gz
See the following files:
-rw-r--r-- 1 root root 2352 Sep 27 15:41 ./results/topic-terms.csv
-rw-r--r-- 1 root root 93 Sep 27 15:41 ./results/doc-topics.csv


In [89]:
response

{'TopicsDetectionJobProperties': {'JobId': '27d8c680451146edb26a006424c7f9fb',
  'JobName': 'MyJobName-1632753183',
  'JobStatus': 'COMPLETED',
  'SubmitTime': datetime.datetime(2021, 9, 27, 14, 33, 3, 487000, tzinfo=tzlocal()),
  'EndTime': datetime.datetime(2021, 9, 27, 15, 3, 19, 262000, tzinfo=tzlocal()),
  'InputDataConfig': {'S3Uri': 's3://am-tmp2/comprehend/textract-results.txt',
   'InputFormat': 'ONE_DOC_PER_LINE'},
  'OutputDataConfig': {'S3Uri': 's3://am-tmp2/comprehend/662559257807-TOPICS-27d8c680451146edb26a006424c7f9fb/output/output.tar.gz'},
  'NumberOfTopics': 10,
  'DataAccessRoleArn': 'arn:aws:iam::662559257807:role/ComprehendDataAccess-SM'},
 'ResponseMetadata': {'RequestId': '490a64a7-b1bb-4278-a27c-2a9ceb6963d5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '490a64a7-b1bb-4278-a27c-2a9ceb6963d5',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '632',
   'date': 'Mon, 27 Sep 2021 15:21:14 GMT'},
  'RetryAttempts': 0}}