# Amazon Textract and Amazon Comprehend AI Services
### Example on extracting insights from a PDF Document


## Contents 
1. [Background](#Background)
1. [Notes and Configuration](#Notes-and-Configuration)
1. [Functions](#Functions)
1. [Amazon Textract](#Amazon-Textract)
1. [Amazon Comprehend](#Amazon-Comprehend)
1. [Key Phrase Extraction](#Key-Phrase-Extraction)
1. [Sentiment Analysis](#Sentiment-Analysis)
1. [Entity Recognition](#Entity-Recognition)
1. [PII Entity Recognition](#PII-Entity-Recognition)
1. [Topic Modeling](#Topic-Modeling)


  
## Background
The goal of this exercise is to learn some insights from an existing PDF document. This is done by using Amazon Textract to extract the text from the document. This text is then analyzed by several Amazon Comprehend services to produce some insights about the document.  

The PDF document used in this example is a compiled list of tweets or other social media posts. Each post is separated by a URL that points to that posting. When the text is extracted from the PDF document, the text is re-assembled into a single line of text which is the full text of the tweet or post. The resulting text file contains one tweet/post per line.

## Notes and Configuration
* Kernel `Python 3 (Data Science)` works well with this notebook
* The CSV results files produced by this script use the pipe '|' symbol as a delimiter. When viewing these files in SageMaker Studio, be sure and change the Delimiter to 'pipe'.


#### Regarding IAM Roles and Permissions:

Within SageMaker Studio, each SageMaker User has an IAM Role known as the `SageMaker Execution Role`. Each Notebook for this user will run with this Role and the Permissions specified by this Role. The name of this Role can be found in the Details section of each SageMaker User in the AWS Console.

For the code which runs in this notebook, the `SageMaker Execution Role` needs additional permissions to allow it to use Amazon Textract and Amazon Comprehend. In the AWS Console, navigate to the IAM service and add these two services to your SageMaker Execution Role:
- AmazonTextractFullAccess
- AmazonComprehendFullAccess

Also, an Amazon Comprehend service Role needs to be created to grant Amazon Comprehend read access to your input data.  
When creating this new Role, the default Policies are sufficient (i.e., no other Policies need to be added/modified).

Lastly, the `SageMaker Execution Role` must be allowed to Pass the Comprehend Service Role. To allow this, you must attach a Policy to the `SageMaker Execution Role`. Below, the Resource entry is the ARN of the Comprehend service Role which you created. You can either create this as a new Policy and attach it or add it as an in-line Policy.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "iam:GetRole",
                    "iam:PassRole"
                ],
                "Resource": "arn:aws:iam::810190279255:role/amComprehendServiceRole"
            }
        ]
    }



In [1]:
import os
import json
import sys
import time
import boto3
from sagemaker import get_execution_role

## Setup
Set some variables that will be used throughout this example

In [2]:
region = 'us-east-1'

# change this to an existing S3 bucket in your AWS account
bucket = 'am-buck1'

# this is where the various analysis results files will be stored on the local file system of this SageMaker instance
results_dir = './results'
!mkdir -p $results_dir

# the pdf file to be analyzed by Textract
textract_src_filename = 'Alabama2.pdf'

# the name of the file where the JSON results from Textract are saved
json_textract_results_filename = f'{results_dir}/textract-results.json'

# the post-processed results of the JSON results
textract_results_filename = f'{results_dir}/textract-results.txt'

# the results of Amazon Comprehend - Key Phrases detection
comprehend_keyphrases_results_filename = f'{results_dir}/comp-keyphrases.csv'

# the results of Amazon Comprehend - Sentiment Analysis
comprehend_sentiments_results_filename = f'{results_dir}/comp-sentiment.csv'

# the results of Amazon Comprehend - Entities Detection
comprehend_entities_results_filename = f'{results_dir}/comp-entities.csv'

# the results of Amazon Comprehend - Entities Detection
comprehend_pii_entities_results_filename = f'{results_dir}/comp-pii_entities.csv'

# the results of Amazon Comprehend - Topics Detection
comprehend_topics_results_filename = f'{results_dir}/comp-topics.csv'

# this is the IAM Role that defines which permissions this SageMaker instance has
sm_execution_role = get_execution_role()

# why is this IAM Role needed? See the Notes at the beginning of this Notebook for an explanation
# replace the string below with the ARN of the Amazon Comprehend Service Role that you create in your AWS account
comprehend_role = 'arn:aws:iam::810190279255:role/amComprehendServiceRole'



---
## Functions
The following functions provide a wrapper around the actual API calls for Amazon Textract and Amazon Comprehend


In [3]:
'''
FixupString()
Input: string
Returns: lowercase string with punctuation and stop words removed
'''
def FixupString(s):
    punc = '!()-[]{};:\,<>./?@#$%^&*_~"\''
    
    # these are the English stop words from the NLTK package
    stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 
                 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', 
                 "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 
                 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 
                 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 
                 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 
                 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 
                 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 
                 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 
                 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', 
                 "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 
                 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', 
                 "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"
                ]
    
    # remove punctuation
    for c in punc: 
        if c in s: 
            s = s.replace(c, '')
        
    # convert to lowercase
    s = s.lower()
    
    # remove stop words
    words = s.split()
    for w in words:
        if w in stopwords:
            words.remove(w)
 
    # put string back together
    s = ' '.join(words)
    
    return(s)

'''
CalcFrequencies()
Input: dict with keys and numeric values
Returns: dict with the same keys and numeric frequency
'''
def CalcFrequencies(di):
    
    freq = {}
    
    sum = 0
    for d in di:
        sum += di[d]
    
    for d in di:
        freq[d] = di[d]/sum

    return freq

'''
StartTextractJob()
Input: textract-boto3-client, dict with keys and numeric values
Returns: dict with the same keys and numeric frequency
'''
def StartTextractJob(client, bucket, src_filename):
    response = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': bucket,
            'Name': src_filename
        }
    })

    return response["JobId"]


'''
GetTextractJobResults()
Input: textract-boto3-client, textract job id
Returns: textract results
'''
def GetTextractJobResults(client, jobId):

    pages = []
    seconds_ctr = 0
    response = client.get_document_text_detection(JobId=jobId)
    print('working.', end='')
    while(seconds_ctr < 600):
        print('.', end='')
        response = client.get_document_text_detection(JobId=jobId)
        if response['JobStatus'] == 'IN_PROGRESS':
            time.sleep(2)
            seconds_ctr += 2
        else:
            break
            
    if response['JobStatus'] == 'SUCCEEDED':
        while(True):
            pages.append(response)
            nextToken = None
            if('NextToken' in response):
                nextToken = response['NextToken']
                response = client.get_document_text_detection(JobId=jobId, NextToken=nextToken)

            if nextToken == None:
                break
                
    print('\n')
    return pages

'''
SaveTextractResults()
Input: Textract results (JSON), and output filename
Output: a file with the postprocessed results
Returns: nothing
'''
def SaveTextractResults(response, filename):
    with open(filename, 'w') as fd:
        # iterate through the Textract responses, looking for the LINE and WORD entries
        for resp in response:
            for blk in resp['Blocks']:
                if blk['BlockType'] in ['LINE', 'WORD']:
                    # if 'http' is found at the beginning of the line, we assume a new paragraph of text will be started
                    loc = blk['Text'].find('https')
                    if loc >= 0 and loc <= 2:
                        fd.write('\n')
                    else:
                        fd.write('%s ' % blk['Text'])


'''
GetKeyPhrases()
Input: Boto3 client for Comprehend, text file (each line will be analyzed for key phrases)
Output: a file with the postprocessed results
Returns: dict with key phrases (as the key) and counts
'''
def GetKeyPhrases(client, infile):

    line_ctr = 0
    print('working.', end='')
    
    # keep a running total of the various sentiments
    result_ctr = {}

    with open(infile) as fd:
        lines = fd.readlines()
        for line in lines:
            line_ctr += 1
            if line_ctr % 5 == 0:
                print('.', end='')
            if len(line) > 1:
                # maximum text length for Comprehend Key Phrases is 5,000 characters
                line = line[:4998]
                line = line.replace('|', ' ')           
                response = client.detect_key_phrases(Text=line, LanguageCode='en')
                for keyphrase in response['KeyPhrases']:
                    kp = keyphrase['Text']
                    if kp in result_ctr:
                        result_ctr[kp] += 1
                    else:
                        result_ctr[kp] = 1

    print('\n')

    # sort the dictionary by values
    sorted_ctr = dict(sorted(result_ctr.items(), key=lambda x: x[1], reverse=True))
    return sorted_ctr


'''
GetSentiments()
Input: Boto3 client for Comprehend, text file (each line will be analyzed for sentiment), output file name
Output: file with each line and its sentiment
Returns: dict with sentiment types and counts
'''
def GetSentiments(client, infile, outfile):

    line_ctr = 0
    print('working.', end='')
    
    # keep a running total of the various sentiments
    result_ctr = {}

    with open(infile) as fd:
        lines = fd.readlines()
    
    with open(outfile, 'w') as fd:
        for line in lines:
            line_ctr += 1
            if line_ctr % 5 == 0:
                print('.', end='')
            # maximum text length for Comprehend Sentiment is 5,000 characters
            line = line[:4998]
            line = line.replace('|', ' ')           
            if len(line) > 1:
                response = client.detect_sentiment(Text=line, LanguageCode='en')
                sentiment = response['Sentiment']
                if sentiment in result_ctr:
                    result_ctr[sentiment] += 1
                else:
                    result_ctr[sentiment] = 1
                fd.write('%s|%s' % (sentiment, line))
    
    print('\n')

    # sort the dictionary by values
    sorted_ctr = dict(sorted(result_ctr.items(), key=lambda x: x[1], reverse=True))
    return sorted_ctr


'''
GetEntities()
Input: text file (each line will be analyzed for entities), output file name
Output: file with each line and its entity
Returns: dict with entity types and counts
'''
def GetEntities(client, infile, outfile):

    line_ctr = 0
    print('working.', end='')
    
    # keep a running total of the various sentiments
    result_ctr = {}

    with open(infile) as fd:
        lines = fd.readlines()

    with open(outfile, 'w') as fd:
        for line in lines:
            line_ctr += 1
            if line_ctr % 5 == 0:
                print('.', end='')
            line = line[:4998]
            line = line.replace('|', ' ')                       
            if len(line) > 1:
                # maximum text length for Comprehend Entities is 5,000 characters
                response = client.detect_entities(Text=line, LanguageCode='en')
                for entity in response['Entities']:
                    etype = entity['Type']
                    if etype in result_ctr:
                        result_ctr[etype] += 1
                    else:
                        result_ctr[etype] = 1
                    fd.write('%s|%s\n' % (etype, entity['Text']))
    print('\n')
    
    # sort the dictionary by values
    sorted_ctr = dict(sorted(result_ctr.items(), key=lambda x: x[1], reverse=True))
    return sorted_ctr


'''
GetPIIEntities()
Input: text file (each line will be analyzed for PII entities), output file name
Output: file with each line and its entity type
Returns: dict with PII entity types and counts
'''
def GetPIIEntities(client, infile, outfile):

    line_ctr = 0
    print('working.', end='')
    
    # keep a running total of the various sentiments
    result_ctr = {}

    with open(infile) as fd:
        lines = fd.readlines()

    with open(outfile, 'w') as fd:
        for line in lines:
            line_ctr += 1
            if line_ctr % 5 == 0:
                print('.', end='')
            # maximum text length for Comprehend Entities is 5,000 characters
            line = line[:4998]
            line = line.replace('|', ' ')           
            if len(line) > 1:
                response = client.detect_pii_entities(Text=line, LanguageCode='en')
                for entity in response['Entities']:
                    etype = entity['Type']
                    if etype in result_ctr:
                        result_ctr[etype] += 1
                    else:
                        result_ctr[etype] = 1
                    fd.write('%s|%s\n' % (etype, line[entity['BeginOffset']:entity['EndOffset']]))
                    
    print('\n')
    
    # sort the dictionary by values
    sorted_ctr = dict(sorted(result_ctr.items(), key=lambda x: x[1], reverse=True))
    return sorted_ctr


'''
StartTopicsAnalysisJob()
Input: Boto3 Comprehend client, S3 bucket, Role, and the file containing the text to be analyzed
Returns: Job ID
'''
def StartTopicsAnalysisJob(client, bucket, role, infile):
    # create a unique Job Name
    JobName = 'MyJobName-%d' % (time.time())

    request = {
       "ClientRequestToken": "string",
       "DataAccessRoleArn": '%s' % (role),
       "InputDataConfig": { 
          "InputFormat": "ONE_DOC_PER_FILE",
          "S3Uri": 's3://%s/%s' % (bucket, infile)
       },
       "JobName": JobName,
       "OutputDataConfig": { 
          "S3Uri": 's3://%s' % (bucket)
       }
    }

    # create the comprehend analysis job
    job = client.start_topics_detection_job(**request)
    return job['JobId']

'''
GetTopicsAnalysisJob()
Input: Boto3 Comprehend client, JobId
Returns: 
'''
def GetTopics(client, jobid):
    seconds_ctr = 0
    print('working.', end='')
    
    while(seconds_ctr < 3600):
        response = client.describe_topics_detection_job(JobId=jobid)
        status = response['TopicsDetectionJobProperties']['JobStatus']
        if status == 'IN_PROGRESS' or status == 'SUBMITTED':
            print('.', end='')
            seconds_ctr += 3
            time.sleep(3)
        else:
            break
            
    print('\n')
    return response['TopicsDetectionJobProperties']


---
## Amazon Textract
Amazon Textract is a machine learning service that automatically extracts text, handwriting and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.  
  
In the next few cells the following steps will be performed:
1. A specified PDF document will be uploaded to Amazon S3 and analyzed by Amazon Textract.  
1. The result of this analysis is a JSON file with each element containing details about a specific instance of text in the PDF.  
1. This JSON file is copied from S3 to this local SageMaker instance.  
1. The JSON file is then read and post-processed to produce a text file with one tweet (or other social media post) per line.  


In [4]:
# create a boto3 session
# this session will be used for the remainder of this notebook
session = boto3.Session(region_name=region)


In [5]:
# create the Textract Job
textract_client = session.client('textract')
jobId = StartTextractJob(textract_client, bucket, textract_src_filename)
print('Started Textract job at %s' % (time.ctime()))
print('JobId: %s' % (jobId))


Started Textract job at Fri Apr 16 20:14:28 2021
JobId: 3b805af6818cac67cb193e86a514d21268f82cba584f50b308658e0d633b9c15


In [6]:
%%time
# wait for job to complete and get the results
results = GetTextractJobResults(textract_client, jobId)

# save the entire results set to a local file
# this file isn't used in the remaining example, but you can open this JSON file in your Jupyter Notebook and view the elements returned by Textract
with open(json_textract_results_filename, 'w') as fd:
    json.dump(results, fd)
    

working..................................................................................

CPU times: user 4.69 s, sys: 119 ms, total: 4.81 s
Wall time: 2min 52s


In [7]:
SaveTextractResults(results, textract_results_filename)
print('See results file: %s\n' % textract_results_filename)


See results file: ./results/textract-results.txt



---
## Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to discover insights from text. The service provides APIs for Keyphrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection so you can easily integrate natural language processing into your applications. The following cells will walk through several examples of how to use the API.  


## Key Phrase Extraction
Use Amazon Comprehend to extract Key Phrases in the text from the Textract analysis.


In [8]:
# create the comprehend boto3 client (from the existing boto3 session)
comp_client = session.client('comprehend')


In [9]:
print('Started Comprehend Key Phrases Analysis job')
keyphrase_counts = GetKeyPhrases(comp_client, textract_results_filename)

# calculate the frequency of each key phrase
freq = CalcFrequencies(keyphrase_counts)

# the results file is in csv format and includes the raw counts and the frequency
with open(comprehend_keyphrases_results_filename, 'w') as fd:
    fd.write('key_phrase|count|frequency\n')
    for kp in keyphrase_counts:        
        fd.write('%s|%d|%.4f\n' % (kp, keyphrase_counts[kp], freq[kp]))

print('See results file: %s' % (comprehend_keyphrases_results_filename))


Started Comprehend Key Phrases Analysis job
working...................................................................................

See results file: ./results/comp-keyphrases.csv


---
## Sentiment Analysis
Use Amazon Comprehend to determine the Sentiment of each line of text from the Textract analysis.

In [10]:
print('Started Comprehend Sentiment Analysis job')
sentiments = GetSentiments(comp_client, textract_results_filename, comprehend_sentiments_results_filename)
print('See results file: %s\n' % (comprehend_sentiments_results_filename))

freq = CalcFrequencies(sentiments)
print('Frequencies:')
for d in sentiments:
    print('%s: %.2f' % (d, freq[d]))        
    


Started Comprehend Sentiment Analysis job
working...................................................................................

See results file: ./results/comp-sentiment.csv

Frequencies:
NEUTRAL: 0.72
NEGATIVE: 0.26
POSITIVE: 0.02
MIXED: 0.00


---
## Entity Recognition
Use Amazon Comprehend to detect Entities in the text from the Textract analysis.  
What are the type of Entities?
* PERSON, ORGANIZATION, DATE, QUANTITY, LOCATION, TITLE, COMMERCIAL_ITEM, EVENT, OTHER

In [11]:
print('Started Comprehend Entities Analysis job')
entities = GetEntities(comp_client, textract_results_filename, comprehend_entities_results_filename)
print('See results file: %s\n' % (comprehend_entities_results_filename))

freq = CalcFrequencies(entities)
print('Frequencies:')
for d in entities:
    print('%s: %.2f' % (d, freq[d]))        
                    

Started Comprehend Entities Analysis job
working...................................................................................

See results file: ./results/comp-entities.csv

Frequencies:
PERSON: 0.31
ORGANIZATION: 0.18
DATE: 0.17
QUANTITY: 0.14
LOCATION: 0.09
OTHER: 0.05
TITLE: 0.02
COMMERCIAL_ITEM: 0.02
EVENT: 0.01


---
## PII Entity Recognition
Use Amazon Comprehend to detect PII Entities in the text from the Textract analysis.  
What are the types of PII Entities?  
* NAME, DATE-TIME, ADDRESS, USERNAME, URL, EMAIL, PHONE, CREDIT-DEBIT-EXPIRY, PASSWORD, AGE


In [12]:
print('Started Comprehend PII Entities Analysis job')
pii_entities = GetPIIEntities(comp_client, textract_results_filename, comprehend_pii_entities_results_filename)
print('See results file: %s\n' % (comprehend_pii_entities_results_filename))

freq = CalcFrequencies(pii_entities)
print('Frequencies:')
for d in pii_entities:
    print('%s: %.2f' % (d, freq[d]))        
                    

Started Comprehend PII Entities Analysis job
working...................................................................................

See results file: ./results/comp-pii_entities.csv

Frequencies:
NAME: 0.34
DATE_TIME: 0.30
ADDRESS: 0.16
USERNAME: 0.10
URL: 0.06
EMAIL: 0.05
PHONE: 0.00
CREDIT_DEBIT_EXPIRY: 0.00
PASSWORD: 0.00
AGE: 0.00


---
## Topic Modeling
Use Amazon Comprehend to extract Topics in the text from the Textract analysis.  

In this example, we are running the analysis as an asynchronous job, so the results are stored in a file in the S3 bucket we specify.  
This analysis may take up to 10 minutes to run.  

In this example, the input is a single file. Each line in the file is considered a document. This is best for short documents, such as social media postings.  
Each line must end with a line feed (LF, \n), a carriage return (CR, \r), or both (CRLF, \r\n).   

The output results are two files:  
*topic_terms.csv:*  A list of topics in the collection. For each topic, the list includes the top terms by topic according to their weight.  
*doc-topics.csv:*   Lists the documents associated with a topic and the proportion of the document that is concerned with the topic.


In [13]:
# put the file to be analyzed into the s3 bucket
# in this example, this file is the results from running textract on a pdf
s3dest = 's3://%s/%s' % (bucket, os.path.basename(textract_results_filename))
!aws s3 cp $textract_results_filename $s3dest

upload: results/textract-results.txt to s3://am-buck1/textract-results.txt


In [14]:
# start the Amazon Comprehend Topics Analysis job
jobId = StartTopicsAnalysisJob(comp_client, bucket, comprehend_role, textract_results_filename)
print('Started Comprehend Topics Analysis job at %s' % (time.ctime()))
print('JobId: %s' % (jobId))


Started Comprehend Topics Analysis job at Fri Apr 16 20:18:55 2021
JobId: f573bc7bcb05193bffe517e60be95991


In [15]:
%%time
results = GetTopics(comp_client, jobId)
print(results['JobStatus'])

# the comprehend analysis results are in the s3 bucket, full path is S3Uri
s3uri = results['OutputDataConfig']['S3Uri']
basename = os.path.basename(s3uri)

# copy the 'output.tar.gz' file from the s3 bucket to the results folder
!aws s3 cp $s3uri $results_dir

# remove the folder/file from s3
!aws s3 rm $s3uri

# extract the contents of this tarball, which are two files: topic-terms.csv, doc-topics.csv
!(cd $results_dir; tar xzf $basename)
!(cd $results_dir; rm -f $basename)

print('See the following files:')
!ls -l $results_dir/topic-terms.csv
!ls -l $results_dir/doc-topics.csv


working...................................................................................................................................................................

COMPLETED
download: s3://am-buck1/810190279255-TOPICS-f573bc7bcb05193bffe517e60be95991/output/output.tar.gz to results/output.tar.gz
delete: s3://am-buck1/810190279255-TOPICS-f573bc7bcb05193bffe517e60be95991/output/output.tar.gz
See the following files:
-rw-r--r-- 1 root root 2568 Apr 16 20:25 ./results/topic-terms.csv
-rw-r--r-- 1 root root 195 Apr 16 20:25 ./results/doc-topics.csv
CPU times: user 738 ms, sys: 109 ms, total: 847 ms
Wall time: 8min 11s
