# Amazon Comprehend
#### Extract valuable insights from text within documents

## Contents 
1. [Setup](#Setup)
1. [IAM Roles and Permissions](#IAM)
1. [Amazon Textract](#Textract)
1. [Key Phrase Extraction](#KeyPhrases)
1. [Sentiment Analysis](#Sentiment)
1. [Entity Recognition](#Entity)
1. [PII Entity Recognition](#PII)
1. [PII Entity Recognition - Batch Mode](#PII-Batch)
1. [Topic Modeling](#Topics)
1. [Custom Classifiers](#Custom)

#### Notes and Configuration
* Kernel `Python 3 (Data Science)` works well with this notebook

In [None]:
import os
import json
import sys
import time
import boto3
import sagemaker
import pandas as pd

## Setup <a name="Setup"></a>
Set some variables that will be used throughout this example

In [None]:
session = boto3.Session()
region = session.region_name
sm_session = sagemaker.Session()

s3bucket = sm_session.default_bucket()    
s3prefix = 'comprehend'

In [None]:
# this is where the various analysis results files will be stored on the local file system of this SageMaker instance
results_dir = './results'
!mkdir -p $results_dir

# this is the IAM Role that defines which permissions this SageMaker instance has
sagemaker_role = sagemaker.get_execution_role()
print('sagemaker execution role: ', sagemaker_role)
print('s3 bucket:', s3bucket)
print('s3 prefix:', s3prefix)
print('region:', region)

In [None]:
# set this to the ARN of the Role you created in the Textract Notebook
#comprehend_role = 'arn:aws:iam::662559257807:role/myComprehendDataAccessRole'
comprehend_role = '<enter your Comprehend Role created from the previous Textract workshop>'

---
## Amazon Textract <a name="Textract"></a>
Amazon Textract is a machine learning service that automatically extracts text, handwriting and data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.  
  
In the next few cells the following steps will be performed:
1. A specified PDF document will be uploaded to Amazon S3 to be analyzed by Amazon Textract.  
1. The result of this analysis is a JSON file with each element containing details about a specific instance of text in the PDF.  
1. This JSON file is copied from S3 to this local SageMaker instance.  
1. The JSON file is then read and post-processed to produce a text file.


In [None]:
# create the Textract Job
#textract_src_filename = 'amazon-press-release.png'
#textract_src_filename = 'police-report.pdf'
textract_src_filename = 'resume.pdf'

# upload the source document to S3 for Textract to access
!aws s3 cp data/$textract_src_filename s3://$s3bucket/$s3prefix/$textract_src_filename

textract_client = session.client('textract')
response = textract_client.start_document_text_detection(
    DocumentLocation={
    'S3Object': {
        'Bucket': s3bucket,
        'Name': f'{s3prefix}/{textract_src_filename}'
        }
    }
)

JobId = response["JobId"]
print('JobId: %s' % (JobId))


In [None]:
response = textract_client.get_document_text_detection(JobId=JobId)
print(response)

In [None]:
while response['JobStatus'] != 'SUCCEEDED':
    print('.', end='')
    response = textract_client.get_document_text_detection(JobId=JobId)
    time.sleep(5)
print('done')    

In [None]:
pages = []
while(True):
    pages.append(response)
    nextToken = None
    if('NextToken' in response):
        nextToken = response['NextToken']
        response = textract_client.get_document_text_detection(JobId=JobId, NextToken=nextToken)

    if nextToken == None:
        break

In [None]:
fulltext = ''

# iterate through the Textract JSON response, looking for the LINE and WORD entries
for page in pages:
    for blk in page['Blocks']:
        if blk['BlockType'] in ['LINE']:
            fulltext = fulltext + '\n' + blk['Text']

textract_results_filename = 'textract-results.txt'
with open(f'./results/{textract_results_filename}', 'w') as fd:
    fd.write(f'{fulltext}\n')



In [None]:
print(fulltext)

---
## Amazon Comprehend
Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to discover insights from text. The service provides APIs for Keyphrase Extraction, Sentiment Analysis, Entity Recognition, Topic Modeling, and Language Detection so you can easily integrate natural language processing into your applications. The following cells will walk through several examples of how to use the API.  


## Key Phrase Extraction <a name="KeyPhrases"></a>
Use Amazon Comprehend to extract Key Phrases in the text from the Textract analysis.  
The input is a UTF-8 text string that must contain fewer that 5,000 bytes of UTF-8 encoded characters.


In [None]:
# create the comprehend boto3 client (from the existing boto3 session)
comp_client = session.client('comprehend')


In [None]:
response = comp_client.detect_key_phrases(Text=fulltext, LanguageCode='en')
response

---
## Sentiment Analysis <a name="Sentiment"></a>
Use Amazon Comprehend to determine the Sentiment of each line of text from the Textract analysis.
* POSITIVE, NEUTRAL, NEGATIVE, MIXED

The input is a UTF-8 text string that must contain fewer that 5,000 bytes of UTF-8 encoded characters.


In [None]:
sample_text = 'That person looks happy today'
response = comp_client.detect_sentiment(Text=sample_text, LanguageCode='en')
response

---
## Entity Recognition <a name="Entity"></a>
Use Amazon Comprehend to detect Entities in the text from the Textract analysis.  
What are the type of Entities?
* PERSON, ORGANIZATION, DATE, QUANTITY, LOCATION, TITLE, COMMERCIAL_ITEM, EVENT, OTHER

The input is a UTF-8 text string that must contain fewer that 5,000 bytes of UTF-8 encoded characters.


In [None]:
response = comp_client.detect_entities(Text=fulltext, LanguageCode='en')
response

---
## PII Entity Recognition <a name="PII"></a>
Use Amazon Comprehend to detect PII Entities in the text from the Textract analysis.  
What are the types of PII Entities?  
* NAME, DATE-TIME, ADDRESS, USERNAME, URL, EMAIL, PHONE, CREDIT-DEBIT-EXPIRY, PASSWORD, AGE

In this example, we are looking at a string of text

In [None]:
response = comp_client.detect_pii_entities(Text=fulltext, LanguageCode='en')
response           

---
## PII Using Batch Processing <a name="PII-Batch"></a>
Use Amazon Comprehend to extract PII directly from a PDF document  

In this example, we are running the analysis as an asynchronous job, so the results are stored in a file in the S3 bucket we specify.  
This analysis may take up to 10 minutes to run.  



In [None]:
# put the file to be analyzed into the s3 bucket
# in this example, this file is the results from running textract on a pdf
s3dest = f's3://{s3bucket}/{s3prefix}/{textract_results_filename}'
!aws s3 cp ./results/$textract_results_filename $s3dest

In [None]:
request = comp_client.start_pii_entities_detection_job(
    InputDataConfig={
        'S3Uri': s3dest,
        'InputFormat': 'ONE_DOC_PER_FILE'
    },
    OutputDataConfig={
        'S3Uri': f's3://{s3bucket}/{s3prefix}'
    },
    Mode='ONLY_REDACTION',
    RedactionConfig={
        'PiiEntityTypes': ['ALL'],
        'MaskMode': 'REPLACE_WITH_PII_ENTITY_TYPE',
        'MaskCharacter': '*'
    },
    DataAccessRoleArn=comprehend_role,
    #JobName='string',
    LanguageCode='en'
)


In [None]:
JobId = request['JobId']
while True:
    response = comp_client.describe_pii_entities_detection_job(JobId=JobId)
    if response['PiiEntitiesDetectionJobProperties']['JobStatus'] != 'IN_PROGRESS':
        print('')
        break
    print('.', end='')
    time.sleep(5)
    
response

---
## Topic Modeling <a name="Topics"></a>
Use Amazon Comprehend to extract Topics in the text from the Textract analysis.  

In this example, we are running the analysis as an asynchronous job, so the results are stored in a file in the S3 bucket we specify.  
This analysis may take up to 10 minutes to run.  

The output results are two files:  
*topic_terms.csv:*  A list of topics in the collection. For each topic, the list includes the top terms by topic according to their weight.  
*doc-topics.csv:*   Lists the documents associated with a topic and the proportion of the document that is concerned with the topic.


In [None]:
# start the Amazon Comprehend Topics Analysis job
request = comp_client.start_topics_detection_job(
    InputDataConfig = { 
      "InputFormat": "ONE_DOC_PER_FILE",
      "S3Uri": s3dest
    },
    OutputDataConfig = { 
      "S3Uri": f's3://{s3bucket}/{s3prefix}/'
    },
    DataAccessRoleArn = comprehend_role
)

JobId = request['JobId']
print(JobId)


In [None]:
%%time
while True:
    response = comp_client.describe_topics_detection_job(JobId=JobId)
    if response['TopicsDetectionJobProperties']['JobStatus'] != 'IN_PROGRESS':
        print('')
        break
    print('.', end='')
    time.sleep(10)
response


In [None]:
# the comprehend analysis results are in the s3 bucket, full path is S3Uri
s3uri = response['TopicsDetectionJobProperties']['OutputDataConfig']['S3Uri']
basename = os.path.basename(s3uri)

# copy the 'output.tar.gz' file from the s3 bucket to the results folder
!aws s3 cp $s3uri $results_dir

# extract the contents of this tarball, which are two files: topic-terms.csv, doc-topics.csv
!(cd $results_dir; tar xzf $basename)
!(cd $results_dir; rm -f $basename)

print('See the following files:')
!ls -l $results_dir/topic-terms.csv
!ls -l $results_dir/doc-topics.csv

---
## Amazon Comprehend - Custom Classifiers <a name="Custom"></a>
Use Amazon Comprehend to label or classify documents. This functionality gives you the ability to perform document classifications that are unique to your business.

In this example, we are going to label some resumes.

First, upload the resume data to S3 so we can train our classifier

In [None]:
df = pd.read_csv('./data/resumes.csv')

# use 90% of our data for training and the remainder for testing
df_train = df.sample(frac = 0.9)
df_test = df.drop(df_train.index)

df_train.to_csv('./data/train.csv', index=False)
df_test.to_csv('./data/test.csv', index=False)

In [None]:
s3dest = f's3://{s3bucket}/{s3prefix}/data/train.csv'
!aws s3 cp data/train.csv $s3dest

## Create (train) the Custom Classifier
This step takes a long time to run (almost two hours) on the *resumes.csv* data set

In [None]:
s3results = f's3://{s3bucket}/{s3prefix}/custom/'

request = comp_client.create_document_classifier(
    DocumentClassifierName='myComprehendClassifier',
    DataAccessRoleArn=comprehend_role,
    InputDataConfig={
        'DataFormat': 'COMPREHEND_CSV',
        'S3Uri': s3dest,
        'LabelDelimiter': '|',
    },
    OutputDataConfig={
        'S3Uri': s3results,
    },
    LanguageCode='en',
    Mode='MULTI_LABEL',
    VersionName='v1'
)


In [None]:
print(request)

In [None]:
%%time
while True:
    response = comp_client.describe_document_classifier(DocumentClassifierArn=request['DocumentClassifierArn'])
    status = response["DocumentClassifierProperties"]["Status"]
    if status in ['TRAINED', 'FAILED']:
        print('')
        break
    print('.', end='')
    time.sleep(10)
response

### Now let's take our sample resume data and run inferencing on them

In [None]:
s3dest = f's3://{s3bucket}/{s3prefix}/data/test.csv'
!aws s3 cp ./data/test.csv $s3dest

### Start the Classification Job

In [None]:
arn = response['DocumentClassifierProperties']['DocumentClassifierArn']
request = comp_client.start_document_classification_job(
    InputDataConfig={
        'S3Uri': f's3://{s3bucket}/{s3prefix}/data/test.csv',
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': f's3://{s3bucket}/{s3prefix}/inf-results/'
    },
    DataAccessRoleArn = comprehend_role,
    DocumentClassifierArn = arn
)

request

### Wait for the Classification Job to complete

In [None]:
%%time
while True:
    response = comp_client.describe_document_classification_job(JobId=request['JobId'])
    status = response["DocumentClassificationJobProperties"]["JobStatus"]
    if status != 'IN_PROGRESS':
        print('')
        break
    print('.', end='')
    time.sleep(10)
response

In [None]:
s3file = response['DocumentClassificationJobProperties']['OutputDataConfig']['S3Uri']

In [None]:
!aws s3 cp $s3file results/

In [None]:
basename = os.path.basename(s3uri)
!(cd results; tar xvzf $basename)

In [None]:
!head ./results/predictions.jsonl

In [None]:
!head -3 ./data/test.csv | tail -1