# Using Amazon Comprehend Through the boto3 API 

This notebook shows how to use boto3 Amazon API to use Amazon Comprehend for real time analysis as well as scheduling analysis jobs.
1. For boto3 to work you need to create an IAM User, receive `aws_access_key_id` and `aws_secret_access_key` and configure your credentials using AWS Command Line Interface (AWS CLI).
2. Cost. If you are using free AWS tier, you can analyze 50K units a month free. In my example, every tweet is a unit. In the scheduled job I am analyzing 10K tweets at once, so the free tier runs out pretty fast, and then it's \$1 per 10K. Be sure to check pricing before you proceed. https://aws.amazon.com/comprehend/pricing/
3. Reference. Boto3 S3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.html  Boto3 Comprehend: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

In [1]:
import boto3
from botocore.exceptions import ClientError
import pandas as pd
import json
import tarfile

## Quick Reference to S3 Upload and Download

Upload data files to S3

In [2]:
def upload_file(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """

    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        response = s3_client.upload_file(file_name, bucket, object_name)
    except ClientError as e:
        logging.error(e)
        return False
    return True

In [3]:
bucket_name  = 'comprehend-api'
local_file_name = 'Comprehend/amazon_tweets.csv'
aws_file_name = 'input-data/amazon_tweets.csv'
# Upload file to specific location
upload_file(local_file_name, bucket_name, aws_file_name)

True

Downloading Results

In [5]:
entities_results_S3Url = 's3://BUCKETNAME/results/LONG_LONG_NAME/output/output.tar.gz'
local_results_filename = 'Comprehend/outputs/entities.tar.gz'

s3 = boto3.client('s3')
s3_name = 's3://' + bucket_name + '/'
results_aws_filename = entities_results_S3Url.replace(s3_name, '')

In [6]:
# Download results
s3.download_file(bucket_name,
                 results_aws_filename, 
                 local_results_filename)

Unzip the results

In [7]:
def extract_targz(targz_file, output_path = ''):
    if targz_file.endswith("tar.gz"):
        tar = tarfile.open(targz_file, "r:gz")
        tar.extractall(path = output_path)
        tar.close()
    elif targz_file.endswith("tar"):
        tar = tarfile.open(targz_file, "r:")
        tar.extractall(path = output_path)
        tar.close()

In [8]:
output_path = 'Comprehend/outputs/extracted'
extract_targz(local_results_filename, output_path)

## Live Single Record Processing

In [9]:
local_file_name = 'Comprehend/amazon_tweets.csv'
region_name = 'us-east-2' # USE YOUR REGION HERE
# read data
df = pd.read_csv(local_file_name, header = None, names = ['amazon_tweets'], dtype = 'str')
# Record to examine
df.loc[0].item()

In [10]:
#initialize comprehend module
comprehend = boto3.client(service_name='comprehend', region_name=region_name)

In [15]:
# Run sentiment analysis
sentiment_output = comprehend.detect_sentiment(Text=df.loc[0].item(), LanguageCode='en')
# Output
sentiment_output

{'Sentiment': 'NEUTRAL',
 'SentimentScore': {'Positive': 0.28640827536582947,
  'Negative': 0.0001400432811351493,
  'Neutral': 0.7134460806846619,
  'Mixed': 5.648644219036214e-06},
 'ResponseMetadata': {'RequestId': '40eb6a43-8745-4cf6-ab4d-884dce7fd680',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '40eb6a43-8745-4cf6-ab4d-884dce7fd680',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '163',
   'date': 'Tue, 29 Sep 2020 18:03:53 GMT'},
  'RetryAttempts': 0}}

In [16]:
#Score
sentiment_output['SentimentScore']

{'Positive': 0.28640827536582947,
 'Negative': 0.0001400432811351493,
 'Neutral': 0.7134460806846619,
 'Mixed': 5.648644219036214e-06}

### Live Multiple Record Processing

In [17]:
# A maximum of 25 records can be analyzed
tweets25 = list(df.amazon_tweets[0:25])

#### Sentiment Analysis

In [18]:
#Run a sentiment batch
sentiment_batch = comprehend.batch_detect_sentiment(TextList=tweets25,
                                                    LanguageCode='en')

Check a tweet and its sentiment

In [19]:
tweets25[4]

'@thom_galvin https://t.co/hLLKVH322N Book to soon to come! Getting excited!'

In [20]:
sentiment_batch['ResultList'][4]

{'Index': 4,
 'Sentiment': 'POSITIVE',
 'SentimentScore': {'Positive': 0.9799827337265015,
  'Negative': 6.78108845022507e-05,
  'Neutral': 0.019944585859775543,
  'Mixed': 4.9112672968476545e-06}}

Parse the results into a dataframe

In [21]:
def parse_sentiment_batch(data):
    df = pd.DataFrame() 
    for line in data['ResultList']:
        try:
            dt_temp = pd.DataFrame(line['SentimentScore'], index = [0])  # extract data from sub-dictionary
            for field in list(line.keys())[:-1]:  # add common fields
                dt_temp[field] = line[field]
        
            df = df.append(dt_temp, ignore_index = True)
                  
        except:
            for field in list(line.keys())[:-1]:  # add common fields
                dt_temp[field] = line[field]
        
            df = df.append(dt_temp, ignore_index = True)
    
    return(df)

In [22]:
parse_sentiment_batch(sentiment_batch).head()

Unnamed: 0,Positive,Negative,Neutral,Mixed,Index,Sentiment
0,0.286408,0.00014,0.713446,6e-06,0,NEUTRAL
1,0.015133,0.361038,0.619455,0.004373,1,NEUTRAL
2,0.000617,0.991953,0.007428,2e-06,2,NEGATIVE
3,0.001871,0.74723,0.25088,1.9e-05,3,NEGATIVE
4,0.979983,6.8e-05,0.019945,5e-06,4,POSITIVE


#### Entities Detection

In [23]:
# Run entities analysis on a batch
entities_batch = comprehend.batch_detect_entities(TextList=tweets25, LanguageCode='en')

In [24]:
# Sample tweet
tweets25[0]

'https://t.co/L7HVYLQu9Q We\xa0look forward to connecting with you through the camera tonight at 7pm CT! \xa0 Here are the ways to watch: Watch at https://t.co/kcNTEKjuSU or via Facebook Live, Roku or Amazon Fire https://t.co/QdUkItoUlT'

In [25]:
# Entities in a sample tweet
pd.DataFrame(entities_batch['ResultList'][0]['Entities'])

Unnamed: 0,Score,Type,Text,BeginOffset,EndOffset
0,0.987823,OTHER,https://t.co/L7HVYLQu9Q,0,23
1,0.868062,DATE,tonight at,82,92
2,0.869554,DATE,7pm CT,93,99
3,0.992966,OTHER,https://t.co/kcNTEKjuSU,140,163
4,0.658535,TITLE,Facebook Live,171,184
5,0.956544,ORGANIZATION,Roku,186,190
6,0.966188,ORGANIZATION,Amazon,194,200
7,0.954848,COMMERCIAL_ITEM,Fire,201,205
8,0.991052,OTHER,https://t.co/QdUkItoUlT,206,229


Parse entities into a dataframe

In [26]:
# Function to parse the dictionary
def parse_entities_batch(data):
    df = pd.DataFrame() # declare an empty dataframe
    nested_json = 'Entities' # nested sub-dictiptionary to extract data from
    # populate the dataframe
    for line in data['ResultList']:
        dt_temp = pd.DataFrame(line[nested_json])  # extract data from sub-dictionary
        other_fields = list(line.keys())
        other_fields.remove(nested_json) # remove nested fields        
        for field in other_fields:  # add common fields
            dt_temp[field] = line[field]
        
        df = df.append(dt_temp, ignore_index = True)
        
    return(df)

In [27]:
entities_batch_df = parse_entities_batch(entities_batch)
entities_batch_df.head(15)

Unnamed: 0,Score,Type,Text,BeginOffset,EndOffset,Index
0,0.987823,OTHER,https://t.co/L7HVYLQu9Q,0,23,0
1,0.868062,DATE,tonight at,82,92,0
2,0.869554,DATE,7pm CT,93,99,0
3,0.992966,OTHER,https://t.co/kcNTEKjuSU,140,163,0
4,0.658535,TITLE,Facebook Live,171,184,0
5,0.956544,ORGANIZATION,Roku,186,190,0
6,0.966188,ORGANIZATION,Amazon,194,200,0
7,0.954848,COMMERCIAL_ITEM,Fire,201,205,0
8,0.991052,OTHER,https://t.co/QdUkItoUlT,206,229,0
9,0.59791,PERSON,@JediLive,0,9,1


#### Key Phrases

In [28]:
# Combine all 25 tweets
tweet_dump = [''.join(tweets25)]

In [29]:
# Run the Key Phrases batch
key_phrases_batch_output = comprehend.batch_detect_key_phrases(TextList=tweet_dump, LanguageCode='en')

In [30]:
# Results
pd.DataFrame(key_phrases_batch_output['ResultList'][0]['KeyPhrases'])

Unnamed: 0,Score,Text,BeginOffset,EndOffset
0,0.809662,the camera tonight,71,89
1,0.999997,7pm CT,93,99
2,0.999996,the ways,112,120
3,0.735539,Watch,131,136
4,0.949925,"Facebook Live, Roku or Amazon Fire https://t.c...",171,251
...,...,...,...,...
118,0.999983,the twilight movies,3984,4003
119,0.845206,their site BUT HULU ADDED THEM YOU,4008,4042
120,0.963395,KNOW WTF GOIN ONhaven,4051,4072
121,0.913137,amazon,4095,4101


## Scheduling an Analysis Job

I highly recommend that you run at least one Comprehend job from the point and click interface, especially, if you are new to AWS. This way you can create a data access role (aka `data_access_role_arn`), and then you can simply copy the role name from the job description. 

In [5]:
bucket_name  = 'comprehend-api'
local_file_name = 'Comprehend/amazon_tweets.csv'
aws_file_name = 'input-data/amazon_tweets.csv'
# Upload file to specific location
upload_file(local_file_name, bucket_name, aws_file_name)

True

In [6]:
# Set these values before running the program
input_s3_url = 's3://comprehend-api/input-data'
input_doc_format = 'ONE_DOC_PER_LINE'
output_s3_url = 's3://comprehend-api/results'
data_access_role_arn = "arn:aws:iam::XXXXXXXXX:role/service-role/YOUR_ROLE_NAME"
number_of_topics = 10   # Optional argument

# Set up job configuration
input_data_config = {'S3Uri': input_s3_url, 'InputFormat': input_doc_format}
output_data_config = {'S3Uri': output_s3_url}

In [7]:
# Begin a job to detect the topics in the document collection
comprehend = boto3.client('comprehend')
start_job_entities = comprehend.start_entities_detection_job(
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    DataAccessRoleArn=data_access_role_arn,
    LanguageCode='en')
job_id = start_job_entities['JobId']
print(f'Started Topic Detection Job: {job_id}')

Started Topic Detection Job: 48a33dda99b62aec30273a0d757a3d90


In [12]:
# Retrieve information about the job
describe_result = comprehend.describe_entities_detection_job(JobId=job_id)
job_status = describe_result['EntitiesDetectionJobProperties']['JobStatus']
print(f'Job Status: {job_status}')
if job_status == 'FAILED':
    print(f'Reason: {describe_result["EntitiesDetectionJobProperties"]["Message"]}')

Job Status: COMPLETED


In [15]:
# List all topic-detection jobs
list_result = comprehend.list_entities_detection_jobs()
for job in list_result['EntitiesDetectionJobPropertiesList']:
    print(f'Job ID: {job["JobId"]}, Status: {job["JobStatus"]}')

Job ID: 3332908855ffce7a71524805a6b9f6ad, Status: COMPLETED
Job ID: 54605b617b2e0c8fedc67962130dc3cc, Status: COMPLETED
Job ID: cd1b44d1a6f1db713f96298b5987c716, Status: COMPLETED
Job ID: 1a5a38d9a5c02ee15ed806675271a034, Status: COMPLETED
Job ID: cc651ab875e5e50eb476bfac7b424cb7, Status: COMPLETED
Job ID: 48a33dda99b62aec30273a0d757a3d90, Status: COMPLETED



### Download and Process the Results

In [33]:
# job_id = '48a33dda99b62aec30273a0d757a3d90'
# Link to resutls file

entities_results_S3Url = comprehend.describe_entities_detection_job(
    JobId=job_id
)['EntitiesDetectionJobProperties']['OutputDataConfig']['S3Uri']

# S3 Uri
#entities_results_S3Url

If you don't know where your results are, get a listing of your bucket

In [34]:
# List all files and folders in the bucket
def s3_bucket_list_obj(bucket):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket)
    for obj in bucket.objects.all():
        print(obj.key)

In [36]:
# List all files
s3_bucket_list_obj(bucket_name)

Download the results

In [37]:
s3 = boto3.client('s3')
s3_name = 's3://' + bucket_name + '/'
results_aws_filename = entities_results_S3Url.replace(s3_name, '')
local_results_filename = 'Comprehend/outputs/entities.tar.gz'

In [38]:
# Download results
s3.download_file(bucket_name,
                 results_aws_filename, 
                 local_results_filename)

In [39]:
# Unzip the file
import tarfile

def extract_targz(targz_file, output_path = ''):
    if targz_file.endswith("tar.gz"):
        tar = tarfile.open(targz_file, "r:gz")
        tar.extractall(path = output_path)
        tar.close()
    elif targz_file.endswith("tar"):
        tar = tarfile.open(targz_file, "r:")
        tar.extractall(path = output_path)
        tar.close()

In [40]:
output_path = 'Comprehend/outputs/extracted'
extract_targz(local_results_filename, output_path)

Open the results

In [41]:
# Read JSON into a dictionary   
input_file = output_path + '/output'
entities = [json.loads(line) for line in open(input_file, 'r')]
len(entities)

10000

In [42]:
# This function parses entities data into a dataframe
def parse_entities(data):
    df = pd.DataFrame() # declare an empty dataframe
    nested_json = 'Entities' # nested sub-dictiptionary to extract data from
    # populate the dataframe
    for line in data:
        dt_temp = pd.DataFrame(line[nested_json])  # extract data from sub-dictionary
        other_fields = list(line.keys())
        other_fields.remove(nested_json) # remove nested fields        
        for field in other_fields:  # add common fields
            dt_temp[field] = line[field]
        
        df = df.append(dt_temp, ignore_index = True)
        
    return(df)

In [43]:
# Produce the dataframe - this may take some time if you have >1,000 records
entities_df = parse_entities(entities)

#### Record Check

In [44]:
# Load original data
df = pd.read_csv(local_file_name, header = None, names = ['amazon_tweets'], dtype = 'str')

In [47]:
record_no = 150
# Tweet text
print(df.loc[record_no].item())
# Resutls
entities_df.query('Line == @record_no')

A LIGHT LAST SEEN - US https://t.co/uKL2cS4ayd Jaynie Highsmith needs a happy ending but happy doesn't last forever. Will she repeat the same mistakes over &amp; over? #SecondChances #WomensFiction @Grace_Greene https://t.co/o43zSpujcP https://t.co/9F8uG20lfr


Unnamed: 0,BeginOffset,EndOffset,Score,Text,Type,File,Line
550,0.0,17.0,0.765808,A LIGHT LAST SEEN,TITLE,amazon_tweets.csv,150
551,20.0,22.0,0.621316,US,LOCATION,amazon_tweets.csv,150
552,23.0,46.0,0.995827,https://t.co/uKL2cS4ayd,OTHER,amazon_tweets.csv,150
553,47.0,63.0,0.999372,Jaynie Highsmith,PERSON,amazon_tweets.csv,150
554,198.0,211.0,0.572987,@Grace_Greene,PERSON,amazon_tweets.csv,150
555,212.0,235.0,0.994321,https://t.co/o43zSpujcP,OTHER,amazon_tweets.csv,150
556,236.0,259.0,0.989828,https://t.co/9F8uG20lfr,OTHER,amazon_tweets.csv,150
