# AWS Comprehend Sentiment Analysis Using Python

This notebook shows how to use boto3 Amazon API to use Amazon Comprehend for real time analysis as well as scheduling analysis jobs.
1. For boto3 to work you need to create an IAM User, receive `aws_access_key_id` and `aws_secret_access_key` and configure your credentials using AWS Command Line Interface (AWS CLI)
2. Cost. If you are using free AWS tier, you can analyze 50K units a month free.  Every unit is 100 characters. In my example, every tweet is ~2 units. In the scheduled job I am analyzing 10K tweets at once, so the free tier runs out pretty fast, and then it's \$1 per 10K. Be sure to check pricing before you proceed. https://aws.amazon.com/comprehend/pricing/
3. Reference. Boto3 S3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.html  Boto3 Comprehend: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html

In [1]:
import boto3
from botocore.exceptions import ClientError
import pandas as pd
import json
import tarfile

In [2]:
link_to_data = "https://github.com/tanyazyabkina/AmazonComprehend/blob/master/walmart_tweets_1k.csv?raw=true"

local_file_name = 'Comprehend/walmart_1k.csv'

df = pd.read_csv(link_to_data, header = None, names = ['walmart_tweets'], dtype = 'str', encoding = 'utf-8') 
df.to_csv(local_file_name, encoding = 'utf-8', index = False, header = False)

In [3]:
df.head()

Unnamed: 0,walmart_tweets
0,Tony Hawk’s Pro Skater 1+2 (PS4) is $33.88 on ...
1,@CassieFambro we were just saying that yesterd...
2,@lxoG21 I love me some Walmart candles lol the...
3,I actually am too 🤔 need to go shopping. 24/7 ...
4,@diancalondon Bill was.....Sunday morning Khak...


## Real Time Single Record Processing

Using this type of processing you can analyze one piece of text of up to 5K bytes long.

In [4]:
# Record to examine
text = df.loc[3].item()
print(text)

I actually am too 🤔 need to go shopping. 24/7 Walmart come back pls https://t.co/BB5cXNbSzo


In [9]:
#initialize Comprehend module
comprehend = boto3.client(service_name='comprehend', region_name='us-east-2')

In [10]:
# Run sentiment analysis
sentiment_output = comprehend.detect_sentiment(Text=text, LanguageCode='en')
# Output
sentiment_output

{'Sentiment': 'NEUTRAL',
 'SentimentScore': {'Positive': 0.08862289786338806,
  'Negative': 0.09810954332351685,
  'Neutral': 0.8021741509437561,
  'Mixed': 0.011093364097177982},
 'ResponseMetadata': {'RequestId': '1704d62a-20c6-4782-b3c8-ae51a8436116',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '1704d62a-20c6-4782-b3c8-ae51a8436116',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '162',
   'date': 'Fri, 30 Apr 2021 20:28:06 GMT'},
  'RetryAttempts': 0}}

In [11]:
#Scores
sentiment_output['SentimentScore']

{'Positive': 0.08862289786338806,
 'Negative': 0.09810954332351685,
 'Neutral': 0.8021741509437561,
 'Mixed': 0.011093364097177982}

In [12]:
# Sentiment
sentiment_output['Sentiment']

'NEUTRAL'

## Real-Time Batch Processing

Up to 25 documents of up to 5,000 bytes each, submitted in a list.

In [13]:
# Create a batch
text_list = list(df.walmart_tweets[0:25])

#Run a sentiment batch
sentiment_batch = comprehend.batch_detect_sentiment(TextList=text_list,
                                                    LanguageCode='en')

Single record check

In [14]:
text_list[4]

'@diancalondon Bill was.....Sunday morning Khaki Walmart fly....in his own way...the heart wants what it wants. Yeah. Maybe the pickens are slim midwest? The only one I understood and felt bad for was Barb, because she felt she owed Bill for being there. https://t.co/BOCvIDvAmc'

In [15]:
sentiment_batch['ResultList'][4]

{'Index': 4,
 'Sentiment': 'NEUTRAL',
 'SentimentScore': {'Positive': 0.07533818483352661,
  'Negative': 0.30534008145332336,
  'Neutral': 0.6191905736923218,
  'Mixed': 0.0001311323867412284}}

Parse the results into a dataframe

In [16]:
def parse_sentiment_batch(data):
    df = pd.DataFrame([item['SentimentScore'] for item in data['ResultList']])
    df['Sentiment'] = [item.get('Sentiment') for item in data['ResultList']]
    df['Index'] = [item.get('Index') for item in data['ResultList']]
    df.set_index('Index', inplace = True)
    
    return(df)

In [17]:
parse_sentiment_batch(sentiment_batch).head()

Unnamed: 0_level_0,Positive,Negative,Neutral,Mixed,Sentiment
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.000804,7.5e-05,0.9991,2.2e-05,NEUTRAL
1,0.027284,0.519825,0.452641,0.00025,NEGATIVE
2,0.995531,9.4e-05,0.004349,2.6e-05,POSITIVE
3,0.088623,0.09811,0.802174,0.011093,NEUTRAL
4,0.075338,0.30534,0.619191,0.000131,NEUTRAL


## Asynchronous Processing - Scheduling an Analysis Job

I highly recommend that you run at least one Comprehend job from the point and click interface, especially, if you are new to AWS. This way you can create a data access role (aka `data_access_role_arn`), and then you can simply copy the role name from the job description. 

You will need to create your S3 bucket through the web interface or through BOTO3 API.

Note that I use different folders `s3://comprehend-api/input-data` and `s3://comprehend-api/results` for input data and results output. This way, your results are not going to get confused for inputs if you were to analyze all files in the folder.

#### Upload the data to S3

In [18]:
local_file_name = 'Comprehend/walmart_1k.csv'
bucket_name  = 'comprehend-api'
aws_file_name = 'input-data/walmart_1k.csv'

s3 = boto3.client('s3')

# Upload file to specific location
s3.upload_file(local_file_name, bucket_name, aws_file_name)

#### Configure Sentiment Detection Job

In [19]:
# Set these values before running the program
input_s3_url = 's3://comprehend-api/input-data/walmart_1k.csv' #folder, file or prefix
input_doc_format = 'ONE_DOC_PER_LINE'
output_s3_url = 's3://comprehend-api/results'
data_access_role_arn = "arn:aws:iam::YOUR_ACCOUNT_NUMBER:role/service-role/YOUR_SERVICE_ROLE"

# Set up job configuration
input_data_config = {'S3Uri': input_s3_url, 'InputFormat': input_doc_format}
output_data_config = {'S3Uri': output_s3_url}

#### Run the job

In [20]:
# Start the client
comprehend = boto3.client('comprehend')

# Begin a job to detect the topics in the document collection
start_job_sentiment = comprehend.start_sentiment_detection_job(
    InputDataConfig=input_data_config,
    OutputDataConfig=output_data_config,
    DataAccessRoleArn=data_access_role_arn,
    LanguageCode='en',
    JobName='Walmart_1K_tweets')
job_id = start_job_sentiment['JobId']
print(f'Your Sentiment Detection JobID is: {job_id}')

Your Sentiment Detection JobID is: b6a7626c25c4c5ffcf558901743eb8ff


Your job runs asynchronously. **This may take several minutes to run.**
When the status turns to COMPLETED, you can retrieve the results.

In [27]:
# Retrieve information about the job - the job may take a while to run
describe_result = comprehend.describe_sentiment_detection_job(JobId=job_id)
job_status = describe_result['SentimentDetectionJobProperties']['JobStatus']
print(f'Job Status: {job_status}')
if job_status == 'FAILED':
    print(f'Reason: {describe_result["SentimentDetectionJobProperties"]["Message"]}')

Job Status: IN_PROGRESS


#### Download the Results

In [None]:
#  The results are located here
results_S3Url = comprehend.describe_sentiment_detection_job(
    JobId=job_id)['SentimentDetectionJobProperties']['OutputDataConfig']['S3Uri']

# Your Output S3 Url
results_S3Url

If you don't know where your results are, get a listing of your bucket

In [35]:
# List all files and folders in the bucket
def s3_bucket_list_obj(bucket):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(bucket)
    for obj in bucket.objects.all():
        print(obj.key)

In [None]:
# List all files
s3_bucket_list_obj(bucket_name)

Download the results

In [31]:
# Give your local results file a name
results_name = 'sentiment'

local_results_filename = 'Comprehend/outputs/' + results_name + '.tar.gz'
s3_name = 's3://' + bucket_name + '/'
results_aws_filename = results_S3Url.replace(s3_name, '')

# Download results
s3 = boto3.client('s3')
s3.download_file(bucket_name,
                 results_aws_filename, 
                 local_results_filename)

#### Process the results

In [32]:
# Unzip the file
def extract_targz(targz_file, output_path = ''):
    if targz_file.endswith("tar.gz"):
        tar = tarfile.open(targz_file, "r:gz")
        tar.extractall(path = output_path)
        tar.close()
    elif targz_file.endswith("tar"):
        tar = tarfile.open(targz_file, "r:")
        tar.extractall(path = output_path)
        tar.close()

In [33]:
# creates a temp file is called 'output'
results_name = 'sentiment'
local_results_filename = 'Comprehend/outputs/' + results_name + '.tar.gz'
output_path = 'Comprehend/outputs/extracted' 
extract_targz(local_results_filename, output_path)

Read the results

In [34]:
# Read JSON into a dictionary   
input_file = output_path + '/output'
results = [json.loads(line) for line in open(input_file, 'r')]
print('Number of records in the output:',len(results))

Number of records in the output: 1000


This is how the output looks like. Note that often the records are not in the same order that they were sent in.

In [36]:
results[0]

{'File': 'walmart_1k.csv',
 'Line': 2,
 'Sentiment': 'POSITIVE',
 'SentimentScore': {'Mixed': 2.5774566893232986e-05,
  'Negative': 9.411451173946261e-05,
  'Neutral': 0.00434883451089263,
  'Positive': 0.9955312609672546}}

In [37]:
# This function parses sentiment data into a dataframe
def parse_sentiment(data):
    df = pd.DataFrame([item['SentimentScore'] for item in data])
    df['File'] = [item.get('File') for item in data]
    df['Sentiment'] = [item.get('Sentiment') for item in data]
    df['Line'] = [item.get('Line') for item in data]
    df.set_index('Line', inplace = True)
    
    return(df)

In [38]:
# Produce the dataframe
# Let's sort and index the dataframe by line
sentiment_results = parse_sentiment(results).sort_index()

In [39]:
sentiment_results.head()

Unnamed: 0_level_0,Mixed,Negative,Neutral,Positive,File,Sentiment
Line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2.2e-05,7.5e-05,0.9991,0.000804,walmart_1k.csv,NEUTRAL
1,0.000151,0.542811,0.416986,0.040052,walmart_1k.csv,NEGATIVE
2,2.6e-05,9.4e-05,0.004349,0.995531,walmart_1k.csv,POSITIVE
3,0.011093,0.09811,0.802174,0.088623,walmart_1k.csv,NEUTRAL
4,0.000217,0.35373,0.539221,0.106831,walmart_1k.csv,NEUTRAL


#### Results Validation

Let's take a look at a record and validate that the results were similar (they won't be exactly the same), and that we sorted the dataframe correctly.

In [40]:
record_no = 235
# Tweet text
print('TWEET TEXT:\n', 
      df.loc[record_no].item())
# Real Time Results
print('\nREAL TIME RESULTS:\n') 
print(comprehend.detect_sentiment(Text=df.loc[record_no].item(), LanguageCode='en')['SentimentScore'])
# Job Resutls
print('\nASYNCHRONOUS RESULTS:')
print(sentiment_results.loc[record_no])

TWEET TEXT:
 life was so much simpler when i would go to walmart with my mom and play plants vs zombies on all the display computers while she grocery shopped 😔😔😔

REAL TIME RESULTS:

{'Positive': 0.5083320140838623, 'Negative': 0.0402490496635437, 'Neutral': 0.34519028663635254, 'Mixed': 0.10622867196798325}

ASYNCHRONOUS RESULTS:
Mixed              0.106229
Negative           0.040249
Neutral             0.34519
Positive           0.508332
File         walmart_1k.csv
Sentiment          POSITIVE
Name: 235, dtype: object


#### Save the results file

In [41]:
output_name = 'Comprehend/outputs/sentiment_results_walmart_1k.xlsx'
sentiment_results['Text'] = df.walmart_tweets
sentiment_results.to_excel(output_name, engine = 'xlsxwriter',  encoding = 'utf-8')