### Amazon Comprehend Demo

***
Copyright [2017]-[2017] Amazon.com, Inc. or its affiliates. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance with the License. A copy of the License is located at

http://aws.amazon.com/apache2.0/

or in the "license" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
***

### Prerequisites:

#### Identity and Acces Management

The user or role that executes the commands must have permissions in AWS Identity and Access Management (IAM) to perform those actions. AWS provides a set of managed policies that help you get started quickly. For our example, you should apply the following managed policy to your user or role:

    ComprehendReadOnly

Be aware that we recommend you follow AWS IAM best practices for production implementations, which is out of scope for this workshop.

In [None]:
# download review dataset

#!curl -O http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Amazon_Instant_Video_5.json.gz

In [1]:
import random
def generate_slug():
    hash = random.getrandbits(32)
    return '{:6x}'.format(hash)    

# Nicole, Enrique, Tatyana, Carmen, Lotte, Russell, Geraint, Mads, Penelope, Joanna, Matthew, Brian, Seoyeon, Maxim, Ricardo, Ruben, Giorgio, Carla, Naja, Astrid, Maja, Ivy, Chantal, Kimberly, Amy, Vicki, Marlene, Ewa, Conchita, Karl, Mathieu, Miguel, Justin, Jacek, Takumi, Ines, Cristiano, Gwyneth, Mizuki, Celine, Jan, Liv, Joey, Filiz, Dora, Raveena, Aditi, Salli, Vitoria, Emma, Hans, Kendra
def get_polly_as_mp3(message='hello', lang='en-US', text_type='ssml', voice_id='Amy'):
    message_formatted='{}{}{}'.format('<speak><lang xml:lang=\"{}\">'.format(lang),message,'</lang></speak>')
    #pprint('Calling Polly: {}'.format(message_formatted))
    response = polly.synthesize_speech(
        Text=message_formatted,
        TextType=text_type,
        OutputFormat="mp3",                                           
        VoiceId=voice_id
    )

    #pprint (response)
     
    outfile = "pollyresponse_{}.mp3".format(generate_slug())
    data = response['AudioStream'].read()

    with open(outfile,'wb') as f:
         f.write(data) 
    return outfile

def markdown_text(i,text):
    pprint(text)
    result = Markdown("""

### Review #{}

> ## {}

### Sentiment: {}

- Positive: {:2.2f}%
- Negative: {:2.2f}%
- Neutral: {:2.2f}%
- Mixed: {:2.2f}%

---
"""
.format(i+1,text[0],text[2],text[3]*100,text[4]*100, text[5]*100,text[6]*100)
)

    return result


In [2]:
import boto3
import gzip
import json
import IPython
from IPython.display import HTML, Audio, display, Markdown
from pprint import pprint

comprehend = boto3.client('comprehend', region_name='us-east-1')
polly = boto3.client('polly', region_name='us-east-1')

In [3]:
filename = 'reviews_Amazon_Instant_Video_5.json.gz'
f = gzip.open(filename, 'r') 
out = [] 
x = 5 # only process the first 50 entries 
for line in f: 
    x -= 1
    if x == 0:
        break
    review = json.loads(line)
    # get sentiment for reviewText
    reviewText = review['reviewText']
    if len(reviewText) > 5000: # only supporting up to 5000 Bytes, skipping entry
        print ('Skipping: %s' % reviewText)
    else:
        textSentiment = comprehend.detect_sentiment(
                            Text=reviewText,
                            LanguageCode='en'
                            )

        out.append([review['reviewText'],review['asin'],textSentiment['Sentiment'],textSentiment['SentimentScore']['Positive'],textSentiment['SentimentScore']['Negative'],textSentiment['SentimentScore']['Neutral'],textSentiment['SentimentScore']['Mixed']])   

In [4]:
entries = 5
output_text = []
pollyfile = []
for i in range(0,entries):
    output_text.append(markdown_text(i,out[i]))
    pollyfile.append(get_polly_as_mp3(out[i][0]))

['I had big expectations because I love English TV, in particular '
 "Investigative and detective stuff but this guy is really boring. It didn't "
 'appeal to me at all.',
 'B000H00VBQ',
 'NEGATIVE',
 0.019105231389403343,
 0.8179743885993958,
 0.021709030494093895,
 0.14121128618717194]
['I highly recommend this series. It is a must for anyone who is yearning to '
 'watch "grown up" television. Complex characters and plots to keep one '
 'totally involved. Thank you Amazin Prime.',
 'B000H00VBQ',
 'POSITIVE',
 0.9963352680206299,
 0.00014676454884465784,
 0.0024864415172487497,
 0.0010313879465684295]
["This one is a real snoozer. Don't believe anything you read or hear, it's "
 'awful. I had no idea what the title means. Neither will you.',
 'B000H00VBQ',
 'NEGATIVE',
 0.038532987236976624,
 0.828794538974762,
 0.09235388040542603,
 0.040318556129932404]
['Mysteries are interesting.  The tension between Robson and the tall blond is '
 'good but not always believable.  She often seeme

IndexError: list index out of range

In [5]:
display(output_text[0])
Audio(pollyfile[0])




### Review #1

> ## I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.

### Sentiment: NEGATIVE

- Positive: 1.91%
- Negative: 81.80%
- Neutral: 2.17%
- Mixed: 14.12%

---


In [6]:
display(output_text[2])
Audio(pollyfile[2])



### Review #3

> ## This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.

### Sentiment: NEGATIVE

- Positive: 3.85%
- Negative: 82.88%
- Neutral: 9.24%
- Mixed: 4.03%

---


In [7]:
display(output_text[3])
Audio(pollyfile[3])



### Review #4

> ## Mysteries are interesting.  The tension between Robson and the tall blond is good but not always believable.  She often seemed uncomfortable.

### Sentiment: MIXED

- Positive: 21.44%
- Negative: 4.95%
- Neutral: 2.31%
- Mixed: 71.30%

---


In [None]:
import csv
with open('sentiment.csv', 'w') as csvfile:
            linewriter = csv.writer(csvfile, delimiter=';',quotechar='|', quoting=csv.QUOTE_MINIMAL)
            linewriter.writerow (['review','asin','Sentiment','Positive','Negative','Neutral','Mixed'])
            for all in out:
                linewriter.writerow(all)

In [8]:
!cat ./sentiment.csv

review;asin;Sentiment;Positive;Negative;Neutral;Mixed
I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.;B000H00VBQ;NEGATIVE;0.019105231389403343;0.8179743885993958;0.021709030494093895;0.14121128618717194
I highly recommend this series. It is a must for anyone who is yearning to watch "grown up" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.;B000H00VBQ;POSITIVE;0.9963352680206299;0.00014676454884465784;0.0024864415172487497;0.0010313879465684295
This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.;B000H00VBQ;NEGATIVE;0.038532987236976624;0.828794538974762;0.09235388040542603;0.040318556129932404
Mysteries are interesting.  The tension between Robson and the tall blond is good but not always believable.  She often seemed uncomfortable.;B000H00

In [9]:
key_phrases = comprehend.batch_detect_key_phrases(
                            TextList=[x[0] for x in out[:25]],
                            LanguageCode='en'
                            )
pprint(key_phrases['ResultList'])

[{'Index': 0,
  'KeyPhrases': [{'BeginOffset': 6,
                  'EndOffset': 22,
                  'Score': 0.9991924166679382,
                  'Text': 'big expectations'},
                 {'BeginOffset': 38,
                  'EndOffset': 48,
                  'Score': 0.9947996139526367,
                  'Text': 'English TV'},
                 {'BeginOffset': 53,
                  'EndOffset': 97,
                  'Score': 0.9266034364700317,
                  'Text': 'particular Investigative and detective stuff'},
                 {'BeginOffset': 102,
                  'EndOffset': 110,
                  'Score': 0.9989846348762512,
                  'Text': 'this guy'}]},
 {'Index': 1,
  'KeyPhrases': [{'BeginOffset': 19,
                  'EndOffset': 30,
                  'Score': 0.9816858172416687,
                  'Text': 'this series'},
                 {'BeginOffset': 92,
                  'EndOffset': 102,
                  'Score': 0.909174919128418,
           

In [10]:
entities = comprehend.batch_detect_entities(
                            TextList=[x[0] for x in out[:5]],
                            LanguageCode='en'
                            )

In [11]:
pprint(entities['ResultList'][:])

[{'Entities': [{'BeginOffset': 38,
                'EndOffset': 45,
                'Score': 0.9055940508842468,
                'Text': 'English',
                'Type': 'OTHER'}],
  'Index': 0},
 {'Entities': [{'BeginOffset': 173,
                'EndOffset': 185,
                'Score': 0.6836399435997009,
                'Text': 'Amazin Prime',
                'Type': 'TITLE'}],
  'Index': 1},
 {'Entities': [], 'Index': 2},
 {'Entities': [{'BeginOffset': 48,
                'EndOffset': 54,
                'Score': 0.9408263564109802,
                'Text': 'Robson',
                'Type': 'PERSON'}],
  'Index': 3}]
