# Sentiment Analysis Pipeline

### Part 1: Subcategory Sentiment Analysis

Import required libraries:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import datetime
from datetime import datetime
from datetime import date, time
from dateutil.parser import parse

import json
import pickle
%matplotlib inline

Import IBM Watson Libraries, and add API credentials. Note: Real credentials not included here as this is a public repository.

In [2]:
from watson_developer_cloud import NaturalLanguageUnderstandingV1
from watson_developer_cloud.natural_language_understanding_v1 import Features, KeywordsOptions, EntitiesOptions, EmotionOptions, SentimentOptions

ibm_pass = "x"
ibm_user = "x"
ibm = {'u':ibm_user, 'p':ibm_pass}

`getSentiment` is a function to make API requests and parse the responses.

`getNLU` is a function to compute the number of credits being used for a request

In [3]:
def getSentiment(text):
    if len(text) > 20:
        natural_language_understanding = NaturalLanguageUnderstandingV1(
          username=ibm['u'],
          password=ibm['p'],
          version="2017-02-27")
        try:
            response = natural_language_understanding.analyze(text=text,features=Features(sentiment=SentimentOptions(document=True)))
            report = response["sentiment"]["document"]["score"]
        except WatsonApiException:
            report = "Error"       
    else:
        report = NaN
    return(report)

def getNLU(x):
    return (x if x % 10000 == 0 else x + 10000 - x % 10000)/10000

Testing to see if the `getSentiment` function worked.

In [4]:
combined = getSentiment("This camera worked qute well, I am really happy with its image quality and eas-of-use.")
combined

0.953814

Loading our data

In [5]:
df = pd.read_pickle("Electronics_meta.pickle")
df.head(1)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,length_review,title,brand,price,sub_category_0,sub_category_1,sales_category,sales_rank
0,AO94DHGC771SJ,528881469,amazdnu,"[0, 0]",We got this GPS for my husband who is an (OTR)...,5.0,Gotta have GPS!,1370131200,2013-06-02,2013,156,Rand McNally 528881469 7-inch Intelliroute TND...,,299.99,Electronics,GPS & Navigation,unkown,0


Processing the data to collect by subcategory and date. I.e. we want to see what are the reviews made for products in a particular subcategory on each day.

In [6]:
df = df[['sub_category_1','reviewTime','reviewText']]
df = df.groupby(['reviewTime','sub_category_1']).agg(lambda x: ". ".join(x.tolist()))
df['charlen']=list(map(lambda x:len(x),df.reviewText))
df = df[df.charlen>20]
df.head(1)

Unnamed: 0_level_0,Unnamed: 1_level_0,reviewText,charlen
reviewTime,sub_category_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1999-06-13,Portable Audio & Video,The RIO rocks! It is so great that Diamond Mul...,507


In [7]:
senti_cat = df
senti_cat['NLU'] = list(map(lambda x:getNLU(x),senti_cat.charlen))
senti_cat = senti_cat.sort_values(['NLU', 'charlen'], ascending=[1, 1])
senti_cat.head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,reviewText,charlen,NLU
reviewTime,sub_category_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-07-03,Luggage & Travel Gear,This works good for my Kindle,29,1.0
2007-06-14,Cases,"Looks great, fits like a glove. Nice feel.",42,1.0
2009-05-02,eBook Readers & Accessories,This is a must have for the DS of course. [...],48,1.0
2014-07-02,Car Care,"work great, many uses for them i would recomme...",53,1.0
2000-08-26,Computers & Accessories,"No problems, worked under 98, NT, 2K and Linux...",55,1.0


Parse through our data, and store the results in a new column. Store the data in .pickle files for future reference. Here we decided to store many small pickles as we made Sentiment Analysis requests in smaller batches to ensure API credits were not wasted in case of server errors.

In [None]:
import datetime

for x in range(0,40000,1000):
    batchN = senti_cat.iloc[x:x+1000, :]
    t = datetime.datetime.now()
    print ("Start: "+ t.strftime('%H:%M:%S'))
    batchN['sentScore'] = list(map(lambda x:getSentiment(x),batchN.reviewText))
    filename = "catDate_sentScores_"+str(int(x/1000))+"k_"+str(int((x+1000)/1000))+"k.pickle"
    pickling_on = open(filename,"wb")
    pickle.dump(batchN[['sentScore']], pickling_on)
    pickling_on.close()
    print(filename + " complete")

batchN = senti_cat.iloc[40000:, :]
t = datetime.datetime.now()
print ("Start: "+ t.strftime('%H:%M:%S'))
batchN['sentScore'] = list(map(lambda x:getSentiment(x),batchN.reviewText))
filename = "catDate_sentScores_"+str(int(x/1000))+"k_end.pickle"
pickling_on = open(filename,"wb")
pickle.dump(batchN[['sentScore']], pickling_on)
pickling_on.close()
print(filename + " complete")

### Part 2: Product Sentiment Analysis

Include required libraries

In [2]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer



Initialize NLTK Vader Sentiment Analyzer, and do a quick test to see it works

In [3]:
sentimentAnalyzer = SentimentIntensityAnalyzer()
sentence = "This camera worked qute well, I am really happy with its image quality and ease-of-use."
sentimentAnalyzer.polarity_scores(sentence)

{'compound': 0.7264, 'neg': 0.0, 'neu': 0.663, 'pos': 0.337}

Loading our data

In [7]:
df = pd.read_pickle("electronics_meta_sentiment.pickle")
df.shape

(1689188, 23)

Calculate sentiment score for each review

In [51]:
import datetime

test = df

t = datetime.datetime.now()
print ("Start: "+ t.strftime('%H:%M:%S'))

test['score']=list(map(lambda x:sentimentAnalyzer.polarity_scores(x),test.reviewText))
test['sentiment_com']= list(map(lambda x:x['compound'],test.score))
test['sentiment_pos'] = list(map(lambda x:x['pos'],test.score))
test['sentiment_neg']= list(map(lambda x:x['neg'],test.score))
test['sentiment_neu']= list(map(lambda x:x['neu'],test.score))

t = datetime.datetime.now()
print ("End  : "+ t.strftime('%H:%M:%S'))


Start: 20:39:50
End  : 21:18:13


See updated dataframe

In [55]:
test.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,year,...,price,sub_category_0,sub_category_1,sales_category,sales_rank,score,sentiment_com,sentiment_pos,sentiment_neg,sentiment_neu
0,AO94DHGC771SJ,528881469,amazdnu,"[0, 0]",We got this GPS for my husband who is an (OTR)...,5.0,Gotta have GPS!,1370131200,2013-06-02,2013,...,299.99,Electronics,GPS & Navigation,unkown,0,"{'neg': 0.048, 'neu': 0.763, 'pos': 0.189, 'co...",0.9769,0.189,0.048,0.763
1,AMO214LNFCEI4,528881469,Amazon Customer,"[12, 15]","I'm a professional OTR truck driver, and I bou...",1.0,Very Disappointed,1290643200,2010-11-25,2010,...,299.99,Electronics,GPS & Navigation,unkown,0,"{'neg': 0.054, 'neu': 0.894, 'pos': 0.052, 'co...",0.4359,0.052,0.054,0.894
2,A3N7T0DY83Y4IG,528881469,C. A. Freeman,"[43, 45]","Well, what can I say. I've had this unit in m...",3.0,1st impression,1283990400,2010-09-09,2010,...,299.99,Electronics,GPS & Navigation,unkown,0,"{'neg': 0.041, 'neu': 0.879, 'pos': 0.08, 'com...",0.9858,0.08,0.041,0.879
3,A1H8PY3QHMQQA0,528881469,"Dave M. Shaw ""mack dave""","[9, 10]","Not going to write a long review, even thought...",2.0,"Great grafics, POOR GPS",1290556800,2010-11-24,2010,...,299.99,Electronics,GPS & Navigation,unkown,0,"{'neg': 0.061, 'neu': 0.875, 'pos': 0.063, 'co...",-0.2307,0.063,0.061,0.875
4,A24EV6RXELQZ63,528881469,Wayne Smith,"[0, 0]",I've had mine for a year and here's what we go...,1.0,"Major issues, only excuses for support",1317254400,2011-09-29,2011,...,299.99,Electronics,GPS & Navigation,unkown,0,"{'neg': 0.064, 'neu': 0.902, 'pos': 0.034, 'co...",-0.7845,0.034,0.064,0.902


Save data in a pickle.

In [56]:
pickling_on = open("electronics_meta_sentiment.pickle","wb")
pickle.dump(test, pickling_on)
pickling_on.close()

In [8]:
import datetime

test = df

t = datetime.datetime.now()
print ("Start: "+ t.strftime('%H:%M:%S'))

test['summary_score']=list(map(lambda x:sentimentAnalyzer.polarity_scores(x),test.summary))
test['summary_sentiment_com']= list(map(lambda x:x['compound'],test.summary_score))
test['summary_sentiment_pos'] = list(map(lambda x:x['pos'],test.summary_score))
test['summary_sentiment_neg']= list(map(lambda x:x['neg'],test.summary_score))
test['summary_sentiment_neu']= list(map(lambda x:x['neu'],test.summary_score))

t = datetime.datetime.now()
print ("End  : "+ t.strftime('%H:%M:%S'))

Start: 00:26:06
End  : 00:28:19


In [10]:
test[['reviewText','summary','summary_sentiment_com','summary_sentiment_pos','summary_sentiment_neg','summary_sentiment_neu','sentiment_com','sentiment_pos','sentiment_neg','sentiment_neu']]

Unnamed: 0,reviewText,summary,summary_sentiment_com,summary_sentiment_pos,summary_sentiment_neg,summary_sentiment_neu,sentiment_com,sentiment_pos,sentiment_neg,sentiment_neu
0,We got this GPS for my husband who is an (OTR)...,Gotta have GPS!,0.0000,0.000,0.000,1.000,0.9769,0.189,0.048,0.763
1,"I'm a professional OTR truck driver, and I bou...",Very Disappointed,-0.5256,0.000,0.772,0.228,0.4359,0.052,0.054,0.894
2,"Well, what can I say. I've had this unit in m...",1st impression,0.2263,0.655,0.000,0.345,0.9858,0.080,0.041,0.879
3,"Not going to write a long review, even thought...","Great grafics, POOR GPS",0.0688,0.413,0.386,0.201,-0.2307,0.063,0.061,0.875
4,I've had mine for a year and here's what we go...,"Major issues, only excuses for support",0.4019,0.351,0.000,0.649,-0.7845,0.034,0.064,0.902
5,I am using this with a Nook HD+. It works as d...,HDMI Nook adapter cable,0.0000,0.000,0.000,1.000,0.5719,0.163,0.000,0.837
6,The cable is very wobbly and sometimes disconn...,Cheap proprietary scam,-0.5719,0.000,0.649,0.351,-0.5256,0.000,0.139,0.861
7,This adaptor is real easy to setup and use rig...,A Perfdect Nook HD+ hook up,0.0000,0.000,0.000,1.000,0.9146,0.145,0.020,0.835
8,This adapter easily connects my Nook HD 7&#34;...,A nice easy to use accessory.,0.6908,0.655,0.000,0.345,0.9209,0.183,0.000,0.817
9,This product really works great but I found th...,This works great but read the details...,0.3716,0.298,0.000,0.702,0.9636,0.109,0.000,0.891
