# Predicting sentiment from product reviews

## Fire up Pandas

In [1]:
import pandas
import sklearn
import numpy as np

## Read some product review data
Loading reviews for a set of baby products.

In [5]:
products = pandas.read_csv('../lectures/data/amazon_baby.csv')

In [6]:
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Some quick data cleaning
Let's look at some of the reviews:

In [7]:
products['review'][0]

'These flannel wipes are OK, but in my opinion not worth keeping.  I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality.  I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

Look at reviews 30 to 50, we see some bad data for review #38

In [8]:
products['review'][30:50]

30    Beautiful little book.  A great little short s...
31    This book is so worth the money. It says 9+ mo...
32    we just got this book for our one-year-old and...
33    The book is colorful and is perfect for 6month...
34    The book is cute, and we are huge fans of Lama...
35    What a great book for babies!  I'd been lookin...
36    My son loved this book as an infant.  It was p...
37    Our baby loves this book & has loved it for a ...
38                                                  NaN
39    My son likes brushing elmo's teeth. Almost too...
40    This was a birthday present for my 2 year old ...
41    This bear is absolutely adorable and I would g...
42    My baby absolutely loves Elmo and so this book...
43    I bought two for recent baby showers!  The boo...
44    We wanted to get another book like the Big Bir...
45    This is a cute little peek-a-boo story book.  ...
46    My 3 month old son loves this book. We read it...
47    Very cute interactive book! My son loves t

So let's clean that up

In [9]:
def cleanNaN(value):
    if pandas.isnull(value):
        return ""
    else:
        return value

In [10]:
products['review'] = products['review'].apply(cleanNaN)

In [11]:
products['review'][30:50]

30    Beautiful little book.  A great little short s...
31    This book is so worth the money. It says 9+ mo...
32    we just got this book for our one-year-old and...
33    The book is colorful and is perfect for 6month...
34    The book is cute, and we are huge fans of Lama...
35    What a great book for babies!  I'd been lookin...
36    My son loved this book as an infant.  It was p...
37    Our baby loves this book & has loved it for a ...
38                                                     
39    My son likes brushing elmo's teeth. Almost too...
40    This was a birthday present for my 2 year old ...
41    This bear is absolutely adorable and I would g...
42    My baby absolutely loves Elmo and so this book...
43    I bought two for recent baby showers!  The boo...
44    We wanted to get another book like the Big Bir...
45    This is a cute little peek-a-boo story book.  ...
46    My 3 month old son loves this book. We read it...
47    Very cute interactive book! My son loves t

Now the data looks cleaner. We no longer have the NaN for the 38th review.
## Build the word count vector for each review

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
vect = CountVectorizer(token_pattern = r'\b\w+\b')
features = vect.fit_transform(products['review'])

In [14]:
type(features)

scipy.sparse.csr.csr_matrix

In [15]:
features.shape

(183531, 68069)

There are about 68k words.

Let's look at the first 20 features (the words). Note that the `u'` just means Python is internally representing each word as a unicode string.

In [16]:
vect.get_feature_names()[0:20]

['0',
 '00',
 '000',
 '0001',
 '000ft',
 '000importer',
 '000sqft',
 '001',
 '001cm',
 '00am',
 '00amcreepy',
 '00cons',
 '00dollars',
 '00etwhile',
 '00not',
 '00pm',
 '01',
 '01262',
 '016sc01',
 '01992']

Let's see how many times the word "colorful" appears:

In [17]:
vect.vocabulary_.get(u'colorful')

14124

## Build a sentiment classifier
Examine the ratings for **all** the reviews we have:

In [18]:
products['rating'].plot(y='rating', orientation='horizontal', kind='hist', bins=5)

<matplotlib.axes._subplots.AxesSubplot at 0x7f2cb6111518>

## Define what's a positive and a negative sentiment

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.  Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.   

In [19]:
#ignore all 3* reviews
products = products[products['rating'] != 3]

In [20]:
#positive sentiment = 4* or 5* reviews
products['sentiment'] = products['rating'] >=4

In [21]:
products.head()

Unnamed: 0,name,review,rating,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,True
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,True
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,True
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,True
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,True


## Let's train the sentiment classifier

In [22]:
products.sentiment.value_counts()

True     140259
False     26493
Name: sentiment, dtype: int64

In [23]:
# Define X and y
X = products['review']
y = products['sentiment']

In [24]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)

(133401,)
(33351,)




In [25]:
# instantiate the vectorizer
vect = CountVectorizer()

In [26]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<133401x57485 sparse matrix of type '<class 'numpy.int64'>'
	with 7080210 stored elements in Compressed Sparse Row format>

In [27]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<33351x57485 sparse matrix of type '<class 'numpy.int64'>'
	with 1749877 stored elements in Compressed Sparse Row format>

In [28]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [29]:
# class predictions and predicted probabilities
y_pred_class = logreg.predict(X_test_dtm)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]

## Evaluate the model 
By looking at its accuracy and its [area under the curve](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) score:

In [30]:
# calculate accuracy and AUC
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
print(metrics.roc_auc_score(y_test, y_pred_prob))

0.933885040928
0.95652134862


And the [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix):

In [31]:
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 3862  1412]
 [  793 27284]]
