# Predicting sentiment from product reviews

As usual the most basic imports are on top. On the beginning we will need pandas for DataFrames and sklearn for machine learning.

In [13]:
import pandas
import sklearn

### Loading reviews for a set of baby products.

Firsly, let's read required data from CSV file. In this project we will be working on data from Amazon, and to be more specific, we will be manipulating data about things for babies.

In [14]:
data = pandas.read_csv('../lectures/data/amazon_baby.csv')

Let's review some data to check how it looks like.

In [15]:
data.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


As we can see, we have four columns: index, name of a object, review of it and rating. 
In this task we will try to classify and analyze those reviews to find out the best and the worst items.

### Data cleaning

Let's see how the single entry looks like.

In [16]:
data['review'][0]

'These flannel wipes are OK, but in my opinion not worth keeping.  I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality.  I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

Look at reviews 30 to 50, we see some bad data for review

In [17]:
data['review'][30:50]

30    Beautiful little book.  A great little short s...
31    This book is so worth the money. It says 9+ mo...
32    we just got this book for our one-year-old and...
33    The book is colorful and is perfect for 6month...
34    The book is cute, and we are huge fans of Lama...
35    What a great book for babies!  I'd been lookin...
36    My son loved this book as an infant.  It was p...
37    Our baby loves this book & has loved it for a ...
38                                                  NaN
39    My son likes brushing elmo's teeth. Almost too...
40    This was a birthday present for my 2 year old ...
41    This bear is absolutely adorable and I would g...
42    My baby absolutely loves Elmo and so this book...
43    I bought two for recent baby showers!  The boo...
44    We wanted to get another book like the Big Bir...
45    This is a cute little peek-a-boo story book.  ...
46    My 3 month old son loves this book. We read it...
47    Very cute interactive book! My son loves t

So let's clean that up

In [18]:
def cleanNaN(value):
    if pandas.isnull(value):
        return ""
    else:
        return value

In [19]:
data['review'] = data['review'].apply(cleanNaN)

In [20]:
data['review'][30:50]

30    Beautiful little book.  A great little short s...
31    This book is so worth the money. It says 9+ mo...
32    we just got this book for our one-year-old and...
33    The book is colorful and is perfect for 6month...
34    The book is cute, and we are huge fans of Lama...
35    What a great book for babies!  I'd been lookin...
36    My son loved this book as an infant.  It was p...
37    Our baby loves this book & has loved it for a ...
38                                                     
39    My son likes brushing elmo's teeth. Almost too...
40    This was a birthday present for my 2 year old ...
41    This bear is absolutely adorable and I would g...
42    My baby absolutely loves Elmo and so this book...
43    I bought two for recent baby showers!  The boo...
44    We wanted to get another book like the Big Bir...
45    This is a cute little peek-a-boo story book.  ...
46    My 3 month old son loves this book. We read it...
47    Very cute interactive book! My son loves t

Now the data looks cleaner. We no longer have the NaN for the 38th review.
## Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews.
A vector consisting of word counts is often referred to as bag-of-word features.
Since most words occur in only a few reviews, word count vectors are sparse.
For this reason, scikit-learn and many other tools use sparse matrices to
store a collection of word count vectors. Refer to appropriate manuals to produce
sparse word count vectors. General steps for extracting word count vectors are as follows:

- Learn a vocabulary (set of all words) from the training data. Only the words that show
  up in the training data will be considered for feature extraction.
- Compute the occurrences of the words in each review and collect them into a row vector.
- Build a sparse matrix where each row is the word count vector for the corresponding review.
  Call this matrix train_matrix.
- Using the same mapping between words and columns, convert the test data into a sparse
  matrix test_matrix.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

In [22]:
vect = CountVectorizer(token_pattern = r'\b\w+\b')
features = vect.fit_transform(data['review'])

In [23]:
type(features)

scipy.sparse.csr.csr_matrix

In [24]:
features.shape

(183531, 68069)

There are about 68k words.

Let's look at the first 20 features (the words). Note that the `u'` just means Python is internally representing each word as a unicode string.

In [25]:
vect.get_feature_names()[0:20]

['0',
 '00',
 '000',
 '0001',
 '000ft',
 '000importer',
 '000sqft',
 '001',
 '001cm',
 '00am',
 '00amcreepy',
 '00cons',
 '00dollars',
 '00etwhile',
 '00not',
 '00pm',
 '01',
 '01262',
 '016sc01',
 '01992']

## Build a sentiment classifier
Examine the ratings for **all** the reviews we have:

In [26]:
data['rating'].plot(y='rating', orientation='horizontal', kind='hist', bins=5)

<matplotlib.axes._subplots.AxesSubplot at 0x7f8e06b4ef98>

## Define what's a positive and a negative sentiment

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.  Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.   

In [27]:
#ignore all 3* reviews
data = data[data['rating'] != 3]

In [28]:
#positive sentiment = 4* or 5* reviews
data['sentiment'] = data['rating'] >=4

In [29]:
data.head()

Unnamed: 0,name,review,rating,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,True
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,True
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,True
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,True
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,True


## Let's train the sentiment classifier

Train a sentiment classifier with logistic regression
We will now use logistic regression to create a sentiment classifier on the training data.
 - Learn a logistic regression classifier using the training data.

 - There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture
 that positive weights w_j correspond to weights that cause positive sentiment, while negative
 weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is
 actually nonnegative) coefficients.

In [30]:
data.sentiment.value_counts()

True     140259
False     26493
Name: sentiment, dtype: int64

In [31]:
# Define X and y
X = data['review']
y = data['sentiment']

In [32]:
# split into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)

(133401,)
(33351,)




In [33]:
# instantiate the vectorizer
vect = CountVectorizer()

In [35]:
# learn training data vocabulary, then create document-term matrix
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
X_train_dtm

<133401x57485 sparse matrix of type '<class 'numpy.int64'>'
	with 7080210 stored elements in Compressed Sparse Row format>

In [36]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

<33351x57485 sparse matrix of type '<class 'numpy.int64'>'
	with 1749877 stored elements in Compressed Sparse Row format>

In [37]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [38]:
# class predictions and predicted probabilities
y_pred_class = logreg.predict(X_test_dtm)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]

In [39]:
y_pred_prob

array([ 0.49793407,  0.20228641,  0.09211547, ...,  0.99998893,
        0.99982235,  0.92829456])

### Apply received model 

We now turn to examining the full test dataset, test_data, and use
sklearn.linear_model.LogisticRegression to form predictions on all of the test data points.
Using the sentiment_model, let's find the 20 reviews in the entire test_data with the highest probability
of being classified as a positive review. We refer to these as the "most positive reviews."

In [40]:
data_vect_dtm = vect.transform(data['review'])
data_vect_dtm

<166752x57485 sparse matrix of type '<class 'numpy.int64'>'
	with 8830087 stored elements in Compressed Sparse Row format>

In [41]:
data['predicted_sentiment'] = logreg.predict(data_vect_dtm)

In [42]:
data['predicted_proba'] = logreg.predict_proba(data_vect_dtm)[:, 1]

In [43]:
data.head()

Unnamed: 0,name,review,rating,sentiment,predicted_sentiment,predicted_proba
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,True,True,0.61267
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,True,True,0.994001
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,True,True,0.999792
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,True,True,0.999376
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,True,True,0.999997


In [44]:
data = data.sort_values('predicted_proba', ascending=False)

In [45]:
data.head()

Unnamed: 0,name,review,rating,sentiment,predicted_sentiment,predicted_proba
91485,"Dream On Me / Mia Moda Atmosferra Stroller, Nero",I love this stroller SO much! I am not afraid ...,5,True,True,1.0
50315,"P'Kolino Silly Soft Seating in Tias, Green",I've purchased both the P'Kolino Little Reader...,4,True,True,1.0
93690,The First Years Ignite Stroller,The last thing we wanted was to purchase more ...,5,True,True,1.0
21557,"Joovy Caboose Stand On Tandem Stroller, Black","Ok, I read all the reviews already posted here...",5,True,True,1.0
127831,"Mountain Buggy Duet Double Buggy Stroller, Bla...",My local BRU had only 15 strollers to look at ...,5,True,True,1.0


Now, let us repeat this exercise to find the "most negative reviews."
Use the prediction probabilities to find the 20 reviews in the test_data with the
lowest probability of being classified as a positive review. Repeat the same steps
above but make sure you sort in the opposite order.

In [46]:
data.tail()

Unnamed: 0,name,review,rating,sentiment,predicted_sentiment,predicted_proba
2186,Philips Avent 3 Pack 9oz Bottles,"(This is a long review, but if you read the wh...",1,False,False,4.398846e-18
10180,Arms Reach Co-Sleeper brand Mini Co-Sleeper Ba...,"Please see my email to the company:Hello,I am ...",1,False,False,2.18058e-18
120707,The European NANNY Baby Movement Monitor - EU ...,"The previous reviewers laud the ""piece of mind...",1,False,False,1.100787e-18
120219,Levana Safe N'See Digital Video Baby Monitor w...,I have NEVER written a review before for anyth...,1,False,False,9.061516999999999e-19
147902,Graco Pack 'n Play Playard - Dempsey,My disappointment with this product prompted m...,1,False,False,1.274915e-22


## Evaluate the model 

We will now evaluate the accuracy of the trained classifier.
By looking at its accuracy and its area under the curve score:

In [47]:
# calculate accuracy and AUC
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
print(metrics.roc_auc_score(y_test, y_pred_prob))

0.933885040928
0.95652134862


And the confusion matrix:

In [48]:
print(metrics.confusion_matrix(y_test, y_pred_class))

[[ 3862  1412]
 [  793 27284]]


### Classifier with fewer words

In [49]:
significant_words = ['love','great','easy','old','little','perfect',
                     'loves','well','able','car','broke','less','even','waste',
                     'disappointed','work','product','money','would','return']

In [50]:
count_vect = CountVectorizer(vocabulary=significant_words)
features = count_vect.fit_transform(data['review'])

In [51]:
type(features)

scipy.sparse.csr.csr_matrix

In [52]:
features.shape

(166752, 20)

In [53]:
# learn training data vocabulary, then create document-term matrix
count_vect.fit(X_train)
train_matrix_word_subset = count_vect.transform(X_train)
train_matrix_word_subset

<133401x20 sparse matrix of type '<class 'numpy.int64'>'
	with 298168 stored elements in Compressed Sparse Row format>

In [54]:
# transform testing data (using fitted vocabulary) into a document-term matrix
test_matrix_word_subset = vect.transform(X_test)
test_matrix_word_subset

<33351x57485 sparse matrix of type '<class 'numpy.int64'>'
	with 1749877 stored elements in Compressed Sparse Row format>

In [55]:
subset_logreg = LogisticRegression()
subset_logreg.fit(train_matrix_word_subset, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)