# Predicting sentiment from product reviews

As usual the most basic imports are on top. On the beginning we will need pandas for DataFrames and sklearn for machine learning.

In [3]:
import json
import math
import operator
import string
from collections import Counter

import numpy
import pandas as pd
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Imputer

### Loading reviews for a set of baby products.

Firsly, let's read required data from CSV file. In this project we will be working on data from Amazon, and to be more specific, we will be manipulating data about things for babies.

In [6]:
data = pd.read_csv('../lectures/data/amazon_baby.csv')

Let's review some data to check how it looks like.

In [7]:
data.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


As we can see, we have four columns: index, name of a object, review of it and rating. 
In this task we will try to classify and analyze those reviews to find out the best and the worst items.

### Data cleaning

Let's see how the single entry looks like.

In [8]:
data['review'][0]

'These flannel wipes are OK, but in my opinion not worth keeping.  I also ordered someImse Vimse Cloth Wipes-Ocean Blue-12 countwhich are larger, had a nicer, softer texture and just seemed higher quality.  I use cloth wipes for hands and faces and have been usingThirsties 6 Pack Fab Wipes, Boyfor about 8 months now and need to replace them because they are starting to get rough and have had stink issues for a while that stripping no longer handles.'

Text contains punctuations which we don't need, so we should remove them:

In [9]:
data['review_clean'] = data['review'].str.replace('[^\w\s]', '')

Look at reviews 30 to 50, we see some bad data for review

In [10]:
data['review'][30:50]

30    Beautiful little book.  A great little short s...
31    This book is so worth the money. It says 9+ mo...
32    we just got this book for our one-year-old and...
33    The book is colorful and is perfect for 6month...
34    The book is cute, and we are huge fans of Lama...
35    What a great book for babies!  I'd been lookin...
36    My son loved this book as an infant.  It was p...
37    Our baby loves this book & has loved it for a ...
38                                                  NaN
39    My son likes brushing elmo's teeth. Almost too...
40    This was a birthday present for my 2 year old ...
41    This bear is absolutely adorable and I would g...
42    My baby absolutely loves Elmo and so this book...
43    I bought two for recent baby showers!  The boo...
44    We wanted to get another book like the Big Bir...
45    This is a cute little peek-a-boo story book.  ...
46    My 3 month old son loves this book. We read it...
47    Very cute interactive book! My son loves t

So let's clean that up

In [13]:
def cleanNaN(value):
    if pd.isnull(value):
        return ""
    else:
        return value

In [14]:
data['review_clean'] = data['review_clean'].apply(cleanNaN)

In [15]:
data['review_clean'][30:50]

30    Beautiful little book  A great little short st...
31    This book is so worth the money It says 9 mont...
32    we just got this book for our oneyearold and s...
33    The book is colorful and is perfect for 6month...
34    The book is cute and we are huge fans of Lamaz...
35    What a great book for babies  Id been looking ...
36    My son loved this book as an infant  It was pe...
37    Our baby loves this book  has loved it for a w...
38                                                     
39    My son likes brushing elmos teeth Almost too n...
40    This was a birthday present for my 2 year old ...
41    This bear is absolutely adorable and I would g...
42    My baby absolutely loves Elmo and so this book...
43    I bought two for recent baby showers  The book...
44    We wanted to get another book like the Big Bir...
45    This is a cute little peekaboo story book  Its...
46    My 3 month old son loves this book We read it ...
47    Very cute interactive book My son loves th

Now the data looks cleaner. We no longer have the NaN for the 38th review.

## Define what's a positive and a negative sentiment

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.  Reviews with a rating of 4 or higher will be considered positive, while the ones with rating of 2 or lower will have a negative sentiment.   

In [18]:
data = data[data.rating != 3]  # Ignore neutral ratings

data['sentiment'] = data['rating'].apply(lambda rating: +1 if rating > 3 else -1)  # Assign sentiment tags

In [19]:
data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


## Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews.
A vector consisting of word counts is often referred to as bag-of-word features.
Since most words occur in only a few reviews, word count vectors are sparse.
For this reason, scikit-learn and many other tools use sparse matrices to
store a collection of word count vectors. Refer to appropriate manuals to produce
sparse word count vectors. General steps for extracting word count vectors are as follows:

- Learn a vocabulary (set of all words) from the training data. Only the words that show
  up in the training data will be considered for feature extraction.
- Compute the occurrences of the words in each review and collect them into a row vector.
- Build a sparse matrix where each row is the word count vector for the corresponding review.
  Call this matrix train_matrix.
- Using the same mapping between words and columns, convert the test data into a sparse
  matrix test_matrix.

In [20]:

train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Bag of words training model
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.transform(test_data['review_clean'])

sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])

dx = sentiment_model.__dict__
coefs = dx['coef_'][0]

print("Number of co-efficients", len(coefs))

count = 0
for co in coefs:
    if co >= 0:
        count += 1

print("Number of non negative coeffs ", count)

sample_test_data = test_data[10:13]
print(sample_test_data)

def probability(score):
    return (1 / (1 + numpy.exp(-score)))

sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print(scores)
print(sentiment_model.predict(sample_test_matrix))

test_set_scores = sentiment_model.decision_function(test_matrix)
names = test_data["name"]
name_predictions = dict(zip(names, test_set_scores))

sorted_reviews = sorted(name_predictions.items(), key=operator.itemgetter(1), reverse=True)

most_positive_reviews = sorted_reviews[:20]
print(most_positive_reviews)

most_negative_reviews = sorted_reviews[-1:-22:-1]
print(most_negative_reviews)

def get_accuracy(model, data_matrix, dataset):
    predictions = model.predict(data_matrix)

    match_predictions_labels = list(zip(predictions, dataset))

    correct_count = 0
    for prediction, label in match_predictions_labels:
        if prediction == label:
            correct_count += 1
    return (float(correct_count) / len(match_predictions_labels))

# Classifier with a set of significant words

significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves',
                     'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed',
                     'work', 'product', 'money', 'would', 'return']

vectorizer_word_subset = CountVectorizer(vocabulary=significant_words)  # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

simple_model = LogisticRegression()
simple_model.fit(train_matrix_word_subset, train_data['sentiment'])

simple_model_coef_table = list(zip(significant_words, simple_model.coef_.flatten()))

print(simple_model_coef_table)

print("Training Set Accuracy : Sentiment Model ", get_accuracy(sentiment_model, train_matrix, train_data["sentiment"]))
print("Test Set Accuracy : Sentiment Model ", get_accuracy(sentiment_model, test_matrix, test_data["sentiment"]))
print("Training Set Accuracy : Simple Model ", get_accuracy(simple_model, train_matrix_word_subset, train_data["sentiment"]))
print("Test Set Accuracy : Simple Model ", get_accuracy(simple_model, train_matrix_word_subset, test_data["sentiment"]))

Number of co-efficients 121871
Number of non negative coeffs  85814
                                                    name  \
80054  Simple Wishes Hands-Free Breastpump Bra, Pink,...   
44765  Moby Wrap UV SPF 50+ 100% Cotton Baby Carrier,...   
48173  The First Years American Red Cross Deluxe Nail...   

                                                  review  rating  \
80054  I like the idea but the problems:-awkward clos...       2   
44765  This is my 2nd Moby, just wanted another color...       5   
48173  This is the best nail clipper! Definitely reco...       5   

                                            review_clean  sentiment  
80054  I like the idea but the problemsawkward closin...         -1  
44765  This is my 2nd Moby just wanted another color ...          1  
48173  This is the best nail clipper Definitely recom...          1  
[ -0.54655151  17.03374564   4.88448576]
[-1  1  1]
[('Zooper 2011 Waltz Standard Stroller, Flax Brown', 63.963700627352594), ('Bumbleride