# Predicting sentiment from product reviews


I will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative. I will Sframes to do this. So first I will fire up the Graphlab, then I will train a logistic regression model to predict sentiments of product reviews.

    
## Fire up GraphLab Create

In [1]:
from __future__ import division
import graphlab
import math
import string

# Data preperation

We will use a dataset consisting of baby product reviews on Amazon.com. The detail description about the dataset has been discussed previously.

In [3]:
products = graphlab.SFrame('amazon_baby.gl/')

Now, let's see how the dataset looks like.

In [4]:
products

name,review,rating
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0


## Build the word count vector for each review

Let us explore a specific example of a baby product.


In [5]:
products[269] ## picking any random datapoint whose index is 269.

{'name': 'The First Years Massaging Action Teether',
 'rating': 5.0,
 'review': 'A favorite in our house!'}

To make operations easy to perform, data is always need to be modified. So, I am removing the punctuations from all the reviews, and also I am adding another column to our dataframe object "products" about the word_count of each word in the review section. The word_count can be easily calculatedd using the inbuilt function provided by thee graphlab called as "text_analytica.count_words(parameter)”.

In [6]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

review_without_puctuation = products['review'].apply(remove_punctuation)
products['word_count'] = graphlab.text_analytics.count_words(review_without_puctuation)

In [7]:
products[269]['word_count'] # picking any random datapoint whose index is 269 and seeing its word_count attribute.

{'a': 1L, 'favorite': 1L, 'house': 1L, 'in': 1L, 'our': 1L}

## Extract sentiments

I will **ignore** all reviews with rating = **3**, since they tend to have a **neutral sentiment** and we will consider ratings of **1,2,4,5** only.

Now, we will assign reviews with a rating of 4 or higher to be *positive* reviews, while the ones with rating of 2 or lower are *negative*. **For the sentiment column, we use +1 for the positive class label and -1 for the negative class label**. Hence, in short rating of 4 and 5 == "**positive review**" and the rating of 1 and 2 == "**negative review**”.

In [8]:
products = products[products['rating'] != 3]
len(products)

166752

Now, we will assign reviews with a rating of 4 or higher to be *positive* reviews, while the ones with rating of 2 or lower are *negative*. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label.

In [9]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
products

name,review,rating,word_count,sentiment
Planetwise Wipe Pouch,it came early and was not disappointed. i love ...,5.0,"{'and': 3L, 'love': 1L, 'it': 3L, 'highly': 1L, ...",1
Annas Dream Full Quilt with 2 Shams ...,Very soft and comfortable and warmer than it ...,5.0,"{'and': 2L, 'quilt': 1L, 'it': 1L, 'comfortable': ...",1
Stop Pacifier Sucking without tears with ...,This is a product well worth the purchase. I ...,5.0,"{'and': 3L, 'ingenious': 1L, 'love': 2L, 'is': ...",1
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5.0,"{'and': 2L, 'all': 2L, 'help': 1L, 'cried': 1L, ...",1
Stop Pacifier Sucking without tears with ...,"When the Binky Fairy came to our house, we didn't ...",5.0,"{'and': 2L, 'cute': 1L, 'help': 2L, 'habit': 1L, ...",1
A Tale of Baby's Days with Peter Rabbit ...,"Lovely book, it's bound tightly so you may no ...",4.0,"{'shop': 1L, 'be': 1L, 'is': 1L, 'bound': 1L, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",Perfect for new parents. We were able to keep ...,5.0,"{'and': 2L, 'all': 1L, 'right': 1L, 'able': 1L, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",A friend of mine pinned this product on Pinte ...,5.0,"{'and': 1L, 'fantastic': 1L, 'help': 1L, 'give': ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'all': 1L, 'standarad': 1L, 'another': 1L, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,"{'all': 2L, 'nannys': 1L, 'just': 1L, 'sleep': 2L, ...",1


Now, the dataset contains an extra column called **sentiment** which is either positive (+1) or negative (-1).

## Splitting  data into training and test sets

Data needs to be splitted in 2 parts called as training and test data.I am doing train/test split with 80% of the data in the training set and 20% of the data in the test set.


In [10]:
train_data, test_data = products.random_split(.8, seed=1)
print len(train_data)
print len(test_data)

133416
33336


# Train a sentiment classifier with logistic regression

I am now using logistic regression to create a sentiment classifier on the training data. For any model, we need to specify the feature and the target. This model will use the column **word_count** as a feature and the column **sentiment** as the target

In [11]:
# I am giving the model name as sentiment_model. Every model has target and some feature.
# target = sentiment since i am building sentiment classifier
# features = word_count since i am using nature of words in review to judge the sentiment of a review given by an user
# validation_Set = None since i dont need any external validation set as i had defined test data above already
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features=['word_count'],
                                                      validation_set=None)

In [12]:
sentiment_model

Class                         : LogisticClassifier

Schema
------
Number of coefficients        : 121713
Number of examples            : 133416
Number of classes             : 2
Number of feature columns     : 1
Number of unpacked features   : 121712

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 6
Solver status                 : TERMINATED: Terminated due to numerical difficulties.
Training time (sec)           : 5.3665

Settings
--------
Log-likelihood                : inf

Highest Positive Coefficients
-----------------------------
word_count[mobileupdate]      : 41.9847
word_count[placeid]           : 41.7354
word_count[labelbox]          : 41.151
word_count[httpwwwamazoncomreviewrhgg6qp7tdnhbrefcmcrprcmtieutf8asinb00318cla0nodeid]: 40.0454
word_count[knobskeeping]      : 36.2091

Lowest Negative Coefficients
------------

In [13]:
weights = sentiment_model.coefficients  # .coefficients can be used to get the coefficients of weights of the trainned model
weights.column_names()

['name', 'index', 'class', 'value', 'stderr']

In [14]:
weights

name,index,class,value,stderr
(intercept),,1,1.30337080544,
word_count,recommend,1,0.303815600015,
word_count,highly,1,1.49183015276,
word_count,disappointed,1,-3.95748618393,
word_count,love,1,1.43301685439,
word_count,it,1,0.00986646490307,
word_count,planet,1,-0.797764553926,
word_count,and,1,0.048449573172,
word_count,bags,1,0.165541436615,
word_count,wipes,1,-0.0949937947269,


There are a total of **121713 coefficients** in the model. Also, positive weights correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. So all weights having [value > 0] are positive and helps in making sentiment positive whereas all weights having [value<0] are negative and helps in making sentiment negative.

In [154]:
weights.num_rows() # total weights are effectively eaual to the number of total different words that appeared in 
                   # all the reviews given by all the users.

121713

In [155]:
num_positive_weights = (weights['value'] >= 0).sum()
num_negative_weights = (weights['value'] < 0).sum()

print "Number of positive weights: %s " % num_positive_weights
print "Number of negative weights: %s " % num_negative_weights

Number of positive weights: 68419 
Number of negative weights: 53294 


## Making predictions with logistic regression

As the model is trained, i can make predictions now on the **test data**. Okay, so til now i have used only training data. Now i will predict the sentiment given by model on the test data samples, and will compare it to its actual value of sentiment given in the test data. In this way, we can also calculate the accuracy of our logistic model.
So let's take any 3 datapoints from the test_data. Lets take the datapoints 10,11,12.

In [26]:
sample_test_data = test_data[10:13] #using python list slicing to get 10,11,12 points
print sample_test_data['rating']
sample_test_data

[5.0, 2.0, 1.0]


name,review,rating,word_count,sentiment
Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in ...,5.0,"{'and': 2L, 'all': 1L, 'love': 1L, ...",1
Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The decals ...,2.0,"{'and': 1L, 'wall': 1L, 'them': 1L, 'decals': ...",-1
New Style Trailing Cherry Blossom Tree Decal ...,Was so excited to get this product for my baby ...,1.0,"{'all': 1L, 'money': 1L, 'into': 1L, 'it': 3L, ...",-1


Let's go deeper into the first row of the **sample_test_data**. Here's the full review:

In [27]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

That review seems pretty positive.

Now, let's see what the next row of the **sample_test_data** looks like. As we could guess from the sentiment (-1), the review is quite negative.

In [28]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

We will now make a **class** prediction for the **sample_test_data**. the `sentiment_model` defined above should predict **+1** if the sentiment is positive and **-1** if the sentiment is negative.

In [29]:
scores = sentiment_model.predict(sample_test_data, output_type='margin')
print scores

[6.734619727060483, -5.734130996760992, -14.668460404469744]


### Predicting sentiment

These scores can be used to make class predictions as follows:

$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) > 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

Using scores, write code to calculate $\hat{y}$, the class predictions:

In [46]:
def class_predictions(scores):      # Defining class prediction function
    preds = []
    for score in scores:
        if score > 0:
            pred = 1
        else:
            pred = -1
        preds.append(pred)
    return preds

In [47]:
class_predictions(scores)

[1, -1, -1]

Run the following code to verify that the class predictions obtained by your calculations are the same as that obtained from GraphLab Create.

In [42]:
print "Class predictions according to GraphLab Create:" 
print sentiment_model.predict(sample_test_data)

Class predictions according to GraphLab Create:
[1L, -1L, -1L]




### Probability predictions

Probability of the prediction simply means how much confidence there is in the results given by the model. Probability can be calculated by the formula of probability for logistic model given below.

Recall from the lectures that we can also calculate the probability predictions from the scores using:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

Using **scores** value calculated previously, probability that a sentiment is positive can be calculated using the above formula. For each row, the probabilities should be a number in the range **[0, 1]**.

In [50]:
def calculate_proba(scores):
    proba_preds = []
    for score in scores:
        proba_pred =  1 / (1 + math.exp(-score))
        proba_preds.append(proba_pred)
    return proba_preds

calculate_proba(scores)

[0.9988123848377212, 0.0032232681817983204, 4.261557996650197e-07]

# Finding the most positive and negative reviews

In [55]:
# probability predictions on test_data using the sentiment_model
test_data['proba_pred'] = sentiment_model.predict(test_data, output_type='probability')
test_data

name,review,rating,word_count,sentiment
"Baby Tracker&reg; - Daily Childcare Journal, ...",This has been an easy way for my nanny to record ...,4.0,"{'all': 1L, 'standarad': 1L, 'another': 1L, ...",1
"Baby Tracker&reg; - Daily Childcare Journal, ...",I love this journal and our nanny uses it ...,4.0,"{'all': 2L, 'nannys': 1L, 'just': 1L, 'sleep': 2L, ...",1
Nature's Lullabies First Year Sticker Calendar ...,"I love this little calender, you can keep ...",5.0,"{'and': 1L, 'babys': 1L, 'love': 1L, 'like': 1L, ...",1
Nature's Lullabies Second Year Sticker Calendar ...,"I had a hard time finding a second year calendar, ...",5.0,"{'and': 3L, 'all': 1L, 'months': 1L, ...",1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4.0,"{'and': 2L, 'because': 1L, 'family': 1L, ...",1
"Lamaze Peekaboo, I Love You ...",My son loved this book as an infant. It was ...,5.0,"{'all': 1L, 'being': 1L, 'infant': 1L, 'course': ...",1
"Lamaze Peekaboo, I Love You ...",Our baby loves this book & has loved it for a ...,5.0,"{'and': 1L, 'own': 1L, 'it': 3L, 'our': 1L, ...",1
"SoftPlay Giggle Jiggle Funbook, Happy Bear ...",This bear is absolutely adorable and I would ...,2.0,"{'and': 3L, 'cute': 1L, 'rating': 1L, ...",-1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,I bought two for recent baby showers! The book ...,5.0,"{'beautiful': 1L, 'and': 2L, 'love': 1L, 'elmo': ...",1
Baby's First Year Undated Wall Calendar with ...,I searched high and low for a first year cale ...,5.0,"{'remembering': 1L, 'and': 4L, ...",1

proba_pred
0.758399887752
0.999999999966
0.22895097808
0.999999558063
0.990542169248
0.999999295968
0.99976447628
0.722834466283
0.999266840896
0.999786830048


In [162]:
test_data['name','proba_pred'].topk('proba_pred', k=20).print_rows(20) #Defining 20 most positive reviews

+-------------------------------+------------+
|              name             | proba_pred |
+-------------------------------+------------+
| Britax Decathlon Convertib... |    1.0     |
| Ameda Purely Yours Breast ... |    1.0     |
| Traveling Toddler Car Seat... |    1.0     |
| Shermag Glider Rocker Comb... |    1.0     |
| Cloud b Sound Machine Soot... |    1.0     |
| JP Lizzy Chocolate Ice Cla... |    1.0     |
| Fisher-Price Rainforest Me... |    1.0     |
| Lilly Gold Sit 'n' Stroll ... |    1.0     |
|  Fisher-Price Deluxe Jumperoo |    1.0     |
| North States Supergate Pre... |    1.0     |
|   Munchkin Mozart Magic Cube  |    1.0     |
| Britax Marathon Convertibl... |    1.0     |
| Wizard Convertible Car Sea... |    1.0     |
|   Capri Stroller - Red Tech   |    1.0     |
| Peg Perego Primo Viaggio C... |    1.0     |
| HALO SleepSack Micro-Fleec... |    1.0     |
| Leachco Snoogle Total Body... |    1.0     |
| Summer Infant Complete Nur... |    1.0     |
| Safety 1st 

By using the prediction probabilities to find the  20 reviews in the **test_data** with the **lowest probability** of being classified as a **positive review**, 20 most negative reviews can be found this way.

In [61]:
test_data['name','proba_pred'].topk('proba_pred', k=20, reverse=True).print_rows(20)

+-------------------------------+--------------------+
|              name             |     proba_pred     |
+-------------------------------+--------------------+
| Jolly Jumper Arctic Sneak ... | 7.80415068198e-100 |
| Levana Safe N'See Digital ... |  6.8365088551e-25  |
| Snuza Portable Baby Moveme... | 2.12654510822e-24  |
| Fisher-Price Ocean Wonders... | 2.24582080778e-23  |
| VTech Communications Safe ... | 1.32962966148e-22  |
| Safety 1st High-Def Digita... | 2.06872097469e-20  |
| Chicco Cortina KeyFit 30 T... | 5.93881994667e-20  |
| Prince Lionheart Warmies W... | 6.28510016532e-20  |
| Valco Baby Tri-mode Twin S... | 8.05528712682e-20  |
| Adiri BPA Free Natural Nur... | 8.46521724932e-20  |
| Munchkin Nursery Projector... | 1.52853945169e-19  |
| The First Years True Choic... | 1.77901889388e-19  |
| Nuby Natural Touch Silicon... | 1.15227353847e-18  |
| Peg-Perego Tatamia High Ch... | 1.26175666135e-18  |
|    Fisher-Price Royal Potty   | 1.60282966314e-18  |
| Safety 1

## Computing accuracy of the classifier

We will now evaluate the accuracy of the trained classifer. Recall that the accuracy is given by


$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

This can be computed as follows:

* **Step 1:** Use the trained model to compute class predictions (**Hint:** Use the `predict` method)
* **Step 2:** Count the number of data points when the predicted class labels match the ground truth labels (called `true_labels` below).
* **Step 3:** Divide the total number of correct predictions by the total number of data points in the dataset.

Complete the function below to compute the classification accuracy:

In [81]:
# Test SArray comparison
print graphlab.SArray([1,1,1]) == sample_test_data['sentiment']
print sentiment_model.predict(sample_test_data) == sample_test_data['sentiment']

[1L, 0L, 0L]
[1L, 1L, 1L]


In [82]:
def get_classification_accuracy(model, data, true_labels):
    predictions = model.predict(data)
    num_correct = sum(predictions == true_labels)
    accuracy = num_correct/len(data)
    return accuracy

Now, let's compute the classification accuracy of the **sentiment_model** on the **test_data**.

In [83]:
get_classification_accuracy(sentiment_model, test_data, test_data['sentiment'])

0.9145368370530358

Hence the accuracy of the sentiment_model on the test_data is approximately equal to 91.45% .