# Predicting sentiment from product reviews

The goal of this assignment is to explore logistic regression and feature engineering with existing Turi Create functions.

In this assignment, you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative. You will:

   *  Use SFrames to do some feature engineering
   * Train a logistic regression model to predict the sentiment of product reviews.
   * Inspect the weights (coefficients) of a trained logistic regression model.
   * Make a prediction (both class and probability) of sentiment for a new product review.
   * Given the logistic regression weights, predictors and ground truth labels, write a function to compute the accuracy of the model.
   * Inspect the coefficients of the logistic regression model and interpret their meanings.
   * Compare multiple logistic regression models.

1. import libraries

In [148]:
# imoprt some import module
import numpy as np
import pandas as pd

#### Load Amazon data

In [2]:
data = pd.read_csv('amazon_baby.csv')

In [3]:
data.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    183213 non-null  object
 1   review  182702 non-null  object
 2   rating  183531 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


2. Perform text cleaning

In [5]:
def remove_punctuation(text):
    import string
    return str(text).translate(str.maketrans('', '', string.punctuation))

data['review_clean'] = data['review'].apply(remove_punctuation)

In [149]:
pd.set_option('max_colwidth', None)
data.head(1)

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.,5,it came early and was not disappointed i love planet wise bags and now my wipe holder it keps my osocozy wipes moist and does not leak highly recommend it,1


In [150]:
# check if there is null values in data
data.isnull().sum()

name            296
review            0
rating            0
review_clean      0
sentiment         0
dtype: int64

In [151]:
# Fill the num data with empty string
data = data.fillna({'review':''})

In [9]:
data.isnull().sum()

name            318
review            0
rating            0
review_clean      0
dtype: int64

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   name          183213 non-null  object
 1   review        183531 non-null  object
 2   rating        183531 non-null  int64 
 3   review_clean  183531 non-null  object
dtypes: int64(1), object(3)
memory usage: 5.6+ MB


#### Extract Sentiment

3. We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment

In [153]:
# remove rating with value 3
data = data[data['rating'] !=3]

 4. Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. 

In [12]:
data['sentiment'] = data.rating.apply(lambda x: +1 if x>=4 else -1)

In [154]:
pd.set_option('max_colwidth', 50)
data['sentiment'].value_counts()

 1    140259
-1     26493
Name: sentiment, dtype: int64

In [155]:
data.head(1)

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1


In [115]:
# from sklearn.model_selection import train_test_split
# # np.random.rand(1)
# train_data, test_data = train_test_split(data, test_size = 0.2, random_state = 1)

In [101]:
# np.random.rand(1)
# msk = np.random.rand(len(data)) < 0.8
# train = data[msk]
# test = data[~msk]

### Split into training and test sets

5. Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set.

In [157]:
# load indicies 
import json
with open('module-2-assignment-test-idx.json') as test_data_file:    
    test_data_idx = json.load(test_data_file)
with open('module-2-assignment-train-idx.json') as train_data_file:    
    train_data_idx = json.load(train_data_file)

print(train_data_idx[:3])
print(test_data_idx[:3])

[0, 1, 2]
[8, 9, 14]


In [127]:
# train_data_index = pd.read_json('module-2-assignment-train-idx.json')
# test_data_index = pd.read_json('module-2-assignment-test-idx.json')

In [158]:
# split train and test according to given indices
train_data = data.iloc[train_data_idx]
test_data = data.iloc[test_data_idx]

In [159]:
train_data.shape

(133416, 5)

### Build the word count vector for each review 

6. We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as *bag-of-word features*.

General steps for extracting word count vectors are as follows:

   * Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
   * Compute the occurrences of the words in each review and collect them into a row vector.
   * Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
   * Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

In [133]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [134]:
train_matrix

<133416x121713 sparse matrix of type '<class 'numpy.int64'>'
	with 7327230 stored elements in Compressed Sparse Row format>

### Train a sentiment classifier with logistic regression

7. Learn a logistic regression classifier using the training data.

In [103]:
from sklearn.linear_model import LogisticRegression

In [147]:
sentiment_model = LogisticRegression(max_iter = 5000)
sentiment_model.fit(train_matrix, train_data['sentiment'])

LogisticRegression(max_iter=5000)

8. There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. 

Calculate the number of positive (>= 0, which is actually nonnegative) coefficients. 

Quiz question 1 : How many weights are >= 0?

In [144]:
np.sum(sentiment_model.coef_ >=0)

91143

### Question 1
How many weights are greater than or equal to 0

Ans = 91143

### Making predictions with logistic regression

9. Now  make predictions on the test data.

In [160]:
sample_test_data = test_data[10:13]
sample_test_data

Unnamed: 0,name,review,rating,review_clean,sentiment
59,Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in...,5,Absolutely love it and all of the Scripture in...,1
71,Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The dec...,2,Would not purchase again or recommend The deca...,-1
91,New Style Trailing Cherry Blossom Tree Decal R...,Was so excited to get this product for my baby...,1,Was so excited to get this product for my baby...,-1


In [172]:
sample_test_data['review'][59]

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [173]:
sample_test_data['review'][71]

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

10. We will now make a class prediction for the sample_test_data. 

In [176]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
scores

array([  5.58698191,  -3.1829204 , -10.43157565])

### Prediciting Sentiment

11. These scores can be used to make class predictions as follows:

$\hat{y}_i$ = +1 if $w_⊺$h($x_i$) > 0

$\hat{y}_i$ = -1 if $w_⊺$h($x_i$) <= 0

In [184]:
scores

array([  5.58698191,  -3.1829204 , -10.43157565])

### Probability Predictions

12. Calculate the probability predictions from the scores 

In [181]:
p =[]
for  score in scores:
    p.append(1/(1+np.exp(-score)))

In [183]:
# Proability
p

[0.9962676649615883, 0.03981354150234166, 2.9485700521919945e-05]

Checkpoint: Make sure your probability predictions match the ones obtained from sentiment_model.

In [190]:
sentiment_model.predict_proba(sample_test_matrix)

array([[3.73233504e-03, 9.96267665e-01],
       [9.60186458e-01, 3.98135415e-02],
       [9.99970514e-01, 2.94857005e-05]])

#### Quiz question: 2
Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

Ans = Third

### Find the most positive (and negative) review

13. We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points.

In [192]:
test_matrix

<33336x121713 sparse matrix of type '<class 'numpy.int64'>'
	with 1817073 stored elements in Compressed Sparse Row format>

In [218]:
test_scores = sentiment_model.decision_function(test_matrix)
test_scores

array([ 1.2769364 , 14.11412158,  2.60359001, ..., 12.16459142,
       12.95557298,  3.98220291])

In [228]:
max_positive_idx = np.argsort(-test_scores)

In [229]:
max_positive_idx[:20]

array([18112, 15732, 24286, 25554, 24899,  9125, 21531, 32782, 14482,
       30535,  9555, 30634, 17558, 26830, 11923, 20743,  4140, 30076,
       33060, 26838])

In [230]:
test_scores[max_positive_idx]

array([ 53.92399591,  52.39795   ,  48.75925227, ..., -30.11931041,
       -33.90267007, -34.61875371])

In [241]:
test_data.iloc[max_positive_idx[:20]]

Unnamed: 0,name,review,rating,review_clean,sentiment
100166,"Infantino Wrap and Tie Baby Carrier, Black Blu...",I bought this carrier when my daughter was abo...,5,I bought this carrier when my daughter was abo...,1
87017,Baby Einstein Around The World Discovery Center,I am so HAPPY I brought this item for my 7 mon...,5,I am so HAPPY I brought this item for my 7 mon...,1
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,I got this stroller for my daughter prior to t...,1
140816,"Diono RadianRXT Convertible Car Seat, Plum",I bought this seat for my tall (38in) and thin...,5,I bought this seat for my tall 38in and thin 2...,1
137034,Graco Pack 'n Play Element Playard - Flint,My husband and I assembled this Pack n' Play l...,4,My husband and I assembled this Pack n Play la...,1
50315,"P'Kolino Silly Soft Seating in Tias, Green",I've purchased both the P'Kolino Little Reader...,4,Ive purchased both the PKolino Little Reader C...,1
119182,Roan Rocco Classic Pram Stroller 2-in-1 with B...,Great Pram Rocco!!!!!!I bought this pram from ...,5,Great Pram RoccoI bought this pram from Europe...,1
180646,Mamas &amp; Papas 2014 Urbo2 Stroller - Black,After much research I purchased an Urbo2. It's...,4,After much research I purchased an Urbo2 Its e...,1
80155,"Simple Wishes Hands-Free Breastpump Bra, Pink,...","I just tried this hands free breastpump bra, a...",5,I just tried this hands free breastpump bra an...,1
168081,Buttons Cloth Diaper Cover - One Size - 8 Colo...,"We are big Best Bottoms fans here, but I wante...",4,We are big Best Bottoms fans here but I wanted...,1


#### Quiz Question: 3 
Which of the following products are represented in the 20 most positive reviews?

### Quiz Question 4
Which of the following products are represented in the 20 most negative reviews?

In [242]:
max_negative_idx = np.argsort(test_scores)
max_negative_idx

array([ 2931, 21700, 13939, ..., 24286, 15732, 18112])

In [244]:
test_scores[max_negative_idx]

array([-34.61875371, -33.90267007, -30.11931041, ...,  48.75925227,
        52.39795   ,  53.92399591])

In [243]:
test_data.iloc[max_negative_idx[:20]]

Unnamed: 0,name,review,rating,review_clean,sentiment
16042,Fisher-Price Ocean Wonders Aquarium Bouncer,We have not had ANY luck with Fisher-Price pro...,2,We have not had ANY luck with FisherPrice prod...,-1
120209,Levana Safe N'See Digital Video Baby Monitor w...,This is the first review I have ever written o...,1,This is the first review I have ever written o...,-1
77072,Safety 1st Exchangeable Tip 3 in 1 Thermometer,I thought it sounded great to have different t...,1,I thought it sounded great to have different t...,-1
48694,Adiri BPA Free Natural Nurser Ultimate Bottle ...,I will try to write an objective review of the...,2,I will try to write an objective review of the...,-1
155287,VTech Communications Safe &amp; Sounds Full Co...,"This is my second video monitoring system, the...",1,This is my second video monitoring system the ...,-1
94560,The First Years True Choice P400 Premium Digit...,Note: we never installed batteries in these un...,1,Note we never installed batteries in these uni...,-1
53207,Safety 1st High-Def Digital Monitor,We bought this baby monitor to replace a diffe...,1,We bought this baby monitor to replace a diffe...,-1
81332,Cloth Diaper Sprayer--styles may vary,I bought this sprayer out of desperation durin...,1,I bought this sprayer out of desperation durin...,-1
113995,Motorola Digital Video Baby Monitor with Room ...,DO NOT BUY THIS BABY MONITOR!I purchased this ...,1,DO NOT BUY THIS BABY MONITORI purchased this m...,-1
10677,Philips AVENT Newborn Starter Set,"It's 3am in the morning and needless to say, t...",1,Its 3am in the morning and needless to say thi...,-1


### Compute accuracy of the classifier

15. We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by

accuracy= correctly classified examples /  total examples

This can be computed as follows:

    * Step 1: Use the sentiment_model to compute class predictions.
    * Step 2: Count the number of data points when the predicted class labels match the ground truth labels.
    * Step 3: Divide the total number of correct predictions by the total number of data points in the dataset.

In [248]:
#prediction on test data
#step 1
predict  = sentiment_model.predict(test_matrix)

In [249]:
predict

array([1, 1, 1, ..., 1, 1, 1])

In [251]:
# step 2 counting the truth_lebel
truth_labels = np.sum(predict == test_data['sentiment'])
truth_labels

31084

In [254]:
total_labels = len(test_data)
total_labels

33336

In [256]:
#step 3 calculate accuracy
accuracy = truth_labels/total_labels
print(accuracy)

0.9324454043676506


### Quiz Question: 5.
What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places 

ans = 0.93

### Learn another classifier with fewer words

16. There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews

In [257]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [258]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

### Train a logistic regression model on a subset of data

17. Now build a logistic regression classifier with train_matrix_word_subset as features and sentiment as the target. Call this model simple_model.

In [259]:
simple_model = LogisticRegression()
simple_model.fit(train_matrix_word_subset,train_data['sentiment'] )

LogisticRegression()

18. Let us inspect the weights (coefficients) of the simple_model. First, build a table to store (word, coefficient) pairs. 

In [280]:
simple_model.coef_.flatten()

array([ 1.36369679,  0.94395038,  1.19221941,  0.08542375,  0.52017372,
        1.51026262,  1.67326913,  0.50375976,  0.19093732,  0.05881344,
       -1.65214402, -0.20934844, -0.51145646, -2.03448908, -2.34847753,
       -0.62130739, -0.32049066, -0.89806176, -0.36215714, -2.10981455])

In [283]:
simple_model_coef_table = pd.DataFrame({'word':significant_words,
              'coefficient': simple_model.coef_.flatten() })

In [284]:
#Sort the data frame by the coefficient value in descending order
simple_model_coef_table.sort_values('coefficient', ascending = False)

Unnamed: 0,word,coefficient
6,loves,1.673269
5,perfect,1.510263
0,love,1.363697
2,easy,1.192219
1,great,0.94395
4,little,0.520174
7,well,0.50376
8,able,0.190937
3,old,0.085424
9,car,0.058813


### Quiz Question: 7
Consider the coefficients of simple_model. How many of the 20 coefficients (corresponding to the 20 significant_words) are positive for the simple_model?

In [286]:
np.sum(simple_model.coef_>0)

10

### Quiz Question: 8
Are the positive words in the simple_model also positive words in the sentiment_model?

ans Yes

### Comparing models

19. We will now compare the accuracy of the **sentiment_model** and the **simple_model**.

First, compute the classification accuracy of the sentiment_model on the train_data.

Now, compute the classification accuracy of the simple_model on the train_data.

In [290]:
predict_train = sentiment_model.predict(train_matrix)
predict_train

array([1, 1, 1, ..., 1, 1, 1])

In [293]:
truth_labels_train = np.sum(predict_train == train_data['sentiment'])
total_train_labels = len(train)
accuracy_train  = truth_labels_train/total_train_labels
print(total_train_labels)
print(accuracy_train)

133643
0.9669342950996311


In [297]:
predict_simple_train = simple_model.predict(train_matrix_word_subset)
# predict_simple_train

In [303]:
truth_simple_labels = np.sum(predict_simple_train == train_data['sentiment'])
total_simple_labels = len(train_data)
accuracy__simple_train  = truth_simple_labels/total_simple_labels
print(total_train_labels)
print(accuracy__simple_train)

133643
0.8668225700065959


### Quiz Question: 9
Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

Ans - sentiment_model

20. Now, we will repeat this exercise on the test_data. 

Start by computing the classification accuracy of the sentiment_model on the test_data.

In [306]:
# classification accuracy of the sentiment_model on the test_data.
test_predict = sentiment_model.predict(test_matrix)
test_truth_label = np.sum(test_predict == test_data['sentiment'])
test_labels = len(test_data)
sent_test_accuracy = test_truth_label/test_labels
print(sent_test_accuracy)
# classification accuracy of the simple_model on the test_data.
test_simple_predict = simple_model.predict(test_matrix_word_subset)
test_simple_truth_label = np.sum(test_simple_predict == test_data['sentiment'])
test_labels = len(test_data)
simple_test_accuracy = test_simple_truth_label/test_labels
print(simple_test_accuracy)

0.9324454043676506
0.8693604511639069


### Quiz Question: 10
Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?
 
Ans - sentiment_model

### Baseline: Majority class prediction

In [307]:
positive_label = len(test_data[test_data['sentiment']>0])
negative_label = len(test_data[test_data['sentiment']<0])
print("positive_label is {}, negative_label is {}".format(positive_label, negative_label))

positive_label is 28095, negative_label is 5241


In [309]:
Majority_class_accuracy = positive_label*1./(positive_label+negative_label)
Majority_class_accuracy

0.8427825773938085

### Quiz Question: 11
Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).

Ans 0.84

### Quiz Question: 12
Is the sentiment_model definitely better than the majority class classifier (the baseline)?

YES