# Predicting sentiment from product reviews

The goal of this assignment is to explore logistic regression and feature engineering with existing Turi Create functions.

In this assignment, you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative. You will:

   *  Use SFrames to do some feature engineering
   * Train a logistic regression model to predict the sentiment of product reviews.
   * Inspect the weights (coefficients) of a trained logistic regression model.
   * Make a prediction (both class and probability) of sentiment for a new product review.
   * Given the logistic regression weights, predictors and ground truth labels, write a function to compute the accuracy of the model.
   * Inspect the coefficients of the logistic regression model and interpret their meanings.
   * Compare multiple logistic regression models.

1. import libraries

In [148]:
# imoprt some import module
import numpy as np
import pandas as pd

#### Load Amazon data

In [2]:
data = pd.read_csv('amazon_baby.csv')

In [3]:
data.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    183213 non-null  object
 1   review  182702 non-null  object
 2   rating  183531 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


2. Perform text cleaning

In [5]:
def remove_punctuation(text):
    import string
    return str(text).translate(str.maketrans('', '', string.punctuation))

data['review_clean'] = data['review'].apply(remove_punctuation)

In [149]:
pd.set_option('max_colwidth', None)
data.head(1)

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love planet wise bags and now my wipe holder. it keps my osocozy wipes moist and does not leak. highly recommend it.,5,it came early and was not disappointed i love planet wise bags and now my wipe holder it keps my osocozy wipes moist and does not leak highly recommend it,1


In [150]:
# check if there is null values in data
data.isnull().sum()

name            296
review            0
rating            0
review_clean      0
sentiment         0
dtype: int64

In [151]:
# Fill the num data with empty string
data = data.fillna({'review':''})

In [9]:
data.isnull().sum()

name            318
review            0
rating            0
review_clean      0
dtype: int64

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   name          183213 non-null  object
 1   review        183531 non-null  object
 2   rating        183531 non-null  int64 
 3   review_clean  183531 non-null  object
dtypes: int64(1), object(3)
memory usage: 5.6+ MB


#### Extract Sentiment

3. We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment

In [153]:
# remove rating with value 3
data = data[data['rating'] !=3]

 4. Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. 

In [12]:
data['sentiment'] = data.rating.apply(lambda x: +1 if x>=4 else -1)

In [154]:
pd.set_option('max_colwidth', 50)
data['sentiment'].value_counts()

 1    140259
-1     26493
Name: sentiment, dtype: int64

In [155]:
data.head(1)

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1


In [115]:
# from sklearn.model_selection import train_test_split
# # np.random.rand(1)
# train_data, test_data = train_test_split(data, test_size = 0.2, random_state = 1)

In [101]:
# np.random.rand(1)
# msk = np.random.rand(len(data)) < 0.8
# train = data[msk]
# test = data[~msk]

#### Split into training and test sets

5. Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set.

In [157]:
# load indicies 
import json
with open('module-2-assignment-test-idx.json') as test_data_file:    
    test_data_idx = json.load(test_data_file)
with open('module-2-assignment-train-idx.json') as train_data_file:    
    train_data_idx = json.load(train_data_file)

print(train_data_idx[:3])
print(test_data_idx[:3])

[0, 1, 2]
[8, 9, 14]


In [127]:
# train_data_index = pd.read_json('module-2-assignment-train-idx.json')
# test_data_index = pd.read_json('module-2-assignment-test-idx.json')

In [158]:
# split train and test according to given indices
train_data = data.iloc[train_data_idx]
test_data = data.iloc[test_data_idx]

In [159]:
train_data.shape

(133416, 5)

#### Build the word count vector for each review 

6. We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as *bag-of-word features*.

General steps for extracting word count vectors are as follows:

   * Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
   * Compute the occurrences of the words in each review and collect them into a row vector.
   * Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
   * Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

In [133]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [134]:
train_matrix

<133416x121713 sparse matrix of type '<class 'numpy.int64'>'
	with 7327230 stored elements in Compressed Sparse Row format>

#### Train a sentiment classifier with logistic regression

7. Learn a logistic regression classifier using the training data.

In [103]:
from sklearn.linear_model import LogisticRegression

In [147]:
sentiment_model = LogisticRegression(max_iter = 5000)
sentiment_model.fit(train_matrix, train_data['sentiment'])

LogisticRegression(max_iter=5000)

8. There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. 

Calculate the number of positive (>= 0, which is actually nonnegative) coefficients. 

Quiz question 1 : How many weights are >= 0?

In [144]:
np.sum(sentiment_model.coef_ >=0)

91143

#### Making predictions with logistic regression

9. Now that a model is trained, we can make predictions on the test data.