# Predicting sentiment from product reviews
The goal of this assignment is to explore logistic regression and feature engineering with existing scikit learn functions.

In this assignment, you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative. You will:

* Use Scikit-Learn Pipelines
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, compute the accuracy of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

## Imports

In [163]:
import os
import string
import numpy as np
import pandas as pd

In [164]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

## Data
For this notebook, we will use the `Amazon Product Review` dataset.

In [167]:
#change to data directory
os.chdir('/Users/wbailey7/Courses/u-wash-machine-learning/classification/data')

In [168]:
#read in product reveiw data
products = pd.read_csv('amazon_baby.csv')

In [169]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    183213 non-null  object
 1   review  182702 non-null  object
 2   rating  183531 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


In [170]:
products.describe(include=object).T.iloc[:10] # All object cols

Unnamed: 0,count,unique,top,freq
name,183213,32417,Vulli Sophie the Giraffe Teether,785
review,182702,182642,very nice,5


In [171]:
products.describe().T.iloc[:10] # All object cols

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,183531.0,4.120448,1.285017,1.0,4.0,5.0,5.0,5.0


In [172]:
#replace null values with empty string
products = products.fillna({'review':''})

In [173]:
#remove punctuations
def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator) 

products["review_clean"] = products['review'].apply(lambda x : remove_punctuation(x))
products = products[["name","review_clean","rating"]]

In [174]:
#ignore all reviews with rating = 3, since they tend to have a neutral sentiment
products = products[products["rating"]!=3].reset_index(drop=True)

In [175]:
products

Unnamed: 0,name,review_clean,rating
0,Planetwise Wipe Pouch,it came early and was not disappointed i love ...,5
1,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
2,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase I h...,5
3,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried nonstop when I tried...,5
4,Stop Pacifier Sucking without tears with Thumb...,When the Binky Fairy came to our house we didn...,5
...,...,...,...
166747,Baby Teething Necklace for Mom Pretty Donut Sh...,Such a great idea very handy to have and look ...,5
166748,Baby Teething Necklace for Mom Pretty Donut Sh...,This product rocks It is a great blend of fun...,5
166749,Abstract 2 PK Baby / Toddler Training Cup (Pink),This item looks great and cool for my kidsI kn...,5
166750,"Baby Food Freezer Tray - Bacteria Resistant, B...",I am extremely happy with this product I have ...,5


In [176]:
# reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 
#or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 
#for the negative class label
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

In [177]:
#test-train data
with open('module-2-assignment-test-idx.json') as test_data_file:    
    test_data_idx = json.load(test_data_file)
with open('module-2-assignment-train-idx.json') as train_data_file:    
    train_data_idx = json.load(train_data_file)

train_data = products.iloc[train_data_idx]
test_data = products.iloc[test_data_idx]

train_data = train_data[train_data["rating"]!=3].reset_index(drop=True)

test_data = test_data[test_data["rating"]!=3].reset_index(drop=True)

print(len(train_data) + len(test_data))
print(len(products))

166752
166752


In [178]:
models = {}

In [179]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
logistic = LogisticRegression(max_iter=10000, n_jobs=1, solver='liblinear', tol=0.0001)

pipeline = Pipeline([
    ('vectorizer', vectorizer), #Step1 - Vectorize_data
    ('model', logistic) #step2 - classifier
])

pipeline.steps

[('vectorizer', CountVectorizer(token_pattern='\\b\\w+\\b')),
 ('model', LogisticRegression(max_iter=10000, n_jobs=1, solver='liblinear'))]

In [180]:
models['logistic'] = pipeline.fit(train_data['review_clean'], train_data['sentiment'])

In [184]:
np.sum(models['logistic']['model'].coef_ >= 0)

87058

In [189]:
sample_test_data = test_data.iloc[10:13]
sample_test_matrix = pipeline['vectorizer'].transform(sample_test_data['review_clean'])
print(models['logistic']['model'].classes_)
print(models['logistic']['model'].predict_proba(sample_test_matrix))

[-1  1]
[[3.67619649e-03 9.96323804e-01]
 [9.59679693e-01 4.03203066e-02]
 [9.99970318e-01 2.96823831e-05]]


In [193]:
test_matrix = pipeline['vectorizer'].transform(test_data['review_clean'])

print(models['logistic']['model'].classes_)
print(models['logistic']['model'].predict_proba(test_matrix))

[-1  1]
[[2.15378492e-01 7.84621508e-01]
 [7.64757625e-07 9.99999235e-01]
 [6.68436602e-02 9.33156340e-01]
 ...
 [5.16272880e-06 9.99994837e-01]
 [2.41371340e-06 9.99997586e-01]
 [1.82542660e-02 9.81745734e-01]]


In [195]:
def get_classification_accuracy(model, data, true_labels):
    pred_y = model.predict(data)
    correct = np.sum(pred_y==true_labels)
    accuracy = round(correct/len(true_labels),2)
    return accuracy

get_classification_accuracy(models['logistic']['model'] ,
                            test_matrix,
                            test_data["sentiment"])

0.93