In [121]:
import pandas as pd
import numpy as np

In [122]:
a = "a"
type(a)

str

In [123]:
dtype_dict = {'name': str, 'review': str, 'rating': int}

In [124]:
products = pd.read_csv('amazon_baby.csv', dtype=dtype_dict)

In [125]:
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [126]:
products['review'].describe()

count        182702
unique       182642
top       Very good
freq              5
Name: review, dtype: object

In [127]:
import string

trans = {}
for c in string.punctuation:
    trans[c] = None    
trans = str.maketrans(trans)

def remove_punctuation(text):
    return text.translate(trans)

In [128]:
a = remove_punctuation("Cake. cake! cake.... haha")
print (a)

Cake cake cake haha


### Fill N/A

In [129]:
products.fillna({"review": ''}, inplace=True)

### Clean punctuation

In [130]:
products['review_clean'] = products['review'].apply(remove_punctuation)

In [131]:
products.head()

Unnamed: 0,name,review,rating,review_clean
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,These flannel wipes are OK but in my opinion n...
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...


###  Extract sentiments

Ignore all reviews with rating = 3, since they tend to have a neutral sentiment. 

In [134]:
products = products[products['rating'] != 3]

In [135]:
len(products)

166752

4. Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column. In SFrame, you would use apply():



In [136]:
products['sentiment'] = products['rating'].apply(
    lambda rating: 1 if rating > 3 else -1
)

### Split into training and test sets

Use train_test_split of sklearn

In [144]:
# from sklearn.model_selection import train_test_split
# train_test_split?
# train, test = train_test_split(products, test_size=0.2, shuffle=True)
# len(train), len(test)

In [145]:
%ls

Predicting sentiment from product reviews.ipynb
[31mamazon_baby.csv[m[m*
[34mamazon_baby.gl[m[m/
module-2-assignment-test-idx.json
module-2-assignment-train-idx.json


In [149]:
train_idx = pd.read_json('module-2-assignment-train-idx.json')
test_idx = pd.read_json('module-2-assignment-test-idx.json')

In [161]:
train_data = products.iloc[train_idx[0]]
test_data = products.iloc[test_idx[0]]

In [162]:
len(train_data), len(test_data)

(133416, 33336)

### Build the word count vector for each review

In [163]:
from sklearn.feature_extraction.text import CountVectorizer

In [167]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

First, learn vocabulary from the training data and assign column to words

In [177]:
vectorizer?

First, learn the vocab from the training data and assign columns to words,
then covert the training data into a sparse matrix

In [175]:
train_matrix = vectorizer.fit_transform(train_data['review_clean'])

Second, convert the test data into a sparse matrix, using the same word-column mapping

In [176]:
test_matrix = vectorizer.transform(test_data['review_clean'])

### Train a sentiment classfier with logistic regression

Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.


In [187]:
from sklearn.linear_model import LogisticRegression

In [179]:
logReg = LogisticRegression()

In [180]:
sentiment_model = logReg.fit(train_matrix, train_data['sentiment'])



In [185]:
len(sentiment_model.coef_[0])

121712

8. There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.



In [201]:
positive_coeff = (sentiment_model.coef_[0] >= 0).sum()
positive_coeff

85776

### Making predictions with logistic regression

In [210]:
sample_test_data = test_data.iloc[10:13]
print (sample_test_data)

                                                 name  \
59                          Our Baby Girl Memory Book   
71  Wall Decor Removable Decal Sticker - Colorful ...   
91  New Style Trailing Cherry Blossom Tree Decal R...   

                                               review  rating  \
59  Absolutely love it and all of the Scripture in...       5   
71  Would not purchase again or recommend. The dec...       2   
91  Was so excited to get this product for my baby...       1   

                                         review_clean  sentiment  
59  Absolutely love it and all of the Scripture in...          1  
71  Would not purchase again or recommend The deca...         -1  
91  Was so excited to get this product for my baby...         -1  


In [213]:
sample_test_data.iloc[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [215]:
sample_test_data.iloc[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [216]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])

In [220]:
scores_sample = sentiment_model.decision_function(sample_test_matrix)

In [221]:
print (scores_sample)

[  5.60072079  -3.14163294 -10.40078476]


In [273]:
def predict_wo_prob_arr(score):
    return np.array([+1 if u > 0 else -1 for u in score])

In [222]:
def predict_wo_prob(score):
    if score > 0:
        return +1
    else: 
        return -1

### Probability predictions

In [223]:
import math

In [226]:
def predict_with_prob(score):
    return 1.0 / (1.0 + math.e ** (-score))

In [229]:
predict_with_prob(scores_sample)

array([9.96318405e-01, 4.14222326e-02, 3.04076855e-05])

### Quiz question: Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

In [231]:
1 + predict_with_prob(scores_sample).argmin()

3

### Find the most positive (and negative) review


Make probability predictions on test_data using the sentiment_model

In [233]:
scores_test = sentiment_model.decision_function(test_matrix)

In [236]:
scores_test_prob = predict_with_prob(scores_test)
scores_test_prob

array([0.78204172, 0.99999927, 0.93448641, ..., 0.99999448, 0.99999737,
       0.98092862])

In [259]:
top20_positive = (-scores_test_prob).argsort()[:20]

### Quiz Question: Which of the following products are represented in the 20 most positive reviews?

In [264]:
top20_positive_scores = scores_test_prob[top20_positive]
top20_positive_scores

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.])

In [260]:
test_data.iloc[top20_positive]

Unnamed: 0,name,review,rating,review_clean,sentiment
80155,"Simple Wishes Hands-Free Breastpump Bra, Pink,...","I just tried this hands free breastpump bra, a...",5,I just tried this hands free breastpump bra an...,1
180646,Mamas &amp; Papas 2014 Urbo2 Stroller - Black,After much research I purchased an Urbo2. It's...,4,After much research I purchased an Urbo2 Its e...,1
50315,"P'Kolino Silly Soft Seating in Tias, Green",I've purchased both the P'Kolino Little Reader...,4,Ive purchased both the PKolino Little Reader C...,1
114796,"Fisher-Price Cradle 'N Swing, My Little Snuga...",My husband and I cannot state enough how much ...,5,My husband and I cannot state enough how much ...,1
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,I got this stroller for my daughter prior to t...,1
100166,"Infantino Wrap and Tie Baby Carrier, Black Blu...",I bought this carrier when my daughter was abo...,5,I bought this carrier when my daughter was abo...,1
168697,Graco FastAction Fold Jogger Click Connect Str...,Graco's FastAction Jogging Stroller definitely...,5,Gracos FastAction Jogging Stroller definitely ...,1
147949,"Baby Jogger City Mini GT Single Stroller, Shad...","Amazing, Love, Love, Love it !!! All 5 STARS a...",5,Amazing Love Love Love it All 5 STARS all the...,1
137034,Graco Pack 'n Play Element Playard - Flint,My husband and I assembled this Pack n' Play l...,4,My husband and I assembled this Pack n Play la...,1
52631,Evenflo X Sport Plus Convenience Stroller - Ch...,After seeing this in Parent's Magazine and rea...,5,After seeing this in Parents Magazine and read...,1


In [256]:
top20_negative = scores_test_prob.argsort()[:20]

In [266]:
top20_negative_scores = scores_test_prob[top20_negative]
top20_negative_scores

array([9.08317913e-16, 1.94209990e-15, 7.66737346e-14, 1.43149111e-13,
       1.81556052e-13, 4.62631205e-13, 3.40619596e-11, 4.07309421e-11,
       9.43925918e-11, 1.12050143e-10, 4.21305023e-10, 4.76896283e-10,
       6.51348958e-10, 6.59023885e-10, 7.01273287e-10, 7.88538826e-10,
       8.57785868e-10, 1.09127223e-09, 1.62447457e-09, 1.63473356e-09])

In [257]:
test_data.iloc[top20_negative]

Unnamed: 0,name,review,rating,review_clean,sentiment
16042,Fisher-Price Ocean Wonders Aquarium Bouncer,We have not had ANY luck with Fisher-Price pro...,2,We have not had ANY luck with FisherPrice prod...,-1
120209,Levana Safe N'See Digital Video Baby Monitor w...,This is the first review I have ever written o...,1,This is the first review I have ever written o...,-1
77072,Safety 1st Exchangeable Tip 3 in 1 Thermometer,I thought it sounded great to have different t...,1,I thought it sounded great to have different t...,-1
48694,Adiri BPA Free Natural Nurser Ultimate Bottle ...,I will try to write an objective review of the...,2,I will try to write an objective review of the...,-1
155287,VTech Communications Safe &amp; Sounds Full Co...,"This is my second video monitoring system, the...",1,This is my second video monitoring system the ...,-1
94560,The First Years True Choice P400 Premium Digit...,Note: we never installed batteries in these un...,1,Note we never installed batteries in these uni...,-1
53207,Safety 1st High-Def Digital Monitor,We bought this baby monitor to replace a diffe...,1,We bought this baby monitor to replace a diffe...,-1
81332,Cloth Diaper Sprayer--styles may vary,I bought this sprayer out of desperation durin...,1,I bought this sprayer out of desperation durin...,-1
113995,Motorola Digital Video Baby Monitor with Room ...,DO NOT BUY THIS BABY MONITOR!I purchased this ...,1,DO NOT BUY THIS BABY MONITORI purchased this m...,-1
10677,Philips AVENT Newborn Starter Set,"It's 3am in the morning and needless to say, t...",1,Its 3am in the morning and needless to say thi...,-1


### Compute accuracy of the classifier


### Quiz Question: What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).

In [276]:
predicted_sentiment_test = predict_wo_prob_arr(scores_test)

In [278]:
corrected_classified_examples = (predicted_sentiment_test == test_data['sentiment']).sum()

In [281]:
accuracy = corrected_classified_examples / len(test_data)
print ("%.2f" % accuracy)

0.93


training data accuracy

In [283]:
scores_train = sentiment_model.decision_function(train_matrix)

In [284]:
predicted_sentiment_train = predict_wo_prob_arr(scores_train)

In [288]:
accuracy_train = (predicted_sentiment_train == train_data['sentiment']).sum() / len(train_data)

In [289]:
print ("Train accuracy: %.2f" % accuracy_train)

Train accuracy: 0.97


### Learn another classifier with fewer words

In [290]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [291]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words)

In [292]:
train_matrix_word_subset = vectorizer_word_subset.fit_transform(
    train_data['review_clean']
)

In [293]:
test_matrix_word_subset = vectorizer_word_subset.transform(
    test_data['review_clean']
)

### Train a logistic regression model on a subset of data

In [294]:
simple_model = LogisticRegression()

In [295]:
simple_model.fit(train_matrix_word_subset, train_data['sentiment'])



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [298]:
simple_model_coef_table = pd.DataFrame({
    'word': significant_words,
    'coefficient': simple_model.coef_.flatten()
})

In [302]:
simple_model_coef_table.sort_values(['coefficient'], ascending=False)

Unnamed: 0,word,coefficient
6,loves,1.673074
5,perfect,1.509812
0,love,1.36369
2,easy,1.192538
1,great,0.944
4,little,0.520186
7,well,0.50376
8,able,0.190909
3,old,0.085513
9,car,0.058855


### Quiz Question: Consider the coefficients of simple_model. How many of the 20 coefficients (corresponding to the 20 significant_words) are positive for the simple_model?

In [310]:
positive_words = simple_model_coef_table[simple_model_coef_table['coefficient'] > 0]
positive_words

Unnamed: 0,word,coefficient
0,love,1.36369
1,great,0.944
2,easy,1.192538
3,old,0.085513
4,little,0.520186
5,perfect,1.509812
6,loves,1.673074
7,well,0.50376
8,able,0.190909
9,car,0.058855


In [311]:
len(positive_words)

10

### Quiz Question: Are the positive words in the simple_model also positive words in the sentiment_model

In [387]:
vectorizer.get_feature_names()

['0',
 '00',
 '000',
 '0001',
 '001',
 '001cm',
 '002',
 '01',
 '010',
 '010204',
 '0104',
 '010613do',
 '01082013',
 '012',
 '012010',
 '012013',
 '01202012',
 '01252013',
 '01302012my',
 '01312009',
 '015a',
 '017',
 '0182196',
 '02',
 '020',
 '02000z',
 '02060',
 '0207',
 '02072',
 '02090',
 '020902nd',
 '0209a',
 '021',
 '02100',
 '02100a10search',
 '0210a',
 '02172014after',
 '02180',
 '021meal',
 '02220',
 '024',
 '025',
 '02534',
 '02640a',
 '02644',
 '02700',
 '02720',
 '03',
 '030611fantastic',
 '0311',
 '032010',
 '03212014',
 '034',
 '036',
 '03lbs',
 '03mo',
 '03mo36mo612mo',
 '03months',
 '03mos',
 '03mosbut',
 '04',
 '0409',
 '0427',
 '04302013',
 '046060us',
 '05',
 '050',
 '052',
 '05202013',
 '05my',
 '06',
 '0635',
 '065',
 '06a',
 '06mfor',
 '06mo',
 '06mosit',
 '06mths',
 '07',
 '07122011by',
 '07182013',
 '072012',
 '073',
 '075',
 '075long',
 '08',
 '0804',
 '080412',
 '080710',
 '08120firms',
 '0813',
 '081713',
 '08280',
 '08all',
 '08while',
 '09',
 '09082009',

In [384]:
sentiment_model.coef_

array([[-1.23660236e+00,  2.00327531e-04,  2.59395339e-02, ...,
         1.14378316e-02,  3.17780649e-03, -7.09468744e-05]])

In [388]:
sentiment_model_coef_table = pd.DataFrame(
{"word": vectorizer.get_feature_names(),
 "coefficient": sentiment_model.coef_.flatten()
}

)

In [389]:
sentiment_model_coef_table.head()

Unnamed: 0,word,coefficient
0,0,-1.236602
1,0,0.0002
2,0,0.02594
3,1,0.006096
4,1,4.5e-05


In [390]:
simple_positive_idx = sentiment_model_coef_table['word'].isin(positive_word['word'])

In [391]:
sentiment_model_coef_table[simple_positive_idx].sort_values(['coefficient'],
                                                            ascending=False)

Unnamed: 0,word,coefficient
78982,perfect,1.859421
63567,love,1.575313
63646,loves,1.517091
37640,easy,1.358595
48789,great,1.229169
62602,little,0.635883
117906,well,0.540439
7386,able,0.389545
22122,car,0.122838
74106,old,0.05422


###  Comparing models

### Quiz Question: Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

In [353]:
# accuracy of sentiment_model on train data
accuracy_train


0.9676800383762068

In [352]:
# accuracy of sentiment_model on test data
accuracy


0.9323854091672666

In [354]:
# accuracy of simple_model on train_data

In [358]:
simple_scores_train = simple_model.decision_function(train_matrix_word_subset)

In [360]:
simples_classification_train = predict_wo_prob_arr(simple_scores_train)

In [366]:
simples_classification_train_acc = (simples_classification_train == train_data['sentiment']).sum() / len(train_data)
simples_classification_train_acc

0.8668225700065959

In [355]:
# accuracy of simple_model on test_data

In [371]:
simples_scores_test = simple_model.decision_function(test_matrix_word_subset)

In [372]:
simples_class_test = predict_wo_prob_arr(simples_scores_test)

In [374]:
simples_class_test_acc = (simples_class_test == test_data['sentiment']).sum() / len(test_data)
simples_class_test_acc

0.8693604511639069

### Quiz Question: Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

In [376]:
accuracy_train > simples_classification_train_acc

True

### Quiz Question: Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?

In [377]:
accuracy > simples_class_test_acc

True

### Baseline: Majority class prediction


### Quiz Question: Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).



In [380]:
majority_acc_test = (test_data['sentiment'] == 1).sum() / len(test_data)
majority_acc_test

0.8427825773938085

### Quiz Question: Is the sentiment_model definitely better than the majority class classifier (the baseline)?

In [381]:
accuracy > majority_acc_test

True