### Load Amazon dataset

In [1]:
import pandas as pd
import numpy as np

dtype_dict = {'name':str, 'review':str, 'rating': int}
amazon = pd.read_csv('amazon_baby.csv', dtype = dtype_dict)
amazon.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


### Perform text cleaning
We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.

 *   Write a function remove_punctuation that strips punctuation from a line of text
 *   Apply this function to every element in the review column of products, and save the result to a new column review_clean.

IMPORTANT. Make sure to fill n/a values in the review column with empty strings (if applicable). The n/a values indicate empty reviews. For instance, Pandas's the fillna() method lets you replace all N/A's in the review columns as follows:

In [2]:
amazon = amazon.fillna({'review':''}) 

In [3]:
def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator) 

amazon['review_clean'] = amazon['review'].apply(remove_punctuation)
amazon.head()

Unnamed: 0,name,review,rating,review_clean
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,These flannel wipes are OK but in my opinion n...
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...


### Extract Sentiments

We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment.

In [4]:
amazon = amazon[amazon['rating'] != 3]
amazon.head()

Unnamed: 0,name,review,rating,review_clean
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...


Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column. 

In [5]:
amazon['sentiment'] = amazon['rating'].apply(lambda rating : +1 if rating > 3 else -1)
amazon.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


### Split into training and test sets

Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. If you are using SFrame, make sure to use seed=1 so that you get the same result as everyone else does. (This way, you will get the right numbers for the quiz.)

In [6]:
import json

train_data_index = json.loads(open('module-2-assignment-train-idx.json').read())
test_data_index = json.loads(open('module-2-assignment-test-idx.json').read())

In [7]:
train_data = amazon.iloc[train_data_index]
test_data = amazon.iloc[test_data_index]

### Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as bag-of-word features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

  *  Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
  *  Compute the occurrences of the words in each review and collect them into a row vector.
  *  Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
  *  Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

The following cell uses CountVectorizer in scikit-learn. Notice the token_pattern argument in the constructor.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

### Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data.

Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients. 

In [9]:
from sklearn.linear_model import LogisticRegression

sentiment_model = LogisticRegression().fit(train_matrix, train_data['sentiment'])

# Question 1
How many weights are greater than or equal to 0? 

In [10]:
weights = sentiment_model.coef_
num_positive_weights = (weights >= 0).sum()
num_negative_weights = (weights < 0).sum()

print("Num of positive weights: ", num_positive_weights)
print("Num of negative weights: ", num_negative_weights)

Num of positive weights:  87151
Num of negative weights:  34561


### Making predictions with logistic regression

Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the SFrame test_data and print their content:

In [11]:
sample_test_data = test_data[10:13]
sample_test_data

Unnamed: 0,name,review,rating,review_clean,sentiment
59,Our Baby Girl Memory Book,Absolutely love it and all of the Scripture in...,5,Absolutely love it and all of the Scripture in...,1
71,Wall Decor Removable Decal Sticker - Colorful ...,Would not purchase again or recommend. The dec...,2,Would not purchase again or recommend The deca...,-1
91,New Style Trailing Cherry Blossom Tree Decal R...,Was so excited to get this product for my baby...,1,Was so excited to get this product for my baby...,-1


Let's dig deeper into the first row of the sample_test_data. Here's the full review:

In [12]:
sample_test_data.iloc[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

Now, let's see what the next row of the sample_test_data looks like. As we could guess from the rating (-1), the review is quite negative.

In [13]:
sample_test_data.iloc[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

We will now make a class prediction for the sample_test_data. The sentiment_model should predict +1 if the sentiment is positive and -1 if the sentiment is negative. Recall from the lecture that the score (sometimes called margin) for the logistic regression model is defined as:

$score_{i}=w_{T} * h(x_{i})$

where h(xi) represents the features for data point i. We will write some code to obtain the scores. For each row, the score (or margin) is a number in the range (-inf, inf). Use a pre-built function in your tool to calculate the score of each data point in sample_test_data. In scikit-learn, you can call the decision_function() function.

Hint: You'd probably need to convert sample_test_data into the sparse matrix format first.

In [14]:
score_test = sentiment_model.decision_function(test_matrix)
score_sample_test = score_test[10:13]
score_sample_test

array([  5.60193782,  -3.1693431 , -10.42378132])

In [15]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
score_sample_test_new = sentiment_model.decision_function(sample_test_matrix)
score_sample_test_new

array([  5.60193782,  -3.1693431 , -10.42378132])

### Prediciting Sentiment

These scores can be used to make class predictions as follows:

$y^i = +1 if w⊺h(xi)>0 −1 if w⊺h(xi)≤0$

Using scores, write code to calculate predicted labels for sample_test_data.

Checkpoint: Make sure your class predictions match with the ones obtained from sentiment_model. The logistic regression classifier in scikit-learn comes with the predict function for this purpose.


In [16]:
sentiment_model.predict(sample_test_matrix)

array([ 1, -1, -1])

### Probability Predictions

Recall from the lectures that we can also calculate the probability predictions from the scores using:

$P(yi=+1|xi,w)=1/(1+exp(−w⊺h(x_i)))$

Using the scores calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range [0, 1].

Checkpoint: Make sure your probability predictions match the ones obtained from sentiment_model.

In [17]:
prob_sample_test = sentiment_model.predict_proba(sample_test_matrix)
prob_sample_test

array([[  3.67713366e-03,   9.96322866e-01],
       [  9.59664165e-01,   4.03358355e-02],
       [  9.99970284e-01,   2.97164132e-05]])

# Question 2
Of the three data points in sample_test_data, which one has the lowest probability of being classified as a positive review?

In [18]:
prob_sample_test[:,1]

array([  9.96322866e-01,   4.03358355e-02,   2.97164132e-05])

### Find the most positive (and negative) review

We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points.

Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."

To calculate these top-20 reviews, use the following steps:

   * Make probability predictions on test_data using the sentiment_model.
   * Sort the data according to those predictions and pick the top 20.

In [19]:
test_data['prediction_probability'] = sentiment_model.predict_proba(test_matrix)[:,1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [20]:
most_positive_review_test_data = test_data.sort_values(by = 'prediction_probability', ascending = False)
most_positive_review_test_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment,prediction_probability
119182,Roan Rocco Classic Pram Stroller 2-in-1 with B...,Great Pram Rocco!!!!!!I bought this pram from ...,5,Great Pram RoccoI bought this pram from Europe...,1,1.0
97325,Freemie Hands-Free Concealable Breast Pump Col...,I absolutely love this product. I work as a C...,5,I absolutely love this product I work as a Cu...,1,1.0
133651,"Britax 2012 B-Agile Stroller, Red",[I got this stroller for my daughter prior to ...,4,I got this stroller for my daughter prior to t...,1,1.0
114796,"Fisher-Price Cradle 'N Swing, My Little Snuga...",My husband and I cannot state enough how much ...,5,My husband and I cannot state enough how much ...,1,1.0
80155,"Simple Wishes Hands-Free Breastpump Bra, Pink,...","I just tried this hands free breastpump bra, a...",5,I just tried this hands free breastpump bra an...,1,1.0


# Question 3
Which of the following products are represented in the 20 most positive reviews?

In [21]:
most_positive_review_test_data.head(20)['name']

119182    Roan Rocco Classic Pram Stroller 2-in-1 with B...
97325     Freemie Hands-Free Concealable Breast Pump Col...
133651                    Britax 2012 B-Agile Stroller, Red
114796    Fisher-Price Cradle 'N Swing,  My Little Snuga...
80155     Simple Wishes Hands-Free Breastpump Bra, Pink,...
22586        Britax Decathlon Convertible Car Seat, Tiffany
180646        Mamas &amp; Papas 2014 Urbo2 Stroller - Black
50315            P'Kolino Silly Soft Seating in Tias, Green
147949    Baby Jogger City Mini GT Single Stroller, Shad...
52631     Evenflo X Sport Plus Convenience Stroller - Ch...
168081    Buttons Cloth Diaper Cover - One Size - 8 Colo...
66059          Evenflo 6 Pack Classic Glass Bottle, 4-Ounce
137034           Graco Pack 'n Play Element Playard - Flint
100166    Infantino Wrap and Tie Baby Carrier, Black Blu...
140816           Diono RadianRXT Convertible Car Seat, Plum
87017       Baby Einstein Around The World Discovery Center
168697    Graco FastAction Fold Jogger C

# Question 4
Which of the following products are represented in the 20 most negative reviews?

In [22]:
most_negative_review_test_data = test_data.sort_values(by = 'prediction_probability', ascending = True)
most_negative_review_test_data.head(20)['name']

16042           Fisher-Price Ocean Wonders Aquarium Bouncer
120209    Levana Safe N'See Digital Video Baby Monitor w...
77072        Safety 1st Exchangeable Tip 3 in 1 Thermometer
48694     Adiri BPA Free Natural Nurser Ultimate Bottle ...
155287    VTech Communications Safe &amp; Sounds Full Co...
94560     The First Years True Choice P400 Premium Digit...
53207                   Safety 1st High-Def Digital Monitor
81332                 Cloth Diaper Sprayer--styles may vary
10677                     Philips AVENT Newborn Starter Set
113995    Motorola Digital Video Baby Monitor with Room ...
59546                Ellaroo Mei Tai Baby Carrier - Hershey
9915           Cosco Alpha Omega Elite Convertible Car Seat
40079     Chicco Cortina KeyFit 30 Travel System in Adve...
172090    Belkin WeMo Wi-Fi Baby Monitor for Apple iPhon...
75994            Peg-Perego Tatamia High Chair, White Latte
149987                     NUK Cook-n-Blend Baby Food Maker
154878    VTech Communications Safe &amp

### Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifier. Recall that the accuracy is given by

accuracy=# correctly classified examples / # total examples

This can be computed as follows:

 *   Step 1: Use the sentiment_model to compute class predictions.
 *   Step 2: Count the number of data points when the predicted class labels match the ground truth labels.
 *   Step 3: Divide the total number of correct predictions by the total number of data points in the dataset.



In [23]:
test_data['prediction'] = sentiment_model.predict(test_matrix)
test_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,name,review,rating,review_clean,sentiment,prediction_probability,prediction
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,This has been an easy way for my nanny to reco...,1,0.784511,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,I love this journal and our nanny uses it ever...,1,0.999999,1
16,Nature's Lullabies First Year Sticker Calendar,"I love this little calender, you can keep trac...",5,I love this little calender you can keep track...,1,0.933217,1
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,5,I had a hard time finding a second year calend...,1,0.999979,1
28,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,One of babys first and favorite books and it i...,1,0.980231,1


In [39]:
test_label = test_data['sentiment'] == test_data['prediction']
total_true = sum(test_label)
accuracy_score = total_true/len(test_label)
print("Accuracy score: ", accuracy_score)

Accuracy score:  0.932295416367


# Question 5
What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).

In [51]:
from sklearn import metrics

y_predict = sentiment_model.predict(test_matrix)

print("Accuracy score: ", metrics.accuracy_score(y_predict, test_data['sentiment']))

Accuracy score:  0.932295416367


# Question 6
Does a higher accuracy value on the training_data always imply that the classifier is better?

* No, higher accuracy on training data does not necessarily imply that the classifier is better.

### Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews. For this assignment, we selected 20 words to work with. These are:

In [40]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

### Train a logistic regression model on a subset of data

Now build a logistic regression classifier with train_matrix_word_subset as features and sentiment as the target. Call this model simple_model.

Let us inspect the weights (coefficients) of the simple_model. First, build a table to store (word, coefficient) pairs.

In [43]:
simple_model = LogisticRegression().fit(train_matrix_word_subset, train_data['sentiment'])

# Question 7
Consider the coefficients of simple_model. There should be 21 of them, an intercept term + one for each word in significant_words.

How many of the 20 coefficients (corresponding to the 20 significant_words and excluding the intercept term) are positive for the simple_model?

In [47]:
(simple_model.coef_ >= 0).sum()

10

# Question 8
Are the positive words in the simple_model also positive words in the sentiment_model?

In [54]:
simple_model.coef_

array([[ 1.36368976,  0.94399959,  1.19253827,  0.08551278,  0.52018576,
         1.50981248,  1.67307389,  0.50376046,  0.19090857,  0.05885467,
        -1.65157634, -0.20956286, -0.51137963, -2.03369861, -2.34829822,
        -0.62116877, -0.32055624, -0.89803074, -0.36216674, -2.10933109]])

In [73]:
sentiment_words = [word for word in vectorizer.vocabulary_.keys()]
significant_words_index = [sentiment_words.index(word) for word in significant_words]
sentiment_weight = [sentiment_model.coef_[0][word_index] for word_index in significant_words_index]

In [77]:
sentiment_dict = {word:weight for (word, weight) in zip(significant_words, sentiment_weight)}
simple_dict = {word:weight for (word, weight) in zip(significant_words, simple_model.coef_[0])}

word_compare = pd.DataFrame({"weight on sentiment_model": sentiment_dict, "weight on simple_model": simple_dict})
word_compare

Unnamed: 0,weight on sentiment_model,weight on simple_model
able,0.2129163,0.190909
broke,-0.7191992,-1.651576
car,0.05519557,0.058855
disappointed,0.002854324,-2.348298
easy,-0.0054688,1.192538
even,0.07929719,-0.51138
great,0.06483661,0.944
less,0.04175778,-0.209563
little,-0.3146887,0.520186
love,0.2670837,1.36369


### Comparing models

We will now compare the accuracy of the sentiment_model and the simple_model.

First, compute the classification accuracy of the sentiment_model on the train_data.

Now, compute the classification accuracy of the simple_model on the train_data.

In [53]:
y_predict_sentiment_model = sentiment_model.predict(train_matrix)
y_predict_simple_model = simple_model.predict(train_matrix_word_subset)

print("Accuracy score for sentiment_model on the train_data: ", metrics.accuracy_score(y_predict_sentiment_model, train_data['sentiment']))
print("Accuracy score for simple_model on the train_data: ", metrics.accuracy_score(y_predict_simple_model, train_data['sentiment']))

Accuracy score for sentiment_model on the train_data:  0.96849703184
Accuracy score for simple_model on the train_data:  0.866822570007


# Question 9
Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

* sentiment_model

# Question 10
Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?

In [83]:
y_predict_sentiment_model_test = sentiment_model.predict(test_matrix)
y_predict_simple_model_test = simple_model.predict(test_matrix_word_subset)

print("Accuracy score for sentiment_model on the test data: ", metrics.accuracy_score(y_predict_sentiment_model_test, test_data['sentiment']))
print("Accuracy score for simple model on the test data: ", metrics.accuracy_score(y_predict_simple_model_test, test_data['sentiment']))

Accuracy score for sentiment_model on the test data:  0.932295416367
Accuracy score for simple model on the test data:  0.869360451164


### Baseline: Majority class prediction

It is quite common to use the majority class classifier as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.



In [89]:
test_data['sentiment'].value_counts()

 1    28095
-1     5241
Name: sentiment, dtype: int64

In [90]:
y_predict_majority_class = np.ones((len(test_data['sentiment']), 1), dtype = int)

# Question 11
Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).

In [91]:
print("Accuracy score for majority class classifier model on the test data: ", metrics.accuracy_score(y_predict_majority_class, test_data['sentiment']))

Accuracy score for majority class classifier model on the test data:  0.842782577394


# Question 12
Is the sentiment_model definitely better than the majority class classifier (the baseline)?

* Yes