In [1]:
from __future__ import print_function
import sframe

### Load Amazon dataset

In [4]:
products = sframe.SFrame('amazon_baby.gl/')

[INFO] sframe.cython.cy_server: SFrame v2.1 started. Logging /tmp/sframe_server_1518774093.log


### Perform test cleaning
* Remove punctuation (simple text cleaning)

In [2]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation)

In [5]:
products['review clean'] = products['review'].apply(remove_punctuation)

Fill 'n/a' reviews with empty strings

In [6]:
products = products.fillna('review','')

### Extract sentiments
Ignore all ratings = 3, since they don't tell us anything useful.

In [7]:
products = products[products['rating'] != 3]

Create a new column in products called `sentiment` where we assign positive sentiment to review if rating > 3 and -1 otherwise.

In [9]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3
                                                else -1)

### Split into training and test datasets

In [11]:
train_data, test_data = products.random_split(.8, seed=1)

### Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as **bag-of-word features**. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

* Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
* Compute the occurrences of the words in each review and collect them into a row vector.
* Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix **train_matrix**.
* Using the same mapping between words and columns, convert the test data into a sparse matrix **test_matrix**.

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [131]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train_data['review clean'])

In [132]:
test_matrix = vectorizer.transform(test_data['review clean'])

### Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data.

7. Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

8. There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.

In [15]:
from sklearn.linear_model import LogisticRegression

In [133]:
logistic = LogisticRegression()
sentiment_model = logistic.fit(train_matrix, train_data['sentiment'])

**Quiz question: How many weights are >= 0?**

In [134]:
len(sentiment_model.coef_.flatten())

121712

In [135]:
positive = [p for p in list(sentiment_model.coef_.flatten()) if p >= 0]

In [136]:
len(positive)

87151

### Making predictions with logistic regression

9. Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the SFrame test_data and print their content:

In [32]:
sample_test_data = test_data[10:13]
print(sample_test_data)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
| New Style Trailing Cherry ... | Was so excited to get this... |  1.0   |
+-------------------------------+-------------------------------+--------+
+-------------------------------+-----------+
|          review clean         | sentiment |
+-------------------------------+-----------+
| Absolutely love it and all... |     1     |
| Would not purchase again o... |     -1    |
| Was so excited to get this... |     -1    |
+-------------------------------+-----------+
[3 rows x 5 columns]



In [33]:
sample_test_data[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [34]:
sample_test_data[1]['review']

'Would not purchase again or recommend. The decals were thick almost plastic like and were coming off the wall as I was applying them! The would NOT stick! Literally stayed stuck for about 5 minutes then started peeling off.'

In [35]:
sample_test_data[2]['review']

"Was so excited to get this product for my baby girls bedroom!  When I got it the back is NOT STICKY at all!  Every time I walked into the bedroom I was picking up pieces off of the floor!  Very very frustrating!  Ended up having to super glue it to the wall...very disappointing.  I wouldn't waste the time or money on it."

We will now make a class prediction for the sample_test_data. The sentiment_model should predict +1 if the sentiment is positive and -1 if the sentiment is negative. Recall from the lecture that the score (sometimes called margin) for the logistic regression model is defined as:

score$_i$=$\mathbf{w}^⊺h(\mathbf{x}_i)$

where $h(\mathbf{x}_i)$ represents the features for data point $i$. We will write some code to obtain the scores. For each row, the score (or margin) is a number in the range (-inf, inf). Use a pre-built function in your tool to calculate the score of each data point in sample_test_data. In scikit-learn, you can call the `decision_function()` function.

Hint: You'd probably need to convert sample_test_data into the sparse matrix format first.

In [36]:
sample_test_matrix = vectorizer.transform(sample_test_data['review clean'])

In [37]:
scores = sentiment_model.decision_function(sample_test_matrix)
print(scores)

[  5.60193782  -3.1693431  -10.42378132]


### Prediciting Sentiment

11. These scores can be used to make class predictions as follows:

$\hat{\mathbf{y}}_i=\{\begin{array}+1\,\,\,if\, \mathbf{w}^Th(\mathbf{x}_i)>0 \\
                  −1\,\,\,if\, \mathbf{w}^Th(\mathbf{x}_i) \le 0 \end{array} $

Using scores, write code to calculate predicted labels for sample_test_data.

**Checkpoint**: Make sure your class predictions match with the ones obtained from sentiment_model. The logistic regression classifier in scikit-learn comes with the predict function for this purpose.

In [38]:
sentiment = sentiment_model.predict(sample_test_matrix)

In [39]:
sentiment

array([ 1, -1, -1])

In [40]:
sentiment_model.predict_proba(sample_test_matrix)

array([[  3.67713366e-03,   9.96322866e-01],
       [  9.59664165e-01,   4.03358355e-02],
       [  9.99970284e-01,   2.97164132e-05]])

In [41]:
sentiment_model.predict(sample_test_matrix)

array([ 1, -1, -1])

### Probability Predictions

Recall from the lectures that we can also calculate the probability predictions from the scores using:

$$P(\mathbf{y}_i = +1|\mathbf{x}_i,\mathbf{w})=\frac{1}{1+\exp(−\mathbf{w}^⊺h(\mathbf{x}_i))}$$

Using the scores calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range [0, 1].

Checkpoint: Make sure your probability predictions match the ones obtained from sentiment_model.

**Quiz question**: Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

In [42]:
import numpy as np

In [43]:
def calc_prob(scores):
    return 1/(1 + np.exp(-scores))

In [44]:
calc_prob(scores)

array([  9.96322866e-01,   4.03358355e-02,   2.97164132e-05])

### Find the most positive (and negative) review

We now turn to examining the full test dataset, **test_data**, and use `sklearn.linear_model.LogisticRegression` to form predictions on all of the test data points.

Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."

To calculate these top-20 reviews, use the following steps:

1. Make probability predictions on test_data using the sentiment_model.
2. Sort the data according to those predictions and pick the top 20.

**Quiz Question**: Which of the following products are represented in the 20 most positive reviews?

In [137]:
proba = sentiment_model.predict_proba(test_matrix)
sentiment = sentiment_model.predict(test_matrix)
test_data['predicted_sentiment'] = sentiment
test_data['predicted_proba'] = proba[:, 1]

In [138]:
top_twenty = test_data.sort('predicted_proba', ascending=False)[:20][['name', 'sentiment', 'predicted_sentiment', 'predicted_proba']]

In [160]:
top_twenty.print_rows(num_rows=20)

+-------------------------------+-----------+---------------------+-----------------+
|              name             | sentiment | predicted_sentiment | predicted_proba |
+-------------------------------+-----------+---------------------+-----------------+
| Baby Jogger City Mini GT S... |     1     |          1          |       1.0       |
| Graco FastAction Fold Jogg... |     1     |          1          |       1.0       |
| Evenflo 6 Pack Classic Gla... |     1     |          1          |       1.0       |
| Britax Decathlon Convertib... |     1     |          1          |       1.0       |
| Diono RadianRXT Convertibl... |     1     |          1          |       1.0       |
| P'Kolino Silly Soft Seatin... |     1     |          1          |       1.0       |
| Mamas &amp; Papas 2014 Urb... |     1     |          1          |       1.0       |
| Simple Wishes Hands-Free B... |     1     |          1          |       1.0       |
| Roan Rocco Classic Pram St... |     1     |         

Now, let us repeat this exercise to find the "most negative reviews." Use the prediction probabilities to find the 20 reviews in the test_data with the lowest probability of being classified as a positive review. Repeat the same steps above but make sure you sort in the opposite order.

**Quiz Question**: Which of the following products are represented in the 20 most negative reviews?

In [139]:
bottom_twenty = test_data.sort('predicted_proba')[:20][['name', 'sentiment', 'predicted_sentiment', 'predicted_proba']]

In [161]:
bottom_twenty.print_rows(num_rows=20)

+-------------------------------+-----------+---------------------+
|              name             | sentiment | predicted_sentiment |
+-------------------------------+-----------+---------------------+
| Fisher-Price Ocean Wonders... |     -1    |          -1         |
| Levana Safe N'See Digital ... |     -1    |          -1         |
| Safety 1st Exchangeable Ti... |     -1    |          -1         |
| Adiri BPA Free Natural Nur... |     -1    |          -1         |
| VTech Communications Safe ... |     -1    |          -1         |
| The First Years True Choic... |     -1    |          -1         |
| Safety 1st High-Def Digita... |     -1    |          -1         |
| Cloth Diaper Sprayer--styl... |     -1    |          -1         |
| Philips AVENT Newborn Star... |     -1    |          -1         |
| Motorola Digital Video Bab... |     -1    |          -1         |
| Ellaroo Mei Tai Baby Carri... |     -1    |          -1         |
| Cosco Alpha Omega Elite Co... |     -1    |   

## Quiz answers
1. 85974
2. third
3. top_twenty.print_rows(num_rows=20)
4. bottom_twenty.print_rows(num_rows=20)
5. 0.93
6. No

In [140]:
correct = len(test_data[test_data['sentiment'] == test_data['predicted_sentiment']])
total = len(test_data)
accuracy = correct / float(total)
accuracy

0.9322954163666907

### Learn another classifier with fewer words

In [141]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [142]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review clean'])

In [143]:
logistic = LogisticRegression()
simple_model = logistic.fit(train_matrix_word_subset, train_data['sentiment'])

In [144]:
simple_model_coef_table = sframe.SFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})

In [145]:
simple_positive_words = simple_model_coef_table[simple_model_coef_table['coefficient'] >= 0]

In [146]:
sentiment_model_coef_table = sframe.SFrame({'word':vectorizer.vocabulary_,
                                         'coefficient':sentiment_model.coef_.flatten()})

In [127]:
word_subset = sentiment_model_coef_table[sentiment_model_coef_table['word'].is_in(significant_words)]

No, not all positive words in the simple model are positive in the full model.

In [147]:
simple_model_predict = simple_model.predict(train_matrix_word_subset)
train_data['predicted_simple_model'] = simple_model_predict

In [148]:
full_model_predict = sentiment_model.predict(train_matrix)
train_data['predicted_full_model'] = full_model_predict

In [149]:
simple_model_train_accuracy = len(train_data[train_data['sentiment'] == train_data['predicted_simple_model']])/float(len(train_data))
full_model_train_accuracy = len(train_data[train_data['sentiment'] == train_data['predicted_full_model']])/float(len(train_data))
print(simple_model_train_accuracy, full_model_train_accuracy)

0.866822570007 0.96849703184


In [150]:
simple_model_test_predict = simple_model.predict(test_matrix_word_subset)
test_data['predicted_simple_model'] = simple_model_test_predict
full_model_test_predict = sentiment_model.predict(test_matrix)
test_data['predicted_full_model'] = full_model_test_predict

In [151]:
simple_model_test_accuracy = len(test_data[test_data['sentiment'] == test_data['predicted_simple_model']])/float(len(test_data))
full_model_test_accuracy = len(test_data[test_data['sentiment'] == test_data['predicted_full_model']])/float(len(test_data))
print(simple_model_test_accuracy, full_model_test_accuracy)

0.869360451164 0.932295416367


In [152]:
import sklearn

In [154]:
from sklearn.dummy import DummyClassifier

In [162]:
dummy = DummyClassifier(strategy='most_frequent')

In [163]:
dummy_model = dummy.fit(train_matrix, train_data['sentiment'])

In [164]:
test_data['dummy'] = dummy_model.predict(test_matrix)

In [165]:
dummy_test_accuracy = len(test_data[test_data['sentiment'] == test_data['dummy']])/float(len(test_data))
print(dummy_test_accuracy)

0.842782577394
