# Predicting sentiment from product reviews

The goal of this assignment is to explore logistic regression and feature engineering with existing GraphLab Create functions.

In this assignment, you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative. You will:

- Use SFrames to do some feature engineering
- Train a logistic regression model to predict the sentiment of product reviews.
- Inspect the weights (coefficients) of a trained logistic regression model.
- Make a prediction (both class and probability) of sentiment for a new product review.
- Given the logistic regression weights, predictors and ground truth labels, write a function to compute the accuracy of the model.
- Inspect the coefficients of the logistic regression model and interpret their meanings.
- Compare multiple logistic regression models.

In [1]:
import graphlab
import string

Load Amazon dataset

In [2]:
products = graphlab.SFrame('amazon_baby.gl/')

This non-commercial license of GraphLab Create for academic use is assigned to santosh.chilkunda@gmail.com and will expire on July 20, 2017.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1483128300.log


Perform text cleaning

- Write a function remove_punctuation that strips punctuation from a line of text
- Apply this function to every element in the review column of products, and save the result to a new column review_clean.

In [3]:
products['review_clean'] = products['review'].fillna('')

In [4]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

review_without_punctuation = products['review_clean'].apply(remove_punctuation)

Build the word count vector for each review

In [5]:
products['word_count'] = graphlab.text_analytics.count_words(review_without_punctuation)

Extract Sentiments

In [6]:
products = products[products['rating'] != 3]

In [7]:
products['sentiment'] = products['rating'].apply(lambda x: +1 if x > 3 else -1)

Split into training and test sets

In [8]:
train_data, test_data = products.random_split(0.8, seed=1)

Train a sentiment classifier with logistic regression

In [9]:
sentiment_model = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features=['word_count'],
                                                      validation_set=None)

# How many weights are >= 0?

In [10]:
weights = sentiment_model.coefficients
positive_weights = weights[weights['value'] >= 0]
print len(positive_weights)

68419


Making predictions with logistic regression

In [11]:
sample_test_data = test_data[10:13]
print sample_test_data

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
|   Our Baby Girl Memory Book   | Absolutely love it and all... |  5.0   |
| Wall Decor Removable Decal... | Would not purchase again o... |  2.0   |
| New Style Trailing Cherry ... | Was so excited to get this... |  1.0   |
+-------------------------------+-------------------------------+--------+
+-------------------------------+-------------------------------+-----------+
|          review_clean         |           word_count          | sentiment |
+-------------------------------+-------------------------------+-----------+
| Absolutely love it and all... | {'and': 2, 'all': 1, 'love... |     1     |
| Would not purchase again o... | {'and': 1, 'would': 2, 'al... |     -1    |
| Was so excited to get this... | {'all': 1, 'money': 1, 'in... |     -1    |
+------

In [12]:
sample_test_data[2]['review']

"Was so excited to get this product for my baby girls bedroom!  When I got it the back is NOT STICKY at all!  Every time I walked into the bedroom I was picking up pieces off of the floor!  Very very frustrating!  Ended up having to super glue it to the wall...very disappointing.  I wouldn't waste the time or money on it."

Probability Predictions

# Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

In [13]:
scores = sentiment_model.predict(sample_test_data, output_type='probability')
print scores

[0.9988123848377207, 0.0032232681817989848, 4.261557996652647e-07]


In [14]:
test_data['pred_prob'] = sentiment_model.predict(test_data, output_type='probability')

Find the most positive (and negative) review

In [15]:
sorted_test_data = test_data.sort(['pred_prob'], ascending=False)

In [16]:
top20 = sorted_test_data.topk('pred_prob', 20)

# Which of the following products are represented in the 20 most positive reviews?

In [17]:
top20[0:20]['name']

dtype: str
Rows: 20
['Peg Perego Aria Light Weight One Hand Fold Stroller in Moka', 'Regalo Easy Step Walk Thru Gate, White', 'Ingenuity Cradle and Sway Swing, Bella Vista', "The Original CJ's BuTTer (All Natural Mango Sugar Mint, 12 oz. tub)", 'Moby Wrap Original 100% Cotton Baby Carrier, Red', 'Moby Wrap Original 100% Cotton Baby Carrier, Red', 'Baby Jogger City Mini GT Double Stroller, Shadow/Orange', 'Baby Jogger City Mini GT Single Stroller, Shadow/Orange', 'Ameda Purely Yours Breast Pump - Carry All', 'Fisher-Price Rainforest Melodies and Lights Deluxe Gym', 'Munchkin Mozart Magic Cube', 'bumGenius One-Size Cloth Diaper Twilight', 'timi &amp; leslie Charlie 7-Piece Diaper Bag Set, Light Brown', 'Skip Hop Studio Diaper Bag, Black Dot', 'Philips AVENT BPA Free Contemporary Freeflow Pacifier, 0-6 Months, 2-Pack, Colors and Designs May Vary', 'Skip Hop Bento Diaper Tote Bag, Black', 'Safety 1st Magnetic Locking System', 'Baby Planet Endangered Species Sport Lemur Frog Stroller', 'Sum

In [18]:
sorted_test_data2 = test_data.sort(['pred_prob'], ascending=True)
bot20 = sorted_test_data2.topk('pred_prob', 20)

# Which of the following products are represented in the 20 most negative reviews?

In [19]:
bot20.print_rows(20)

+-------------------------------+-------------------------------+--------+
|              name             |             review            | rating |
+-------------------------------+-------------------------------+--------+
| HABA Heart Princess Pacifi... | I got introduced to Haha p... |  5.0   |
| Evenflo X Sport Plus Conve... | After seeing this in Paren... |  5.0   |
| Meeno Baby Cool Me Seat Li... | Numbers speak for themselv... |  5.0   |
| Medela Contact Nipple Shie... | My son has a very bad latc... |  4.0   |
| Lamaze High-Contrast Panda... | Ha ha, this was bought for... |  5.0   |
| Safety 1st Magnetic Lockin... | I installed the previous v... |  5.0   |
| ESPRIT Sun Speed Stroller ... | I bought the orange one fo... |  4.0   |
| BRICA Baby In-Sight Magica... | First off, let me start by... |  4.0   |
| Baby Planet Endangered Spe... | We purchased this Baby Pla... |  5.0   |
| RECARO ProRIDE Convertible... | We recently moved our son ... |  5.0   |
| green sprouts Stacking 

Compute accuracy of the classifier

In [20]:
test_data['pred_sentiment'] = sentiment_model.predict(test_data, output_type='class')

In [21]:
test_data['correc_pred'] = (test_data['sentiment'] == test_data['pred_sentiment'])

# What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).

In [22]:
print test_data['correc_pred'].sum() / float(len(test_data))

0.914536837053


In [23]:
train_data2 = train_data

In [24]:
train_data2['pred_sentiment'] = sentiment_model.predict(train_data2, output_type='class')

In [25]:
train_data2['correc_pred'] = (train_data2['sentiment'] == train_data2['pred_sentiment'])

# Does a higher accuracy value on the training_data always imply that the classifier is better?

In [26]:
print train_data2['correc_pred'].sum() / float(len(train_data2))

0.979440247047


Learn another classifier with fewer words

In [27]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [28]:
train_data['word_count_subset'] = train_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)
test_data['word_count_subset'] = test_data['word_count'].dict_trim_by_keys(significant_words, exclude=False)

Train a logistic regression model on a subset of data

In [29]:
sentiment_model2 = graphlab.logistic_classifier.create(train_data,
                                                      target = 'sentiment',
                                                      features=['word_count_subset'],
                                                      validation_set=None)

# Consider the coefficients of simple_model. How many of the 20 coefficients (corresponding to the 20 significant_words) are positive for the simple_model?

In [30]:
weights2 = sentiment_model2.coefficients
positive_weights2 = weights2[weights2['value'] > 0]
print len(positive_weights2)

11


Comparing models

# Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?

In [31]:
test_data['pred_sentiment2'] = sentiment_model2.predict(test_data, output_type='class')
test_data['correc_pred2'] = (test_data['sentiment'] == test_data['pred_sentiment2'])
print test_data['correc_pred2'].sum() / float(len(test_data))

0.869300455964


# Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

In [32]:
train_data2['pred_sentiment2'] = sentiment_model2.predict(train_data2, output_type='class')
train_data2['correc_pred2'] = (train_data2['sentiment'] == train_data2['pred_sentiment2'])
print train_data2['correc_pred2'].sum() / float(len(train_data2))

0.866815074654


Baseline: Majority class prediction

In [33]:
test_data['one'] = 1
test_data['baseline1'] = (test_data['sentiment'] == test_data['one'])
print test_data['baseline1'].sum() / float(len(test_data))

0.842782577394
