# Predicting sentiment from product reviews


The goal of this first notebook is to explore logistic regression and feature engineering with existing GraphLab functions.

In this notebook you will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

* Use SFrames to do some feature engineering
* Train a logistic regression model to predict the sentiment of product reviews.
* Inspect the weights (coefficients) of a trained logistic regression model.
* Make a prediction (both class and probability) of sentiment for a new product review.
* Given the logistic regression weights, predictors and ground truth labels, write a function to compute the **accuracy** of the model.
* Inspect the coefficients of the logistic regression model and interpret their meanings.
* Compare multiple logistic regression models.

Let's get started!

### Import Necessary Libraries

In [40]:
from __future__ import division

#Import numpy and pandas
import numpy as np
import pandas as pd

#Import matplotlib and seaborn for plotting graphs
import matplotlib.pyplot as plt
import seaborn as sns

#Magic line to print graph on same notebook
%matplotlib inline

import math

### Step 1 : Data Preparation
We will use a dataset consisting of baby product reviews on Amazon.com

In [47]:
#Although I used dtype={'name':str,'review':str,'rating':int} but still it doesnot work
#in pandas so for the time being the resolution is to assign the type after reading the file
products_df = pd.read_csv( "../../../../data/amazon_baby/amazon_baby.csv")

products_df.name = products_df.name.astype(str)
products_df.review = products_df.review.astype(str)
products_df.rating = products_df.rating.astype(int)

#Print the info on the dataframe
products_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
name      183531 non-null object
review    183531 non-null object
rating    183531 non-null int64
dtypes: int64(1), object(2)
memory usage: 4.2+ MB


In [48]:
#Print the shape of the dataframe
print products_df.shape
#Now lets print some starting rows
products_df.head()

(183531, 3)


Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


### Step 2 : Perform Text Cleaning

In [49]:
import string
def remove_punctutation(text):
    """returns stripped text
    This function strips any punctutation from line of text.
    We will apply this function to every element in the review column of product
    """
    return text.translate(None, string.punctuation)

In [53]:
#Apply clean on the review column
products_df['review_clean'] = products_df['review'].apply(remove_punctutation)
#We will fill n/a values in the review column with empty strings(if applicable)
#The n/a values indicate empty reviews. For instance, pandas the fillna() method
#lets you replace all N/A in the reveiw columns as follows
products_df = products_df.fillna({'review':''})

### Step 3  : Extract Sentiments

In [54]:
# We will ignore all review with rating=3, since they tend to have a neutral sentiment.
products_df = products_df[products_df['rating'] != 3]

In [57]:
#Now we will assign reviews with rating to be positive or negative reviews
#rating >=4 --> positive review
#rating <= 2 --> negative review 
products_df['sentiment'] = products_df['rating'].apply(lambda rating : +1 if rating >  3 else -1)
products_df.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


### Step 4 : Split into training and test sets

In [62]:
from sklearn.cross_validation import train_test_split
#Default breakup is 75% and 25%
train_data, test_data = train_test_split(products_df, test_size=0.2, random_state=1)
print len(train_data)
print len(test_data)
#I am stopping to use sklearn here for this assignmnet simply because train_test_split in the
#here is generating different number of rows and columns than the graphlab 
#It would be too much waste of time to continue using sklearn for matching the answers
#Hence I will continue to use sklearn in future after completing the assignments

133401
33351
