*by @rguptabinary*

credit: coursera assignment

In this notebook we are going to create three models on amazon reviews dataset.
* Model 1: use all the words appearing the dataset for model's vocabulary.
* Model 2: With limited word as a vocabulary.
* Model 3: Majority class prediction.

We'll use LogisticRegression of sklearn.

In [2]:
import sframe
products = sframe.SFrame('amazon_baby.gl/')

## Exploring the data

In [3]:
print "products has {} rows and {} columns".format(products.shape[0], products.shape[1])
print "column names {}.".format(products.column_names())
print "rating unique values: {}".format(products['rating'].unique().sort())

products has 183531 rows and 3 columns
column names ['name', 'review', 'rating'].
rating unique values: [1.0, 2.0, 3.0, 4.0, 5.0]


## Perform text cleaning
We would like to remove punctuation characters from reviews st words like "cool" and "cool!" are treated as same. Bad side of this simple technique is words like "would've" "should've" will loose their meaning. A smarter method needs to be used for this regard. Below
* define remove_punctuation fn to take a text review and return a cleaned one.
* in dataframe create a new column 'review_clean' andset it using above function.

Also replace the reviews having NA value to empty strings.

In [4]:
# fill in NA values with empty strings for review column
products = products.fillna('review','')

In [5]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

In [6]:
products[0:1]

name,review,rating,review_clean
Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3.0,These flannel wipes are OK but in my opinion not ...


## Extract the sentiments
We will ignore reviews with rating = 3 assuming them as neutral. Review > 3 we'll treat as positive(+1) and <3 as negative(-1).

In [7]:
# ignore all neutral reviews, ie with sentiment 3
products = products[products['rating'] != 3]

# making reviews categorical with labels +1, -1
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

products[products['rating'] < 3][0:1]

name,review,rating,review_clean,sentiment
Nature's Lullabies Second Year Sticker Calendar ...,I only purchased a second-year calendar for ...,2.0,I only purchased a secondyear calendar for ...,-1


## Splitting the data into training and testing set
We train our model on training set and evaluate it over testing set based on some metric. By testing on a set of data the model hasn't seen allow us to prevent overfitting and bias of the model. Here we are splitting data into 80:20 ratio.

In [8]:
# splitting into training and testing data
train_data, test_data = products.random_split(.8, seed=1)

## Bag-of-word features
We will compute the count of each word that appears in a review. Since most words occurs only in a few review, this vector is sparse. 

In [9]:
# build word count vector for each reviews
from sklearn.feature_extraction.text import CountVectorizer

# Use this token pattern to keep single-letter words
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

# First, learn vocabulary from the training data and assign columns to words
train_matrix = vectorizer.fit_transform(train_data['review_clean'])

# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [10]:
train_matrix.shape

(133416, 121712)

## Training time!

In [11]:
# train a sentiment classifier with logistic regression
from sklearn.linear_model import LogisticRegression

sentiment_model = LogisticRegression()

sentiment_model.fit(train_matrix, train_data['sentiment'])


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
# number of weights > 0
sum(x>=0 for x in sentiment_model.coef_[0])

85810

In [13]:
# making predictions with the learned model

In [14]:
sample_test_data = test_data[10:13]
print sample_test_data[0]['review_clean']

Absolutely love it and all of the Scripture in it  I purchased the Baby Boy version for my grandson when he was born and my daughterinlaw was thrilled to receive the same book again


In [15]:
print sample_test_data[1]['review_clean']

Would not purchase again or recommend The decals were thick almost plastic like and were coming off the wall as I was applying them The would NOT stick Literally stayed stuck for about 5 minutes then started peeling off


In [16]:
print sample_test_data[2]['review_clean']

Was so excited to get this product for my baby girls bedroom  When I got it the back is NOT STICKY at all  Every time I walked into the bedroom I was picking up pieces off of the floor  Very very frustrating  Ended up having to super glue it to the wallvery disappointing  I wouldnt waste the time or money on it


In [17]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print scores

[  5.60283428  -3.14916172 -10.41701265]


In [18]:
# predicting on a negative review
sentiment_model.predict(sample_test_matrix[0])

array([1], dtype=int64)

In [19]:
import numpy as np
def calculate_prob(scores):
    return 1.0/(1+np.exp(-1.0*scores))

In [20]:
scores = np.array(sentiment_model.decision_function(sample_test_matrix))
scores = calculate_prob(scores)
scores

array([  9.96326149e-01,   4.11243216e-02,   2.99182300e-05])

In [21]:
# 3rd one is the most negative review

## Find the most positive (and negative) review
* Make probability predictions on test_data using the sentiment_model.
* Sort the data according to those predictions and pick the top 20.

In [22]:
test_data_vectorized = vectorizer.transform(test_data['review_clean'])

In [23]:
test_data_vectorized.shape

(33336, 121712)

In [24]:
test_data['score'] = sentiment_model.decision_function(test_data_vectorized).reshape(-1)

test_data_prob = calculate_prob(test_data['score'])

test_data['prob_score'] = test_data_prob

test_data['predictions'] = sentiment_model.predict(test_data_vectorized)

test_data = test_data.sort('prob_score', ascending=False)

test_data[0:20]

name,review,rating,review_clean,sentiment
"Infantino Wrap and Tie Baby Carrier, Black ...",I bought this carrier when my daughter was ...,5.0,I bought this carrier when my daughter was ...,1
Mamas &amp; Papas 2014 Urbo2 Stroller - Black ...,After much research I purchased an Urbo2. It's ...,4.0,After much research I purchased an Urbo2 Its ...,1
Evenflo X Sport Plus Convenience Stroller - ...,After seeing this in Parent's Magazine and ...,5.0,After seeing this in Parents Magazine and ...,1
"Baby Jogger City Mini GT Single Stroller, ...","Amazing, Love, Love, Love it !!! All 5 STARS all ...",5.0,Amazing Love Love Love it All 5 STARS all the w ...,1
Graco FastAction Fold Jogger Click Connect ...,Graco's FastAction Jogging Stroller ...,5.0,Gracos FastAction Jogging Stroller definitely g ...,1
Roan Rocco Classic Pram Stroller 2-in-1 with ...,Great Pram Rocco!!!!!!I bought this pram from ...,5.0,Great Pram RoccoI bought this pram from Europe ...,1
"Britax 2012 B-Agile Stroller, Red ...",[I got this stroller for my daughter prior to the ...,4.0,I got this stroller for my daughter prior to the ...,1
Freemie Hands-Free Concealable Breast Pump ...,I absolutely love this product. I work as a ...,5.0,I absolutely love this product I work as a ...,1
"P'Kolino Silly Soft Seating in Tias, Green ...",I've purchased both the P'Kolino Little Reader ...,4.0,Ive purchased both the PKolino Little Reader ...,1
Baby Einstein Around The World Discovery Center ...,I am so HAPPY I brought this item for my 7 mo ...,5.0,I am so HAPPY I brought this item for my 7 mo ...,1

score,prob_score,predictions
53.5447091132,1.0,1
40.8739657108,1.0,1
40.4746780404,1.0,1
38.20916635,1.0,1
39.8726614079,1.0,1
41.8885046733,1.0,1
48.3269510045,1.0,1
38.9144924702,1.0,1
43.2443955792,1.0,1
52.0364368725,1.0,1


## Computing accuracy

In [27]:
correct_pred = test_data[test_data['sentiment'] == test_data['predictions']].shape[0]
total_len = test_data.shape[0]

accuracy = (1.0 * correct_pred)/total_len

print accuracy

0.932235421166


## Model with selective and limited vocabulary

In [28]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [29]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

In [30]:
simple_model = LogisticRegression()

simple_model.fit(train_matrix_word_subset, train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [31]:
test_data['score2'] = simple_model.decision_function(test_matrix_word_subset).reshape(-1)

test_data_prob2 = calculate_prob(test_data['score2'])

test_data['prob_score2'] = test_data_prob2

test_data['predictions2'] = simple_model.predict(test_matrix_word_subset)

In [32]:
sum(x>=0 for x in simple_model.coef_[0])

10

In [33]:
correct_pred2 = test_data[test_data['sentiment'] == test_data['predictions2']].shape[0]
total_len = test_data.shape[0]

accuracy2 = (1.0 * correct_pred2)/total_len

print accuracy2

0.869360451164


## Majority class prediction
Our model should perform better than the random guess. One benchmark we can use is majority class prediction. In which we calculate accuracy by predicting all label as positive.

In [34]:
# majority class
correct_pred3 = test_data[test_data['sentiment'] == 1].shape[0]
total_len = test_data.shape[0]

accuracy3 = (1.0 * correct_pred3)/total_len

print accuracy3

0.842782577394


In [35]:
print 'Model 1 (all words) accuracy     : {}'.format(accuracy)
print 'Model 2 (limited words) accuracy : {}'.format(accuracy2)
print 'Model 3 (majority class) accuracy: {}'.format(accuracy3)

Model 1 (all words) accuracy     : 0.932235421166
Model 2 (limited words) accuracy : 0.869360451164
Model 3 (majority class) accuracy: 0.842782577394
