# Logistic Regression

The goal of this assignment is to implement your own logistic regression classifier. You will:

- Extract features from Amazon product reviews.
- Convert an SFrame into a NumPy array.
- Implement the link function for logistic regression.
- Write a function to compute the derivative of the log likelihood function with respect to a single coefficient.
- Implement gradient ascent.
- Given a set of coefficients, predict sentiments.
- Compute classification accuracy for the logistic regression model.

In [35]:
import pandas as pd
import numpy as np
import string
import json
from __future__ import division

Load review dataset

In [3]:
products = pd.read_csv('amazon_baby_subset.csv')

Let us quickly explore more of this dataset. The name column indicates the name of the product. Try listing the name of the first 10 products in the dataset.

In [36]:
products.name[0:10]

0    Stop Pacifier Sucking without tears with Thumb...
1      Nature's Lullabies Second Year Sticker Calendar
2      Nature's Lullabies Second Year Sticker Calendar
3                          Lamaze Peekaboo, I Love You
4    SoftPlay Peek-A-Boo Where's Elmo A Children's ...
5                            Our Baby Girl Memory Book
6    Hunnt&reg; Falling Flowers and Birds Kids Nurs...
7    Blessed By Pope Benedict XVI Divine Mercy Full...
8    Cloth Diaper Pins Stainless Steel Traditional ...
9    Cloth Diaper Pins Stainless Steel Traditional ...
Name: name, dtype: object

After that, try counting the number of positive and negative reviews.

In [38]:
print "num of positive reviews:", len(products[products.sentiment == 1])
print "num of negative reviews:", len(products[products.sentiment == -1])

num of positive reviews: 26579
num of negative reviews: 26493


Apply text cleaning on the review data

1) Remove punctuation

In [40]:
def remove_punctuation(text):    
    return text.translate(None, string.punctuation)

In [41]:
products = products.fillna({'review':''}) # needed to convert all reviews to strings
products['review_clean'] = products.review.apply(lambda x: remove_punctuation(x))

2) Compute word counts (only for important_words)

In [42]:
with open('important_words.json') as data_file:    
    important_words = json.load(data_file)

In [43]:
for word in important_words:
    products[word] = products.review_clean.apply(lambda r: r.split().count(word))

# How many reviews contain the word perfect?

In [44]:
len(products[products.perfect > 0])

2955

Convert data frame to multi-dimensional array

In [45]:
def get_numpy_data(dataframe, features, label):
    dataframe['one'] = 1
    features = ['one'] + features
    features_array = dataframe[features].as_matrix()
    output_label = dataframe[label].as_matrix()
    
    return (features_array, output_label)

In [46]:
feature_matrix, sentiments = get_numpy_data(products, important_words, 'sentiment')

# How many features are there in the feature_matrix?

In [47]:
print np.shape(feature_matrix)

(53072L, 194L)


In [48]:
def predict_probability(feature_matrix, coefficients):
    score = np.dot(feature_matrix, coefficients)
    predictions = 1 / (1 + np.exp(-score))
    return (predictions)

In [49]:
def feature_derivative(error, feature):
    derivative = np.dot(feature, error)
    return (derivative)

In [50]:
def compute_log_likelihood(feature_matrix, sentiments, coefficients):
    indicator = (sentiments == +1)
    score = np.dot(feature_matrix, coefficients)
    ll = np.sum((indicator - 1)*score - np.log(1 + np.exp(-score)))
    return ll

In [51]:
def logistic_regression(feature_matrix, sentiments, initial_coefficients, step_size, max_iter):
    
    coefficients = np.array(initial_coefficients)
    num_coeffs = len(coefficients)
    
    indicator = (sentiments == +1)

    for i in xrange(max_iter):
        predictions = predict_probability(feature_matrix, coefficients)        
        error = (indicator - predictions)       
        
        for j in xrange(num_coeffs):
            derivative = feature_derivative(error, feature_matrix[:,j])            
            coefficients[j] = (coefficients[j] + step_size*derivative)            
        
        if (i%10 == 0):
            ll = compute_log_likelihood(feature_matrix, sentiments, coefficients)
            print i, ll
    
    return (coefficients)

Now, let us run the logistic regression solver with the parameters below:

In [23]:
init_coeffs = np.zeros(np.size(feature_matrix, axis=1))
step_size = 1e-7
max_iter = 301

# As each iteration of gradient ascent passes, does the log likelihood increase or decrease?

In [52]:
coefficients = logistic_regression(feature_matrix, sentiments, init_coeffs, step_size, max_iter)

0 -36780.9388104
10 -36723.5732784
20 -36666.7762571
30 -36610.5346
40 -36554.8360409
50 -36499.6690994
60 -36445.0229966
70 -36390.8875808
80 -36337.253263
90 -36284.1109584
100 -36231.4520359
110 -36179.2682736
120 -36127.5518186
130 -36076.2951527
140 -36025.4910611
150 -35975.1326055
160 -35925.2130998
170 -35875.7260888
180 -35826.6653291
190 -35778.0247728
200 -35729.7985518
210 -35681.9809651
220 -35634.5664668
230 -35587.5496553
240 -35540.925264
250 -35494.6881529
260 -35448.8333007
270 -35403.3557983
280 -35358.2508422
290 -35313.5137295
300 -35269.1398524


In [54]:
def predicting_sentiment(feature_matrix, coefficients):
    score = np.dot(feature_matrix, coefficients)
    sentiments = np.where(score > 0, 1, -1)
    return (sentiments)

In [55]:
products['pred_sentiments'] = predicting_sentiment(feature_matrix, coefficients)

# How many reviews were predicted to have positive sentiment?

In [56]:
positive_sentiments = len(products[products.pred_sentiments > 0])
print positive_sentiments

26147


In [57]:
num_correctly_classified = len(products[products.sentiment == products.pred_sentiments])
print num_correctly_classified

39872


# What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy)

In [58]:
accuracy = (num_correctly_classified / len(products))
print accuracy

0.751281278263


Which words contribute most to positive & negative sentiments

In [59]:
coeff_no_intercept = list(coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coeff_no_intercept) for word, coeff_no_intercept in zip(important_words, coeff_no_intercept)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)

#  Which word is not present in the top 10 "most positive" words?

In [61]:
word_coefficient_tuples[0:10]

[(u'great', 0.066296001387255332),
 (u'love', 0.065724092902079465),
 (u'easy', 0.064614818395628382),
 (u'little', 0.045170071725352888),
 (u'loves', 0.044896490742526995),
 (u'well', 0.029938542166391018),
 (u'perfect', 0.029668947385268679),
 (u'old', 0.019844898517189645),
 (u'nice', 0.018310362000933323),
 (u'daughter', 0.01755742870228319)]

# Which word is not present in the top 10 "most negative" words

In [62]:
word_coefficient_tuples[-11:-1]

[(u'waste', -0.024082264094189641),
 (u'monitor', -0.024619007521754197),
 (u'return', -0.02664239217459799),
 (u'back', -0.027971174584613689),
 (u'disappointed', -0.02902990836552595),
 (u'get', -0.029045848805011123),
 (u'even', -0.030252629052075222),
 (u'work', -0.033213613120973838),
 (u'money', -0.039083643082457598),
 (u'product', -0.04175318698775627)]