# Logistic Regression

In this code, we will explore the concept of [*Logistic Regression*](https://en.wikipedia.org/wiki/Logistic_regression) and its application for sentimental analysis. 

The goal is to use the [amazon_baby_subset.csv](../Data/amazon_baby_subset.csv), which contains 4 columns: Product name, client review, client rate, and sentiment. The rating goes from 1 (worst) to 5 (best) and the sentiment is -1 if the rating is low (< 3) and 1 if it is good (>= 3). Here, the logistic regression method is used to give weights to each important word in the comments (the important words are given now. In the future, we will see how to select them) and to create a prediction model for future reviews, understanting if it is good or bad.

First, let's load all used packages and the dataset.

In [1]:
import turicreate as tc
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

  from ._conv import register_converters as _register_converters


In [2]:
products = tc.SFrame('../Data/amazon_baby_subset.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
products.head()

name,review,rating,sentiment
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,We wanted to get something to keep track ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,My daughter had her 1st baby over a year ago. ...,5,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4,1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,Very cute interactive book! My son loves this ...,5,1
Our Baby Girl Memory Book,"Beautiful book, I love it to record cherished t ...",5,1
Hunnt&reg; Falling Flowers and Birds Kids ...,"Try this out for a spring project !Easy ,fun and ...",5,1
Blessed By Pope Benedict XVI Divine Mercy Full ...,very nice Divine Mercy Pendant of Jesus now on ...,5,1
Cloth Diaper Pins Stainless Steel ...,We bought the pins as my 6 year old Autistic son ...,4,1
Cloth Diaper Pins Stainless Steel ...,It has been many years since we needed diaper ...,5,1


We can count how many positive and negative reviews the data set has.

In [4]:
print 'Number of positive reviews =', len(products[products['sentiment']==1])
print 'Number of negative reviews =', len(products[products['sentiment']==-1])

Number of positive reviews = 26579
Number of negative reviews = 26493


Pretty close numbers!

The way the reviews are writen have punctuation. Let's clean them and also create columns with the important words count. The important words are in the file [important_words.json](../Data/important_words.json).

## Cleaning the data

First, load the important words from the *json* file:

In [5]:
with open('../Data/important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]
print important_words

['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'retu

Now, let's remove the reviews punctuations:

In [6]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

The next step is to count each important word and create new columns with the results.

In [7]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

In [8]:
products[important_words[0]] # This is the count of the word 'baby' for each review.

dtype: int
Rows: 53072
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 0, 0, 13, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, ... ]

In [9]:
len(products[products[important_words[0]] > 0]) # Number of reviews that contain the word 'baby'

12174

How many time the word *baby* appears in all the reviews?

In [10]:
sum(products['baby'])

18715

## Starting the logistic regression

Logistic regression deals with categorical variables, and here in the sentimental analysis, it assumes only 2 number: 1 for a good review and -1 for a bad review.

The whole idea is to estimate the probabily on which the review is good or bad, and the probability is computed usin the [logistic function](https://en.wikipedia.org/wiki/Logistic_function) (also known as link function):

$$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + e^{-\mathbf{w}^T h(\mathbf{x}_i)}} $$

where, in our code, the feature vector $h(\mathbf{x}_i)$ represents the word counts of **important_words** in the review $\mathbf{x}_i$. In this equation, for a given vector of weights $\mathbf{w}$ and the word count $\mathbf{x}_i$, it computes the probabilty of the sentiment $\mathbf{i}_i$ be positive (equal to 1). The values go from 0 to 1. We can choose a threshold (usually 0.5) on which a higher probability is considered as a positive sentiment, while the opsite comes from a lower probability.

Best prediction comes from optimized weights. And a good way to obtain such weights is by [maximizing the likelihood function](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). The [likelihood function](https://en.wikipedia.org/wiki/Likelihood_function) is writen as:

$$\ell(\mathbf{w}) = \prod_{i = 1}^{N}P(y_i | \mathbf{x}_i,\mathbf{w})$$

The optimization method can be done by by the gradient descent method. To find the optimized weights, we should derive the likelihood function by $\mathbf{w}$. However, such task is not easy for the current function. A good strategy is to work with the [natural logarithm](https://en.wikipedia.org/wiki/Natural_logarithm) of the likelihood function, called the **log-likelihood**.

$$\ell\ell(\mathbf{w}) = \ln\prod_{i = 1}^{N}P(y_i | \mathbf{x}_i,\mathbf{w}) = \sum_{i = 1}^{N}\ln P(y_i | \mathbf{x}_i,\mathbf{w})$$

The log-likelihood can be computed using the following formula:

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \{ (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln[1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))]\} $$

The derivative of the log-likelihood is:

$$\frac{\partial\ell\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)$$

where $\mathbf{1}[y_i = +1]$ means that it is equal 1 if the sentment is positive (1) and 0 if the sentiment is negative (-1). Now that we can compute the gradient of the log-likelihood, the gradient descent method can be applied:

$$\mathbf{w}_{j+1} = \mathbf{w}_j + \lambda\frac{\partial\ell\ell}{\partial w_j}$$

Great!!! Now let's implement the equations above to find the optimal weights $\mathbf{w}$. Then we can use then to make predictions of the reviews sentiment.

The first step is to convert the Turicreate SFrame data to Numpy array to perform the math. The following function receives the SFrame data, the list of desired features (that will be the important words count), and the class label (in this case, 'sentiment'). It will return two outputs: a matrix with the features (word count) plus an initial intercept (= 1), and an array with the sentiment (-1 or +1) of each review.

In [11]:
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1 # Initial value for the intercept, including it in the sframe
    features = ['intercept'] + features # Including 'intercept' in the features array
    features_sframe = data_sframe[features] # Saving a sframe with only the desired features 
    feature_matrix = features_sframe.to_numpy() # Converting the features sframe to a numpy array (matrix)
    label_sarray = data_sframe[label] # Picking the desired label
    label_array = label_sarray.to_numpy() # Converting the label to a numpy array
    return(feature_matrix, label_array)

In [12]:
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment')
feature_matrix.shape

(53072, 194)

In [13]:
print len(products[products['sentiment']]) # Number of reviews
print len(important_words) + 1 # Number of features plus the intercept

53072
194


Okay, the *feature_matrix* matches in size the number of reviews and the number of features (plus the intercept).

Let's now create a function to calculate the probability for positive sentiments (the negative sentiment probability is just one minus the positive sentiment probability), given an array of coefficients and the feature matrix. The output is an array with the probability predictions for each reviews (or length of the feature matrix).

In [14]:
def predict_probability(coefficients, feature_matrix):
    # Computing P(y_i = +1 | x_i, w), using the logistic function
    predictions = 1/(1 + np.exp(-np.dot(coefficients, feature_matrix)))
    return predictions

We will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:
* **errors** vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
* **feature** vector containing $h_j(\mathbf{x}_i)$  for all $i$. 