# Logistic Regression

In this code, we will explore the concept of [*Logistic Regression*](https://en.wikipedia.org/wiki/Logistic_regression) and its application for sentimental analysis. 

The goal is to use the [amazon_baby_subset.csv](../Data/amazon_baby_subset.csv), which contains 4 columns: Product name, client review, client rate, and sentiment. The rating goes from 1 (worst) to 5 (best) and the sentiment is -1 if the rating is low (< 3) and 1 if it is good (>= 3). Here, the logistic regression method is used to give weights to each important word in the comments (the important words are given now. In the future, we will see how to select them) and to create a prediction model for future reviews, understanting if it is good or bad.

First, let's load all used packages and the dataset.

In [5]:
import turicreate as tc
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [6]:
products = tc.SFrame('../Data/amazon_baby_subset.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [7]:
products.head()

name,review,rating,sentiment
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,We wanted to get something to keep track ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,My daughter had her 1st baby over a year ago. ...,5,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4,1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,Very cute interactive book! My son loves this ...,5,1
Our Baby Girl Memory Book,"Beautiful book, I love it to record cherished t ...",5,1
Hunnt&reg; Falling Flowers and Birds Kids ...,"Try this out for a spring project !Easy ,fun and ...",5,1
Blessed By Pope Benedict XVI Divine Mercy Full ...,very nice Divine Mercy Pendant of Jesus now on ...,5,1
Cloth Diaper Pins Stainless Steel ...,We bought the pins as my 6 year old Autistic son ...,4,1
Cloth Diaper Pins Stainless Steel ...,It has been many years since we needed diaper ...,5,1


We can count how many positive and negative reviews the data set has.

In [8]:
print 'Number of positive reviews =', len(products[products['sentiment']==1])
print 'Number of negative reviews =', len(products[products['sentiment']==-1])

Number of positive reviews = 26579
Number of negative reviews = 26493


Pretty close numbers!

The way the reviews are writen have punctuation. Let's clean them and also create columns with the important words count. The important words are in the file [important_words.json](../Data/important_words.json).

## Cleaning the data

First, load the important words from the *json* file:

In [9]:
with open('../Data/important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]
print important_words

['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'retu

Now, let's remove the reviews punctuations:

In [10]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

The next step is to count each important word and create new columns with the results.

In [11]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

In [19]:
products[important_words[0]] # This is the count of the word 'baby' for each review.

dtype: int
Rows: 53072
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 0, 0, 13, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, ... ]

In [20]:
len(products[products[important_words[0]] > 0]) # Number of reviews that contain the word 'baby'

12174