#### Classifier Using Logistic Regression

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [16]:
products = pd.read_csv('amazon_baby.csv')
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [17]:
products = products.fillna({'review':''})  # fill in N/A's in the review column
# convert column "a" to int64 dtype and "b" to complex type
products = products.astype({"review": str})

In [18]:
def remove_punctuation(text):
    import string
    return text.translate(text.maketrans('', '', string.punctuation))

products['review_clean'] = products['review'].apply(remove_punctuation)

In [19]:
#We will ignore all reviews with rating = 3, since they tend to have a neutral sentiment. In SFrame, for instance,
products = products[products['rating'] != 3]

In [20]:
#Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative.
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

In [21]:
products.head(1)

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1


In [31]:
# split into train and test
# load the given data
trainidx = pd.read_json('module-2-assignment-train-idx.json')
trainidx[0] = trainidx[0] + 1 
testidx  = pd.read_json('module-2-assignment-test-idx.json')
testidx[0] = testidx[0] + 1 

In [41]:
train_data = pd.merge(trainidx, products, left_on=0, right_on=products.index)

In [43]:
test_data = pd.merge(testidx, products, left_on=0, right_on=products.index)

In [None]:
# Text to vector conversion

In [45]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
# Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [None]:
# Train a classifier

In [48]:
from sklearn.linear_model import LogisticRegression
sentiment_model = LogisticRegression().fit(train_matrix, train_data['sentiment'])



In [52]:
#here should be over 100,000 coefficients in this sentiment_model. 
#Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, 
#while negative weights correspond to negative sentiment. Calculate the number of positive 
#(>= 0, which is actually nonnegative) coefficients
sentiment_model.coef_.shape

(1, 113092)

In [66]:
sentiment_model.intercept_

array([1.37357778])

#### Quiz question: How many weights are >= 0?

In [65]:

sum(1 for item in sentiment_model.coef_[0] if item >= 0)

80328

In [76]:
#Making predictions with logistic regression
sample_test_data = test_data[10:13]


In [77]:
sample_test_data

Unnamed: 0,0,name,review,rating,review_clean,sentiment
10,54,Baby's First Year Undated Wall Calendar with S...,A friend bought me this calendar when our daug...,5,A friend bought me this calendar when our daug...,1
11,65,My Kindergarten Year - A Keepsake Book,I was pleasantly surprised upon receiving the ...,5,I was pleasantly surprised upon receiving the ...,1
12,83,Cloth Diaper Pins Stainless Steel Traditional ...,"As another reviewer noted, these are great for...",5,As another reviewer noted these are great for ...,1


In [90]:
sentiment_model.predict(test_matrix[10:13])

array([1, 1, 1], dtype=int64)

In [83]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print(scores)

[ 5.5781005   6.74978895 11.42518643]


In [89]:
[1 if item>0 else 0 for item in scores]

[1, 1, 1]

In [None]:
# observe the match in the result above

#### Quiz question: Of the three data points in sample_test_data, which one (first, second, or third) has the lowest probability of being classified as a positive review?

In [91]:
#Probability Predictions
p = 1/(1+np.exp(scores))
print(p)


[3.76550569e-03 1.16975683e-03 1.09169127e-05]


In [92]:
#Answer: third
min(p)

1.0916912745168289e-05

In [115]:
# Find the most positive (and negative) review
#We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression 
#to form predictions on all of the test data points.
scores_test = sentiment_model.decision_function(test_matrix)
p_test = 1/(1+np.exp(scores_test))

In [116]:
#Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability
#of being classified as a positive review. We refer to these as the "most positive reviews."
p_test

array([1.17078810e-01, 1.13057479e-06, 4.70049081e-04, ...,
       2.13327001e-01, 9.67594855e-01, 3.37800997e-01])

In [117]:
indicesTest = np.argsort(p_test)

In [121]:
# reviews with following indices are the top 20 reviews having highest probability
indicesTest[-20:]

array([10763, 21884, 14285, 24456, 23756, 20648, 26403, 21203, 26491,
       13780,  8677, 12064, 25501, 23914,  6962,  5027, 10381, 16261,
        1891, 15749], dtype=int64)

#### Quiz Question: Which of the following products are represented in the 20 most positive reviews?

In [122]:
# depends on the options but let us merge with test_data having reviews
highest_probability_indices = pd.DataFrame(indicesTest[-20:])

In [138]:
for item in pd.merge(highest_probability_indices, test_data['review'], left_on=0, right_on=test_data.index)['review']:
    print(item)
    print("\n")

This is basically an overpriced piece of fabric. All you get is a square sewn to four fabric "straps" and you do all the work. It is tedious, hard, annoying, dangerous because you CANNOT do it alone, someone has to hold the baby and/or tie the damn thing around you, and you basically have to do this operation on a bed cause you might drop the baby, it happened a couple of times to us. Then, after this time-consuming frustrating sweaty experience, once the baby is "on" it's really uncomfortable, it breaks your back and if you just want to get the baby out for a few moments and back inside it's plain impossible. Just buy a backpack carrier, we did that and regreted so much buying this "hip-looking" overpriced piece of junk! They should be sued, seriously!!! Oh, and hope that your baby doesn't start crying and kicking through the process, if he does, then it's easier to tie a bobcat around you, i swear!


Assembly wasn't too terrible - took about an hour. However, the anchors included in 

#### Quiz Question: Which of the following products are represented in the 20 most negative reviews?


In [139]:
lowest_probability_indices = pd.DataFrame(indicesTest[:21])
for item in pd.merge(lowest_probability_indices, test_data['review'], left_on=0, right_on=test_data.index)['review']:
    print(item)
    print("\n")



The joovy zoom 360 was the perfect solution for us. We couldn't justify spending the money on a mountain buggy terrain, but we wanted a very sturdy all-terrain jogger with a locking swivel wheel. I tried out a BOB as well, in the store. I also wanted a large sun canopy, and a seat that my daughter would be able to fit in for years. This stroller is affordable while still having most of the features I was looking for in a jogger. The biggest compromise for me was that I had wanted a hand brake, but honestly I probably don't need it. This stroller is so easy to push and stop that it is unnecessary.The fabric is sturdy and feels like it will really last.The foot rest is far enough away from the seat that my daughter will be able to fit comfortably in the seat for several years without outgrowing it. It is sturdy metal with drainage holes. (I didn't like the BOB's foot rest because it was made of fabric.)The locking swivel wheel is easy to lock or unlock. It doesn't shake when I jog. I u

#### Quiz Question: What is the accuracy of the sentiment_model on the test_data? Round your answer to 2 decimal places (e.g. 0.76).

In [147]:
from sklearn.metrics import accuracy_score
accuracy_score(list(test_data['sentiment']), list(sentiment_model.predict(test_matrix)))

0.9313110641668867

#### Quiz Question: Does a higher accuracy value on the training_data always imply that the classifier is better?
Nope, Overfitting