In [2]:
import pandas as pd
import numpy as np  
from sklearn.model_selection import train_test_split  
from sklearn.feature_extraction.text import  CountVectorizer
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from time import time
from sklearn.naive_bayes import MultinomialNB, BernoulliNB,GaussianNB
from sklearn import tree
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model.stochastic_gradient import SGDClassifier
from scipy.sparse import csr_matrix,coo_matrix,csc_matrix
import scipy
import json

# 1. Introduction, task, and data 

**1.1 Overview**

It is well known that customers' reviews are important information. This assignment aims to develop predictive models for classifying the good or bad reviews, and improve models(accuracy,F-score,etc) by features decomposition and different classifiers. In the last part,explore why some reviews were misclassified and analyze the error samples.

**1.2 Organize and split data**

In [14]:
#read data
fn = ".../input_data/Beauty_5.json"
reviews = (eval(line) for line in open(fn))
reviews = pd.DataFrame(reviews)
#Giving labels to reviews 
#Assume reviews' ratings that over than 4.5 are 'good' reviews
reviews['good']= reviews.overall > 4.5

In [294]:
#train data(70%),test data(20%),evaluation data(10%)
train_data, test_data = train_test_split(reviews, test_size=0.3, random_state=1)
test_data, val_data = train_test_split(test_data ,test_size=0.33,random_state=1)

In [290]:
#the size of the data sets
len(reviews),len(train_data),len(test_data),len(val_data)

(198502, 138951, 39899, 19652)

In [291]:
#label distribution 
sum(train_data.good)/len(train_data),sum(test_data.good)/len(test_data),sum(val_data.good)/len(val_data)

(0.57607357989507091, 0.57921251159176923, 0.57882149399552207)

split all data into train data(70%),test data(20%),evaluation data(10%),the good labels accounted for about 57% in three data sets.

# 2. Baseline model & performance

The accuracy of baseline model (predict good labels) is 57.92%.

# 3. Initial model

In [282]:
# Create document term matrix of reviews
vectorizer = TfidfVectorizer(input='content')
x_train = vectorizer.fit_transform(train_data.reviewText)  
x_test = vectorizer.transform(test_data.reviewText)
y_train = train_data.good
y_test = test_data.good

In [89]:
#fit MultinomialNB
time_star = time()
clf = MultinomialNB().fit(x_train,y_train)
time_end = time()
train_time = time_end-time_star 
print (u'train time:%.3fs' % train_time)

time_star = time()
predictions = clf.predict(x_test)
time_end = time()
test_time = time_end-time_star 
print (u'test time:%.3fs' % test_time)

acc = metrics.accuracy_score(y_test, predictions)
print (u'accuracy:%.2f%%' % (100 * acc))
print(metrics.classification_report(y_test, predictions, target_names=["Bad", "Good"]))

train time：0.099s
test time：0.022s
accuracy：74.84%
             precision    recall  f1-score   support

        Bad       0.84      0.50      0.62     16789
       Good       0.72      0.93      0.81     23110

avg / total       0.77      0.75      0.73     39899



The accuracy of initial model is 74.84% which is higher than baseline model(57.92%).According to the classification report,this model is good at identifying true positive samples(recall = 0.93,f1-score = 0.81).However,this model is not good at identifying true negative samples(f1-score = 0.63).

# 4. Model improvements

**4.1  feature improvements** 

Try to add new feature(user mean scores)by using train data information.

In [370]:
means =train_data.groupby("reviewerID").overall.mean()
train_data = pd.merge(train_data, pd.DataFrame(means),left_on="reviewerID", right_index=True)
test_data = pd.merge(test_data, pd.DataFrame(means),how = "left",left_on="reviewerID", right_index=True)
val_data = pd.merge(val_data, pd.DataFrame(means),how = "left",left_on="reviewerID", right_index=True)
#Handle null values
val_data['overall_y'] = val_data['overall_y'].fillna(np.mean(train_data.overall_y))
test_data['overall_y'] = test_data['overall_y'].fillna(np.mean(train_data.overall_y))

In [472]:
#Converted to matrix
mean_train = coo_matrix(train_data['overall_y'])
mean_val = coo_matrix(val_data['overall_y'])
mean_test = coo_matrix(test_data['overall_y'])
#conbine dtm features and mean scores features
features_train = scipy.sparse.hstack([mean_train.T,train_data.reviewText])
features_val = scipy.sparse.hstack([mean_val.T,train_data.reviewText])
features_test = scipy.sparse.hstack([mean_test.T,train_data.reviewText])

try an classifier MultinomialNB and compared with initial model.

In [475]:
clf = MultinomialNB().fit(features_train,y_train)
predictions = clf.predict(features_val)
acc = metrics.accuracy_score(y_val, predictions)
print (u'accuracy:%.2f%%' % (100 * acc))

accuracy：70.49%


Add the average score of users lead to reduce the accuracy compared with the initial model. It may be because the user mean ratings have some noise data. So I decided to use another way to imporve features.

Firtly,Stop word removal,feature weighting to improve the feature of reviews.

In [427]:
vectorizer = TfidfVectorizer(input='content',max_df=0.5, sublinear_tf=True,use_idf=True)
x_train = vectorizer.fit_transform(train_data.reviewText) 
x_val = vectorizer.transform(val_data.reviewText)
x_test = vectorizer.transform(test_data.reviewText)
y_train = train_data.good
y_test = test_data.good
y_val = val_data.good

Secondly,Feature decomposition by linearsvc.

In [428]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel
lsvc = LinearSVC(penalty="l1",dual=False).fit(x_train,y_train)
model = SelectFromModel(lsvc, prefit=True)
x_train_lsvc = model.transform(x_train)
x_test_lsvc = model.transform(x_test)
x_val_lsvc = model.transform(x_val)

In [268]:
# How many features removed?
x_train.shape[1]-x_train_lsvc.shape[1],x_test.shape[1]-x_test_lsvc.shape[1],x_val.shape[1]-x_val_lsvc.shape[1]

(52572, 52572, 52572)

**4.2 Classifier selection**

Considering the computation time, system memory and label characteristics, the linear model (eg.SGDClassifier,LogisticRegression) will be used in this part.

Firstly,Select best hyperparameter for SGDClassifier and LogisticRegression by GridSearchCV,then fit train data.

In [293]:
from sklearn.model_selection import GridSearchCV
parameters = {'loss':['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron'], 
              'penalty':["l1","l2"],
              'n_jobs': [-1,1],
             }
SGDC = SGDClassifier()
gs = GridSearchCV(SGDC, parameters)
gs.fit(x_train_lsvc, y_train)

In [None]:
parameters = {'dual':[False], 
              'penalty':["l2"],
              'n_jobs': [-1,1],
              'C':[ 0.01,  0.21,  0.41,  0.61,  0.81,  1.01,  1.21,  1.41,  1.61,  1.81],
              "solver":["newton-cg", "lbfgs", "liblinear", "sag"]
             }
lr = LogisticRegression()
gs_lr = GridSearchCV(lr, parameters)
gs_lr.fit(x_train_lsvc, y_train)

Then,predict the evaluation data.

In [361]:
time_star = time()
predictions = gs.predict(x_val_lsvc)
time_end = time()
test_time = time_end-time_star

acc = metrics.accuracy_score(y_val, predictions)

print (u'avg fit time:%.3fs' % np.mean(gs.cv_results_['mean_fit_time']))
print (u'test time:%.3fs' % test_time)
print (u'accuracy:%.2f%%' % (100 * acc))
print(metrics.classification_report(y_val, predictions, target_names=['Bad', 'Good']))

avg fit time：0.358s
test time：0.003s
accuracy：80.02%
             precision    recall  f1-score   support

        Bad       0.75      0.57      0.65      6293
       Good       0.82      0.91      0.86     13359

avg / total       0.79      0.80      0.79     19652



In [360]:
time_star = time()
predictions = gs_lr.predict(x_val_lsvc)
time_end = time()
test_time = time_end-time_star

acc = metrics.accuracy_score(y_val, predictions)

print (u'avg fit time:%.3fs' % np.mean(gs_lr.cv_results_['mean_fit_time']))
print (u'test time:%.3fs' % test_time)
print (u'accuracy:%.2f%%' % (100 * acc))
print(metrics.classification_report(y_val, predictions, target_names=['Bad', 'Good']))

avg fit time：2.547s
test time：0.003s
accuracy：80.28%
             precision    recall  f1-score   support

        Bad       0.73      0.61      0.66      6293
       Good       0.83      0.90      0.86     13359

avg / total       0.80      0.80      0.80     19652



In terms of the accuracy and the stability of the model, LogisticRegression (accuracy = 80.28%, avg f1-score = 0.80) is slightly higher than SGDClassifier (accuracy = 80.02%, avg f1-score = 0.79), but the fit time of SGDClassifier (0.358s) is shorter than logical regression (2.547s). Finally, I decided to choose the higher accuracy classifier(logical regression). Next,the LogisticRegression will be used to predict test data.

In [429]:
time_star = time()
clf = gs_lr.best_estimator_.fit(x_train_lsvc,y_train)
time_end = time()
train_time = time_end-time_star 
print (u'train time:%.3fs' % train_time)

time_star = time()
predictions = clf.predict(x_test_lsvc)
time_end = time()
test_time = time_end-time_star 
print (u'test time:%.3fs' % test_time)

acc = metrics.accuracy_score(y_test, predictions)
print (u'accuracy:%.2f%%' % (100 * acc))
print(metrics.classification_report(y_test, predictions, target_names=["Bad", "Good"]))

train time：3.930s
test time：0.007s
accuracy：80.56%
             precision    recall  f1-score   support

        Bad       0.74      0.61      0.67     12893
       Good       0.83      0.90      0.86     27006

avg / total       0.80      0.81      0.80     39899



The accuracy and f1-score of LogisticRegression is 80.56%, which is higher than initial model(74.84%). Just like the initial model,this model is good at identifying true positive samples(recall = 0.93,f1-score = 0.81). Despite reduced dimension of features, the training time of LogisticRegression is longer than the initial model due to the complexity of the calculation. Compared with initial model,the accuracy rate increased about 6%,precision,recall and f1-score increased 3%,4%,and 7% respectively.

# 5. Error analysis



The probability of bad are predicted as good is higher than good are predicted as bad.Is there a mistake in label good and bad? The wrong samples will be analyzed to answer this question in this part.

In [222]:
#select wrong samples
wrong_samples = test_data[np.logical_xor(test_data.good,predictions)][['reviewerName',"overall_x","summary","reviewText","good","overall_y"]]
wrong_samples = wrong_samples.rename(columns={"reviewerName": "Name", "overall_x": "overall",'overall_y':'mean_overall'})

**5.1 For 'bad' labels are predicted as 'good'**

In [227]:
wrong_samples[wrong_samples.Name =='*rose*']

Unnamed: 0,Name,overall,summary,reviewText,good,mean_overall
192896,*rose*,4.0,Dove Pure Care Dry Oil Conditioner,This is a nice conditioner that works very wel...,False,4.093023
179479,*rose*,4.0,"CoverGirl 300 Flamed Out Mascara, Very Black B...",I like this mascara. When I'm done applying i...,False,4.093023
57798,*rose*,4.0,Works - with a couple irritations,This works very well. It goes on smoothly and...,False,4.093023
189651,*rose*,4.0,Suave Professionals Natural Infusion Seaweed a...,I like this shampoo. The scent is different f...,False,4.093023
30029,*rose*,4.0,Made my hair feel so soft with big bouncy curls!,I love this! Works great and is so easy to us...,False,4.093023
189851,*rose*,4.0,Suave Professionals Split End Rescue Conditioner,We use this conditioner with the Suave Split E...,False,4.093023
165330,*rose*,3.0,Suave Professionals Moroccan Infusion Shine Sh...,This is a very luxurious shampoo. It lathers ...,False,4.093023


This user is a typical representative. In her opinion 4 stars means that the product is very good, so her comments have many words like "like", "good". Therefore, the standard of the good labels(>= 4.5)leads to wrong classification.

**5.2 For 'good' labels are predicted as 'bad'**

In [367]:
good_bad = wrong_samples.ix[[91760,14438,94180,75360]]

Unnamed: 0,Name,overall,summary,reviewText,good,mean_overall
91760,Amazon Customer,5.0,Not sure if it works for me,I received this very quickly from the supplier...,True,4.5
14438,Amazon Customer,5.0,"Yep, it's acetone",Not much to say about this product except that...,True,3.75
94180,Amazon Customer,5.0,Does not cause flaked/peeling skin (for me),I didn't have to worry about peeling. The inst...,True,4.25
75360,Amazon Customer,5.0,keeps depression at bay,I've been using Nioxin since my hair started f...,True,5.0


In [368]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
for i in good_bad['reviewText']:
    alz = SentimentIntensityAnalyzer()
    res = alz.polarity_scores(i)
    print(res)

{'compound': 0.2732, 'neu': 0.952, 'pos': 0.048, 'neg': 0.0}
{'compound': 0.2617, 'neu': 0.961, 'pos': 0.039, 'neg': 0.0}
{'compound': 0.4025, 'neu': 0.955, 'pos': 0.033, 'neg': 0.012}
{'compound': -0.1531, 'neu': 0.918, 'pos': 0.0, 'neg': 0.082}


We can try to do Sentiment anaysis for these misclassified samples,It is clear that these comments are very neutral, and even have some negative words, which means that some customers have complained about the product, but still give 5 star.

# 6. Conclusion

According to the error analysis, I think the effectiveness of LogisticRegression is fine compared with initial model. There are some limitations in this assignment. 

Firstly,the standard of the good labels may mismatch the review texts.As I mentioned in the error analysis,the standard of good products is very subjective,so it is not accurate to label reviews by using rating >= 4.5, more information needed to create the standard of good reviews.

Secondly,Due to memory limitations, some complex classifiers are not tried. Complex algorithms may have a better performance on text classify(e.g.SVM, Neural Networks),but these classifiers need computers with high standard configuration.

Finally,More useful features should be added. For exmaple,user's average score may be a good feature, but we should handle the noise data; On the other hand, brands,text length etc also can as features.

In sum, the prediction model trained in this assignment has room for improvement,Once has a good standard of classify good/bad reviews and different classifiers can uesed, this can prove a very robust method of prediecting reviews.