# Amazon Reviews Classifier

Let's define a class Review!!

In [1]:
class Sentiment:
    NEGATIVE = 'NEGATIVE'
    POSITIVE = 'POSITIVE'
    NEUTRAL = 'NEUTRAL'

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score    
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score >=4:
            return Sentiment.POSITIVE
        else:
            return Sentiment.NEUTRAL

#### Load Data
Let's import the json data.

In [2]:
import json

file_name = './dataset/Books_small.json'
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))

print(reviews[5].text)
print(reviews[5].score)
print(reviews[5].sentiment)

Love the book, great story line, keeps you entertained.for a first novel from this author she did a great job,  Would definitely recommend!
4.0
POSITIVE


#### Prep Data
The machine models love numerical data rather than text.

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

In [4]:
print(train[0].text, train[0].score, train[0].sentiment)

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down. 5.0 POSITIVE


In [5]:
train_x = [x.text for x in train]
train_y = [x.sentiment for x in train]

test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

print(train_x[0])
print(train_y[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
POSITIVE


### Bag of Words

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# vectorizer.fit(train_x)
# vectorizer.transform(train_x) ---> SAME AS V

train_x_vectors = vectorizer.fit_transform(train_x)

# Don't fit the the vectorizer again to test_x
test_x_vectors = vectorizer.transform(test_x)

train_x_vectors[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

### Linear SVM

In [7]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

n=2
print(test_x[n])
print(test_y[n])

clf_svm.predict(test_x_vectors[n])

Michael Cunningham mesmerizes with the thoughtful, elegant prose that is this book.  The reader becomes so close to its characters...the reader feels what these brothers feel.  Beautiful and tragic...a book that will stay with me for a long, long time.  Thank you again, Mr. Cunningham.  The Hours remains at the top of my list and The Snow Queen is another gift to your readers.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Decision Tree

In [8]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)


n=42
print(test_x[n])
print(test_y[n])

clf_dec.predict(test_x_vectors[n])

The review of innovation techniques and examples of their application to healthcare problems is absolutely amazing!  The Lean, on the other hand, is sketchy and not convincing.  Still, an informative and a well-written book.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Naive Bayes

In [9]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
# Dense data should be passed and not a Sparse matrix
clf_gnb.fit(train_x_vectors.toarray(), train_y) 


n=5
print(test_x[n])
print(test_y[n])

clf_gnb.predict(test_x_vectors[n].toarray())

An intriguing book, but I am ashamed to admit I sometimes did not fully comprehend what was said, particularly when Chesterton referred to other people (writers) that I am not knowledgeable of.  At any rate a good read.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y) 


n=6
print(test_x[n])
print(test_y[n])

clf_log.predict(test_x_vectors[n])

I loved this book.  It was very interesting to hear the background involved in training a helper dog and Luis's background in the service.  I learned a lot, but was entertained at the same time.  Of course, I have a a Golden Retriever myself, so I may be biased, but mine is as dumb as a brick!  Great book.
POSITIVE


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array(['POSITIVE'], dtype='<U8')

### Evaluation

In [11]:
# Mean Accuraies 
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8242424242424242
0.7787878787878788
0.8121212121212121
0.8303030303030303


In [12]:
# F1 Scores
from sklearn.metrics import f1_score

labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]
print(labels)

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=labels))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=labels))


['POSITIVE', 'NEUTRAL', 'NEGATIVE']
[0.91319444 0.21052632 0.22222222]
[0.88307155 0.13333333 0.        ]
[0.89678511 0.08510638 0.09090909]
[0.91370558 0.12244898 0.1       ]


#### All the models predict POSITIVE correctly but NEUTRAL and NEGATIVE are bad!!
Let's find out why.

In [13]:
print(len(train_y))
print(train_y.count(Sentiment.POSITIVE),train_y.count(Sentiment.POSITIVE)/len(train_y))
print(train_y.count(Sentiment.NEUTRAL),train_y.count(Sentiment.NEUTRAL)/len(train_y))
print(train_y.count(Sentiment.NEGATIVE),train_y.count(Sentiment.NEGATIVE)/len(train_y))

670
552 0.8238805970149253
71 0.10597014925373134
47 0.07014925373134329


So we see that there are around 82\% POSITIVEs in the training data. Thus, our models will be heavily biased towards POSITIVE.
Now, there are only 47 NEUTRALs in this data. To balance the dataset, wewill either need to make all three classes around 47 data points each (which is too small for a dataset) OR...

#### GET A BIGGER DATASET!!!

In [14]:
file_name = './dataset/Books_small_10000.json'
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))

print(reviews[5].text)
print(reviews[5].score)
print(reviews[5].sentiment)

I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.
5.0
POSITIVE


In [15]:
# Evenly distribute positives and negatives   
import random
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))      
#         print(len(negative))
#         print(len(positive))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

In [16]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

#### Note to Self:  Do for NEUTRAL as well

In [17]:
train_cont = ReviewContainer(train)
test_cont = ReviewContainer(test)

train_cont.evenly_distribute() # If test data is not evenly dist. then
test_cont.evenly_distribute()  # the NEGATIVE F1 score doesn't improve. WHY??

print(len(train_cont.reviews))
print(len(test_cont.reviews))

872
416


In [18]:
train_x = train_cont.get_text()
train_y = train_cont.get_sentiment()

test_x = test_cont.get_text()
test_y = test_cont.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


In [19]:
print(len(train_y))
print(train_y.count(Sentiment.POSITIVE),train_y.count(Sentiment.POSITIVE)/len(train_y))
print(train_y.count(Sentiment.NEUTRAL),train_y.count(Sentiment.NEUTRAL)/len(train_y))
print(train_y.count(Sentiment.NEGATIVE),train_y.count(Sentiment.NEGATIVE)/len(train_y))

872
436 0.5
0 0.0
436 0.5


In [20]:
print(len(test_y))
print(test_y.count(Sentiment.POSITIVE),test_y.count(Sentiment.POSITIVE)/len(test_y))
print(test_y.count(Sentiment.NEUTRAL),test_y.count(Sentiment.NEUTRAL)/len(test_y))
print(test_y.count(Sentiment.NEGATIVE),test_y.count(Sentiment.NEGATIVE)/len(test_y))

416
208 0.5
0 0.0
208 0.5


### Bag of Words Vectorization 
Let's use TDIDF Vectorizer.

In [68]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

# Don't fit the the vectorizer again to test_x
test_x_vectors = vectorizer.transform(test_x)

train_x_vectors[0].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

### Linear SVM

In [69]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

n=2
print(test_x[n])
print(test_y[n])

clf_svm.predict(test_x_vectors[n])

I think you should just take the title to heart, and skip the content. Try that experience. It's a little bit like seeing a movie preview and then realizing that the full length film isn't going to give you that much more information.
NEGATIVE


array(['NEGATIVE'], dtype='<U8')

### Decision Tree

In [72]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)


n=42
print(test_x[n])
print(test_y[n])

clf_dec.predict(test_x_vectors[n])

Although the Navajo element was interesting as background and provided a view as to life around a reservation, with its cultural nuances, the story itself was very basic. There were limited twists, if any, and the ending left me feeling like the reading journey was uneventful. Disappointing overall.
NEGATIVE


array(['POSITIVE'], dtype='<U8')

### Naive Bayes

In [70]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
# Dense data should be passed and not a Sparse matrix
clf_gnb.fit(train_x_vectors.toarray(), train_y) 


n=5
print(test_x[n])
print(test_y[n])

clf_gnb.predict(test_x_vectors[n].toarray())

I bought it as a refresher, something to read on the plane. It doesn't work on Kindle....tables and text get scrambled. You can work it out but quickly get to a point where the pain is way higher than the payoff.
NEGATIVE


array(['POSITIVE'], dtype='<U8')

### Logistic Regression

In [71]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y) 


n=6
print(test_x[n])
print(test_y[n])

clf_log.predict(test_x_vectors[n])

I really wanted to like this book. Everything was there for a great story, it just never came together. The conversation between characters felt stilted and forced, even when the author clearly wasn't meaning for it to. There wasn't enough character development for the supporting characters, but with that said this book seemed to still drag on and on. It was irritating to read and I had to make myself finish it. I believe that this author has a ton of potential.... but needs refining ( based off of this book)
NEGATIVE


array(['NEGATIVE'], dtype='<U8')

### Evaluation

In [73]:
# Mean Accuraies 
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6298076923076923
0.6610576923076923
0.8052884615384616


In [74]:
# F1 Scores
from sklearn.metrics import f1_score

labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]
print(labels)

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=labels))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=labels))


['POSITIVE', 'NEUTRAL', 'NEGATIVE']
[0.80582524 0.         0.80952381]
[0.62439024 0.         0.63507109]
[0.65693431 0.         0.66508314]
[0.80291971 0.         0.80760095]


  average, "true nor predicted", 'F-score is', len(true_sum)
  average, "true nor predicted", 'F-score is', len(true_sum)
  average, "true nor predicted", 'F-score is', len(true_sum)
  average, "true nor predicted", 'F-score is', len(true_sum)


In [75]:
test_set = ['Very good book!', 'A must read for any war buff.',
            'It is not bad', 'Horrible Waste of time'] 

test_set = vectorizer.transform(test_set)

print(clf_svm.predict(test_set))
print(clf_dec.predict(test_set))
print(clf_gnb.predict(test_set.toarray()))
print(clf_log.predict(test_set))

['POSITIVE' 'POSITIVE' 'NEGATIVE' 'NEGATIVE']
['POSITIVE' 'POSITIVE' 'NEGATIVE' 'POSITIVE']
['NEGATIVE' 'POSITIVE' 'NEGATIVE' 'NEGATIVE']
['POSITIVE' 'POSITIVE' 'NEGATIVE' 'NEGATIVE']


### Tuning Our Model (with Grid Search)

In [76]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel':('linear', 'rbf'), 'C':[1,2,4,8,16]}

svc = svm.SVC()
tuned_svm = GridSearchCV(svc, parameters,cv=5)
tuned_svm.fit(train_x_vectors, train_y)


GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [1, 2, 4, 8, 16], 'kernel': ('linear', 'rbf')})

In [77]:
import pandas as pd
resdf = pd.DataFrame.from_dict(tuned_svm.cv_results_)
pd.DataFrame.from_dict(tuned_svm.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.210116,0.009128,0.044887,0.001079,1,linear,"{'C': 1, 'kernel': 'linear'}",0.845714,0.817143,0.873563,0.793103,0.816092,0.829123,0.027788,6
1,0.228531,0.002997,0.053695,0.001238,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.868571,0.84,0.885057,0.775862,0.816092,0.837117,0.038705,2
2,0.215722,0.006513,0.043921,0.001008,2,linear,"{'C': 2, 'kernel': 'linear'}",0.828571,0.817143,0.862069,0.810345,0.821839,0.827993,0.018047,7
3,0.231588,0.004199,0.054353,0.000959,2,rbf,"{'C': 2, 'kernel': 'rbf'}",0.868571,0.84,0.890805,0.775862,0.821839,0.839415,0.039596,1
4,0.225223,0.008298,0.044203,0.001582,4,linear,"{'C': 4, 'kernel': 'linear'}",0.828571,0.84,0.850575,0.804598,0.816092,0.827967,0.016392,8
5,0.238724,0.004569,0.057497,0.002302,4,rbf,"{'C': 4, 'kernel': 'rbf'}",0.868571,0.84,0.87931,0.775862,0.821839,0.837117,0.036779,2
6,0.224928,0.008383,0.045405,0.001535,8,linear,"{'C': 8, 'kernel': 'linear'}",0.828571,0.84,0.850575,0.804598,0.816092,0.827967,0.016392,8
7,0.231566,0.002896,0.05392,0.001242,8,rbf,"{'C': 8, 'kernel': 'rbf'}",0.868571,0.84,0.87931,0.775862,0.821839,0.837117,0.036779,2
8,0.220163,0.006604,0.047966,0.005303,16,linear,"{'C': 16, 'kernel': 'linear'}",0.828571,0.84,0.850575,0.804598,0.816092,0.827967,0.016392,8
9,0.24585,0.011305,0.056053,0.003503,16,rbf,"{'C': 16, 'kernel': 'rbf'}",0.868571,0.84,0.87931,0.775862,0.821839,0.837117,0.036779,2


In [95]:
print(tuned_svm.best_estimator_)
print(tuned_svm.best_params_)

SVC(C=2)
{'C': 2, 'kernel': 'rbf'}


In [79]:
print(tuned_svm.score(test_x_vectors, test_y))

0.8173076923076923


In [96]:
best_tuned_svm = tuned_svm.best_estimator_

best_tuned_svm.fit(train_x_vectors, train_y)

n=2
print(test_x[n])
print(test_y[n])

best_tuned_svm.predict(test_x_vectors[n])

I think you should just take the title to heart, and skip the content. Try that experience. It's a little bit like seeing a movie preview and then realizing that the full length film isn't going to give you that much more information.
NEGATIVE


array(['NEGATIVE'], dtype='<U8')

In [97]:
print(best_tuned_svm.score(test_x_vectors, test_y))

0.8173076923076923


In [99]:
best_tuned_svm.get_params(deep=True)

{'C': 2,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [100]:
# F1 Scores
from sklearn.metrics import f1_score

labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]
print(labels)

print(f1_score(test_y, tuned_svm.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, best_tuned_svm.predict(test_x_vectors), average=None, labels=labels))

['POSITIVE', 'NEUTRAL', 'NEGATIVE']
[0.82075472 0.         0.81372549]
[0.82075472 0.         0.81372549]


  average, "true nor predicted", 'F-score is', len(true_sum)
  average, "true nor predicted", 'F-score is', len(true_sum)


### Saving the Model

In [104]:
import pickle

with open('./models/senti_clf_svm_c2_rbf.pkl', 'wb') as f:
    pickle.dump(best_tuned_svm, f)

### Loading the Model

In [105]:
import pickle

with open('./models/senti_clf_svm_c2_rbf.pkl', 'rb') as f:
    loaded_svm_clf = pickle.load(f)

In [111]:
n = 42
print(test_x[n])
print(test_y[n])
loaded_svm_clf.predict(test_x_vectors[n])

Although the Navajo element was interesting as background and provided a view as to life around a reservation, with its cultural nuances, the story itself was very basic. There were limited twists, if any, and the ending left me feeling like the reading journey was uneventful. Disappointing overall.
NEGATIVE


array(['NEGATIVE'], dtype='<U8')