## Predicting Positive or Negative Comments

In [113]:
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

In [114]:
class Review:
    
    def __init__(self, text, score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE

In [None]:
# Not used but can be used to evenly distribute the training data so we don't have more of one status over 
# other and our mode performs better.

class ReviewContainer:
    
    def __init__(self, reviews):
        self.reviews = reviews
        
    def evenly_distribute(self):
        negative = filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews)
        positive = filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews)
        neutral = filter(lambda x: x.sentiment == Sentiment.NEUTRAL, self.reviews)
        
        # Print samples
        print(negative[0].text)
        print(positive[0].text)
        print(neutral[0].text)
    

In [115]:
import json

In [116]:
# Run this first and we will check the model. The model performs good for Positive but not for others as you can see below.
#file_name = "books_small.json" 

# Run this after which has varied data. In this we can see all three status performing well.

file_name = "books_big.json" 

reviews = []

with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review["reviewText"], review["overall"]))
        
print(reviews[1].text)
print(reviews[1].score)
print(reviews[1].sentiment)

I enjoyed this short book. But it was way way to short ....I can see how easily it would have been to add several chapters.
3.0
NEUTRAL


## Prep Data

In [117]:
from sklearn.model_selection import train_test_split

In [118]:
# 33% will be test data and 67 will be training data

training, test = train_test_split(reviews, test_size=0.33, random_state=42) 

print(len(training))
print(len(test))
print(training[0].text)
print(test[0].text)

6700
3300
Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found thi

## Training Data

In [119]:
# Training data X and Y axis

train_x = [x.text for x in training] # X axis is text or comments
train_y = [x.sentiment for x in training] # Y axis is Positive, Negative or Neutral

print(train_x[0])
print(train_y[0])

Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to be an

## Test Data

In [120]:
test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

print(test_x[0])
print(test_y[0])

was sent an Arc of this book for an honest review and here it is = This is the kind of book that you want to read while sitting in front of the fire with a cup of hot apple cider and a blanket over your legs.I have read many of Jaci Burton's books and have never been disappointed. This first book in her new Hope series does not disappoint either.This is the story of Emma, a new vet who has come back home to open her own practice and Luke McCormack, a police officer in the same town.Both have been previously burned by love so both have issues but, that doesn't stop them from acting on that attraction.This book pulls you in from the first page, wraps you up and doesn't let you go until the end.I loved it!
POSITIVE


## Bag of Words Vectorization (Converts each words in to a vector)

In [121]:
from sklearn.feature_extraction.text import CountVectorizer

### Training and Test Data Vectorization

In [122]:
vectorizer = CountVectorizer()

# Way 1

# vectorizer.fit(train_x)
# train_x_vectors = vectorizer.transform(train_x)

#Way 2

train_x_vectors = vectorizer.fit_transform(train_x)

# No need to fit test data. Just need to transform.
test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0])
print(train_x_vectors[0].toarray())

Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to be an

## Classification

##### Linear SVM (Support Vector Machine)

In [123]:
from sklearn import svm

clf_svm = svm.SVC(kernel="linear")

clf_svm.fit(train_x_vectors, train_y)

print(test_x[0])
print(test_x_vectors[0])

clf_svm.predict(test_x_vectors[0])

was sent an Arc of this book for an honest review and here it is = This is the kind of book that you want to read while sitting in front of the fire with a cup of hot apple cider and a blanket over your legs.I have read many of Jaci Burton's books and have never been disappointed. This first book in her new Hope series does not disappoint either.This is the story of Emma, a new vet who has come back home to open her own practice and Luke McCormack, a police officer in the same town.Both have been previously burned by love so both have issues but, that doesn't stop them from acting on that attraction.This book pulls you in from the first page, wraps you up and doesn't let you go until the end.I loved it!
  (0, 683)	1
  (0, 1295)	2
  (0, 1326)	5
  (0, 1562)	1
  (0, 1624)	1
  (0, 2011)	1
  (0, 2210)	1
  (0, 2543)	2
  (0, 2922)	1
  (0, 3137)	4
  (0, 3161)	1
  (0, 3218)	2
  (0, 3643)	1
  (0, 3655)	1
  (0, 3677)	1
  (0, 3706)	1
  (0, 4520)	1
  (0, 4930)	1
  (0, 5962)	1
  (0, 6910)	1
  (0, 69

array(['POSITIVE'], dtype='<U8')

#### SVM Accuracy

In [125]:
clf_svm.score(test_x_vectors, test_y)

0.8124242424242424

##### Decision Tree

In [126]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()

clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

#### Decision Tree Accuracy

In [127]:
clf_dec.score(test_x_vectors, test_y)

0.7724242424242425

#### Naive Bayes

In [128]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()

clf_gnb.fit(train_x_vectors.todense(), train_y) # For NB needs dense array

clf_gnb.predict(test_x_vectors.todense()[0])

array(['POSITIVE'], dtype='<U8')

#### Naive Bayes Accuracy

In [129]:
clf_gnb.score(test_x_vectors.todense(), test_y)

0.6587878787878788

#### Logistic Regression

In [130]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()

clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])


array(['POSITIVE'], dtype='<U8')

#### Logistic Regression Accuracy

In [131]:
clf_log.score(test_x_vectors, test_y)

0.8478787878787879

## Accuracy

In [132]:
# Mean Accuracy

print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.todense(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8124242424242424
0.7724242424242425
0.6587878787878788
0.8478787878787879


## F1 Score

In [133]:
# F1 Score

from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None))

# SVM for Positive is good but for others its bad using books_small.json. Gets better after using books_big.json
print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, 
      labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

# Decision Tree for Positive is good but for others its bad books_small.json. Gets better after using books_big.json
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, 
      labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

# Naive Bayes for Positive is good but for others its bad books_small.json. Gets better after using books_big.json
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.todense()), average=None, 
      labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

# Logistic Regression for Positive is good but for others its bad books_small.json. Gets better after using books_big.json
print(f1_score(test_y, clf_log.predict(test_x_vectors.todense()), average=None, 
      labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

# It seems like the models are predicting Positive good but others very bad. It could be a model or a data issue.
# The model performs better after using books_big.json

[0.40268456 0.2656     0.90738061]
[0.90738061 0.2656     0.40268456]
[0.87628866 0.17258883 0.17232376]
[0.7996939  0.1260745  0.11851852]
[0.92493017 0.29714286 0.4092219 ]


## Looking at training data to fix the model

In [135]:
print("TOTAL: " + str(len(train_y)))
print("POSITIVE : " + str(train_y.count(Sentiment.POSITIVE)))
print("NEUTRAL : " + str(train_y.count(Sentiment.NEUTRAL)))
print("NEGATIVE : " + str(train_y.count(Sentiment.NEGATIVE)))

# As we can see most of our training data has Positive cases than others. 

TOTAL: 6700
POSITIVE : 5611
NEUTRAL : 653
NEGATIVE : 436


## Sample Testing

In [137]:
test_set = ["I thoroughly enjoyed this, 5 stars", "bad book do not buy", "horrible waste of time"]
new_test = vectorizer.transform(test_set)

print(clf_svm.predict(new_test))
print(clf_log.predict(new_test))
print(clf_dec.predict(new_test))
print(clf_gnb.predict(new_test.todense()))

['POSITIVE' 'NEGATIVE' 'NEGATIVE']
['POSITIVE' 'NEGATIVE' 'NEGATIVE']
['POSITIVE' 'POSITIVE' 'NEGATIVE']
['NEGATIVE' 'NEGATIVE' 'NEGATIVE']


## Tuning model with Grid Search

In [140]:
from sklearn.model_selection import GridSearchCV

parameters = {
    "kernel": ("linear", "rbf"), 
    "C": (1, 4, 8, 16, 32)
    }

svc = svm.SVC()

clf_grd = GridSearchCV(svc, parameters, cv=5)

clf_grd.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'kernel': ('linear', 'rbf'), 'C': (1, 4, 8, 16, 32)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

## F1 Score

In [142]:
print(clf_grd.score(test_x_vectors, test_y))

0.8396969696969697


## Saving Model

In [143]:
import pickle

with open("./models/Scikit_Learn_Sentiment_Classifier.pkl", "wb") as f:
    pickle.dump(clf_grd, f)

## Importing the Model

In [145]:
with open("./models/Scikit_Learn_Sentiment_Classifier.pkl", "rb") as f:
    loaded_clf = pickle.load(f)

print(test_x[0])    
print(loaded_clf.predict(test_x_vectors[0]))

was sent an Arc of this book for an honest review and here it is = This is the kind of book that you want to read while sitting in front of the fire with a cup of hot apple cider and a blanket over your legs.I have read many of Jaci Burton's books and have never been disappointed. This first book in her new Hope series does not disappoint either.This is the story of Emma, a new vet who has come back home to open her own practice and Luke McCormack, a police officer in the same town.Both have been previously burned by love so both have issues but, that doesn't stop them from acting on that attraction.This book pulls you in from the first page, wraps you up and doesn't let you go until the end.I loved it!
['POSITIVE']
