#### Data Class

In [91]:
import random
class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self,text,score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    def get_text(self):
        return [x.text for x in self.reviews]
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive [:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
        
        
        

#### Load data

In [92]:
import json

file_name = 'Books_small_10000.json'

reviews= []

with open(file_name) as f:
    for line in f:
        review = json.loads(line) #convert from json to python dictionary(use json.dumps for python to json conversion)
        reviews.append(Review(review['reviewText'],review['overall']))
reviews[3].sentiment


'POSITIVE'

#### Prep data

In [93]:
from sklearn.model_selection import train_test_split
training, test = train_test_split( reviews, test_size=0.33, random_state=42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)



In [94]:
len(test) 

3300

In [95]:
len(training)

6700

In [96]:
print(training[0].text)
print(training[0].score)

Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to be an

In [98]:
train_container.evenly_distribute()

train_x = train_container.get_text() #x is what we are passing
train_y = train_container.get_sentiment() # y is what we want to predict

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()


print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


#### Bag of words vectorizer 

In [119]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

test_x_vectors = vectorizer.transform(test_x)# just want to transform since this is just our test data

#the above statement has two steps in one
#vectorizer.fit(train_x)
#train_x_vectors = vectorizer.transform(train_x)

print(train_x[0])
print(train_x_vectors[0].toarray())


    

I bought this book because I have always had an interest in insects and like many other books on this subject I thought I would get a lot of info about the structure of indiviidual insects along with hey looked functioned mated etc,odd facts,a history of each one in general ABOUT insects this author seems hung up on the holocost and the Nazis habit of comparing jews to insects etc etc,very little actual info about individual insects,so if you are interested in insects this is not the book to buy,this is more of a history book.
[[0. 0. 0. ... 0. 0. 0.]]


## Classification

### Linear SVM

In [120]:
from sklearn import svm
clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
print(test_x[0])
clf_svm.predict(test_x_vectors[0])

This book is published by Amazon - so I was surprised at the clunky writing. The manuscript cried out for an editor who could have removed some of the cliches and the profusion of needless adjectives and adverbs. Of course, no editor could have breathed much life into the wooden characters and stilted dialogue - but that's another issue.The quality of the writing shows itself early when we learn that one character has &#34;dread etched on his face,&#34; and another &#34;burned with curiosity&#34; while a third has a mouth &#34;frozen in a crooked half smile&#34; and for a fourth, &#34;anger coursed through his body.&#34; A young girl writes in her diary that &#34;Virginie respects that which is haram or forbidden.&#34; Kind of the diarist to explain.I have to admit I did not get through this long slog set in Egypt in 1919 and 1940. The world is too full of good books to waste time on mediocre ones.


array(['NEGATIVE'], dtype='<U8')

#### Decision Tree

In [121]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
print(test_x[8])
clf_dec.predict(test_x_vectors[0])

While the DASH diet was originally created for people with high blood pressure and bad cardiovascular health, this diet works for anyone. With increased fats and moderate protein intake, your body will start to burn the fat that is stored in your body. This book not only provides a brief history of where the DASH diet comes from to a full menu of what to eat on a regular basis &#8211; including what and where to eat when you don&#8217;t cook your own meals. I would recommend this book to anyone who is looking to lose weight and keep it off.


array(['NEGATIVE'], dtype='<U8')

#### Naive Bayes

In [143]:
from sklearn.naive_bayes import GaussianNB
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.todense(), train_y)
clf_gnb.predict(test_x_vectors[0].todense())



array(['NEGATIVE'], dtype='<U8')

#### Logistic Regression

In [144]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)
print(test_x[0])
clf_log.predict(test_x_vectors[0])

This book is published by Amazon - so I was surprised at the clunky writing. The manuscript cried out for an editor who could have removed some of the cliches and the profusion of needless adjectives and adverbs. Of course, no editor could have breathed much life into the wooden characters and stilted dialogue - but that's another issue.The quality of the writing shows itself early when we learn that one character has &#34;dread etched on his face,&#34; and another &#34;burned with curiosity&#34; while a third has a mouth &#34;frozen in a crooked half smile&#34; and for a fourth, &#34;anger coursed through his body.&#34; A young girl writes in her diary that &#34;Virginie respects that which is haram or forbidden.&#34; Kind of the diarist to explain.I have to admit I did not get through this long slog set in Egypt in 1919 and 1940. The world is too full of good books to waste time on mediocre ones.


array(['NEGATIVE'], dtype='<U8')

#### Evaluation

In [147]:
#Mean Accuracy
print(clf_svm.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.todense(), test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6610576923076923
0.6706730769230769
0.8052884615384616




In [124]:
from sklearn.metrics import f1_score

#F1 Score
#from sklearn.metrics import f1_score


print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]))

#print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]))
#F1 Score

#print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE,Sentiment.NEGATIVE]))


[0.80582524 0.80952381]


In [125]:
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))
print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))

436
436
208
208


In [127]:
test_set = ["I thoroughly enjoyed this,5 stars", "bad book","horrible waste of time"]
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)


array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

#### Tuning our model(with  Grid search)

In [135]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)

clf.fit(train_x_vectors, train_y)
clf.best_params_

{'C': 4, 'kernel': 'rbf'}

In [136]:
print(clf.score(test_x_vectors, test_y))

0.8197115384615384


#### Saving model

In [138]:
import pickle

with open('./models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

#### Load model

In [139]:
with open('./models/sentiment_classifier.pkl','rb') as f:
    loaded_clf = pickle.load(f)

In [141]:
print(test_x[0])

loaded_clf.predict(test_x_vectors[0])

This book is published by Amazon - so I was surprised at the clunky writing. The manuscript cried out for an editor who could have removed some of the cliches and the profusion of needless adjectives and adverbs. Of course, no editor could have breathed much life into the wooden characters and stilted dialogue - but that's another issue.The quality of the writing shows itself early when we learn that one character has &#34;dread etched on his face,&#34; and another &#34;burned with curiosity&#34; while a third has a mouth &#34;frozen in a crooked half smile&#34; and for a fourth, &#34;anger coursed through his body.&#34; A young girl writes in her diary that &#34;Virginie respects that which is haram or forbidden.&#34; Kind of the diarist to explain.I have to admit I did not get through this long slog set in Egypt in 1919 and 1940. The world is too full of good books to waste time on mediocre ones.


array(['NEGATIVE'], dtype='<U8')