### Data Classes

In [70]:
import random

class Sentiment:  # enum class
    NEGATIVE = 'NEGATIVE'
    NEUTRAL = 'NEUTRAL'
    POSITIVE = 'POSITIVE'

class Review:
    def __init__(self, text, score):  # 'self' is a necessary initialization, other two are optional
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
    
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: 
            return Sentiment.POSITIVE

class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]

    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

### Load the Data

In [50]:
import json

file_name = './sklearn-master/data/sentiment/Books_small_10000.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        # reviews.append((review['reviewText'], review['overall']))     ## without using class
        reviews.append(Review(review['reviewText'], review['overall']))    ## using class

# reviews[5][1]    ## without using class
# reviews[5][0]    ## without using class

print(reviews[5].score)    ## using class
print(reviews[5].text)    ## using class


5.0
I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.


### Prep the Data

In [84]:
from sklearn.model_selection import train_test_split # trying to split the data in training and test set

training, test = train_test_split(reviews, test_size = 0.33, random_state = 42)
print(len(training))
print(len(test))

train_container = ReviewContainer(training)
test_container = ReviewContainer(test)

train_container.evenly_distribute()
test_container.evenly_distribute()

6700
3300


In [87]:
# train_x = [x.text for x in training]
# train_y = [x.sentiment for x in training]

# test_x = [x.text for x in test]
# test_y = [x.sentiment for x in test]

train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))

# print(train_x[0], train_y[0], '\n')
# print(test_x[0], test_y[0])

208
208


#### Bag of Words Vectorization

In [131]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
# CountVectorizer weights each word equally - Point to be noted
# So we can use something like Term Frequency Inverse Document Frequency Vectorizer to handle that issue

# vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x) # 1 STEP PROCESS

test_x_vectors = vectorizer.transform(test_x) # The reasons we didn't fit is
# because we didn't wanna fit a new model for testing, the model will be the training model

# vectorizer.fit(train_x)
# train_x_vectors = vectorizer.transform(train_x) # 2 STEP PROCESS

print(train_x[0])
print(train_x_vectors[0])

This book is a really good choice for anyone who wants to learn about the Louisiana Purchase, but is not interested in a heavily documented academic study. I prefer a documented book on any history topic, but I realize that is not what many prefer. Fleming is an excellent writer and while this book is not documented it is a reliable source on this momentous event in United States history. Highly recommended.
  (0, 6444)	0.11390450271244093
  (0, 3769)	0.10358985835820621
  (0, 7469)	0.14312928345225817
  (0, 8359)	0.156877491045465
  (0, 2783)	0.15139547296682715
  (0, 5164)	0.17390619455317205
  (0, 7338)	0.156877491045465
  (0, 6524)	0.17390619455317205
  (0, 4277)	0.03573968486953971
  (0, 8693)	0.08800670271859598
  (0, 423)	0.031069135183508054
  (0, 8838)	0.10820978004877607
  (0, 2814)	0.12440561647121268
  (0, 416)	0.05922282121984397
  (0, 3116)	0.17390619455317205
  (0, 4891)	0.07829496533593017
  (0, 8679)	0.061518235586859364
  (0, 7925)	0.04367989242155538
  (0, 6408)	0.13

### Classification

#### Linear SVM

In [121]:
from sklearn import svm

clf_svm = svm.SVC(kernel = 'linear') # setting the classifier as SVM
clf_svm.fit(train_x_vectors, train_y) # training the model

print(test_x[0])

print(clf_svm.predict(test_x_vectors[0]))

I'm absolutely sick and tired of the man cheats, woman sucks it up, no grovel romances.  Caden cheated.  More than once.  He had ZERO grovel time.  None.  Zip, zilch, nada.  Oh, but we are told he is sorry.  And remorseful.  Ever is like 'dude, you cheated.. How many times?' And flounces off for a WEEK, to come back and say I forgive you, for me, because I can't find a better man who won't step out on me.  Give me a break!  Do I want reality in my romances, not necessarily- I don't need discourse on hygiene, taxes, and mortgage payments, but I expect to read about realistic response when you find out your TWIN and your HUSBAND consistently bumped uglies whilst you were in a coma.  Reality- if Caden and Ever had a chance, it would have been after she dumped him, got some therapy, and found herself some equal footing.  Not a week and oh, I forgive you.  Eden didn't deserve squat and she is the one who walks away a winner.  Uh, yeah, no.  I was wondering how Wilder would make this work an

#### Decision Tree

In [122]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

#### Naive Bayes

In [130]:
# from sklearn.naive_bayes import GaussianNB

# clf_gnb = GaussianNB()
# clf_gnb.fit(train_x_vectors.todense(), train_y)

# clf_gnb.predict(test_x_vectors[0])

#### Logistic Regression

In [98]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

### Evaluation

#### Mean Accuracy

In [99]:
print('Accuracy for SVM is: ' + str(round(float(clf_svm.score(test_x_vectors, test_y))*100, 2)) + '%')
print('Accuracy for Decision Tree is: ' + str(round(float(clf_dec.score(test_x_vectors, test_y))*100, 2)) + '%')
print('Accuracy for Logistic Regression is: ' + str(round(float(clf_log.score(test_x_vectors, test_y))*100, 2)) + '%')

# Case 1
# Accuracy for SVM is: 82.42%
# Accuracy for Decision Tree is: 76.36%
# Accuracy for Logistic Regression is: 83.03%

# Case 2
# Accuracy for SVM is: 79.81%
# Accuracy for Decision Tree is: 64.42%
# Accuracy for Logistic Regression is: 81.49%

# Case 3
# Accuracy for SVM is: 80.77%
# Accuracy for Decision Tree is: 62.74%
# Accuracy for Logistic Regression is: 80.53%

Accuracy for SVM is: 80.77%
Accuracy for Decision Tree is: 62.74%
Accuracy for Logistic Regression is: 80.53%


#### F1 Scores

In [100]:
from sklearn.metrics import f1_score

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]))

# Case 1
# [0.91319444 0.21052632 0.22222222]
# [0.87170475 0.1        0.06451613]
# [0.91370558 0.12244898 0.1       ]

# Case 2
# [0.8028169  0.79310345]
# [0.65258216 0.63546798]
# [0.82051282 0.808933  ]  ## Didn't include NEUTRAL

# Case 3
# [0.80582524 0.80952381]
# [0.61728395 0.63700234]
# [0.80291971 0.80760095]

[0.80582524 0.80952381]
[0.61728395 0.63700234]
[0.80291971 0.80760095]


##### At this point we can see that all our models are performing very well when it comes to predicting the POSITIVE part of the rating but very bad when it comes to predicting the NEUTRAL or the NEGATIVE. So now, we will be improving the NEUTRAL/NEGATIVE part of our model

In [83]:
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


##### It is quite evident that the number of positive versus the number of negative sentiments is very heavily biased towards the positive side, hence, our model will also likely be biased, which shows in the F1 score of the testing phase

##### Now, we will go up again at the beginning and replace the current JSON file with the new JSON file containing not just 1000, but around 10000 data points so our model can be trained better 

In [101]:
test_set = ['This book is not good', 'horrible book', 'interesting book']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['NEGATIVE', 'POSITIVE', 'POSITIVE'], dtype='<U8')

In [106]:
from sklearn.model_selection import GridSearchCV
# to find out the best parameters to be used within a classifier parameter list

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
tuned_clf_svm = GridSearchCV(svc, parameters, cv = 5) # 'cv' stands for cross validation, 
# or number of iterations to check the optimal parameter values

tuned_clf_svm.fit(train_x_vectors, train_y)

dec = DecisionTreeClassifier()
tuned_clf_dec = GridSearchCV(svc, parameters, cv = 5)

tuned_clf_dec.fit(train_x_vectors, train_y)

In [107]:
print('Accuracy for SVM is: ' + str(round(float(tuned_clf_svm.score(test_x_vectors, test_y))*100, 2)) + '%')
print('Accuracy for Decision Tree is: ' + str(round(float(tuned_clf_dec.score(test_x_vectors, test_y))*100, 2)) + '%')

Accuracy for SVM is: 80.77%
Accuracy for Decision Tree is: 80.77%


### Saving the Model

In [109]:
import pickle # for saving the models

with open('./models/sentiment_classifier_dectree.pkl', 'wb') as f:
    pickle.dump(tuned_clf_dec, f)

with open('./models/sentiment_classifier_svm.pkl', 'wb') as f:
    pickle.dump(tuned_clf_svm, f)

### Load the Model

In [110]:
with open('./models/sentiment_classifier_dectree.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [116]:
for i in range(10):
    print(test_x[i]) if len(test_x[i]) < 50 else print(test_x[i][:50])
    print(loaded_clf.predict(test_x_vectors[i]))

I'm absolutely sick and tired of the man cheats, w
['NEGATIVE']
Although the Navajo element was interesting as bac
['NEGATIVE']
Enjoyed the book at lot. It has a very good plot. 
['POSITIVE']
did not serve my need.
['NEGATIVE']
Well, the story did have potential but oh boy, I'v
['NEGATIVE']
All my annoyance melted. "You dumb-a@#," I crooned
['POSITIVE']
Confusing, too many tangents, easily forgotten
['NEGATIVE']
I love HM Ward and I love the Fero boys. But becau
['POSITIVE']
Intercepting Love- LP Dover ( 4 Stars)This book wa
['NEGATIVE']
My husband and I got new computers each and he had
['NEGATIVE']
