# Loading the Data

We are importing the data and extracting the relevent features that really matters to find find the sentiment

In [133]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text,score):
        self.text = text
        self.score = score
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else: #Score of 4 or 5
            return Sentiment.POSITIVE
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
        
        
    def evenly_distribute(self):
        negative = list(filter(lambda x:x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x:x.sentiment == Sentiment.POSITIVE, self.reviews))
 #      neutral = list(filter(lambda x:x.sentiment == Sentiment.NEUTRAL, self.reviews))
        positive_shrunk = positive[:len(negative)] 
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

In [134]:
import json

file_name = "/Users/rishabhtiwari/Desktop/Summer-ml-project/Books_small_10000.json"

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))

reviews[444].text

'A Look InsideDo not for a moment think that this is a romance novel. It definitely is not. While there are romantic relationships involved, the story is much deeper. Its title, The Hurricane  Sisters  A Novel, is accurate. This book covers many social issues today that are often held in family secrets. While it is slow-going at first, once the &#8220;secrets&#8221; begin to be revealed, the story moves right along. Set in the Charleston, South Carolina area, typical of the author, the characters are well-developed. The book is not, in my opinion, appropriate for younger readers, due to the nature of these &#8220;family secrets&#8221;. It could easily be turned into a soap opera for television.'

# Preprocessing Data

In [135]:
len(reviews)

10000

In [136]:
from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews, test_size=0.33, random_state = 42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)

len(cont.reviews)

6700

In [137]:
len(training)

6700

In [138]:
len(test)

3300

In [139]:
print(training[0].text)

Olivia Hampton arrives at the Dunraven family home as cataloger of their extensive library. What she doesn't expect is a broken carriage wheel on the way. Nor a young girl whose mind is clearly gone, an old man in need of care himself (and doesn&#8217;t quite seem all there in Olivia&#8217;s opinion). Furthermore, Marion Dunraven, the only sane one of the bunch and the one Olivia is inexplicable drawn to, seems captive to everyone in the dusty old house. More importantly, she doesn't expect to fall in love with Dunraven's daughter Marion.Can Olivia truly believe the stories of sadness and death that surround the house, or are they all just local neighborhood rumor?Was that carriage trouble just a coincidence or a supernatural sign to stay away? If she remains, will the Castle&#8217;s dark shadows take Olivia down with them or will she and Marion long enough to declare their love?Patty G. Henderson has created an atmospheric and intriguing story in her Gothic tale. I found this to be an

In [140]:
print(training[0].sentiment)

POSITIVE


In [141]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


# Bags of Words vectorization

In [149]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  # term freq. inverse Document Frequency

# this book is great !
# this book was so bad

# vectorizer = CountVectorizer()
vectorizer = TfidfVectorizer()
# vectorizer.fit(train_X)
# train_x_vectors = vectorizer.transform(train_x)

train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())

I fully enjoyed the presentation although you need patience for the first third to get the full picture. Nothing too detailed to lose you in minutia, just a well detailed story with interesting characters. He throws in good excitement at the family level and military decisions. I thoroughly enjoyed the book and it's presented in a way to make you feel it really happened.
[[0. 0. 0. ... 0. 0. 0.]]


# Classification

### Linear SVM

In [150]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

test_x[0]

clf_svm.predict(test_x_vectors[0])

array(['NEGATIVE'], dtype='<U8')

### Decision Tree

In [151]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

clf_dec.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

### Naive bayes

In [152]:
# from sklearn.naive_bayes import GaussianNB

# clf_gnb = GaussianNB()
# clf_gnb.fit(train_x_vectors, train_y)

# clf_dec.predict(test_x_vectors[0])

### Logistic Regression

In [153]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()

clf_log.fit(train_x_vectors, train_y)

clf_log.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

# Evaluation

In [154]:
# Mean Accuracy
mean_acc_svm = clf_svm.score(test_x_vectors, test_y)
mean_acc_dec = clf_dec.score(test_x_vectors, test_y)
mean_acc_gnb = clf_gnb.score(test_x_vectors, test_y)
mean_acc_log = clf_log.score(test_x_vectors, test_y)

print(mean_acc_svm)
print(mean_acc_dec)
print(mean_acc_gnb)
print(mean_acc_log)

0.8076923076923077
0.6538461538461539
0.8052884615384616


In [155]:
# F1 Scores
from sklearn.metrics import f1_score

f1_svm = f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels = [Sentiment.POSITIVE,Sentiment.NEGATIVE])
f1_dec = f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels = [Sentiment.POSITIVE,Sentiment.NEGATIVE])
#f1_gnb = f1_score(test_y, clf_gnb.predict(test_x_vectors), average=None, labels = [Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])
f1_log = f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE])

print(f1_svm)
print(f1_dec)
#print(f1_gnb)
print(f1_log)

[0.80582524 0.80952381]
[0.65048544 0.65714286]
[0.80291971 0.80760095]


In [156]:
train_y[0:5]

['POSITIVE', 'POSITIVE', 'POSITIVE', 'NEGATIVE', 'NEGATIVE']

We can see POSITIVE labeled data is way more than negative hence model will be biased towards the POSITIVE data

In [157]:
print(test_y.count(Sentiment.POSITIVE))
print(test_y.count(Sentiment.NEGATIVE))

208
208


In [158]:
test_set = ['very fun', "bad book do not buy", "horrible waste of time"]
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')

# Tuning out model (with Grid Search)


Grid search refers to a technique used to identify the optimal hyperparameters for a model

In [159]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = svm.SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)

In [160]:
mean_acc_svm = clf_svm.score(test_x_vectors, test_y)
print(mean_acc_svm)

0.8076923076923077


there are some room for improvements :
  1. More words in corpus
  2. words like "good!" and "good" should be treated differently. (stripping out punctuation mark
  3. try different language model

# Saving Model

### save model

In [169]:
import pickle

with open ('/Users/rishabhtiwari/Desktop/Summer-ml-project/sentiment_classifier.pkl','wb') as f:
    pickle.dump(clf_svm, f)

with open(r'/Users/rishabhtiwari/Desktop/Summer-ml-project/sentiment_classifier.pkl', 'wb') as g:
    pickle.dump(vectorizer, g)

### Load model

In [170]:
import pickle

with open(r'/Users/rishabhtiwari/Desktop/Summer-ml-project/sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

with open(r'/Users/rishabhtiwari/Desktop/Summer-ml-project/sentiment_classifier.pkl', 'rb') as g:
    loaded_vectorizer = pickle.load(g) 