<a href="https://colab.research.google.com/github/sana1207/Portfolio/blob/main/Python_ML_with_SkLearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Class

In [None]:
import random

class Sentiment:
    NEGATIVE = "NEGATIVE"
    NEUTRAL = "NEUTRAL"
    POSITIVE = "POSITIVE"

class Review:
    def __init__(self, text, score):
        self.text =  text
        self.score = score
        self.sentiment = self.get_sentiment()
            
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score == 3:
            return Sentiment.NEUTRAL
        else:
            return Sentiment.POSITIVE
        
class ReviewContainer:
    def __init__(self,reviews):
        self.reviews = reviews
        
    def get_text(self):
        return [x.text for x in self.reviews]
        
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)
        

### Load Data


In [None]:
import json

file_name = 'd:/Users/Asus/Desktop/DataScience/Books_small.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))
        
reviews[5].score
reviews[5].text
reviews[5].sentiment

'POSITIVE'

In [None]:
import json

file_name = 'd:/Users/Asus/Desktop/DataScience/Books_small_10000.json'

reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'], review['overall']))

reviews[9999].text


"Highly recommend this entire trilogy. It is very well written and held me in suspense and kept me reading.  Even with the same old young girl heroine who goes head strong and hell bent on saving the new world, id tecommend this book to dystopian fiction fans!  Not overdone, thankfully! A fesw situations made it feel like I've read this same plot before....but these were well thought out and much better written!  This authr has a gift a d I will be looking forward to reading more of ber work."

In [None]:

# split between test set and training set


from sklearn.model_selection import train_test_split

training, test = train_test_split(reviews,test_size=0.33, random_state = 42)

train_container = ReviewContainer(training)

test_container = ReviewContainer(test)




### Prep Data

In [None]:
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment() 

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


### Bag of Words Vectorization


In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# this book is great
# this book was so bad

#vectorizer = CountVectorizer()

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

#fit : vectorizer.fit(train_x)
#transform : train_x_vectors = vectorizer.transform(train_x)

test_x_vectors=vectorizer.transform(test_x)

print(train_x[0])
print(train_x_vectors[0].toarray())

train_x_vectors
train_y

I got half way through and had to quit.  There was nothing I liked about the book.  Oh, I guess the title isn't bad.
[[ 0.  0.  0. ...,  0.  0.  0.]]


['NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'NEGATIVE',
 'NEGATIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'NEGATIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',
 'POSITIVE',

## Classification

### Linear SVM

In [None]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)
test_x[0]
clf_svm.predict(test_x_vectors[0])

array(['NEGATIVE'],
      dtype='<U8')

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)
clf_dec.predict(test_x_vectors[0])

array(['NEGATIVE'],
      dtype='<U8')

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB


clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(train_x_vectors, train_y)
clf_gnb.predict(test_x_vectors[0])


array(['NEGATIVE'],
      dtype='<U8')

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression


clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y)
clf_log.predict(test_x_vectors[0])

array(['NEGATIVE'],
      dtype='<U8')

# Evaluation

### Mean Accuracy

In [None]:
print(clf_svm.score(test_x_vectors,test_y))
print(clf_dec.score(test_x_vectors,test_y))
print(clf_gnb.score(test_x_vectors,test_y))
print(clf_log.score(test_x_vectors,test_y))

0.807692307692
0.644230769231
0.634615384615
0.802884615385


### F1_score

In [None]:
from sklearn.metrics import f1_score
f1_score(test_y, clf_svm.predict(test_x_vectors), average = None, labels= [Sentiment.POSITIVE,Sentiment.NEGATIVE])

#f1_score(test_y, clf_dec.predict(test_x_vectors), average = None, labels= [Sentiment.POSITIVE,Sentiment.NEUTRAL,Sentiment.NEGATIVE])
#f1_score(test_y, clf_gnb.predict(test_x_vectors), average = None, labels= [Sentiment.POSITIVE,Sentiment.NEUTRAL,Sentiment.NEGATIVE])
#f1_score(test_y, clf_log.predict(test_x_vectors), average = None, labels= [Sentiment.POSITIVE,Sentiment.NEUTRAL,Sentiment.NEGATIVE])

array([ 0.80582524,  0.80952381])

In [None]:
test_y.count(Sentiment.POSITIVE)

2767

In [None]:
test_y.count(Sentiment.NEGATIVE)

208

In [None]:
test_y.count(Sentiment.POSITIVE)

208

In [None]:
test_y.count(Sentiment.NEGATIVE)

208

In [None]:
test_set = ['I throughly enjoyed this, 5 stars','bad book do not buy','horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'],
      dtype='<U8')

In [None]:
test_set = ['not great','bad book do not buy','horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['NEGATIVE', 'NEGATIVE', 'NEGATIVE'],
      dtype='<U8')

In [None]:
test_set = ['a time passer','bad book do not buy','horrible waste of time']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'],
      dtype='<U8')

### Saving Model

#### Save Model 

In [None]:
import pickle 

with open('d:/Users/Asus/Desktop/DataScience/models/sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf_svm,f)


#### Load Model 

In [None]:
with open('d:/Users/Asus/Desktop/DataScience/models/sentiment_classifier.pkl','rb' ) as f:
    loaded_clf =pickle.load(f)

In [None]:
print(test_x[0])
loaded_clf.predict(test_x_vectors[0])

First of all, this book needs some serious editing.  This book probably should not have been sold in it's  current version.  The story itself is a little unbelievable.  Billionaire brothers who beat up their fathers don't exist in most people's lives.  The storyline or what I could make out to be the storyline is really weak.  Save your money.


array(['NEGATIVE'],
      dtype='<U8')