<a href="https://colab.research.google.com/github/santoshhulbutti/ML_Concepts_Practice/blob/main/SKLearn_models_learning_session_Model_improvisation_Rev03(Grid_Search).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import random

class Sentiment:
  NEGATIVE = "NEGATIVE"
  NEUTRAL = "NEUTRAL"
  POSITIVE = "POSITIVE"

class Review:
  def __init__(self, text, score):
    self.text = text
    self.score = score
    self.sentiment = self.get_sentiment()
  
  def get_sentiment(self):
    if self.score <= 2:
      return Sentiment.NEGATIVE
    elif self.score ==3:
      return Sentiment.NEUTRAL
    else: #if score if 4 or 5
      return Sentiment.POSITIVE
  
class ReviewContainer:
  def __init__(self, reviews):
    self.reviews = reviews

  def get_text(self):
    return [x.text for x in self.reviews]

  def get_sentiment(self):
    return [x.sentiment for x in self.reviews]

  def evenly_distribute(self):
    negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
    positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))
    neutral = list(filter(lambda x: x.sentiment == Sentiment.NEUTRAL, self.reviews))

    positive_shrunk = positive[:len(negative)]
    neutral_shrunk = neutral[:len(negative)]

    self.reviews = negative + positive_shrunk + neutral_shrunk
    random.shuffle(self.reviews)

    # print(negative[0].text)
    # print(len(negative))
    # print(len(positive))
    # print(len(neutral))



Cleaning data

In [2]:
import json
file_name = './sample_data/Books_small_10000.json'

reviews = []
with open(file_name) as f:
   for line in f:
     print(line)
     break

{"reviewerID": "A1F2H80A1ZNN1N", "asin": "B00GDM3NQC", "reviewerName": "Connie Correll", "helpful": [0, 0], "reviewText": "I bought both boxed sets, books 1-5.  Really a great series!  Start book 1 three weeks ago and just finished book 5.  Sloane Monroe is a great character and being able to follow her through both private life and her PI life gets a reader very involved!  Although clues may be right in front of the reader, there are twists and turns that keep one guessing until the last page!  These are books you won't be disappointed with.", "overall": 5.0, "summary": "Can't stop reading!", "unixReviewTime": 1390435200, "reviewTime": "01 23, 2014"}



Loadning data

In [3]:
with open(file_name) as f:
   for line in f:
     review = json.loads(line)
     reviews.append(Review(review['reviewText'], review['overall']))

Prep Data

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size = 0.25, random_state = 27)

train_container = ReviewContainer(train)
test_container = ReviewContainer(test)

In [5]:
train_container.evenly_distribute()
test_container.evenly_distribute()

train_x = train_container.get_text()
train_y = train_container.get_sentiment()

test_x = test_container.get_text()
test_y = test_container.get_sentiment()

bag of words vectorization

In [6]:
#using different vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)

In [7]:
print(train_x[0])
print(train_x_vectors[0].toarray())
# print(train_x_vectors[0])

The CD works as a decent companion to the book, although I might have found an MP3-download format easier to manage in some cases. The CD is useful for those times when the sound of another's voice can help you to tune out the &#34;noise&#34; inside your own thoughts and focus on the meditation.
[[0. 0. 0. ... 0. 0. 0.]]


## Classification

Linear Support Vector Machines

In [8]:
from sklearn import svm
clf_svm = svm.SVC(kernel ='linear')
clf_svm.fit(train_x_vectors, train_y)

SVC(kernel='linear')

In [9]:
test_x[0]

"I laughed.  I cried.  I loved this book.  A sensitive portrayal of a totally dependent life without being overly sentimental or cloying.  Who set who's life free...........a beautiful story."

In [10]:
clf_svm.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

Decision tree

In [11]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

DecisionTreeClassifier()

In [12]:
clf_dec.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

Naive Bayes

In [13]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)

GaussianNB()

In [14]:
clf_gnb.predict(test_x_vectors[0].toarray())

array(['POSITIVE'], dtype='<U8')

Logistic Regression

In [15]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression(solver = 'newton-cg')
clf_log.fit(train_x_vectors, train_y)

LogisticRegression(solver='newton-cg')

In [16]:
clf_log.predict(test_x_vectors[0])

array(['POSITIVE'], dtype='<U8')

### EVALUATION  

mean accuracy

In [17]:
# for Linear Support Vector Machines
clf_svm.score(test_x_vectors, test_y)

0.6431623931623932

In [18]:
# for Decision tree classifier
clf_dec.score(test_x_vectors, test_y)

0.4423076923076923

In [19]:
# for Naive bayes - GuassianNB
clf_gnb.score(test_x_vectors.toarray(), test_y)

0.4465811965811966

In [20]:
# for logistic regression - newton-cg solver
clf_log.score(test_x_vectors, test_y)

0.6773504273504274

## F1 Score

In [21]:
from sklearn.metrics import f1_score


print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]))

[0.73684211 0.54368932 0.64473684]
[0.49681529 0.38795987 0.43962848]
[0.46357616 0.4137931  0.46853147]
[0.75       0.59210526 0.68243243]


In [24]:
from sklearn.model_selection import GridSearchCV

parameters = {'kernel' : ('linear', 'rbf'), 'C':(1,4,8,16,32)}
svc = svm.SVC()

clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')})

In [25]:
clf.score(test_x_vectors, test_y)

0.6495726495726496

## Saving the model in pickle

In [28]:
import pickle

with open('./sample_data/sentiment_classifier.pkl', 'wb') as fi:
  pickle.dump(clf, fi)

## Loading the model

In [29]:
with open('./sample_data/sentiment_classifier.pkl', 'rb') as fi:
  loaded_clf = pickle.load(fi)

In [30]:
print(test_x[2])

Good information but definitely not complete. Web contacts were terrible. Could use some illustrations. It's a thin little book but good to put in your pocket. Good for a library but not the only book you'll need.


In [31]:
loaded_clf.predict(test_x_vectors[2])

array(['NEUTRAL'], dtype='<U8')

In [32]:
print(test_x[8])
print(loaded_clf.predict(test_x_vectors[8]))

I love Scott Turow, especially Presumed Innocent and Innocent, as well as The Burden of Proof, but this one was really awful. Not only does it have fathers committing incest and then killing daughters, it has wives killing husbands. Basically I think he lost control of the story here. Too bad because it had a terrific premise.
['NEGATIVE']


In [33]:
print(test_x[35])
print(loaded_clf.predict(test_x_vectors[35]))

When he first started off, there was a receptive audience of older folks willing and eager to pay O'Rourke good money to confirm...and enhance...their worst impressions of a generation they resented. Now in his 60s, his audience is primarily those members of the Boomer generation who, like O'Rourke, react with glee that their generation didn't live up to its ideals because it justifies all the time they spent watching from the sidelines as their bolder brethren tried to change a society sorely in need of change. They are the self-loathing Boomers. (More review of O'Rourke at The Nobby Works, Curse of the Boomers, Part 2.)[...]
['NEGATIVE']


In [34]:
print(test_x[27])
print(loaded_clf.predict(test_x_vectors[27]))

Lacy and Jason face a mystery that has Lacy in the hospital and Jason feeling out of it.  As the wedding approaches the intrigue builds.  You won't want to miss this or the five other books that proceed this sixth book in the series.  I have them all.  Great reads each one.
['POSITIVE']
