# Amazon Reviews Classifier

Let's define a class Review!!

In [29]:
class Sentiment:
    NEGATIVE = 'NEGATIVE'
    POSITIVE = 'POSITIVE'
    NEUTRAL = 'NEUTRAL'

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score    
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score >=4:
            return Sentiment.POSITIVE
        else:
            return Sentiment.NEUTRAL

#### Load Data
Let's import the json data.

In [3]:
import json

file_name = './dataset/Books_small.json'
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))

print(reviews[5].text)
print(reviews[5].score)
print(reviews[5].sentiment)

Love the book, great story line, keeps you entertained.for a first novel from this author she did a great job,  Would definitely recommend!
4.0
POSITIVE


#### Prep Data
The machine models love numerical data rather than text.

In [4]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

In [5]:
print(train[0].text, train[0].score, train[0].sentiment)

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down. 5.0 POSITIVE


In [6]:
train_x = [x.text for x in train]
train_y = [x.sentiment for x in train]

test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

print(train_x[0])
print(train_y[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
POSITIVE


### Bag of Words

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# vectorizer.fit(train_x)
# vectorizer.transform(train_x) ---> SAME AS V

train_x_vectors = vectorizer.fit_transform(train_x)

# Don't fit the the vectorizer again to test_x
test_x_vectors = vectorizer.transform(test_x)

train_x_vectors[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

### Linear SVM

In [8]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

n=2
print(test_x[n])
print(test_y[n])

clf_svm.predict(test_x_vectors[n])

Michael Cunningham mesmerizes with the thoughtful, elegant prose that is this book.  The reader becomes so close to its characters...the reader feels what these brothers feel.  Beautiful and tragic...a book that will stay with me for a long, long time.  Thank you again, Mr. Cunningham.  The Hours remains at the top of my list and The Snow Queen is another gift to your readers.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Decision Tree

In [9]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)


n=42
print(test_x[n])
print(test_y[n])

clf_dec.predict(test_x_vectors[n])

The review of innovation techniques and examples of their application to healthcare problems is absolutely amazing!  The Lean, on the other hand, is sketchy and not convincing.  Still, an informative and a well-written book.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Naive Bayes

In [10]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
# Dense data should be passed and not a Sparse matrix
clf_gnb.fit(train_x_vectors.toarray(), train_y) 


n=5
print(test_x[n])
print(test_y[n])

clf_gnb.predict(test_x_vectors[n].toarray())

An intriguing book, but I am ashamed to admit I sometimes did not fully comprehend what was said, particularly when Chesterton referred to other people (writers) that I am not knowledgeable of.  At any rate a good read.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y) 


n=6
print(test_x[n])
print(test_y[n])

clf_log.predict(test_x_vectors[n])

I loved this book.  It was very interesting to hear the background involved in training a helper dog and Luis's background in the service.  I learned a lot, but was entertained at the same time.  Of course, I have a a Golden Retriever myself, so I may be biased, but mine is as dumb as a brick!  Great book.
POSITIVE


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array(['POSITIVE'], dtype='<U8')

### Evaluation

In [12]:
# Mean Accuraies 
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8242424242424242
0.7515151515151515
0.8121212121212121
0.8303030303030303


In [13]:
# F1 Scores
from sklearn.metrics import f1_score

labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]
print(labels)

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=labels))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=labels))


['POSITIVE', 'NEUTRAL', 'NEGATIVE']
[0.91319444 0.21052632 0.22222222]
[0.85815603 0.14925373 0.06896552]
[0.89678511 0.08510638 0.09090909]
[0.91370558 0.12244898 0.1       ]


#### All the models predict POSITIVE correctly but NEUTRAL and NEGATIVE are bad!!
Let's find out why.

In [14]:
print(len(train_y))
print(train_y.count(Sentiment.POSITIVE),train_y.count(Sentiment.POSITIVE)/len(train_y))
print(train_y.count(Sentiment.NEUTRAL),train_y.count(Sentiment.NEUTRAL)/len(train_y))
print(train_y.count(Sentiment.NEGATIVE),train_y.count(Sentiment.NEGATIVE)/len(train_y))

670
552 0.8238805970149253
71 0.10597014925373134
47 0.07014925373134329


So we see that there are around 82\% POSITIVEs in the training data. Thus, our models will be heavily biased towards POSITIVE.
Now, there are only 47 NEUTRALs in this data. To balance the dataset, wewill either need to make all three classes around 47 data points each (which is too small for a dataset) OR...

#### GET A BIGGER DATASET!!!

In [15]:
file_name = './dataset/Books_small_10000.json'
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))

print(reviews[5].text)
print(reviews[5].score)
print(reviews[5].sentiment)

I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia's trauma to her trying to cope.  I love the way the story displays how there is no "just bouncing back" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character's voice on a strong subject and making it so that other peoples story may be heard through Mia's.
5.0
POSITIVE


In [42]:
# Evenly distribute positives and negatives   
import random
class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews
    
    def get_text(self):
        return [x.text for x in self.reviews]
    
    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]
    
    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))      
#         print(len(negative))
#         print(len(positive))
        positive_shrunk = positive[:len(negative)]
        self.reviews = negative + positive_shrunk
        random.shuffle(self.reviews)

In [26]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

#### Note to Self:  Do for NEUTRAL as well

In [69]:
train_cont = ReviewContainer(train)
test_cont = ReviewContainer(test)

train_cont.evenly_distribute() # If test data is not evenly dist. then
test_cont.evenly_distribute()  # the NEGATIVE F1 score doesn't improve. WHY??

print(len(train_cont.reviews))
print(len(test_cont.reviews))

436
5611
208
2767
872
416


In [70]:
train_x = train_cont.get_text()
train_y = train_cont.get_sentiment()

test_x = test_cont.get_text()
test_y = test_cont.get_sentiment()

print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

436
436


In [71]:
print(len(train_y))
print(train_y.count(Sentiment.POSITIVE),train_y.count(Sentiment.POSITIVE)/len(train_y))
print(train_y.count(Sentiment.NEUTRAL),train_y.count(Sentiment.NEUTRAL)/len(train_y))
print(train_y.count(Sentiment.NEGATIVE),train_y.count(Sentiment.NEGATIVE)/len(train_y))

872
436 0.5
0 0.0
436 0.5


In [72]:
print(len(test_y))
print(test_y.count(Sentiment.POSITIVE),test_y.count(Sentiment.POSITIVE)/len(test_y))
print(test_y.count(Sentiment.NEUTRAL),test_y.count(Sentiment.NEUTRAL)/len(test_y))
print(test_y.count(Sentiment.NEGATIVE),test_y.count(Sentiment.NEGATIVE)/len(test_y))

416
208 0.5
0 0.0
208 0.5


### Linear SVM

In [73]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
train_x_vectors = vectorizer.fit_transform(train_x)

# Don't fit the the vectorizer again to test_x
test_x_vectors = vectorizer.transform(test_x)

train_x_vectors[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [74]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

n=2
print(test_x[n])
print(test_y[n])

clf_svm.predict(test_x_vectors[n])

It was a good book . The story and the characters came together nicely and the ending was awesome .
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Decision Tree

In [75]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)


n=42
print(test_x[n])
print(test_y[n])

clf_dec.predict(test_x_vectors[n])

I choose this book because I enjoy books about the 2nd world war.  Great reading and another &#34;can't put the book down.&#34; I reccommend it to all that is into this subject.
POSITIVE


array(['NEGATIVE'], dtype='<U8')

### Naive Bayes

In [76]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
# Dense data should be passed and not a Sparse matrix
clf_gnb.fit(train_x_vectors.toarray(), train_y) 


n=5
print(test_x[n])
print(test_y[n])

clf_gnb.predict(test_x_vectors[n].toarray())

Robin writes such wondrous feel-good romances! I love them! When I pick up one of her books, I know that I will be in for a wonderful, enjoyable read.Yours at Midnight was a treat. Lyric and Quinn are wonderful characters. Quinn is our tortured hero who has been in love with Lyric since childhood. But... Lyric was besties with Quinn's brother. The three of them as next-door neighbors made a lot of memories through the years.Quinn's brother has now tragically passed away. Quinn has been away from home having left shortly after his brother's funeral. He hasn't seen or talked to Lyric in all that time, and he has an apology to make to her. Lyric has a few big secrets of her own. LOL! I wanted to shake her more than once!I connected immediately to both Lyric and Quinn. They're both survivors; each was stubborn and strong. The book's supporting characters and the setting as the time line neared New Year's Eve were perfect. I couldn't put the book down as Lyric's and Quinn's re-developing mo

array(['POSITIVE'], dtype='<U8')

### Logistic Regression

In [77]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y) 


n=6
print(test_x[n])
print(test_y[n])

clf_log.predict(test_x_vectors[n])

Excellent story dealing with what might have happened during the occupation of Rome. A must read for any war buff.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Evaluation

In [78]:
# Mean Accuraies 
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.7980769230769231
0.6418269230769231
0.6346153846153846
0.8149038461538461


In [79]:
# F1 Scores
from sklearn.metrics import f1_score

labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]
print(labels)

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=labels))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=labels))


['POSITIVE', 'NEUTRAL', 'NEGATIVE']
[0.8028169  0.         0.79310345]
[0.63746959 0.         0.64608076]
[0.59574468 0.         0.66666667]
[0.82051282 0.         0.808933  ]


  average, "true nor predicted", 'F-score is', len(true_sum)
  average, "true nor predicted", 'F-score is', len(true_sum)
  average, "true nor predicted", 'F-score is', len(true_sum)
  average, "true nor predicted", 'F-score is', len(true_sum)


In [96]:
test_set = ['Very good book!', 'A must read for any war buff.',
            'It is not bad', 'Horrible Waste of time'] 

test_set = vectorizer.transform(test_set)

print(clf_svm.predict(test_set))
print(clf_dec.predict(test_set))
print(clf_gnb.predict(test_set.toarray()))
print(clf_log.predict(test_set))

['POSITIVE' 'POSITIVE' 'NEGATIVE' 'NEGATIVE']
['POSITIVE' 'POSITIVE' 'NEGATIVE' 'NEGATIVE']
['NEGATIVE' 'NEGATIVE' 'NEGATIVE' 'NEGATIVE']
['POSITIVE' 'POSITIVE' 'NEGATIVE' 'NEGATIVE']
