# Amazon Reviews Classifier

Let's define a class Review!!

In [1]:
class Sentiment:
    NEGATIVE = 'NEGATIVE'
    POSITIVE = 'POSITIVE'
    NEUTRAL = 'NEUTRAL'

class Review:
    def __init__(self, text, score):
        self.text = text
        self.score = score    
        self.sentiment = self.get_sentiment()
        
    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE
        elif self.score >=4:
            return Sentiment.POSITIVE
        else:
            return Sentiment.NEUTRAL

#### Load Data
Let's import the json data.

In [2]:
import json

file_name = './dataset/Books_small.json'
reviews = []
with open(file_name) as f:
    for line in f:
        review = json.loads(line)
        reviews.append(Review(review['reviewText'],review['overall']))

print(reviews[5].text)
print(reviews[5].score)
print(reviews[5].sentiment)

Love the book, great story line, keeps you entertained.for a first novel from this author she did a great job,  Would definitely recommend!
4.0
POSITIVE


#### Prep Data
The machine models love numerical data rather than text.

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(reviews, test_size=0.33, random_state=42)

In [4]:
print(train[0].text, train[0].score, train[0].sentiment)

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down. 5.0 POSITIVE


In [5]:
train_x = [x.text for x in train]
train_y = [x.sentiment for x in train]

test_x = [x.text for x in test]
test_y = [x.sentiment for x in test]

print(train_x[0])
print(train_y[0])

Vivid characters and descriptions. The author has created a tale that grabs your attention and I couldn't put it down.
POSITIVE


### Bag of Words

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

# vectorizer.fit(train_x)
# vectorizer.transform(train_x) ---> SAME AS V

train_x_vectors = vectorizer.fit_transform(train_x)

# Don't fit the the vectorizer again to test_x
test_x_vectors = vectorizer.transform(test_x)

train_x_vectors[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

### Linear SVM

In [19]:
from sklearn import svm

clf_svm = svm.SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

n=2
print(test_x[n])
print(test_y[n])

clf_svm.predict(test_x_vectors[n])

Michael Cunningham mesmerizes with the thoughtful, elegant prose that is this book.  The reader becomes so close to its characters...the reader feels what these brothers feel.  Beautiful and tragic...a book that will stay with me for a long, long time.  Thank you again, Mr. Cunningham.  The Hours remains at the top of my list and The Snow Queen is another gift to your readers.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Decision Tree

In [25]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)


n=42
print(test_x[n])
print(test_y[n])

clf_dec.predict(test_x_vectors[n])

The review of innovation techniques and examples of their application to healthcare problems is absolutely amazing!  The Lean, on the other hand, is sketchy and not convincing.  Still, an informative and a well-written book.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Naive Bayes

In [32]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()
# Dense data should be passed and not a Sparse matrix
clf_gnb.fit(train_x_vectors.toarray(), train_y) 


n=5
print(test_x[n])
print(test_y[n])

clf_gnb.predict(test_x_vectors[n].toarray())

An intriguing book, but I am ashamed to admit I sometimes did not fully comprehend what was said, particularly when Chesterton referred to other people (writers) that I am not knowledgeable of.  At any rate a good read.
POSITIVE


array(['POSITIVE'], dtype='<U8')

### Logistic Regression

In [36]:
from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression()
clf_log.fit(train_x_vectors, train_y) 


n=6
print(test_x[n])
print(test_y[n])

clf_log.predict(test_x_vectors[n])

I loved this book.  It was very interesting to hear the background involved in training a helper dog and Luis's background in the service.  I learned a lot, but was entertained at the same time.  Of course, I have a a Golden Retriever myself, so I may be biased, but mine is as dumb as a brick!  Great book.
POSITIVE


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


array(['POSITIVE'], dtype='<U8')

### Evaluation

In [40]:
# Mean Accuraies 
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors.toarray(), test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8242424242424242
0.7757575757575758
0.8121212121212121
0.8303030303030303


In [52]:
# F1 Scores
from sklearn.metrics import f1_score

labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE]
print(labels)

print(f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=labels))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()), average=None, labels=labels))
print(f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=labels))


['POSITIVE', 'NEUTRAL', 'NEGATIVE']
[0.91319444 0.21052632 0.22222222]
[0.87915937 0.12698413 0.07692308]
[0.89678511 0.08510638 0.09090909]
[0.91370558 0.12244898 0.1       ]


#### All the models predict POSITIVE correctly but NEUTRAL and NEGATIVE are bad!!
Let's find out why.

In [60]:
print(len(train_y))
print(train_y.count(Sentiment.POSITIVE),train_y.count(Sentiment.POSITIVE)/len(train_y))
print(train_y.count(Sentiment.NEUTRAL),train_y.count(Sentiment.NEUTRAL)/len(train_y))
print(train_y.count(Sentiment.NEGATIVE),train_y.count(Sentiment.NEGATIVE)/len(train_y))

670
552 0.8238805970149253
71 0.10597014925373134
47 0.07014925373134329


So we see that there are around 82\% POSITIVEs in the training data. Thus, our models will be heavily biased towards POSITIVE.
Now, there are only 47 NEUTRALs in this data. To balance the dataset, wewill either need to make all three classes around 47 data points each (which is too small for a dataset) OR...

#### GET A BIGGER DATASET!!!