### NLP Project: Reviews Classifier

In this project, I aim to build a model to classify Amazon reviews into classes. In particular deciding if reviews are Positive, Negative or Neutral. 

Data is provided as JSON, need JSON package to read data in. 

In [1]:
import json
from reviewer import Review
from data_optimiser import DataOptimizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import svm
from sklearn import linear_model
from sklearn.metrics import f1_score

In [2]:
data = []

with open('./Data/Books_small_10000.json') as f:
    for line in f:
        json_line = json.loads(line)
        data.append(Review(json_line['reviewText'],json_line['overall']))

The data is very large and contains more than 100,000 reviews, for performance, I'm going to only use 9000 reviews. Furthermore, I'll distribute the sentiments equally, ie each sentiment label will appear roughly equal number of times in the data.

Split Data: 70% Train, 30% testing:

In [3]:
train_data, test_data = train_test_split(data, test_size=0.3, train_size=0.7, random_state=10)
train_reviews, train_sentiments = DataOptimizer(train_data).get_reviews_ratings()
test_reviews, test_sentiments = DataOptimizer(test_data).get_reviews_ratings()

I'll be using a "Bag of Words" method to extract the features from the reviews. 

In [4]:
feature_extractor = CountVectorizer()
# Note that: CountVectorizer can fit and transform, the ouput is stored as a dense numpy array. Can use .toarray() to visualise if needed.
train_X = feature_extractor.fit_transform(train_reviews)
test_X = feature_extractor.transform(test_reviews)

Let's fit a few different models and we can use our model metrics to evaluate the effectiveness of each model. I'll create a model using an SVM and a Logistic model.  

In [5]:
support_vm = svm.SVC(kernel='linear').fit(train_X, train_sentiments)

In [6]:
support_vm.score(train_X, train_sentiments)

0.7792114695340502

In [7]:
print(f1_score(test_sentiments, support_vm.predict(test_X), average=None, labels=['Positive', 'Neutral', 'Negative']))

[0.64705882 0.46060606 0.54054054]
