### NLP Project: Reviews Classifier

In this project, I aim to build a model to classify Amazon reviews into classes. In particular deciding if reviews are Positive or Negative.

Data is provided as JSON, need JSON package to read data in. 

In [1]:
import json
import numpy as np
import pickle
from reviewer import Review
from data_optimiser import DataOptimizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import svm
from sklearn import linear_model
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV

The data is very large and contains more than 100,000 reviews, for performance, I'm going to only use 10000 reviews. Furthermore, I'll distribute the sentiments equally, ie each sentiment label will appear roughly an equal number of times in the data.

In [2]:
data = []
num_datapoints = 10000
count = 0
with open('./Data/Video_Games_5.json') as f:
    for line in f:
        if count <= num_datapoints:
            count += 1 
            json_line = json.loads(line)
            data.append(Review(json_line['reviewText'],json_line['overall']))
        else:
            break 

Split Data: 70% Train, 30% testing:

In [3]:
train_data, test_data = train_test_split(data, test_size=0.3, train_size=0.7, random_state=10)
train_reviews, train_sentiments = DataOptimizer(train_data).get_reviews_ratings()
test_reviews, test_sentiments = DataOptimizer(test_data).get_reviews_ratings()
print(len(train_reviews))
print(len(test_reviews))

2562
1124


I'll be using a "Bag of Words" method to extract the features from the reviews. Note that: CountVectorizer/TfidfVectorizer can fit and transform, the ouput is stored as a dense numpy array. Can use .toarray() to visualise if needed.

In [4]:
feature_extractor = TfidfVectorizer()
train_X = feature_extractor.fit_transform(train_reviews)
test_X = feature_extractor.transform(test_reviews)

##### Fitting Model:

Using GridSearch to optimize model hyperparameters.

In [14]:
parameters = {'kernel': ('linear', 'rbf', 'sigmoid'), 'C' : (1, 2, 4, 10)}
support_vector_model = svm.SVC()
tuned_model = GridSearchCV(support_vector_model, parameters, cv=5)
tuned_model.fit(train_X, train_sentiments)
tuned_model.best_params_

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': (1, 2, 4, 10),
                         'kernel': ('linear', 'rbf', 'sigmoid')})

In [17]:
tuned_model.score(test_X, test_sentiments)

0.7428825622775801

In [18]:
print(f1_score(test_sentiments, tuned_model.predict(test_X), average=None, labels=['Positive', 'Negative']))

[0.74447392 0.74127126]


Save Model:

In [21]:
with open('./Models/SVM_Games_Reviews.pkl', 'wb') as f:
    pickle.dump(tuned_model, f)

Load Model:

In [22]:
with open('./Models/SVM_Games_Reviews.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

In [25]:
example_pos = 'This game was amazing, well worth the money'
example_neg = 'Game was awful, total waste of money'
examples = [example_pos, example_neg]
vec_example = feature_extractor.transform(examples)
loaded_model.predict(vec_example)

array(['Positive', 'Negative'], dtype='<U8')