### **Amazon product review analysis**

To compare three diffferent models, I've decided to use the ROC AUC score as the target metric. This metric usually used when all classes have the same value for the research, so this is the case of this work. Moreover, the classes in the sample dataset are balanced, so ir is possible to use this metric. The FastText, TFIDF+XGBoost and RNN model were built.


Firstly, the import of the all necessary packages and data

In [0]:
pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████▊                           | 10kB 16.9MB/s eta 0:00:01[K     |█████████▌                      | 20kB 3.0MB/s eta 0:00:01[K     |██████████████▎                 | 30kB 4.0MB/s eta 0:00:01[K     |███████████████████             | 40kB 2.9MB/s eta 0:00:01[K     |███████████████████████▉        | 51kB 3.2MB/s eta 0:00:01[K     |████████████████████████████▋   | 61kB 3.8MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.0MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp36-cp36m-linux_x86_64.whl size=3016314 sha256=c3bf1fb9fbee919d9418377be8eb20a69143e86ee6a0042dc20ddd7aa2a0ecd2
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c154

In [0]:
from zipfile import ZipFile
import pickle
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import fasttext
import string
import re
import bz2
import csv

from keras.preprocessing.text import Tokenizer

from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
from xgboost import XGBClassifier
from nltk.corpus import stopwords
from nltk import word_tokenize
import nltk

Using TensorFlow backend.


In [0]:
# Loading of the training data 
train = bz2.BZ2File("train.ft.txt.bz2")
train = train.readlines()
train = [x.decode('utf-8') for x in train]
print(len(train)) 

3600000


In [0]:
# Loading of the test data 
test = bz2.BZ2File("test.ft.txt.bz2")
test = test.readlines()
test = [x.decode('utf-8') for x in test]
print(len(test), 'number of records in the test set') 

400000 number of records in the test set


In [0]:
train[1:10]

["__label__2 The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.\n",
 '__label__2 Amazing!: This soundtrack is my favorite music of all time, hands down. The intense sadness of "Prisoners of Fate" (which means all the more if you\'ve played the game) and the hope in "A Distant Promise" and "Girl who Stole the Star" have been an important inspiration to me personally throughout my teen years. The higher energy tracks like "Chrono Cross ~ Time\'s Scar~", "Time of the Dreamwatch", and "Chronomantique" (indefinably remeniscent of Chrono Tri

In [0]:
train = pd.DataFrame(train)
train.to_csv("train.txt", index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")

test = pd.DataFrame(test)
test.to_csv("test.txt", index=False, sep=' ', header=False, quoting=csv.QUOTE_NONE, quotechar="", escapechar=" ")


## **The FastText model**

In [0]:
model = fasttext.train_supervised('train.txt',label_prefix='__label__', thread=4, epoch = 50)
print(model.labels, 'are the labels or targets the model is predicting')

['__label__1', '__label__2'] are the labels or targets the model is predicting


In [0]:
# Removing the __label__1 and __label__2 from the testset to run the predict function 
test_n = [w.replace('__label__2 ', '') for w in test]
test_n = [w.replace('__label__1 ', '') for w in test_n]
test_n = [w.replace('\n', '') for w in test_n]

In [0]:
pred = model.predict(test_n)

In [0]:
# Lets recode the actual targets to 1's and 0's from both the test set and the actual predictions  
labels = [0 if x.split(' ')[0] == '__label__1' else 1 for x in test]
pred_labels = [0 if x == ['__label__1'] else 1 for x in pred[0]]

# run the accuracy measure. 
print(roc_auc_score(labels, pred_labels))

0.8954399999999999




The fasttext model has the 89,5% ROC-AUC score, which is relatively high. Firstly, I've tried to build this model with only 10 epochs, and it gave even higher score - around 91,7%


## **TFIDF + XGBoost model**

In [0]:
f = open('train.txt', 'r', encoding ='utf-8')
X_train = []
for i in f: 
    X_train.append(i[11:])
f.close()

f = open('test.txt', 'r', encoding ='utf-8')
X_test = []
for i in f: 
    X_test.append(i[11:])
f.close()

In [0]:
f = open('train.txt', 'r', encoding ='utf-8')
y_train = []
for i in f: 
    if i[:10] == '__label__1':
        y_train.append(0)
    else:
        y_train.append(1)
f.close()

f = open('test.txt', 'r', encoding ='utf-8')
y_test = []
for i in f: 
    if i[:10] == '__label__1':
        y_test.append(0)
    else:
        y_test.append(1)
f.close()

In [0]:
stop_words = set(stopwords.words('english'))

In [0]:
def Token (str_input):
    words = re.sub(r"[^A-Za-z0-9\-]", " ", str_input).lower().split()
    porter_stemmer=nltk.PorterStemmer()
    words = [porter_stemmer.stem(word) for word in words]
    return words

In [0]:
vectorizer = TfidfVectorizer(tokenizer=Token, stop_words=stop_words)

In [0]:
trainXGB = vectorizer.fit_transform(X_train)
testXGB = vectorizer.transform(X_test)

In [0]:
XGB = XGBClassifier()

In [0]:
XGB.fit(trainXGB, y_train)

XGB.save_model("XGB.bin")

In [0]:
XGB.load_model("XGB.bin")

In [0]:
XGB_pred = XGB_model.predict(test)

with open('XGB_pred', 'wb') as f:
     pickle.dump(XGB_pred, f)

In [0]:
with open('XGB_pred', 'rb') as f:
     XGB_pred = pickle.load(f)

In [0]:
XGB_auc = roc_auc_score(y_test, XGB_pred)
print ("SCORE:", XGB_auc)

SCORE: 0.85914


The result of ML model is pretty much lower, than the performance of FastText model (85,9 vs 91,7%). Let's have a look at the RNN model.

## **RNN model**

I've decided to build the RNN model with LSTM layer, 2 Dense layers and Dropout layer for avoiding the overfitting

In [0]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical

In [0]:
token = Tokenizer(num_words=500)
token.fit_on_texts(X_train)
sequences = token.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=100)

In [0]:
def RNN():
    inputs = Input(name='inputs',shape=[100])
    layer = Embedding(500,50,input_length=100)(inputs) # Embedding layer
    layer = LSTM(64)(layer) # Recurrent layer
    layer = Dense(256, activation='relu')(layer) # Fully connected layer
    layer = Dropout(0.5)(layer) # Dropout for regularization
    layer = Dense(1,name='out_layer', activation='relu')(layer) # Fully connected layer
    model = Model(inputs=inputs,outputs=layer)
    return model

In [7]:
model = RNN()
model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 100)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 50)           25000     
_________________________________________________________________
lstm (LSTM)                  (None, 64)                29440     
_________________________________________________________________
dense (Dense)                (None, 256)               16640     
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 257       
_________________________________________________________________
activation (Activation)      (None, 1)                 0     

In [0]:
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy'])

In [0]:
# fit the data to RNN with batch size of 256 and 3 epochs
model.fit(sequences_matrix,y_train,batch_size=256,epochs=3,
          validation_split=0.2)

In [0]:
roc_auc_score(y_test, model.predict(X_test))

Unfortunately, I couldn't manage to fit this model even with only 3 epochs. My Colab just restarted runtime each time that I've started to run cells. I can assume, that this model will provide better results, than the previous 2, especially after some ajustments, but I cannot prove it due to the limits of computing power. 

Among the FastText and the TFIDF + XGBoost model the first one shows better results. 