# IBM Advanced Data Science Capstone Project
## Sentiment Analysis of Amazon Customer Reviews
### Harsh V Singh, Apr 2021

## Model Definition

In this notebook, we will define the machine learning model that will be used to train and predict the sentiment of an Amazon customer's review given its review heading and text. We have already preprocessed the raw data into a training set containing tokenized and vectorized features of the review text content along with a binary review sentiment which is 1 for positive and 0 for negative reviews.

## Importing required Python libraries and initializing Apache Spark environment

In [30]:
import numpy as np
import pandas as pd
import math
import time
from pathlib import Path
from scipy import sparse
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import seaborn as sns

import sklearn
from sklearn.naive_bayes import ComplementNB
from sklearn import metrics

import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, LSTM, Masking, Embedding
from keras import regularizers
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, IntegerType, StringType, ArrayType
from pyspark.sql.functions import udf, rand, col, concat, coalesce
from pyspark.ml.feature import HashingTF, IDF

CPU_CORES = 6
conf = SparkConf().setMaster("local[*]") \
    .setAll([("spark.driver.memory", "24g"),\
             ("spark.executor.memory", "4g"), \
             ("spark.driver.maxResultSize", "24g"), \
             ("spark.executor.cores", CPU_CORES), \
             ("spark.executor.heartbeatInterval", "3600s"), \
             ("spark.network.timeout", "7200s")])
sc = SparkContext.getOrCreate(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()

RUN_SAMPLE_CODE = True
NUM_FEATURES = 2**15
SEED_NUMBER = 1324

In [31]:
# Function to print time taken by a particular process, given the start and end times
def printElapsedTime(startTime, endTime):
    elapsedTime = endTime - startTime
    print("-- Process time = %.2f seconds --"%(elapsedTime))

## Method 1: Training models using TFIDF vectorized data

First, we will use the TFIDF vectorized data to build a baseline Naive Bayes model and then train a neural network with 2 hidden layers.

### Loading TFIDF train/ test data

We will begin by loading the train/ test data.


In [32]:
if RUN_SAMPLE_CODE:
    sourceDir = "data/sample/tfData"
    X_train_tf = sparse.load_npz(sourceDir + "/X_train.npz")
    X_test_tf = sparse.load_npz(sourceDir + "/X_test.npz")

    X_train_tf.sort_indices()
    X_test_tf.sort_indices()

    y_train_tf = pd.read_csv(sourceDir + "/y_train.csv")["review_sentiment"].to_numpy()
    y_test_tf = pd.read_csv(sourceDir + "/y_test.csv")["review_sentiment"].to_numpy()

    print("X_train_tf is of type %s and shape %s."%(type(X_train_tf), X_train_tf.shape))
    print("y_train_tf is of type %s, shape %s and %d unique classes."%(type(y_train_tf), y_train_tf.shape, len(np.unique(y_train_tf))))

X_train_tf is of type <class 'scipy.sparse.csr.csr_matrix'> and shape (31701, 71798).
y_train_tf is of type <class 'numpy.ndarray'>, shape (31701,) and 2 unique classes.


### Predictions using a Naive Bayes model for setting a baseline

**ComplementNB** implements the Complement Naive Bayes (CNB) algorithm. CNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm that is particularly suited for imbalanced data sets. CNB regularly outperforms MNB on text classification tasks so we will be using this model for our baseline.

In [6]:
if RUN_SAMPLE_CODE:
    tfCNBModel = ComplementNB().fit(X_train_tf, y_train_tf)
    print("ComplementNB Accuracy: %.2f%%"%(100 * metrics.accuracy_score(y_test_tf, tfCNBModel.predict(X_test_tf))))

### Predictions using a Keras Neural Network

We will be using a **Sequential** model with **two** hidden layers and a **sigmoid** activation for the output layer. We can experiment with the hyperparameters such as *L2 regularization, dropout rate, number of nodes in the hidden layers and the activation functions* to find the best possible combination that gives the best accuracy on the test data.

In [7]:
# Plot the model accuracy and loss over the training epochs
def plotTrainingPerformance(history, figTitle, figSize=(12,5)):
    fig = plt.figure(figsize=figSize)
    sns.set_theme()
    sns.set_style("white")
    
    xvals = np.arange(len(history.history["accuracy"])) + 1

    fig.add_subplot(1, 2, 1)
    sns.lineplot(x=xvals, y=history.history["accuracy"])
    plt.xticks(xvals)
    plt.ylabel("accuracy")
        
    fig.add_subplot(1, 2, 2)
    sns.lineplot(x=xvals, y=history.history["loss"])
    plt.xticks(xvals)
    plt.ylabel("loss")
    
    fig.suptitle(figTitle)
    plt.show()

In [8]:
# Function to compile, fit and predict keras model
def fitAndPredictModel(modelName, model, X_train, y_train, X_test, y_test, loss, optimizer, metrics, epochs, batch_size):
    # Compile the model
    model.compile(loss=loss, optimizer=optimizer, metrics=metrics)
    # Fit the model on the training data
    history = model.fit(X_train, y_train.reshape((-1,1)), epochs=epochs, batch_size=batch_size)
    # Plot training performance
    plotTrainingPerformance(history=history, figTitle="Training Accuracy/ Loss over Epochs")
    # Predict review sentiments on the test data and check model accuracy
    _, accuracy = model.evaluate(X_test, y_test.reshape((-1,1)))
    print("%s Accuracy: %.2f%%" % (modelName, accuracy*100))

In [9]:
if RUN_SAMPLE_CODE:

    # Model definition
    tfModel = Sequential()
    l2Reg = 1e-3
    dropout = 0.5
    tfModel.add(Dense(32, input_shape=(X_train_tf.shape[1],), \
        kernel_regularizer=regularizers.l2(l2Reg), \
        bias_regularizer=regularizers.l2(l2Reg)))
    tfModel.add(Activation('relu'))
    tfModel.add(Dropout(dropout))
    tfModel.add(Dense(32, input_shape=(X_train_tf.shape[1],), \
        kernel_regularizer=regularizers.l2(l2Reg), \
        bias_regularizer=regularizers.l2(l2Reg)))
    tfModel.add(Activation('relu'))
    tfModel.add(Dropout(dropout))
    tfModel.add(Dense(1))
    tfModel.add(Activation('sigmoid'))

    # Complie, fit and predict model
    fitAndPredictModel(
        modelName="Neural Network", model=tfModel, X_train=X_train_tf, y_train=y_train_tf, X_test=X_test_tf, y_test=y_test_tf, 
        loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"], epochs=5, batch_size=128)

## Method 2: Training models using sequential word vectors data

Now, we will use the sequential word vectors data to train a recurrent neural network with **1 LSTM layer**, **1 Dense layer** and a **sigmoid** output layer.

### Loading sample sequential train/ test data

We will begin by loading the train/ test data.

In [10]:
def getVocabularyCount(X_train):
    return (max([max(x) for x in X_train]) + 1)

In [11]:
if RUN_SAMPLE_CODE:
    sourceDir = "data/sample/seqData"
    X_train_seq = np.load(sourceDir + "/X_train.npy")
    X_test_seq = np.load(sourceDir + "/X_test.npy")
    y_train_seq = pd.read_csv(sourceDir + "/y_train.csv")["review_sentiment"].to_numpy()
    y_test_seq = pd.read_csv(sourceDir + "/y_test.csv")["review_sentiment"].to_numpy()

    vocabCount = getVocabularyCount(X_train_seq)

### Predictions using a Keras LSTM Recurrent Neural Network

We will be using a **Sequential** model with **one** .... and a **sigmoid** activation for the output layer. We can experiment with the hyperparameters such as *L2 regularization, dropout rate, number of nodes in the hidden layers and the activation functions* to find the best possible combination that gives the best accuracy on the test data.

In [12]:
if RUN_SAMPLE_CODE:
    # Model definition
    seqModel = Sequential()
    l2Reg = 1e-2
    dropout = 0.5
    seqModel.add(Embedding(input_dim=vocabCount, output_dim=64, input_length=X_train_seq.shape[1], mask_zero=True))
    seqModel.add(LSTM(32, return_sequences=False, dropout=dropout, recurrent_dropout=dropout))
    seqModel.add(Dense(32, activation='relu', \
        kernel_regularizer=regularizers.l2(l2Reg), \
        bias_regularizer=regularizers.l2(l2Reg)))
    seqModel.add(Dropout(dropout))
    seqModel.add(Dense(1, activation='sigmoid'))

    # Complie, fit and predict model
    fitAndPredictModel(
        modelName="LSTM RNN", model=seqModel, X_train=X_train_seq, y_train=y_train_seq, X_test=X_test_seq, y_test=y_test_seq, 
        loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"], epochs=5, batch_size=128)

In [13]:
seqTrain = spark.read.parquet("data/seqTrain.parquet")
seqTrain = seqTrain.repartition(CPU_CORES)
print("There are %d samples in the training data."%(seqTrain.count()))
seqTrain.show(5)

featCount = len(seqTrain.select("features").take(1)[0][0])

There are 79292 samples in the training data.
+--------------------+----------------+
|            features|review_sentiment|
+--------------------+----------------+
|[-0.0591365794037...|               1|
|[-0.0607305002509...|               1|
|[0.02287686754120...|               1|
|[-0.0188678904944...|               0|
|[-0.0777446282467...|               1|
+--------------------+----------------+
only showing top 5 rows



In [23]:
seqModel = Sequential()
l2Reg = 1e-2
dropout = 0.5
seqModel.add(Embedding(input_dim=NUM_FEATURES, output_dim=64, input_length=featCount, mask_zero=True))
seqModel.add(LSTM(32, return_sequences=False, dropout=dropout, recurrent_dropout=dropout))
seqModel.add(Dense(32, activation='relu', \
    kernel_regularizer=regularizers.l2(l2Reg), \
    bias_regularizer=regularizers.l2(l2Reg)))
seqModel.add(Dropout(dropout))
seqModel.add(Dense(1, activation='sigmoid'))

In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(seqTrain.select("features").toPandas(), seqTrain.select("review_sentiment").toPandas(), \
    test_size=0.2, random_state=SEED_NUMBER)

In [34]:
fitAndPredictModel(
        modelName="LSTM RNN", model=seqModel, X_train=X_train, y_train=y_train.to_numpy(), X_test=X_test, y_test=y_test.to_numpy(), 
        loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"], epochs=5, batch_size=128)


ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type DenseVector).

In [37]:
X_train
to_array = udf(lambda v: v.toArray().tolist(), T.ArrayType(T.FloatType()))
df = df.withColumn('features', to_array('features'))

Unnamed: 0,features
53805,"[0.0011533140786923468, 0.028042381648750354, ..."
73531,"[0.03016689089902987, -0.006372004863806069, -..."
24181,"[-0.07831877383600491, -0.09187610984708254, 0..."
45385,"[0.07983627039939166, 0.05028595492476598, -0...."
24148,"[-0.1311945755218263, -0.031005788164643142, -..."
...,...
20152,"[-0.08307440197103473, -0.06721421543010314, -..."
74073,"[-0.04373169668463313, 0.018820407859642396, 0..."
65386,"[-0.074089135613966, 0.048293973036509535, -0...."
56813,"[0.005523316178689985, 0.004374821971663658, 0..."
