# IBM Advanced Data Science Capstone Project
## Sentiment Analysis of Amazon Customer Reviews
### Harsh V Singh, Apr 2021

## Model Definition

In this notebook, we will define the machine learning model that will be used to train and predict the sentiment of an Amazon customer's review given its review heading and text. We have already preprocessed the raw data into a training set containing tokenized and vectorized features of the review text content along with a binary review sentiment which is 1 for positive and 0 for negative reviews.

## Importing required Python libraries and initializing Apache Spark environment

In [63]:
import numpy as np
import pandas as pd
import math
import time
from pathlib import Path
from scipy import sparse
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

import seaborn as sns
import sklearn

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType, ArrayType
from pyspark.sql.functions import udf, rand, col, concat, coalesce
from pyspark.ml.feature import HashingTF, IDF

CPU_CORES = 6
conf = SparkConf().setMaster("local[*]") \
    .setAll([("spark.driver.memory", "24g"),\
             ("spark.executor.memory", "4g"), \
             ("spark.driver.maxResultSize", "24g"), \
             ("spark.executor.cores", CPU_CORES), \
             ("spark.executor.heartbeatInterval", "3600s"), \
             ("spark.network.timeout", "7200s")])
sc = SparkContext.getOrCreate(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()

SEED_NUMBER = 1324

In [None]:
#spark.sparkContext.stop()

In [2]:
# Function to print time taken by a particular process, given the start and end times
def printElapsedTime(startTime, endTime):
    elapsedTime = endTime - startTime
    print("Process time = %.2f seconds."%(elapsedTime))

## Loading data

We will begin by loading the train/ test data.


In [69]:
sourceDir = "data/sample/tfData"
X_train = sparse.load_npz(sourceDir + "/X_train.npz")
X_test = sparse.load_npz(sourceDir + "/X_test.npz")

X_train.sort_indices()
X_test.sort_indices()

y_train = pd.read_csv(sourceDir + "/y_train.csv")["review_sentiment"].to_numpy()
y_test = pd.read_csv(sourceDir + "/y_test.csv")["review_sentiment"].to_numpy()

In [70]:
print("X_train is of type %s and shape %s."%(type(X_train), X_train.shape))
print("y_train is of type %s, shape %s and %d unique classes."%(type(y_train), y_train.shape, len(np.unique(y_train))))

X_train is of type <class 'scipy.sparse.csr.csr_matrix'> and shape (616, 6109).
y_train is of type <class 'numpy.ndarray'>, shape (616,) and 2 unique classes.


In [71]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
clf = MultinomialNB().fit(X_train, y_train)
predicted = clf.predict(X_test)
print("MultinomialNB Accuracy: %.2f%%"%(100 * metrics.accuracy_score(y_test, predicted)))


MultinomialNB Accuracy: 75.32%


In [72]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

In [77]:
model = Sequential()
model.add(Dense(256, input_shape=(X_train.shape[1],)))
model.add(Activation('relu'))
model.add(Dense(256, input_shape=(X_train.shape[1],)))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))

In [78]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [79]:
model.fit(X_train, y_train.reshape((-1,1)), epochs=10, batch_size=64)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x202f7897ca0>

In [82]:
_, accuracy = model.evaluate(X_test, y_test.reshape((-1,1)))
print("Sequential Neural Network Accuracy: %.2f%%" % (accuracy*100))

Sequential Neural Network Accuracy: 75.32%
