For an assignment completed in class, we were tasked with utilizing machine learning on Spark.  We used sentiment analysis to classify if Amazon reviews were positive or negative.

In [1]:
import pyspark as ps    # for the pyspark suite
import os               # for environ variables in Part 3

%load_ext autoreload
%autoreload 2

spark = ps.sql.SparkSession.builder \
            .appName("df lecture") \
            .getOrCreate()

Start with loading Amazon reviews from json into a dataframe.

In [3]:
df_reviews = spark.read.json('../data/reviews_Musical_Instruments_5.json.gz')

In [4]:
df_reviews.printSchema() # take a look at our schema 

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)



In [5]:
print(df_reviews.count()) # how many observations our dataframe?

10261


We are only interested in the 'reviewText' and 'overall' columns, so let's only use those.

In [6]:
df_corpus = df_reviews.select('reviewText', 'overall')

In [7]:
df_corpus.printSchema() # double check our new schema 

root
 |-- reviewText: string (nullable = true)
 |-- overall: double (nullable = true)



Since we want to look at positive or negative reviews, we need to create labels for classification, as well as make sure our classes are balanced. 

In [8]:
from pyspark.sql.functions import count

Ratings are 1-5, so let's get a total count of each rating.

In [11]:
res_test = df_corpus.groupBy("overall").agg(count("overall"))
res_test.printSchema()

root
 |-- overall: double (nullable = true)
 |-- count(overall): long (nullable = false)



In [12]:
classes_count = dict(res_test.collect())
print("class representation: {}".format(classes_count))

class representation: {1.0: 217, 4.0: 2084, 3.0: 772, 2.0: 250, 5.0: 6938}


Fairly imbalanced classes. We are only looking at positive and negative reviews, so lets look at the extremes (1 and 5). Then we can upsample or downsample to make balanced classes.

In [14]:
balanced_classsize = min(classes_count[1.0], classes_count[5.0], 10000)
print("using limit size: {}".format(balanced_classsize))

using limit size: 217


In [16]:
from pyspark.sql.functions import rand

Let's use 'filter' to create a 'positive' reviews dataframe and a second 'negative' reviews dataframe.

In [17]:
dataset_neg = df_corpus.filter(df_corpus["overall"] <= 1.0).orderBy(rand()).limit(balanced_classsize)
dataset_pos = df_corpus.filter(df_corpus["overall"] >= 5.0).orderBy(rand()).limit(balanced_classsize)

Then combine these two dataframes.

In [18]:
df_posnegdataset = dataset_pos.union(dataset_neg)

Confirmation of balanced dataframe.

In [19]:
print("data points in the neg class: {}".format(dataset_neg.count()))
print("data points in the pos class: {}".format(dataset_pos.count()))

data points in the neg class: 217
data points in the pos class: 217


Create new column with '0' as label for negative class and '1' as label for positive class.

In [27]:
df_posnegdataset = df_posnegdataset.withColumn("label", (df_posnegdataset['overall']-1.0)/4.0)
df_posnegdataset.printSchema()

root
 |-- reviewText: string (nullable = true)
 |-- overall: double (nullable = true)
 |-- label: double (nullable = true)



Import 'indexing_pipeline' function created in nlp_pipeline file in folder. Use that function to get features and vocabulary of the dataframe, and to index every review.

In [28]:
from nlp_pipeline import indexing_pipeline

In [29]:
df_output, vocab = indexing_pipeline(df_posnegdataset, inputCol="reviewText")

In [30]:
df_output.printSchema()

root
 |-- reviewText: string (nullable = true)
 |-- overall: double (nullable = true)
 |-- label: double (nullable = true)
 |-- bow: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- vector_tf: vector (nullable = true)
 |-- features: vector (nullable = true)



Take a look at some vocab.

In [31]:
print("vocabulary: {}".format(vocab[0:10]))

vocabulary: ['guitar', 'b', 'string', 'pedal', 'great', 'good', 'sound', 'other', 'time', 'product']


Use persist to split our data into training and test set.

In [32]:
splits = df_output.randomSplit([0.7, 0.3])
df_train = splits[0]
df_test = splits[1]

df_train.persist()

DataFrame[reviewText: string, overall: double, label: double, bow: array<string>, vector_tf: vector, features: vector]

Use the naive bayes ML algorithm for features, labels, and predictions. 

In [33]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [34]:
# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(df_train)

# apply the model on the test set
result = model.transform(df_test)

Use multiclass classification evaluator to get accuracy.

In [35]:
# keep only label and prediction to compute accuracy
predictionAndLabels = result.select("prediction", "label")

# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                                  metricName="accuracy")

print("Accuracy: {}".format(evaluator.evaluate(predictionAndLabels)))

Accuracy: 0.6746031746031746


Our model has an approx. 67% chance of predicting if an instrument review will be positive or negative. 

Now, let's see if we can find words related to the positive class, as well as words related to the negative class.

In [36]:
import numpy as np

thetaarray = model.theta.toArray().T

vocab_size = len(vocab)

dtype = [('label', 'S10'), ('neg', float), ('pos', float)]
prob_values = [ (vocab[i],
                 np.exp(thetaarray[i,0])*(1-np.exp(thetaarray[i,1])),
                 (1-np.exp(thetaarray[i,0]))*np.exp(thetaarray[i,1]))
               for i in range(vocab_size) ]

a = np.array(prob_values, dtype=dtype)       # create a structured array

In [37]:
prob_values[0]

('guitar', 0.006222922580237934, 0.008793205823640094)

In [38]:
np.sort(a, order='pos')[::-1][0:20]

array([(b'pedal', 0.00667606, 0.01036225),
       (b'string', 0.00391814, 0.00955271),
       (b'guitar', 0.00622292, 0.00879321),
       (b'great', 0.00169551, 0.00781073),
       (b'ipad', 0.0014542 , 0.00682595),
       (b'delay', 0.00035155, 0.00615357),
       (b'easi', 0.00063057, 0.00605684),
       (b'sound', 0.00373599, 0.00598788),
       (b'mike', 0.00040292, 0.0053778 ),
       (b'price', 0.0027482 , 0.00501963),
       (b'good', 0.00468863, 0.00496378),
       (b'pick', 0.00283501, 0.00482706),
       (b'other', 0.00448186, 0.004827  ),
       (b'cabl', 0.00386692, 0.00481758),
       (b'jam', 0.00070566, 0.0047976 ),
       (b'stand', 0.00327582, 0.004767  ),
       (b'tone', 0.00152812, 0.00466939),
       (b"b'i", 0.00195458, 0.00439361),
       (b'nice', 0.00143984, 0.0043902 ),
       (b'tuner', 0.00469819, 0.0043608 )],
      dtype=[('label', 'S10'), ('neg', '<f8'), ('pos', '<f8')])

In [39]:
np.sort(a, order='neg')[::-1][0:20]

array([(b'pedal', 0.00667606, 0.01036225),
       (b'record', 0.00643652, 0.0016922 ),
       (b'guitar', 0.00622292, 0.00879321),
       (b'product', 0.00610722, 0.00305226),
       (b'thing', 0.00518676, 0.00162531),
       (b'cheap', 0.00492438, 0.00114616),
       (b'mic', 0.00486441, 0.00245868),
       (b'tuner', 0.00469819, 0.0043608 ),
       (b'good', 0.00468863, 0.00496378),
       (b'problem', 0.00460113, 0.0027619 ),
       (b'other', 0.00448186, 0.004827  ),
       (b'time', 0.00428997, 0.003345  ), (b'b', 0.00425538, 0.00403088),
       (b'capo', 0.00422081, 0.00342679),
       (b'review', 0.00417728, 0.00135955),
       (b'strap', 0.00416568, 0.00412251),
       (b'batteri', 0.00413894, 0.00240294),
       (b'someth', 0.00412973, 0.0017657 ),
       (b'same', 0.00408317, 0.00122643),
       (b'way', 0.00403511, 0.00261567)],
      dtype=[('label', 'S10'), ('neg', '<f8'), ('pos', '<f8')])

Some words in the positive class like 'great', 'easy', 'good', and 'nice' are clearly associated with  the positive class.  Words like 'problem' and 'cheap' are more obviously related to the negative class. However, most of the words are more product related. That is because we ran the model on a dataset that only included instrument reviews. Therefore, the positive and negative associations are biased by the vocabulary related to the products that people evaluate as positive or negative. To get a better sense of words associated with positive or negative reviews, we need to broaden our dataset. 