# Amazon Review Data Pipeline

The goal of this jupyter notebook is to:
1. Load Amazon Review data acquired from [https://nijianmo.github.io/amazon/index.html](https://nijianmo.github.io/amazon/index.html)
2. Turn it into a Spark RDD
3. Create a Bag-of-Words with:
    - Single words
    - 1-grams and 2-grams
4. Visualize the frequency of these words occurences in the Review data
5. Create a LabeledPoint RDD for each review and save it as a libsvm file.

## Preparing Data

The data cannot be read directly from it's gzipped format into a Spark DataFrame. It produces Schema errors because there are a few naming collisions in the Json data.

Instead, the data must be unzipped and a few of its errors must be correct before being loaded.

The following script will achieve that. Specifically, it will ensure consistent casing for the field names 'style' and 'styleName'.

Note: it is possible that both `gzip` and `sed` will need to be installed.

In [3]:
%%bash

# Decompress Gzip
gzip -dk $PWD/data/Musical_Instruments_5.json.gz

# # Replace problematic field names
sed -i "s:style Name:styleName:" $PWD/data/Musical_Instruments_5.json
sed -i "s:style name:styleName:" $PWD/data/Musical_Instruments_5.json
sed -i "s:Style:style:" $PWD/data/Musical_Instruments_5.json

gzip: /home/steph/proj/550/project/data/Musical_Instruments_5.json already exists;	not overwritten


## Imports

In [350]:
from pyspark.sql import SparkSession
from pyspark.mllib.util import MLUtils
import os
import gzip
import json
import re

## Spark DataFrame

The goal of this section is to convert the Pandas dataframe from above into a Spark dataframe, to drop all columns except 'overall' (rating) and 'reviewText' and create an RDD from this data.

#### Creating Spark Session

In [5]:
spark = SparkSession.builder \
    .master("local[16]") \
    .appName("data_pipeline") \
    .getOrCreate()

22/12/14 17:53:05 WARN Utils: Your hostname, Zephyrus resolves to a loopback address: 127.0.1.1; using 172.17.20.254 instead (on interface eth0)
22/12/14 17:53:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/14 17:53:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Creating Spark Dataframe

In [6]:
spark_df = spark.read.json('data/Musical_Instruments_5.json')
spark_df.printSchema()

[Stage 0:>                                                        (0 + 16) / 16]

root
 |-- asin: string (nullable = true)
 |-- image: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- style: struct (nullable = true)
 |    |-- Color Name:: string (nullable = true)
 |    |-- Color:: string (nullable = true)
 |    |-- Configuration:: string (nullable = true)
 |    |-- Edition:: string (nullable = true)
 |    |-- Format:: string (nullable = true)
 |    |-- Item Display Length:: string (nullable = true)
 |    |-- Item Package Quantity:: string (nullable = true)
 |    |-- Length:: string (nullable = true)
 |    |-- Model Number:: string (nullable = true)
 |    |-- Number of Items:: string (nullable = true)
 |    |-- Package Quantity:: string (nullable = true)
 |    |-- Package Type:: string (nullable = true)
 |    |-- Platform for Display:

                                                                                

In [7]:
print(f"There are {spark_df.count()} rows of data")

There are 231392 rows of data


## Cleaning Data

1. Drop all columns except `overall` and `reviewTest`
2. Drop all rows with NA values
3. Convert to RDD
4. Convert to lowercase
5. Strip non-alphabetic/space chars
6. Split into vector of strings
7. Remove empty reviews
8. Simplify rating into 0 or 1

In [8]:
# Drop all columns except reviewText and overall
df = spark_df[['overall', 'reviewText']]
df.printSchema()

# Drop all rows with NA values
df = df.dropna()

# Convert to rdd
rdd = df.rdd

# Convert to lowercase
rdd = rdd.map(lambda x: (x[0], x[1].lower()))

# Strip non-alphanumeric & space chars
rdd = rdd.map(lambda x: (x[0], re.sub(r'[^A-Za-z ]+', '', x[1])))

# Convert review to array of words
rdd = rdd.map(lambda x: (x[0], x[1].split()))

# Convert overall rating to 1 or 0
# 4-5 -> Positive
# 1-3 -> Negative
rdd = rdd.map(lambda x: (1, x[1]) if x[0] in [4,5] else (0, x[1]))

# Filter out empty reviews
basic_rdd = rdd.filter(lambda x: len(x[1]) > 0)

# Cache as basic_rdd
total_rows = basic_rdd.count()
basic_rdd.cache

root
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)



                                                                                

<bound method RDD.cache of PythonRDD[18] at RDD at PythonRDD.scala:53>

## Creating Bags-of-Words

### Method for Filtering Stop Words

For all $(k, v)$ where $v$ is a list of strings, remove all $e$ from $v$ where $e \in \text{stopwords}$. 

Stopwords sourced from [here](http://snowball.tartarus.org/algorithms/english/stop.txt).

In [70]:
stopwords = []

with open("data/stopwords.txt") as file:
    global stopwords
    stopwords = file.read().splitlines()

### Creating Positive OneGram Bag of Words

In [71]:
def remove_stopwords(x):
    global stopwords
    result = []
    for s in x[1]:
        if not s in stopwords:
            result.append(s)
    return (x[0], result)

In [194]:
positive_onegram_counts = basic_rdd \
    .filter(lambda x: x[0] == 1) \
    .map(remove_stopwords) \
    .flatMap(lambda x: x[1]) \
    .map(lambda x: (tuple([x]), 1)) \
    .reduceByKey(lambda a,b: a+b) \
    .filter(lambda x: x[1] > 1) \
    .sortBy(lambda x: -x[1])

negative_onegram_counts = basic_rdd \
    .filter(lambda x: x[0] == 0) \
    .map(remove_stopwords) \
    .flatMap(lambda x: x[1]) \
    .map(lambda x: (tuple([x]), 1)) \
    .reduceByKey(lambda a,b: a+b) \
    .filter(lambda x: x[1] > 1) \
    .sortBy(lambda x: -x[1])

                                                                                

In [195]:
num_shared_pos_onegrams = positive_onegram_counts.count()
num_shared_neg_onegrams = negative_onegram_counts.count()

positive_onegram_counts.cache
negative_onegram_counts.cache

print(f"There are {num_shared_pos_onegrams} shared Positive OneGrams in the corpus")
print(f"There are {num_shared_neg_onegrams} shared Negative OneGrams in the corpus")

There are 53160 shared Positive OneGrams in the corpus
There are 21554 shared Negative OneGrams in the corpus


In [196]:
print("Positive OneGrams:")
for a in positive_onegram_counts.take(6):
    print('\t', end="")
    print(a)

print("Negative OneGrams:")
for a in negative_onegram_counts.take(6):
    print('\t', end="")
    print(a)

Positive OneGrams:
	(('great',), 74484)
	(('good',), 54966)
	(('guitar',), 51021)
	(('sound',), 47321)
	(('one',), 44627)
	(('like',), 44025)
Negative OneGrams:
	(('one',), 10232)
	(('just',), 9416)
	(('like',), 9228)
	(('good',), 8735)
	(('guitar',), 8474)
	(('sound',), 8006)


Number of distinct words:

### Creating ThreeGram Bag of Words

In [260]:
n = 2

def ngram(x):
    result = []
    for i in range(len(x[1])-n+1):
        gram = []
        for j in range(i, i+n):
            gram.append(x[1][j])
        result.append(tuple(gram))
    return (x[0], result)

def ngram_remove_stopwords(x):
    global stopwords
    result = []
    for s in x[1]:
        if not s[0] in stopwords and not s[-1] in stopwords:
            result.append(s)
    return (x[0], result)

In [263]:
positive_twogram_counts = basic_rdd \
    .filter(lambda x: x[0] == 1) \
    .map(ngram) \
    .map(ngram_remove_stopwords) \
    .flatMap(lambda x: [(a, 1) for a in x[1]]) \
    .reduceByKey(lambda a,b: a+b) \
    .filter(lambda x: x[1] > 1) \
    .sortBy(lambda x: -x[1])

negative_twogram_counts = basic_rdd \
    .filter(lambda x: x[0] == 0) \
    .map(ngram) \
    .map(ngram_remove_stopwords) \
    .flatMap(lambda x: [(a, 1) for a in x[1]]) \
    .reduceByKey(lambda a,b: a+b) \
    .filter(lambda x: x[1] > 1) \
    .sortBy(lambda x: -x[1])

                                                                                

In [264]:
num_shared_pos_twograms = positive_twogram_counts.count()
num_shared_neg_twograms = negative_twogram_counts.count()

positive_twogram_counts.cache
negative_twogram_counts.cache

print(f"There are {num_shared_pos_twograms} shared Positive TwoGrams in the corpus")
print(f"There are {num_shared_neg_twograms} shared Negative TwoGrams in the corpus")

There are 245943 shared Positive TwoGrams in the corpus
There are 59697 shared Negative TwoGrams in the corpus


In [266]:
print("Positive TwoGrams:")
for a in positive_twogram_counts.take(6):
    print('\t', end="")
    print(a)

print("Negative TwoGrams:")
for a in negative_twogram_counts.take(6):
    print('\t', end="")
    print(a)

Positive TwoGrams:
	(('works', 'great'), 7120)
	(('good', 'quality'), 3826)
	(('great', 'product'), 3631)
	(('great', 'price'), 3372)
	(('highly', 'recommend'), 2990)
	(('sounds', 'great'), 2702)
Negative TwoGrams:
	(('much', 'better'), 658)
	(('dont', 'know'), 628)
	(('can', 'get'), 451)
	(('power', 'supply'), 441)
	(('didnt', 'work'), 425)
	(('sound', 'quality'), 424)


### Generating Bag of 3000 OneGrams

In [201]:
top_3k_pos_onegrams = set(positive_onegram_counts.map(lambda x: x[0]).take(3000))
top_3k_neg_onegrams = set(negative_onegram_counts.map(lambda x: x[0]).take(3000))
common_3k_onegrams = top_3k_pos_onegrams.intersection(top_3k_neg_onegrams)

In [202]:
num_3k_pos = int((3000 - len(common_3k_onegrams)) / 2)
num_3k_neg = 3000 - len(common_3k_onegrams) - num_3k_pos

unique_3k_pos_onegrams = set(positive_onegram_counts \
    .map(lambda x: x[0]) \
    .filter(lambda x: not x in common_3k_onegrams) \
    .take(num_3k_pos))

unique_3k_neg_onegrams = set(negative_onegram_counts \
    .map(lambda x: x[0]) \
    .filter(lambda x: not x in common_3k_onegrams) \
    .take(num_3k_neg))

bag_3k = common_3k_onegrams.union(unique_3k_neg_onegrams).union(unique_3k_pos_onegrams)

In [203]:
print(f"There are {len(common_3k_onegrams)} common OneGrams")
print(f"We will include {len(unique_3k_pos_onegrams)} unique positive OneGrams")
print(f"We will include {len(unique_3k_neg_onegrams)} unique positive OneGrams")
print(f"There are {len(bag_3k)} OneGrams in total")

There are 2630 common OneGrams
We will include 185 unique positive OneGrams
We will include 185 unique positive OneGrams
There are 3000 OneGrams in total


### Generating Bag of 5000 OneGrams

In [204]:
top_5k_pos_onegrams = set(positive_onegram_counts.map(lambda x: x[0]).take(5000))
top_5k_neg_onegrams = set(negative_onegram_counts.map(lambda x: x[0]).take(5000))
common_5k_onegrams = top_5k_pos_onegrams.intersection(top_5k_neg_onegrams)

In [217]:
num_5k_pos = int((5000 - len(common_5k_onegrams)) / 2)
num_5k_neg = 5000 - len(common_5k_onegrams) - num_5k_pos

unique_5k_pos_onegrams = set(positive_onegram_counts \
    .map(lambda x: x[0]) \
    .filter(lambda x: not x in common_5k_onegrams) \
    .take(num_5k_pos))

unique_5k_neg_onegrams = set(negative_onegram_counts \
    .map(lambda x: x[0]) \
    .filter(lambda x: not x in common_5k_onegrams) \
    .take(num_5k_neg))

bag_5k_a = common_5k_onegrams.union(unique_5k_neg_onegrams).union(unique_5k_pos_onegrams)

In [218]:
print(f"There are {len(common_5k_onegrams)} common OneGrams")
print(f"We will include {len(unique_5k_pos_onegrams)} unique Positive OneGrams")
print(f"We will include {len(unique_5k_neg_onegrams)} unique Negative OneGrams")
print(f"There are {len(bag_5k_a)} OneGrams in total")

There are 4330 common OneGrams
We will include 335 unique Positive OneGrams
We will include 335 unique Negative OneGrams
There are 5000 OneGrams in total


### Generating Bag of 3000 OneGrams + 2000 TwoGrams

In [220]:
top_2k_pos_twograms = set(positive_twogram_counts.map(lambda x: x[0]).take(2000))
top_2k_neg_twograms = set(negative_twogram_counts.map(lambda x: x[0]).take(2000))
common_2k_twograms = top_2k_pos_twograms.intersection(top_2k_neg_twograms)

In [224]:
num_2k_pos = int((2000 - len(common_2k_twograms)) / 2)
num_2k_neg = 2000 - len(common_2k_twograms) - num_2k_pos

unique_2k_pos_twograms = set(positive_twogram_counts \
    .map(lambda x: x[0]) \
    .filter(lambda x: not x in common_2k_twograms) \
    .take(num_2k_pos))

unique_2k_neg_twograms = set(negative_twogram_counts \
    .map(lambda x: x[0]) \
    .filter(lambda x: not x in common_2k_twograms) \
    .take(num_2k_neg))

bag_5k_b = common_2k_twograms.union(unique_2k_neg_twograms).union(unique_2k_pos_twograms).union(bag_3k)

In [225]:
print(f"There are {len(common_2k_twograms)} common TwoGrams")
print(f"We will include {len(unique_2k_pos_twograms)} unique Positive TwoGrams")
print(f"We will include {len(unique_2k_neg_twograms)} unique Negative TwoGrams")
print(f"We will include {len(bag_3k)} OneGrams")
print(f"There are {len(bag_5k_b)} NGrams in total")

There are 1265 common TwoGrams
We will include 367 unique Positive TwoGrams
We will include 368 unique Negative TwoGrams
We will include 3000 OneGrams
There are 5000 NGrams in total


## Create Data Sets

### Generate OneGram RDD

In [237]:
onegram_tuples_rdd = basic_rdd \
    .map(lambda x: (x[0], [tuple([s]) for s in x[1]]))

### Generate TwoGram RDD

In [277]:
n = 2
def one_and_two_grams(x):
    result = []
    for s in x[1]:
        result.append(tuple([s])) # Get all OneGrams
    for i in range(len(x[1])-n+1):
        gram = []
        for j in range(i, i+n):
            gram.append(x[1][j])
        result.append(tuple(gram)) # Get all TwoGrams
    return (x[0], result)

In [278]:
twogram_tuples_rdd = basic_rdd.map(one_and_two_grams)

### Filter Method

In [315]:
def bag_filter(x, bag):
    global count
    result = []
    for s in x[1]:
        if s in bag:
            result.append(s)
    return (x[0], result)

### Filter Data

In [316]:
count = 0
data_3k_rdd = onegram_tuples_rdd \
    .map(lambda x: bag_filter(x, bag_3k)) \
    .filter(lambda x: len(x[1]) > 0) # Ensure there is at least one feature

data_5k_a_rdd = onegram_tuples_rdd \
    .map(lambda x: bag_filter(x, bag_5k_a)) \
    .filter(lambda x: len(x[1]) > 0) # Ensure there is at least one feature

data_5k_b_rdd = twogram_tuples_rdd \
    .map(lambda x: bag_filter(x, bag_5k_b)) \
    .filter(lambda x: len(x[1]) > 0) # Ensure there is at least one feature

num_3k_remaining = data_3k_rdd.count()
num_5k_a_remaining = data_5k_a_rdd.count()
num_5k_b_remaining = data_5k_b_rdd.count()

                                                                                

In [317]:
print(f"{num_3k_remaining/total_rows*100:.2f}% of rows remain after Bag_3k filter")
print(f"{num_5k_a_remaining/total_rows*100:.2f}% of rows remain after Bag_5k_a filter")
print(f"{num_5k_b_remaining/total_rows*100:.2f}% of rows remain after Bag_5k_b filter")

99.32% of rows remain after Bag_3k filter
99.50% of rows remain after Bag_5k_a filter
99.32% of rows remain after Bag_5k_b filter


## Prepare Data for Logistic Regression

### Convert to RDD\<LabeledPoint\>

In [335]:
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.linalg import SparseVector

In [338]:
def convert_to_lp_rdd(x, bag):
    feature_ids = []
    feature_vals = []
    for id,val in enumerate(bag):
        if val in x[1]:
            feature_ids.append(id)
            feature_vals.append(1)
    # return LabeledPoint(x[0], feature_
    return LabeledPoint(x[0], SparseVector(len(bag), feature_ids, feature_vals))

In [346]:
from pyspark.mllib.util import MLUtils
import os

In [349]:
model_A = data_3k_rdd.map(lambda x: convert_to_lp_rdd(x, bag_3k))
model_B = data_5k_a_rdd.map(lambda x: convert_to_lp_rdd(x, bag_5k_a))
model_C = data_5k_b_rdd.map(lambda x: convert_to_lp_rdd(x, bag_5k_b))

if not os.path.isdir("data/model_A"):
    MLUtils.saveAsLibSVMFile(model_A, "data/model_A")
if not os.path.isdir("data/model_B"):
    MLUtils.saveAsLibSVMFile(model_B, "data/model_B")
if not os.path.isdir("data/model_C"):
    MLUtils.saveAsLibSVMFile(model_C, "data/model_C")

                                                                                