<hr>
# Data Cleaning

Notebook for testing the cleaning process for our dataset.
<hr>
## Setup

In [1]:
import pandas as pd
import numpy as np
import amzn_reviews_cleaner_funcs as amzn
from pyspark.sql import SparkSession

%autoreload 2

<hr>
## Load Data

In [2]:
# create spark session
spark = SparkSession(sc)

# get dataframe
# specify s3 as sourc with s3a://
df = spark.read.json("s3a://amazon-review-data/reviews_Musical_Instruments_5.json.gz")
df.show(3)

+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|      asin| helpful|overall|          reviewText| reviewTime|    reviewerID|        reviewerName|             summary|unixReviewTime|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
|1384719342|  [0, 0]|    5.0|Not much to write...|02 28, 2014|A2IBPI20UZIR0U|cassandra tu "Yea...|                good|    1393545600|
|1384719342|[13, 14]|    5.0|The product does ...|03 16, 2013|A14VAT5EAX3D9S|                Jake|                Jake|    1363392000|
|1384719342|  [1, 1]|    5.0|The primary job o...|08 28, 2013|A195EZSQDW3E21|Rick Bennette "Ri...|It Does The Job Well|    1377648000|
+----------+--------+-------+--------------------+-----------+--------------+--------------------+--------------------+--------------+
only showing top 3 rows



<hr>
## Test helper module

### Add tfidf vectors

In [4]:
df_tfidf, vocab = amzn.add_tfidf(df)

df_tfidf.select("idf_vector").show(3)

IllegalArgumentException: u'requirement failed: Input type must be ArrayType(StringType) but got StringType.'

### Test extract top n features

In [15]:
df_features = amzn.add_top_features(df_tfidf, vocab)

In [16]:
df_features.select("top_features").show(3)

+--------------------+
|        top_features|
+--------------------+
|[supposed, record...|
|[nose, candy, car...|
|[pops, allowing, ...|
+--------------------+
only showing top 3 rows



### Test clean_reviewText()

In [10]:
df_clean = amzn.clean_reviewText(df)
df_clean.select("cleanText").show(3)

+--------------------+
|           cleanText|
+--------------------+
|Not much to write...|
|The product does ...|
|The primary job o...|
+--------------------+
only showing top 3 rows



## Test removal of empty tokens

In [15]:
# clean
df_clean = amzn.clean_reviewText(df)

# tokenize
df_raw_tokens = amzn.tokenize(df_clean)

In [7]:
df_raw_tokens.select("raw_tokens").show(3)

+--------------------+
|          raw_tokens|
+--------------------+
|[not, much, to, w...|
|[the, product, do...|
|[the, primary, jo...|
+--------------------+
only showing top 3 rows



In [22]:
# remove stopwords
#df_tokens = amzn.remove_stop_words(df_raw_tokens)
df_remove_test = amzn.remove_empty_tokens(df_raw_tokens)
test_row = df_remove_test.select("raw_tokens").first()

TypeError: __init__() takes at least 2 arguments (1 given)

In [19]:
test_row["raw_tokens"]

u"[not, much, to, write, about, here, but, it, does, exactly, what, it's, supposed, to, filters, out, the, pop, sounds, now, my, recordings, are, much, more, crisp, it, is, one, of, the, lowest, prices, pop, filters, on, amazon, so, might, as, well, buy, it, they, honestly, work, the, same, despite, their, pricing]"