# Amazon Review ETL

This is an ETL pipeline for Amazon pet product reviews data.

## Dependencies

In [1]:
# Download a Postgres driver to allow Spark to interact with Postgres
!curl -O https://jdbc.postgresql.org/download/postgresql-42.2.16.jar

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  979k  100  979k    0     0   807k      0  0:00:01  0:00:01 --:--:--  807k


In [33]:
# Locate Spark
import findspark
findspark.init()

# Dependencies
from pyspark import SparkFiles
from pyspark.sql import SparkSession
from pyspark.sql.types import DateType
from pyspark.sql.functions import col
from config import db_password

# Spark session adding the Postgres driver to Spark
spark = SparkSession.builder \
                    .appName('amz') \
                    .config('spark.driver.extraClassPath', 'postgresql-42.2.16.jar') \
                    .getOrCreate()
spark

## Extract

1. [Download the data](https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Pet_Products_v1_00.tsv.gz)
2. Unzip the downloaded file and move the data file to the same directory as this notebook

Source: https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt

In [24]:
# Read in data
df = spark.read.csv('pet_product_reviews.tsv', sep='\t', header=True, inferSchema=True)
df.show(2, vertical=True, truncate=False)

-RECORD 0-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 marketplace       | US                                                                                                                                                                                                                                                           
 customer_id       | 28794885                                                                                                                                                                                                                                                     
 review_id         | REAKC26P07MDN                                                                                                                                             

In [25]:
# Schema and row count
df.printSchema()
df.count()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: integer (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- star_rating: integer (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: string (nullable = true)



2643619

## Transform

In [36]:
# Review table
review_cols = ['review_id', 'customer_id', 'product_id', 'product_parent', df['review_date'].cast(DateType())]
review_df = df.select(review_cols).orderBy('review_id')
print(review_df.count())
review_df.show(5)

2643619
+--------------+-----------+----------+--------------+-----------+
|     review_id|customer_id|product_id|product_parent|review_date|
+--------------+-----------+----------+--------------+-----------+
|R100065P6TTS3J|   42711834|B000HHM6PS|      64793786| 2014-04-06|
|R10007MH6NTVFM|   25105396|B0090Z9FFC|     675354291| 2015-06-24|
|R1000CIZTRNP23|   25423435|B00K1B6RCI|     308737701| 2015-03-06|
|R1000DL08MOV57|   13165224|B0006L2PCO|     976701490| 2015-04-12|
|R1000JOVLD0J41|   34245087|B000TZ7022|     870738517| 2010-11-13|
+--------------+-----------+----------+--------------+-----------+
only showing top 5 rows



In [29]:
# Customer table
customer_df = df.groupBy('customer_id').count().orderBy('customer_id')
customer_df = customer_df.withColumnRenamed('count', 'review_count')
print(customer_df.count())
customer_df.show(5)

1415190
+-----------+------------+
|customer_id|review_count|
+-----------+------------+
|      10003|           2|
|      10164|           1|
|      10206|           1|
|      10227|           2|
|      10228|           1|
+-----------+------------+
only showing top 5 rows



In [31]:
# Product table
product_df = df.select(['product_id', 'product_title']).dropDuplicates().orderBy('product_id')
print(product_df.count())
product_df.show(5)

239343
+----------+--------------------+
|product_id|       product_title|
+----------+--------------------+
|0310824230|Advantage Flea Co...|
|039480001X|  The Cat in the Hat|
|0615553605|Pet Qwerks Treat ...|
|0684836483|250 Things You Ca...|
|0761129804|  Pop Bottle Science|
+----------+--------------------+
only showing top 5 rows



In [38]:
# Vine table
vine_cols = ['review_id', 'star_rating', 'helpful_votes', 'total_votes', 'vine', 'verified_purchase']
vine_df = df.select(vine_cols).orderBy('review_id')
print(vine_df.count())
vine_df.show(5)

2643619
+--------------+-----------+-------------+-----------+----+-----------------+
|     review_id|star_rating|helpful_votes|total_votes|vine|verified_purchase|
+--------------+-----------+-------------+-----------+----+-----------------+
|R100065P6TTS3J|          4|            0|          0|   N|                Y|
|R10007MH6NTVFM|          3|            0|          0|   N|                Y|
|R1000CIZTRNP23|          4|            3|          3|   N|                Y|
|R1000DL08MOV57|          5|            0|          0|   N|                Y|
|R1000JOVLD0J41|          2|            0|          0|   N|                N|
+--------------+-----------+-------------+-----------+----+-----------------+
only showing top 5 rows



### Load