# Amazon Products and Reviews Data (Sports and Outdoor) ETL 
### Data Engineering Capstone Project

#### Project Summary
In this project we explore the [Amazon Review Data (2018)](https://nijianmo.github.io/amazon/).
Particularly focusing on the sports and outdoor category.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
import configparser
from datetime import datetime
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
from pyspark.sql.functions import col, size
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pyspark.sql.types as T


config = configparser.ConfigParser()
config.read('aws.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['default']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['default']['AWS_SECRET_ACCESS_KEY']

In [2]:
# Read in the data here
spark = SparkSession \
        .builder \
        .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
        .config("spark.sql.broadcastTimeout", "36000")\
        .set("spark.speculation","false")\
        .getOrCreate()

spark

In [3]:
# Some udfs used later
size_ = udf(lambda xs: len(xs) if xs else 0, IntegerType())

In [4]:
from datetime import datetime
get_timestamp = udf(lambda ms: datetime.fromtimestamp(ms), T.TimestampType())

### Step 1: Scope the Project and Gather Data

#### Scope 
In this project I explored the sports and outdoor category of [amazon review data](https://nijianmo.github.io/amazon/), the end goal is to produce a relational database based on the dataset. The richness of these data would be useful for analysis of various kind. The main tool in this project is PySpark.


#### Describe and Gather Data
The [amazon review data](https://nijianmo.github.io/amazon/) can be downloaded from their project website after filling out a request form. The website provides a review dataset, a product dataset and a core-review dataset "in which all users and items have at least 5 reviews". For this project I opt to work on the full product and review data, but focusing on the 'Sports and Outdoors' section. There exists many other version of the amazon review dataset, even one provided by [amazon](https://s3.amazonaws.com/amazon-reviews-pds/readme.html). The selected dataset stands out as one of the latest verion (2018) as well as it provides a separate products dataset, which believe to be a web crawl of amazon prodcut webpages. This combo provides a better context to understand the reviews and accumulated on Amazon.com.


See below for some sample entries.

For ease of processing and testing
Use split command to slice the dataset into pieces.
Then use aws cli sync to upload to s3 buckets.

Repeat similar operations for the review data.

It is recommended to process on a EC2 instance, to take advantage of the relative high speed within AWS.

The EC2 micro instance provides at most 30 GB of free storage and should be sufficient to process all data.

In [5]:
products_data = "s3a://capstone-zwmtrue/products/products_9011.json"
#work with a subset of 10x5000 records here
products = spark.read.json(products_data)

In [6]:
reviews_data = "s3a://capstone-zwmtrue/reviews/reviews_9011.json"
#work with a subset of 10x5000 records here
reviews = spark.read.json(reviews_data)

### Step 2: Explore and Assess the Data
#### Explore the Data 
First let's take a look at the imported data schema and counts.

In [7]:
reviews.printSchema()

root
 |-- asin: string (nullable = true)
 |-- image: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- style: struct (nullable = true)
 |    |-- Color:: string (nullable = true)
 |    |-- Size:: string (nullable = true)
 |    |-- Style Name:: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)
 |-- verified: boolean (nullable = true)
 |-- vote: string (nullable = true)



In [8]:
df_review = reviews.limit(1000).toPandas()
df_review.count()

asin              1000
image               24
overall           1000
reviewText        1000
reviewTime        1000
reviewerID        1000
reviewerName      1000
style              283
summary            998
unixReviewTime    1000
verified          1000
vote               284
dtype: int64

In [9]:
df_review.head()
# Using pandas b/c the preview is nicer

Unnamed: 0,asin,image,overall,reviewText,reviewTime,reviewerID,reviewerName,style,summary,unixReviewTime,verified,vote
0,B000FDDWB6,,1.0,Unfortunately nowadays you have to spend quite...,"06 15, 2013",A2NCK6EQTR2M3U,Bikenmike,,Gear hunter,1371254400,False,
1,B000FDDWB6,,2.0,"This bike I owned for about 4 months, mostly f...","06 12, 2013",A1NLA4BIBEY5G,J. Nicholson,,"Not really durable, cheaply made",1370995200,True,
2,B000FDDWB6,,5.0,I have had this bike for more than a year. I ...,"06 6, 2013",AWMGBGFW2TIFA,D.R. of IL,,It works for what I needed,1370476800,True,
3,B000FDDWB6,,5.0,I absolutely love this bike! I've had it for a...,"06 4, 2013",A1SDMTG1OKI27M,Skye,,Amazing Product for an Amazing Price!,1370304000,False,
4,B000FDDWB6,,1.0,It is a shame that any company would put a bic...,"05 29, 2013",A1C1AE4YLCLFDZ,Jed Kelson,,A rotten lemon,1369785600,False,2.0


In [10]:
products.printSchema()

root
 |-- also_buy: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- also_view: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- asin: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- date: string (nullable = true)
 |-- description: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- details: string (nullable = true)
 |-- feature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- fit: string (nullable = true)
 |-- image: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- main_cat: string (nullable = true)
 |-- price: string (nullable = true)
 |-- rank: string (nullable = true)
 |-- similar_item: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- asin: string (nullable = true)
 |    |    |-- features: struct (nullabl

In [11]:
df_products = products.limit(1000).toPandas()
df_products.count()

also_buy         194
also_view        304
asin            1000
brand            831
category         914
date             138
description      770
details          831
feature          755
fit               28
image            557
main_cat         889
price            387
rank             978
similar_item      12
tech1             11
title           1000
dtype: int64

In [12]:
df_products.head()

Unnamed: 0,also_buy,also_view,asin,brand,category,date,description,details,feature,fit,image,main_cat,price,rank,similar_item,tech1,title
0,"[B0072084UO, B004VQ6170, B0197DBCYS, B01ESKH8S...",,B00FQUR4F8,Scuba Choice,,,[One of the best selling items in Scuba Choice...,"\n <div class=""content"">\n\n\n\n\n\n\n<ul...","[Length: 2-1/4"", Will support the heaviest fis...",,,Sports & Outdoors,$7.98,"770,132inSportsOutdoors(",,,Scuba Choice Archery Bow Fishing Fish Hunting ...
1,,,B00FQV0R7O,Scuba Choice,,,[One of the best selling items in Scuba Choice...,"\n <div class=""content"">\n\n\n\n\n\n\n<ul...","[Length: 3.5"", A compact tool for smoother and...",,,Sports & Outdoors,$7.48,"2,069,324inSportsOutdoors(",,,"Scuba Choice 3.5"" Black Archery Bow Fishing Fi..."
2,"[B076TDHCG8, B01N7NL02R, B07D9P92N9, B07CN7D7L...","[B076TDHCG8, B01N7NL02R, B07D9P92N9, B07GJLQ7K...",B00FQV25WO,The North Face,"[Sports & Outdoors, Outdoor Recreation, Winter...",,[Get back to basics with a breathable cotton/o...,"\n <div class=""content"">\n\n\n\n\n\n\n<ul...","[Cotton,Polyester, Made in USA or Imported, Cl...",,[https://images-na.ssl-images-amazon.com/image...,Sports & Outdoors,"$14.05 - $1,999.00","8,003inSportsOutdoors(",,,The North Face Men's Short Sleeve Half Dome Tee
3,"[B076TDHCG8, B01N7NL02R, B07D9P92N9, B07CN7D7L...","[B076TDHCG8, B01N7NL02R, B07D9P92N9, B07GJLQ7K...",B00FQV31OK,The North Face,"[Sports & Outdoors, Outdoor Recreation, Winter...",,[Get back to basics with a breathable cotton/o...,"\n <div class=""content"">\n\n\n\n\n\n\n<ul...","[Cotton,Polyester, Made in USA or Imported, Cl...",,[https://images-na.ssl-images-amazon.com/image...,Sports & Outdoors,"$14.05 - $1,999.00","8,003inSportsOutdoors(",,,The North Face Men's Short Sleeve Half Dome Tee
4,"[B076TDHCG8, B01N7NL02R, B07D9P92N9, B07CN7D7L...","[B076TDHCG8, B01N7NL02R, B07D9P92N9, B07GJLQ7K...",B00FQV3ISY,The North Face,"[Sports & Outdoors, Outdoor Recreation, Winter...",,[Get back to basics with a breathable cotton/o...,"\n <div class=""content"">\n\n\n\n\n\n\n<ul...","[Cotton,Polyester, Made in USA or Imported, Cl...",,[https://images-na.ssl-images-amazon.com/image...,Sports & Outdoors,"$14.05 - $1,999.00","8,003inSportsOutdoors(",,,The North Face Men's Short Sleeve Half Dome Tee


'asin': Product id and short for 'Amazon Standard Identification Number', the product page can be accessed at [https://amazon.com/dp/0889350426](https://amazon.com/dp/0889350426)

'also_buy','also_view': Array of asins of related products.

'similar_item': This appear to a web scrape of the similar item section on the product page, we only need the respective asin here.

Spin off separate tables for these 3 different categories, this should be helpful to establish product cluster.

The following columns are some of the less complex ones and seems to associate with the product more closely

'brand': Brand name.

'date': Date of product listed as string in American format (MMM d, YYYY)

'description': An array of strings contains the product description. This needs to be concated. There appear to be instances where raw web scrapes are passed in.

'details': Detail section on prodcut page, frequently contain raw html.

'feature_merged': Featue section on product page.

'main_cat': Main category of product.

'price': Price listed, when multiple price for different size/color exists, only extract the first. (This dues part to pyspark doesn't provide an "extract all" feature to its regexp funcitions.

'rank': Rank of the first category listed, this is usually the main category.

'title': Title string.

#### Cleaning Steps

Now we clean the core part of products and reviews

For products we need to 
    1. Convert date to datetime
    2. Merge description, feature into one string per row. 
    3. Extract the first item as the primary price/rank.
    4. Parse the html table in fit column
    5. Designated category id for the main_cat
    6. Rename asin to product_id
    

In [13]:
products_cleaned = (
    products
    .withColumn('product_date', F.to_date('date', 'MMM d, YYYY'))
    .withColumn('description_merged', F.concat_ws(' ', 'description'))
    .withColumn('feature_merged', F.concat_ws(' ', 'feature'))
    .withColumn('first_price', F.regexp_extract('price', r'\d+\.\d+', 0).cast('Decimal'))
    .withColumn('first_rank', F.regexp_extract('rank', r'\d+\.\d+', 0).cast('Int'))
    #.withColumn('fit_str', parse_table_udf(col('fit')))
    .withColumn('fit_id', F.sha2(col('fit'), 256))
    .withColumn('main_cat_id', F.sha2('main_cat', 256))
    .withColumn('brand_id', F.sha2('brand', 256))
)
products_cleaned.printSchema()

root
 |-- also_buy: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- also_view: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- asin: string (nullable = true)
 |-- brand: string (nullable = true)
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- date: string (nullable = true)
 |-- description: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- details: string (nullable = true)
 |-- feature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- fit: string (nullable = true)
 |-- image: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- main_cat: string (nullable = true)
 |-- price: string (nullable = true)
 |-- rank: string (nullable = true)
 |-- similar_item: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- asin: string (nullable = true)
 |    |    |-- features: struct (nullabl

Let's move on to reviews dataset

On the basic level we need to 
    1. Create a primary key.
    2. Convert date string and unix timestamp properly.
    3. Convert vote to integer.
    4. Rename reviewerID to user_Id.

In [14]:
reviews_cleaned = (
    reviews
    .withColumn('review_Id', F.sha2(F.concat_ws(' ', col('asin'), col('reviewerID')), 256))
    .withColumn('review_Date', F.to_date('reviewTime', 'MM dd, yyyy'))
    .withColumn('review_TS', get_timestamp('unixReviewTime'))
    .withColumn('votes', col('vote').cast("Int"))
    .withColumnRenamed('reviewerID', 'user_Id')
)
reviews_cleaned.printSchema()

root
 |-- asin: string (nullable = true)
 |-- image: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- user_Id: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- style: struct (nullable = true)
 |    |-- Color:: string (nullable = true)
 |    |-- Size:: string (nullable = true)
 |    |-- Style Name:: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: long (nullable = true)
 |-- verified: boolean (nullable = true)
 |-- vote: string (nullable = true)
 |-- review_Id: string (nullable = true)
 |-- review_Date: date (nullable = true)
 |-- review_TS: timestamp (nullable = true)
 |-- votes: integer (nullable = true)



### Step 3: Define the Data Model and Run Pipelines to Model the Data 
#### 3.1 Conceptual Data Model
Let's build a star schema around fact reivews. 

fact_review table is defined as follows

In [15]:
fact_review = (
    reviews_cleaned.join(products_cleaned, ['asin'], how='full')
    .withColumn('year', F.year('review_Date'))
    .withColumn('month', F.month('review_Date'))
    .select(
        col('review_Id'), 
        col('asin'), 
        col('overall'), 
        col('reviewText').alias('review_Text'), 
        col('review_Date'),
        col('year'),
        col('month'),
        col('user_Id'), 
        col('review_TS'), 
        col('summary'), 
        col('verified'), 
        col('votes'), 
        col('main_cat_id').alias('category_id'),
        col('brand_id'),
        col('fit_id')
    )
)
fact_review.printSchema()

root
 |-- review_Id: string (nullable = true)
 |-- asin: string (nullable = true)
 |-- overall: double (nullable = true)
 |-- review_Text: string (nullable = true)
 |-- review_Date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- user_Id: string (nullable = true)
 |-- review_TS: timestamp (nullable = true)
 |-- summary: string (nullable = true)
 |-- verified: boolean (nullable = true)
 |-- votes: integer (nullable = true)
 |-- category_id: string (nullable = true)
 |-- brand_id: string (nullable = true)
 |-- fit_id: string (nullable = true)



The following should create our dim_product table in PySpark

In [16]:
dim_product = (
    products_cleaned
        .select(
            col('asin'), 
            col('brand_id'), 
            col('description_merged').alias('description'), 
            col('details'), 
            col('feature_merged').alias('feature'), 
            col('first_price'), 
            col('first_rank'), 
            col('main_cat_id').alias('category_id'), 
            col('product_date'), 
            F.year('product_date').alias('year'),
            F.month('product_date').alias('month'),
            col('fit_id'), 
            col('title')
        )
)
dim_product.printSchema()

root
 |-- asin: string (nullable = true)
 |-- brand_id: string (nullable = true)
 |-- description: string (nullable = false)
 |-- details: string (nullable = true)
 |-- feature: string (nullable = false)
 |-- first_price: decimal(10,0) (nullable = true)
 |-- first_rank: integer (nullable = true)
 |-- category_id: string (nullable = true)
 |-- product_date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- fit_id: string (nullable = true)
 |-- title: string (nullable = true)



The user dimension table

In [19]:
dim_user = (
    reviews_cleaned
    .select(
        'user_id', 
        col('reviewerName').alias('user_name')
    )
)
dim_user.printSchema()

root
 |-- user_id: string (nullable = true)
 |-- user_name: string (nullable = true)



The brand dim table

In [20]:
dim_brand = (
    products_cleaned
    .select(
        col('brand_id'),
        col('brand')
    )
)
dim_brand.printSchema()

root
 |-- brand_id: string (nullable = true)
 |-- brand: string (nullable = true)



The category dim table

In [21]:
dim_category = (
    products_cleaned
    .select(
        col('main_cat_id').alias('category_id'),
        col('main_cat').alias('category')
    )
)
dim_brand.printSchema()

root
 |-- brand_id: string (nullable = true)
 |-- brand: string (nullable = true)



I also included a dim table for the fit data, since it provides another interesting angle of analysis. 
The fit string is provided as html table, I had initially developed a UDF to parse these tables, but seems to not scale well. 
Will leave it here for future references.

In [22]:
import lxml.html
#
def parse_table(s):
    """takes in html table, spit out a dictionary of table elements"""
    table_root = lxml.html.fromstring(s)
    table_list = table_root.xpath('//tr/td//span//text()')
    table_dict = {table_list[k].strip():int(table_list[k+1]) for k in range(0, len(table_list), 2)}
    return table_dict

spark.udf.register("parseTable", parse_table)
parse_table_udf = udf(parse_table, T. MapType(T.StringType(), T.IntegerType()))

In [23]:
dim_fit = (
    products_cleaned
    .select(
        col('fit_id'),
        col('fit')
    )
)
dim_fit.printSchema()

root
 |-- fit_id: string (nullable = true)
 |-- fit: string (nullable = true)



In [28]:
fact_review.show()

+--------------------+----------+-------+--------------------+-----------+----+-----+--------------+-------------------+--------------------+--------+-----+-----------+--------+------+
|           review_Id|      asin|overall|         review_Text|review_Date|year|month|       user_Id|          review_TS|             summary|verified|votes|category_id|brand_id|fit_id|
+--------------------+----------+-------+--------------------+-----------+----+-----+--------------+-------------------+--------------------+--------+-----+-----------+--------+------+
|605e4a1939949635d...|B000FDVVF0|    5.0|Product was as-ad...| 2011-05-14|2011|    5|A3AAEKI85KYR6D|2011-05-13 17:00:00|Excellent product...|    true|    2|       null|    null|  null|
|3ddaa237b7a05a290...|B000FDVVF0|    5.0|I write "Hits the...| 2011-04-22|2011|    4| APGE05SC50SPL|2011-04-21 17:00:00|Hits the middle g...|    true|    5|       null|    null|  null|
|9690a5bf06c5991cc...|B000FDVVF0|    1.0|It was easy enoug...| 2011-04-15|2

In [29]:
dim_product.show()

+----------+--------------------+--------------------+--------------------+--------------------+-----------+----------+--------------------+------------+----+-----+------+--------------------+
|      asin|            brand_id|         description|             details|             feature|first_price|first_rank|         category_id|product_date|year|month|fit_id|               title|
+----------+--------------------+--------------------+--------------------+--------------------+-----------+----------+--------------------+------------+----+-----+------+--------------------+
|B00FQUR4F8|8baeea39de5cf707a...|One of the best s...|
      <div class...|Length: 2-1/4" Wi...|          8|      null|9fa507308e7849891...|        null|null| null|  null|Scuba Choice Arch...|
|B00FQV0R7O|8baeea39de5cf707a...|One of the best s...|
      <div class...|Length: 3.5" A co...|          7|      null|9fa507308e7849891...|        null|null| null|  null|Scuba Choice 3.5"...|
|B00FQV25WO|418e83d148e4c4feb...|Ge

In [30]:
dim_user.show()

+--------------+--------------------+
|       user_id|           user_name|
+--------------+--------------------+
|A2NCK6EQTR2M3U|           Bikenmike|
| A1NLA4BIBEY5G|        J. Nicholson|
| AWMGBGFW2TIFA|          D.R. of IL|
|A1SDMTG1OKI27M|                Skye|
|A1C1AE4YLCLFDZ|          Jed Kelson|
|A2XNQIEI1YEJFN|           Ryan Carr|
|A1BMF7OAXSP4L2|                Seth|
|A3LT3K6E0Q7JO4|    Rebecca Wolinski|
| AOI2NMASX8QJX|   Fil&#039;s review|
|A1Q272JSV2BDML|DarklyDreamingDexter|
|A3Q1942ZWO9BJY|             Francis|
|A13WD26BXDOOBL|    Randall Batridge|
|A37ZQ3KP44OLKA|      C. Comperatore|
| AUPLRWWSBQHQB|      Alex Churchill|
|A2W3N6LQCZOBBV|        Sean Sampson|
|A2LZCCWR6MDA0D|               Bowen|
| AFBZ7G9XOJQR5|             ambschi|
| AE0G2J5UD328R|              Delcio|
| A5B4BPLRX110P|              amarot|
|A20BODDLOJMVQB|George T. Chamber...|
+--------------+--------------------+
only showing top 20 rows



In [31]:
dim_brand.show()

+--------------------+--------------+
|            brand_id|         brand|
+--------------------+--------------+
|8baeea39de5cf707a...|  Scuba Choice|
|8baeea39de5cf707a...|  Scuba Choice|
|418e83d148e4c4feb...|The North Face|
|418e83d148e4c4feb...|The North Face|
|418e83d148e4c4feb...|The North Face|
|418e83d148e4c4feb...|The North Face|
|0c944d4e85f8876ac...|  Kalaj Kutter|
|418e83d148e4c4feb...|The North Face|
|418e83d148e4c4feb...|The North Face|
|4050b2c7254c63794...|          XLAB|
|418e83d148e4c4feb...|The North Face|
|68d667821f8254407...|  GUNS4US Inc.|
|992a91880377c8b30...|       Scanpod|
|418e83d148e4c4feb...|The North Face|
|418e83d148e4c4feb...|The North Face|
|45e1e3e71ed2fd4e1...|          5.11|
|                null|          null|
|cce7668153f0c4c04...|       Gerbing|
|cce7668153f0c4c04...|       Gerbing|
|3bdc7e229b05bc9e9...|         Casio|
+--------------------+--------------+
only showing top 20 rows



In [32]:
dim_category.show()

+--------------------+-----------------+
|         category_id|         category|
+--------------------+-----------------+
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|                null|             null|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|9fa507308e7849891...|Sports & Outdoors|
|                null|             null|
+--------------------+-----------------+
only showing top

In [33]:
dim_fit.show()

+------+----+
|fit_id| fit|
+------+----+
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
|  null|null|
+------+----+
only showing top 20 rows



### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
output_data = "s3a://capstone-zwmtrue/data-model"
fact_review.write.partitionBy('year','month').parquet(output_data + "review.parquet", mode="overwrite")

In [None]:
dim_product.write.partitionBy('year','month').parquet(output_data + "product.parquet", mode="overwrite")

In [None]:
dim_user.write.parquet(output_data + "user.parquet", mode="overwrite")

In [None]:
dim_brand.write.parquet(output_data + "brand.parquet", mode="overwrite")

In [None]:
dim_category.write.parquet(output_data + "category.parquet", mode="overwrite")

In [None]:
dim_fit.write.parquet(output_data + "fit.parquet", mode="overwrite")

#### 4.2 Data Quality Checks

The original data does however include many nulls. 

In [None]:
fact_review.count()

In [None]:
fact_review.printSchema()

In [None]:
dim_product.count()

#### 4.2 Data dictionary 

review_id: ID of review

asin/product_Id: ASIN (Amazon Standard Inventory Number)

overall: Producting score (1-5)

reviewText: Text content of review

review_Date: Date of review

user_Id: Reviewer ID

user_name: Reviewer Name

summary: Summary of user review 

review_TS: Timestamp of review in mms

verified: Verified purchase

votes: Number of vote for this review

category_id: ID of product's main category

category: Strings of product main category. A product can be of many cateogries I choose to just keep the main one for simplicity.

brand_id: ID of product's brand

fit_id: ID of product's fit statistics. 

fit: HTML table of product's fit statistics, needs to be further processed.

description: Product description

details: Details of product 

feature: Feature of product



#### Step 5: Complete Project Write Up

Spark seems to be the ideal tool for this project, I have been wanting to try it out and although there're limitations to the framework, it proves to be very efficient both in performance and the amount of code it takes to perform the task I faced.

This data should probably updated in an annual or semi-annual frequency, since product reviews don't happen very often for the majority of items.

If data is increased by 100 fold, we will need run this ETL Process in batches using spark clusters.

For multiple access, we should consider write the parquets to HDFS.