## Big Data Challenge
### Goals

1) Perform the ETL process completely in the cloud and upload a dataframe to an RDS instance

2) Use PySpark or SQL to perform a statistical analysis of selected data.

3) I am using my local machine to run spark so I can load the data to my local postgres database and avoid paying Amazon. 

In [32]:
# https://medium.com/beeranddiapers/installing-apache-spark-on-mac-os-ce416007d79f
# installed spark on mac using brew and modified bash_profile 
import os
SPARK_VERSION = 'spark-3.3.1'
import findspark
findspark.init()
from dotenv import load_dotenv
from pyspark.sql.functions import col,to_date,count
DB_HOST = os.getenv('DB_HOST')
DB_NAME = os.getenv('DB_NAME')
DB_USER = os.getenv('DB_USER')
DB_PASS = os.getenv('DB_PASS')
load_dotenv()

True

In [6]:
# Start Spark session on local machine
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName("CloudETLProject").getOrCreate()

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark write to postgres example") \
    .config("spark.jars", "/usr/local/Cellar/apache-spark/3.2.0/libexec/jars/postgresql-42.2.9.jar") \
    .getOrCreate()


In [7]:
from pyspark import SparkFiles
# Load in user_data.csv from S3 into a DataFrame
url = "https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Wireless_v1_00.tsv.gz"
spark.sparkContext.addFile(url)

In [8]:
df = spark.read.option('header', 'true').csv(SparkFiles.get("amazon_reviews_us_Wireless_v1_00.tsv.gz"), inferSchema=True, sep='\t', timestampFormat="mm/dd/yy")
# df.show(10)

In [9]:
# Count number of records in the spark dataframe
df.count()

9002021

In [10]:
# Print schema
df.printSchema()

root
 |-- marketplace: string (nullable = true)
 |-- customer_id: integer (nullable = true)
 |-- review_id: string (nullable = true)
 |-- product_id: string (nullable = true)
 |-- product_parent: integer (nullable = true)
 |-- product_title: string (nullable = true)
 |-- product_category: string (nullable = true)
 |-- star_rating: string (nullable = true)
 |-- helpful_votes: integer (nullable = true)
 |-- total_votes: integer (nullable = true)
 |-- vine: string (nullable = true)
 |-- verified_purchase: string (nullable = true)
 |-- review_headline: string (nullable = true)
 |-- review_body: string (nullable = true)
 |-- review_date: string (nullable = true)



In [24]:
# Create review dataframe
review_table = df.select("customer_id","product_id","product_parent","review_date")

In [26]:
type(review_table)
review_table.show()

+-----------+----------+--------------+-----------+
|customer_id|product_id|product_parent|review_date|
+-----------+----------+--------------+-----------+
|   16414143|B00YL0EKWE|     852431543| 2015-08-31|
|   50800750|B00XK95RPQ|     516894650| 2015-08-31|
|   15184378|B00SXRXUKO|     984297154| 2015-08-31|
|   10203548|B009V5X1CE|     279912704| 2015-08-31|
|     488280|B00D93OVF0|     662791300| 2015-08-31|
|   13334021|B00XVGJMDQ|     421688488| 2015-08-31|
|   27520697|B00KQW1X1C|     554285554| 2015-08-31|
|   48086021|B00IP1MQNK|     488006702| 2015-08-31|
|   12738196|B00HVORET8|     389677711| 2015-08-31|
|   15867807|B00HX3G6J6|     299654876| 2015-08-31|
|    1972249|B00U4NATNQ|     577878727| 2015-08-31|
|   10956619|B00SZEFDH8|     654620704| 2015-08-31|
|   14805911|B00JRJUL9U|     391166958| 2015-08-31|
|   15611116|B00KQ4T0HE|     481551630| 2015-08-31|
|   39298603|B00M0YWKPM|     685107474| 2015-08-31|
|   17552454|B00KDZEE68|     148320945| 2015-08-31|
|   12218556

## Spent HOURS trying to figure out why I could not write the contents of a dataframe to a table. 
## https://cumsum.wordpress.com/2020/09/26/pyspark-attributeerror-nonetype-object-has-no-attribute/
### The cause was I had a .show() at the end of my create dataframe command. The show returns a none object. Instead, execute the create dataframe by itself then excute a dataframe.show() on the next command. 


In [27]:
review_table.write \
    .mode("overwrite") \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/big_data") \
    .option("dbtable", "review_id_table") \
    .option("user", DB_USER) \
    .option("password", DB_PASS) \
    .option("driver", "org.postgresql.Driver") \
    .save()

In [28]:
# Create products dataframe
products_table = df.select("product_id","product_title")

In [29]:
products_table.write \
    .mode("overwrite") \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/big_data") \
    .option("dbtable", "products") \
    .option("user", DB_USER) \
    .option("password", DB_PASS) \
    .option("driver", "org.postgresql.Driver") \
    .save()

In [30]:
# Create vine dataframe
vine_table = df.select("review_id","star_rating","helpful_votes","total_votes","vine")

In [31]:
vine_table.write \
    .mode("overwrite") \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/big_data") \
    .option("dbtable", "vine_table") \
    .option("user", DB_USER) \
    .option("password", DB_PASS) \
    .option("driver", "org.postgresql.Driver") \
    .save()

In [33]:
# customer counts
customers_table = df.select(col('customer_id').cast('int')).groupBy('customer_id').agg(count('customer_id').alias("customer_count") )
customers_table.show()

+-----------+--------------+
|customer_id|customer_count|
+-----------+--------------+
|   46909180|             6|
|   42560427|             7|
|   43789873|             3|
|   22037526|             2|
|   34220092|             2|
|   42801586|             1|
|    9565734|             2|
|   15829398|             1|
|   38247118|             1|
|   32478248|             2|
|   48114630|             1|
|   23085063|             1|
|   32787070|             3|
|   43515569|             1|
|    4919528|             2|
|    5088547|             2|
|   41852407|             3|
|   49703087|             1|
|   12713799|             1|
|   36728141|             8|
+-----------+--------------+
only showing top 20 rows

