Skip to content

Using Pyspark to perform ETL and connect to an AWS RDS instance

Notifications You must be signed in to change notification settings

tonydefor/Amazon_Vine_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Amazon_Vine_Analysis

Using Pyspark to perform ETL and connect to an AWS RDS instance

Overview

In this project, we’ll have access to approximately 50 datasets. Each one contains reviews of a specific product, from clothing apparel to wireless products. We’ll need to pick one of these datasets and use PySpark to perform the ETL process to extract the dataset, transform the data, connect to an AWS RDS instance, and load the transformed data into pgAdmin. Next, we’ll use PySpark, Pandas, or SQL to determine if there is any bias toward favorable reviews from Vine members in your dataset. Then, we’ll write a summary of the analysis for Jennifer to submit to the SellBy stakeholders.

Results

Using our knowledge of the cloud ETL process, well’ll create an AWS RDS database with tables in pgAdmin, pick a dataset from the Amazon Review datasets Links to an external site., and extract the dataset into a DataFrame. We'll transform the DataFrame into four separate DataFrames that match the table schema in pgAdmin. Then, we'll upload the transformed data into the appropriate tables and run queries in pgAdmin to confirm that the data has been uploaded.

  • First, we create a new database in our Amazon RDS Server, and then we run queries to create tables for our new databse in pgAdmin

Screenshot 2023-01-23 at 1 13 09 AM

  • Next, we extract our dataset for review and then create a new Dataframe

Screenshot 2023-01-23 at 1 16 11 AM

  • Using our knowledge of PySpark, Pandas, or SQL, we’ll determine if there is any bias towards reviews that were written as part of the Vine program. For this analysis, we'll determine if having a paid Vine review makes a difference in the percentage of 5-star reviews.

Screenshot 2023-01-23 at 1 18 26 AM

About

Using Pyspark to perform ETL and connect to an AWS RDS instance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published