Sparkify ETL Pipeline

Ian McDonough

AWS Project Aims

The aim of this project is to build a scalable ETL pipeline which will take app data from the media app "Sparkify", which are currently stored in S3, and convert them into 5 normalized tables stored in Amazon Redshift. The choice to store this data in Third Normal Form (3NF) on Amazon Redshift will enable data analysis while making efficient use of storage resources.

Files And How to Run Them

This zipped folder contains four additional files necessary to run the pipeline:

'sql_queries.py': Defines SQL queries which will be used to create the empty tables and populates them with the Sparkify data
'create_tables.py': Creates the tables by executing the SQL queries in sql_queries.py
'etl.py': Executes the SQL queries that load the data into the created tables on an AWS cluster
'dwh.cfg': Which contains the credentials and addresses necessary to access the AWS server and Sparkify data including HOST, DB_NAME, DB_USER, DB_PASSWORD, DB_PORT, ARN, LOG_DATA, LOG_JSONPATH, and SONG_DATA

To execute this pipeline you must:

Insert your personal AWS data, from an active cluster in the us-west-2 region, into the dwh.cfg file and save it in the same file which contains the other files listed above
From a terminal first execute the 'sql_queries.py' file
Then execute the 'create_tables.py' file
Lastly execute the 'etl.py' file
Allow for several minutes for the data transfer to complete
Confirm the completion of the ETL pipeline from your AWS account

Attributions

Udacity Mentor Survesh: https://knowledge.udacity.com/questions/1041967

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Apache_Cassandra_Music_Database_Project.ipynb		Apache_Cassandra_Music_Database_Project.ipynb
LICENSE		LICENSE
README.md		README.md
accelerometer_landing.sql		accelerometer_landing.sql
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sparkify ETL Pipeline

Ian McDonough

AWS Project Aims

Files And How to Run Them

Attributions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

i-afk/Data_Engineering_with_AWS_Projects

Folders and files

Latest commit

History

Repository files navigation

Sparkify ETL Pipeline

Ian McDonough

AWS Project Aims

Files And How to Run Them

Attributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages