Data Warehouse with Redshift

In this project, we will build an ETL pipeline for a database hosted on Redshift. We'll load data from S3 to staging tables on Redshift and execute SQL to transform data to another table in Star Schema

Their analytics team to continue finding insights into what songs their users are listening to.

Source data

Dataset stored in S3, the udacity-dend bucket is situated in the us-west-2 region:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

To properly read log data s3://udacity-dend/log_data, you'll need the following metadata file:

Log metadata: s3://udacity-dend/log_json_path.json

Data model

Staging tables

Star schema tables

From song and event dataset, we'll create a star schema optimized for queries on song play analysis. This following tables:

users table: user in the app
- PK: user_id ~ unique id for all user
artists table:
- PK: artist_id ~ unique id for all artist
songs table:
- PK: song_id ~ unique id for all song
- FK: artist_id ~ artist_id of song
songplays table:
- PK: songplay_id ~ unique id for all song play(auto increase) with page NextSong
- FK: artist_id ~ artist_id of song play
- FK: song_id ~ song_id of song play
- FK: user_id ~ user_id of user play song
time table:
- PK: start time ~ unique timestamp of records in songplays table

Project file

create_table.py is where we'll create our staging/fact and dimension tables for the star schema in Redshift.
etl.py is where we'll load data from S3 into staging tables on Redshift and then process that data into your analytics tables on Redshift.
sql_queries.py is where we'll define SQL statements, which will be imported into the two other files above.
dwh.cfg is where we'll define config of connection to Redshift, S3 part of dataset, IAM role

ETL pipeline

run create_table.py
- drop_tables: drop all table if exist
- create_tables: create all table if not exist
run etl.py
- load_staging_tables: load data from S3 to staging tables in database
- insert_tables: move data from staging tables to star schema's tables
- get_most_played_song: execute query get_most_played_song

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
data sample		data sample
img		img
README.md		README.md
create_tables.py		create_tables.py
dwh.cfg		dwh.cfg
etl.py		etl.py
sql.py		sql.py
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Warehouse with Redshift

Source data

Data model

Staging tables

Star schema tables

Project file

ETL pipeline

About

Releases

Packages

Languages

tripham208/repo_pmt_udacity-de_with_aws-002

Folders and files

Latest commit

History

Repository files navigation

Data Warehouse with Redshift

Source data

Data model

Staging tables

Star schema tables

Project file

ETL pipeline

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages