A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto cloud. Their data resides in S3 ( a directory of JSON logs on the app's user activity section & a directoty of JSON metadata on the app's song section)
Build a ETL pipeline to extract data from S3 , stage them in Redshift, transform them into a set of dimensional tables for analytics team to find insights about what songs their users are listening to.
There are two datasets that reside in S3.
(1). Link to Song data: s3://udacity-dend/song_data
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata
about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
For example,here are filepaths to a file in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
Single song file example :
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "",
"artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
(2). Link to Log data: s3://udacity-dend/log_data
It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These
simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
(1).sql_queries.py
Define SQL statements like creating tables, drop tables, copying data from S3 to staging tables, inserting data from satging
tables to dimensional tables.
The statements will be imported into create_table.py and etl.py
(2).create_tables.py
Create dimension and fact tables for the star schema in Redshift
(3).etl.py
Execute the SQL statement defined in sql_queries.py. Load data from S3 into staging tables on Redshift and then process that data into the analytics tables on Redshift.
(4).README.md
A summary and supporting material for the whole project.
(1). Finish sql_queries.py
(2). Finish create_tables.py
(3). Create a IAM role with read only access and a security group
(4). Launch a redshift cluster ( Do not let your cluster run overnight or over the weekend to prevent unexpected high charges)
(5). Run create_tables.py in the terminal by type "python create_tables.py".
A new termnial will be available if you click "file" tab on the top left, move your cursor to "new" and select 'terminal'
(6). In query editor under cluster you will see your database and all your tables (you must run "python create_tables.py" first)
To use query editor in your cluster, you need to make sure
a. the node type meeting the standard (DC1.8xlarge, DC2.large, DC2.8xlarge, DS2.8xlarge either one is fine)
b. attach Iam policies to query editor to enable access to query editor
make sure select 'public' in the dropdown button under schema
(7). Run etl.py in the terminal by type "python etl.py".
(1).Fact Table
songplays - records in event data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
(2).Dimension Tables
users - users in the app:
user_id, first_name, last_name, gender, level
songs - songs in music database:
song_id, title, artist_id, year, duration
artists - artists in music database:
artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units:
start_time, hour, day, week, month, year, weekday
(1).Find the popular artists
(2).Find the popular songs
select song_title, count(song_title) as song_played
from staging_events
group by song_title
order by song_played desc;
(3).Song plays by hours
(4).song plays by level and gender
- https://www.klipfolio.com/blog/create-sql-dashboard
- https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-authorization.html
- https://github.com/FedericoSerini/DEND-Project-3-Data-Warehouse-AWS/blob/master/sql_queries.py
- https://github.com/janjagusch/dend_03_data_warehouse
- https://www.postgresql.org/docs/9.4/functions-datetime.html