A music streaming startup called Sparkify. It has grown their user base and song database and want to move their
data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app
as well as a directory of JSON metadata on the songs in their app.
Build an ETL pipeline that extracts data from S3, process them using spark, and load the data back into S3 as a set
of dimensional tables. This would allow the data analytics team to continue finding insights in what songs their users
are listening to.
The two datasets reside in S3. Here is the links for each:
- Song data: s3://udacity-dend/song_data
- Log data: s3://udacity-dend/log_data
The first data set is a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata
about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.
eg:
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
The second data consists of log files in JSON format generated by event simulator based on the songs above. The activity
log data is simulated from an imaginary music streaming app based on configuration settings.
You can download the dataset on your desktop and do it locally. After all the process completed, load the output parquet data back to S3.
etl.py : reads data from S3, processes the data using Spark and writes them back to S3
dl.cfg: contains your aws credential ( do not expose them in public)
README.md: explanation of the project process and decisions
Using the song and log datasets to create a star schema optimized for queries on song play analysis.
This includes the following tables.
Fact Table:
songplays - records in log data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
Dimension Tables:
users - users in the app
user_id, first_name, last_name, gender, level
songs - songs in music database
song_id, title, artist_id, year, duration
artists - artists in music database
artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday
(1) complete the code in etl.py
(2) create a new user with s3 full access
(check your users to see if you have it already. Normally , create 3 types of users
user 1 with read only access
user 2 with full access
user 3 with admin access)
(3) Launch a cluster (specify the master user name to the one with full access)
In the terminal, run python etl.py
A new user was created on AWS IAM with S3 full access. Use control key and secret key to Fill in credentails in dl.cfg( no quotation needed). cluster type: multi node ; Set the number of nodes to 4