Name	Name	Last commit message	Last commit date
parent directory ..
Data	Data
__MACOSX	__MACOSX
README.md	README.md
dl.cfg	dl.cfg
etl.py	etl.py
star_schema.PNG	star_schema.PNG

Purpose of this Project

A music streaming startup called Sparkify. It has grown their user base and song database and want to move their 
data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app
as well as a directory of JSON metadata on the songs in their app.

My Task

Build an ETL pipeline that extracts data from S3, process them using spark, and load the data back into S3 as a set
of dimensional tables. This would allow the data analytics team to continue finding insights in what songs their users
are listening to.

Datasets

The two datasets reside in S3. Here is the links for each:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

The first data set is a subset of real data from Million Song Dataset. Each file is in JSON format and contains metadata
about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 
eg: 
    song_data/A/B/C/TRABCEI128F424C983.json
	    song_data/A/A/B/TRAABJL12903CDCF1A.json

  The second data consists of log files in JSON format generated by event simulator based on the songs above. The activity
log data is simulated from an imaginary music streaming app based on configuration settings.


You can download the dataset on your desktop and do it locally. After all the process completed, load the output parquet data back to S3.

This project includes 3 files

etl.py : reads data from S3, processes the data using Spark and writes them back to S3
dl.cfg: contains your aws credential ( do not expose them in public)
README.md: explanation of the project process and decisions

Schema for Song Play Analysis

Using the song and log datasets to create a star schema optimized for queries on song play analysis. 
This includes the following tables.

Fact Table: 
songplays - records in log data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables:
  users - users in the app
user_id, first_name, last_name, gender, level
songs - songs in music database
song_id, title, artist_id, year, duration
artists - artists in music database
artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units
start_time, hour, day, week, month, year, weekday

steps for this project

(1) complete the code in etl.py
(2) create a new user with s3 full access 
    (check your users to see if you have it already. Normally , create 3 types of users
     user 1 with read only access
     user 2 with full access
     user 3 with admin access)
(3) Launch a cluster (specify the master user name to the one with full access)
In the terminal, run python etl.py

Notes:

A new user was created on AWS IAM with S3 full access. Use control key and secret key to Fill in credentails in dl.cfg( no quotation needed). cluster type: multi node ; Set the number of nodes to 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

Project_4_Data_Lake_with_Spark

Project_4_Data_Lake_with_Spark

README.md

Purpose of this Project

My Task

Datasets

This project includes 3 files

Schema for Song Play Analysis

steps for this project

Notes:

Reference:

Files

Project_4_Data_Lake_with_Spark

Directory actions

More options

Directory actions

More options

Latest commit

History

Project_4_Data_Lake_with_Spark

Folders and files

parent directory

README.md

Purpose of this Project

My Task

Datasets

This project includes 3 files

Schema for Song Play Analysis

steps for this project

Notes:

Reference: