Skip to content

Files

Latest commit

 

History

History

Project_3_Data_Warehouse_with_Redshift

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto cloud. Their data resides in S3 ( a directory of JSON logs on the app's user activity section & a directoty of JSON metadata on the app's song section)

My Task

Build a ETL pipeline to extract data from S3 , stage them in Redshift, transform them into a set of dimensional tables for analytics team to find insights about what songs their users are listening to.

The Datasets

There are two datasets that reside in S3.

  (1). Link to Song data: s3://udacity-dend/song_data

  The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata   
  about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. 
  
  For example,here are filepaths to a file in this dataset.    
  song_data/A/B/C/TRABCEI128F424C983.json

 Single song file example :
 {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "",    
 "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

(2). Link to Log data: s3://udacity-dend/log_data

 It consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These 
 simulate app activity logs from an imaginary music streaming app based on configuration settings.     
 The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset.

 log_data/2018/11/2018-11-12-events.json
 log_data/2018/11/2018-11-13-events.json         

Files in the Repository

(1).sql_queries.py
Define SQL statements like creating tables, drop tables, copying data from S3 to staging tables, inserting data from satging 
tables to dimensional tables.

The statements will be imported into create_table.py and etl.py

(2).create_tables.py
 Create dimension and fact tables for the star schema in Redshift

(3).etl.py
Execute the SQL statement defined in sql_queries.py. Load data from S3 into staging tables on Redshift and then process that data     into the analytics tables on Redshift.

(4).README.md
A summary and supporting material for the whole project. 

Steps to Run the Program

 (1). Finish sql_queries.py
 (2). Finish create_tables.py
 (3). Create a IAM role with read only access and a security group
 (4). Launch a redshift cluster ( Do not let your cluster run overnight or over the weekend to prevent unexpected high charges)
 (5). Run create_tables.py in the terminal by type "python create_tables.py". 
     A new termnial will be available if you click "file" tab on the top left, move your cursor to "new" and select 'terminal'
 (6). In query editor under cluster you will see your database and all your tables (you must run "python create_tables.py" first)
     To use query editor in your cluster, you need to make sure 
     a. the node type meeting the standard (DC1.8xlarge, DC2.large, DC2.8xlarge, DS2.8xlarge either one is fine)
     b. attach Iam policies to query editor to enable access to query editor
     make sure select 'public' in the dropdown button under schema
(7). Run etl.py in the terminal by type "python etl.py". 

Star Schema for Song Play Analysis

(1).Fact Table
songplays - records in event data associated with song plays i.e. records with page NextSong
songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
(2).Dimension Tables
users - users in the app:
    user_id, first_name, last_name, gender, level
songs - songs in music database:
    song_id, title, artist_id, year, duration
artists - artists in music database:
    artist_id, name, location, lattitude, longitude
time - timestamps of records in songplays broken down into specific units:
    start_time, hour, day, week, month, year, weekday

Example Queries

(1).Find the popular artists

(2).Find the popular songs
select song_title, count(song_title) as song_played
from staging_events 
group by song_title
order by song_played desc;

(3).Song plays by hours

(4).song plays by level and gender 

References: