- configparser
- psycopg2
I learn Data Modeling on AWS by using JSON metadata that represents the songs and JSON files that represents user activity.
Created tables must be stored into Redshift. In addtition for analyzing user activity, Fact and Dimension tables must be created from songs metadata and user activity logs.
- create_tables.py: create tables on AWS
- elt.py: define the ETL process
- sql_queries.py: define the SQL queries
- create_tables.drop_tables: Drop tables
- create_tables.create_tables: Create tables
- etl.load_staging_tables: Load json files to staging_tables
- etl.insert_tables: Insert staging data to fact and dimension tables
The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
And below is an example of what a single activity log in 2018-11-13-events.json, looks like.
{"artist":null,"auth":"Logged In","firstName":"Kevin","gender":"M","itemInSession":0,"lastName":"Arellano","length":null,"level":"free","location":"Harrisburg-Carlisle, PA","method":"GET","page":"Home","registration":1540006905796.0,"sessionId":514,"song":null,"status":200,"ts":1542069417796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.125 Safari\/537.36\"","userId":"66"}
The summary of ETL processes is below. For more details, see etl.ipynb, etl.py and sql_queries.py.
- Copy song JSON files on S3 to staging_songs table on Redshift.
- Copy log JSON files on S3 to staging_events table on Redshift.
- Column names in JSON files are different from names of staging_events, so JSONPaths is needed. https://docs.aws.amazon.com/ja_jp/redshift/latest/dg/copy-usage_notes-copy-from-json.html
- Select columns for song ID, title, artist ID, year, and duration from staging_songs.
- Execute an insert query to songs table in Redshift.
- Select columns for artist ID, name, location, latitude, and longitude from staging_songs.
- Execute an insert query to artists table in Redshift.
- Extract the timestamp, hour, day, week of year, month, year, and weekday from staging_events.
- Execute an insert query to time table in Redshift.
- Select columns for user ID, first name, last name, gender and level from staging_events.
- Execute an insert query to songs table in Redshift.
- Select the timestamp, user ID, level, song ID, artist ID, session ID, location, and user agent from staging_events.
- Log files don't include song ID and artist ID, so get these ID by executing select query to songs and artists tables.
- Execute an insert query to songs table in Redshift.
Create tables and execute ETL.
$ python create_tables.py
$ python etl.py
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull |
---|---|---|---|---|---|---|---|
public | songplays | songplay_id | bigint | lzo | FALSE | 0 | TRUE |
public | songplays | start_time | timestamp without time zone | lzo | FALSE | 0 | TRUE |
public | songplays | user_id | character varying(256) | lzo | FALSE | 0 | TRUE |
public | songplays | level | character varying(256) | lzo | FALSE | 0 | FALSE |
public | songplays | song_id | character varying(256) | lzo | FALSE | 0 | TRUE |
public | songplays | artist_id | character varying(256) | lzo | FALSE | 0 | TRUE |
public | songplays | session_id | integer | lzo | FALSE | 0 | FALSE |
public | songplays | lcation | character varying(256) | lzo | FALSE | 0 | FALSE |
public | songplays | user_agent | character varying(256) | lzo | FALSE | 0 | FALSE |
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull |
---|---|---|---|---|---|---|---|
public | users | user_id | character varying(256) | lzo | FALSE | 0 | TRUE |
public | users | first_name | character varying(256) | lzo | FALSE | 0 | FALSE |
public | users | last_name | character varying(256) | lzo | FALSE | 0 | FALSE |
public | users | gender | character varying(256) | lzo | FALSE | 0 | FALSE |
public | users | level | character varying(256) | lzo | FALSE | 0 | FALSE |
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull |
---|---|---|---|---|---|---|---|
public | songs | song_id | character varying(256) | lzo | FALSE | 0 | TRUE |
public | songs | title | character varying(256) | lzo | FALSE | 0 | FALSE |
public | songs | artist_id | character varying(256) | lzo | FALSE | 0 | TRUE |
public | songs | year | integer | lzo | FALSE | 0 | FALSE |
public | songs | duration | double precision | none | FALSE | 0 | FALSE |
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull |
---|---|---|---|---|---|---|---|
public | artists | artist_id | character varying(256) | lzo | FALSE | 0 | TRUE |
public | artists | name | character varying(256) | lzo | FALSE | 0 | FALSE |
public | artists | location | character varying(256) | lzo | FALSE | 0 | FALSE |
public | artists | latitude | double precision | none | FALSE | 0 | FALSE |
public | artists | longitude | double precision | none | FALSE | 0 | FALSE |
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull |
---|---|---|---|---|---|---|---|
public | time | start_time | timestamp without time zone | lzo | FALSE | 0 | TRUE |
public | time | hour | integer | lzo | FALSE | 0 | FALSE |
public | time | day | integer | lzo | FALSE | 0 | FALSE |
public | time | week | integer | lzo | FALSE | 0 | FALSE |
public | time | month | integer | lzo | FALSE | 0 | FALSE |
public | time | year | integer | lzo | FALSE | 0 | FALSE |
public | time | weekday | integer | lzo | FALSE | 0 | FALSE |
I wish to thank Udacity for advice and review.