### Delete Existing Resources

In [1]:
%%time
!python scripts/delete_resources.py

ERROR:root:Cluster 'dwhCluster' does not exist.
ERROR:root:Another error occurred: CusterProps must not be None. Skipping Ingress revocation.
ERROR:root:Cluster 'dwhCluster' does not exist. Skipping cluster deletion.
ERROR:root:IAM role 'dwhRole' does not exist. Skipping IAM role deletion.
CPU times: user 8.95 ms, sys: 11 ms, total: 20 ms
Wall time: 1.62 s


### Create Required Resources
This includes the IAM role, RedShift cluster, and ingress rules.

In [2]:
%%time
!python scripts/create_resources.py

INFO:root:IAM Role Created: ARN arn:aws:iam::736720720705:role/dwhRole
INFO:root:Cluster is creating...2.384185791015625e-07 seconds elapsed.
INFO:root:Cluster is creating...10.238889217376709 seconds elapsed.
INFO:root:Cluster is creating...20.997263431549072 seconds elapsed.
INFO:root:Cluster is creating...31.69753861427307 seconds elapsed.
INFO:root:Cluster is creating...42.36496829986572 seconds elapsed.
INFO:root:Cluster is creating...53.10138392448425 seconds elapsed.
INFO:root:Cluster is creating...63.818397998809814 seconds elapsed.
INFO:root:Cluster is creating...74.51712822914124 seconds elapsed.
CPU times: user 671 ms, sys: 239 ms, total: 911 ms
Wall time: 1min 29s


### Create Tables
Staging tables will first be created, followed by fact and dimension tables.

In [5]:
%%time
!python scripts/create_tables.py

CPU times: user 45 ms, sys: 24.5 ms, total: 69.6 ms
Wall time: 6.27 s


**ERD**

Outlined below is the entity relationship diagram for the generated tables:

![erd](images/erd.png)

The fact and dimension tables are arranged in a star schema, while the staging tables are isolated.

### ELT
Data is loaded from S3 into the staging tables, and the data is then transformed into the star schema.

In [7]:
%%time
!python scripts/etl.py load

INFO:root:Loading staging data.
INFO:root:Staging tables loaded.
CPU times: user 37.9 s, sys: 12.3 s, total: 50.2 s
Wall time: 1h 54min 11s


In [9]:
%%time
!python scripts/etl.py etl

INFO:root:Performing ETL.
INFO:root:ETL completed.
CPU times: user 37.6 ms, sys: 20.1 ms, total: 57.6 ms
Wall time: 5.45 s


### Sample Queries
Simple queries are run to gather information about the database, as well as simple analytics insights.

In [8]:
import psycopg2
from scripts.helpers import LoadConfig

In [10]:
# Load pararameters from dwh.cfg
config = LoadConfig(autoload=True)

conn = psycopg2.connect(
    dbname=config.get("CLUSTER", "DB_NAME"),
    user=config.get("CLUSTER", "DB_USER"),
    password=config.get("CLUSTER", "DB_PASSWORD"),
    host=config.get("CLUSTER", "HOST"),
    port=config.get("CLUSTER", "DB_PORT")
)
cur = conn.cursor()

In [11]:
# Table row counts
query = """ 
select 'staging_events' as table_name, count(*) as row_count from staging_events
union select 'staging_songs' as table_name, count(*) as row_count from staging_songs
union select 'f_songplay' as table_name, count(*) as row_count from f_songplay
union select 'd_artist' as table_name, count(*) as row_count from d_artist
union select 'd_song' as table_name, count(*) as row_count from d_song
union select 'd_user' as table_name, count(*) as row_count from d_user
union select 'd_time' as table_name, count(*) as row_count from d_time
order by 2 desc;
"""
cur.execute(query)

# Top played artists
# Note: we perform a subquery to return a single artist per artist_id
# There is a one-to-many relationship on the artist_id and name columns
# e.g. ABCD -> Some Artist; ABCD -> Some Artist; Some Backing Artist
query = """
select 
	da.name,
	count(fs.song_id) as song_play_count
from f_songplay fs
join (
	select
		artist_id,
		name,
		row_number() over (partition by artist_id order by name) as row_number
	from d_artist
) da on da.artist_id = fs.artist_id and da.row_number = 1
group by da.name
order by 2 desc
limit 10;
"""

# Top played songs by listen time
# We perform a similar window function on the d_artist table here
query = """
select
	ds.title as song_title,
	da.name as artist_name,
	sum(ds.duration) as song_play_time
from f_songplay fs
join d_song ds on ds.song_id = fs.song_id
join (
	select
		artist_id,
		name,
		row_number() over (partition by artist_id order by name) as row_number
	from d_artist
) da on da.artist_id = fs.artist_id and da.row_number = 1
group by 1,2
order by 3 desc
limit 10;
"""

|song_title|artist_name|song_play_time|
|----------|-----------|--------------|
|Greece 2000|3 Drives On A Vinyl|24762.881199999996|
|Sehr kosmisch|Harmonia|13771.32771|
|You're The One|Dwight Yoakam|8854.370100000007|
|Stronger|Kanye West|8737.26728|
|What Goes Around...Comes Around|Justin Timberlake|6554.84754|
|Yellow|Coldplay|6440.217719999998|
|Revelry|Kings Of Leon|5448.47742|
|Bring Me To Life|Evanescence|5299.00763|
|Horn Concerto No. 4 in E flat K495: II. Romance (Andante cantabile)|Barry Tuckwell/Academy of St Martin-in-the-Fields/Sir Neville Marriner|5266.015870000001|
|Just Dance|Lady GaGa|5222.523800000001|
