# Data Modeling with Postgres
I apply Data Modeling with Postgres and build an ETL pipeline using Python. 

Assume a starup company wants to analyze the data they've been collecting on songs and user activity on their new music streaming app the analytics team is particularly interested in understanding what songs users are listening to.

** I created a database schema and ETL pipeline for this analysis. ** 


Currently, the JSON data resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

## My data sources:

### **Song Dataset**
Songs dataset is a subset of [Million Song Dataset](http://millionsongdataset.com/).

Sample Record :
```
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
```

### **Log Dataset**
Logs dataset is generated by [Event Simulator](https://github.com/Interana/eventsim).

Sample Record :
```
{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}
```

### Database Schema

The schema I used is the Star Schema: 
There is one main **fact table** containing all the measures associated to each event (user song plays), and 4 **dimentional tables**, each with a primary key that is being referenced from the fact table.


#### Fact Table
**songplays** - records in log data associated with song plays i.e. records with page NextSong
- songplay_id (INT) PRIMARY KEY: ID of each user song play 
- start_time (DATE) NOT NULL: Timestamp of beggining of user activity
- user_id (INT) NOT NULL: ID of user
- level (TEXT): User level {free | paid}
- song_id (TEXT) NOT NULL: ID of Song played
- artist_id (TEXT) NOT NULL: ID of Artist of the song played
- session_id (INT): ID of the user Session 
- location (TEXT): User location 
- user_agent (TEXT): Agent used by user to access Sparkify platform

#### Dimension Tables
**users** - users in the app
- user_id (INT) PRIMARY KEY: ID of user
- first_name (TEXT) NOT NULL: Name of user
- last_name (TEXT) NOT NULL: Last Name of user
- gender (TEXT): Gender of user {M | F}
- level (TEXT): User level {free | paid}

**songs** - songs in music database
- song_id (TEXT) PRIMARY KEY: ID of Song
- title (TEXT) NOT NULL: Title of Song
- artist_id (TEXT) NOT NULL: ID of song Artist
- year (INT): Year of song release
- duration (FLOAT) NOT NULL: Song duration in milliseconds

**artists** - artists in music database
- artist_id (TEXT) PRIMARY KEY: ID of Artist
- name (TEXT) NOT NULL: Name of Artist
- location (TEXT): Name of Artist city
- lattitude (FLOAT): Lattitude location of artist
- longitude (FLOAT): Longitude location of artist

**time** - timestamps of records in songplays broken down into specific units
- start_time (DATE) PRIMARY KEY: Timestamp of row
- hour (INT): Hour associated to start_time
- day (INT): Day associated to start_time
- week (INT): Week of year associated to start_time
- month (INT): Month associated to start_time 
- year (INT): Year associated to start_time
- weekday (TEXT): Name of week day associated to start_time

In [17]:
import os
import glob
import psycopg2
import pandas as pd
from sql import *

#### Feel free to refer to the offcial [website](https://www.postgresql.org/download/) or [this site](https://www.tutorialspoint.com/python_data_access/python_postgresql_database_connection.htm) for instructions of installing Postgres database.

In [24]:
# connect to the postgre database using psycopg2 package.
conn = psycopg2.connect(database='postgres', user='postgres',password='postgres', host='127.0.0.1')
cursor = conn.cursor()
cursor.execute("select version()")
data = cursor.fetchone()
print("Connection established to: ", data)

Connection established to:  ('PostgreSQL 14.1 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit',)


In [22]:
# conn.close()

In [None]:
def get_file(filepath):
    all_file = []
    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root, '*.json'))
        

In [27]:
for root, dirs, files in os.walk('data/song_data'):
    print(root, dirs, files)

data/song_data ['A'] []
data/song_data/A ['A', 'B'] []
data/song_data/A/A ['A', 'C', 'B'] []
data/song_data/A/A/A [] ['TRAAAEF128F4273421.json', 'TRAAARJ128F9320760.json', 'TRAAAFD128F92F423A.json', 'TRAAAPK128E0786D96.json', 'TRAAAVO128F93133D4.json', 'TRAAABD128F429CF47.json', 'TRAAAAW128F429D538.json', 'TRAAADZ128F9348C2E.json', 'TRAAAVG12903CFA543.json', 'TRAAAMO128F1481E7F.json', 'TRAAAMQ128F1460CD3.json']
data/song_data/A/A/C [] ['TRAACOW128F933E35F.json', 'TRAACPE128F421C1B9.json', 'TRAACTB12903CAAF15.json', 'TRAACFV128F935E50B.json', 'TRAACIW12903CC0F6D.json', 'TRAACVS128E078BE39.json', 'TRAACER128F4290F96.json', 'TRAACSL128F93462F4.json', 'TRAACHN128F1489601.json', 'TRAACLV128F427E123.json', 'TRAACZK128F4243829.json', 'TRAACCG128F92E8A55.json', 'TRAACQT128F9331780.json', 'TRAACNS128F14A2DF5.json']
data/song_data/A/A/B [] ['TRAABDL12903CAABBA.json', 'TRAABXG128F9318EBD.json', 'TRAABYN12903CFD305.json', 'TRAABNV128F425CEE1.json', 'TRAABJL12903CDCF1A.json', 'TRAABCL128F4286650.js