Skip to content

vince7ben23/Udacity-Nano-Degree-Cloud-Data-Warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

  • A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs of user activity on the app, as well as a directory with JSON metadata of the songs information in their app.

  • We will build a ETL data pipeline to move these data from S3 to data warehouse redshift.

Architecture

data_warehouse_architecture.png

Requirements

  • Create AWS Redshift cluster
  • Create a Role with Permission to access the required data which are stored in AWS S3
  • Access permission to AWS Redshift
  • Python 3
  • Install pyscog2: pip install psycopg2
  • Fill in the appropriate configurations in the dwh.cfg file, these configs allow us access to AWS resources, should keep it private.

Dataset

Song dataset

  • The dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song.

  • Example: {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

User activity dataset

  • The second dataset consists of log files in JSON format generated by the event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

Log_Data.img

Create Table Schema

Execute the script: create_tables.py:

python create_tables.py

This will connect to the Redshift cluster and execute the queries to create the following tables Redshift. Include both staging table and dimension model table.

  • Staging
    • staging_events
    • staging_songs
  • Dimensional model
    • songplays
    • users
    • songs
    • artists
    • time

Build ETL Pipeline

Execute the script: etl.py:

python etl.py

This will connect to the Redshift cluster and load the data from S3:

  • Song data: s3://udacity-dend/song_data
  • Log data: s3://udacity-dend/log_data

Loaded data will be stored in the staging tables firstly, staging_events and staging_songs, then it will be transformed into the analytic tables of star schema for data consumers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages