Skip to content

Data Engineer (Udacity): Project 4 Data Lakes with Spark on Amazon Web Service (AWS)

Notifications You must be signed in to change notification settings

tianlinhe/data_lakes_Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DATA LAKES WITH SPARK

Author: Tianlin He

Date: 18 Oct 2020

Tag: #Udacity #Data Engineering #AWS #S3 #Spark

Project Overview

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that

  1. extracts the data (format=JSON) from S3,
  2. process the data with Spark ,
  3. load the data back to S3 (format=Parquet) as a set of dimensional tables.

You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Dataset

You'll be working with two datasets that reside in S3 with the following link:

  • Song data: s3://udacity-dend/song_data

  • Log data: s3://udacity-dend/log_data

Song Data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Data

The log data are in JSON format files generated by this event simulator based on the songs in the dataset above. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:

img

Project Development

The project is developed in a local enviornment (MacOS) and based on libraries including pyspark spark.jars.packages, and org.apache.hadoop:hadoop-aws:2.7.0.

Overview of the steps

  1. On AWS console, create IAM user
    • save the access key and secrete in dl.cfg
  2. On AWS console, create a S3 bucket
    • The name of my S3 bucket is aws-emr-resources-327442518701-us-west-2
    • The output from this project will be saved in the subfolder s3a://aws-emr-resources-327442518701-us-west-2/datalakes-project/
  3. Test the pipeline with local data, to check the correctness of codes:
    1. Run etl_local.py test on one song file and one log file
    2. Run etl_local.py full on full data downloaded from udacity workspace
  4. Run the pipeline with data stored in S3, either
    • Run etl.py locally (very slow, the speed is limited by Internet data transfer Internet between different regions) OR
    • Run etl.py on an emr cluster

Files

  • etl_local.py: pipeline for data stored on S3 s3://udacity-dend/
  • etl.py: pipeline for test data downloaded from project workspace and stored locally on /Users/hetianlin/OneDrive/GitHub/data_lakes_with_Spark/data/
  • sample_dl.cfg is a copy of dl.cfg without key and secret, becuase they ought NOT to be accessible to the public

Test on local data

python etl_local.py full
songs_table: 71 rows
+------------------+-----------------+------------------+----+---------+
|           song_id|            title|         artist_id|year| duration|
+------------------+-----------------+------------------+----+---------+
|SOGOSOV12AF72A285E|¿Dónde va Chichi?|ARGUVEV1187B98BA17|1997|313.12934|
+------------------+-----------------+------------------+----+---------+
only showing top 1 row

artists_table: 69 rows
+------------------+---------------+---------------+---------------+----------------+
|         artist_id|    artist_name|artist_location|artist_latitude|artist_longitude|
+------------------+---------------+---------------+---------------+----------------+
|AR3JMC51187B9AE49D|Backstreet Boys|    Orlando, FL|       28.53823|       -81.37739|
+------------------+---------------+---------------+---------------+----------------+
only showing top 1 row

time_table: 6820 rows
+-------------+----------+-------------------+----+---+----+-----+----+--------+
|           ts| timestamp|         start_time|hour|day|week|month|year| weekday|
+-------------+----------+-------------------+----+---+----+-----+----+--------+
|1542241826796|1542241826|2018-11-15 01:30:26|   1| 15|  46|   11|2018|Thursday|
+-------------+----------+-------------------+----+---+----+-----+----+--------+
only showing top 1 row

song_df: 71 rows
+------------------+--------------------+--------+----+------------------+
|           song_id|               title|duration|year|         artist_id|
+------------------+--------------------+--------+----+------------------+
|SOAOIBZ12AB01815BE|I Hold Your Hand ...|43.36281|2000|ARPBNLO1187FB3D52F|
+------------------+--------------------+--------+----+------------------+
only showing top 1 row

songplays_table: 6834 rows
  user_id level song_id artist_id  sessionId  \
0      26  free    None      None        583   
1      26  free    None      None        583   
2      26  free    None      None        583   

                             location  \
0  San Jose-Sunnyvale-Santa Clara, CA   
1  San Jose-Sunnyvale-Santa Clara, CA   
2  San Jose-Sunnyvale-Santa Clara, CA   

                                           userAgent  year  month  songplay_id  
0  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...  2018     11            0  
1  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...  2018     11            1  
2  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...  2018     11            2  

Releases

No releases published

Packages

No packages published

Languages