Author: Tianlin He
Date: 18 Oct 2020
Tag: #Udacity #Data Engineering #AWS #S3 #Spark
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, you are tasked with building an ETL pipeline that
- extracts the data (format=JSON) from S3,
- process the data with Spark ,
- load the data back to S3 (format=Parquet) as a set of dimensional tables.
You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
You'll be working with two datasets that reside in S3 with the following link:
-
Song data:
s3://udacity-dend/song_data
-
Log data:
s3://udacity-dend/log_data
The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
The log data are in JSON format files generated by this event simulator based on the songs in the dataset above. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:
The project is developed in a local enviornment (MacOS) and based on libraries including pyspark
spark.jars.packages
, and org.apache.hadoop:hadoop-aws:2.7.0
.
- On AWS console, create IAM user
- save the access key and secrete in
dl.cfg
- save the access key and secrete in
- On AWS console, create a S3 bucket
- The name of my S3 bucket is
aws-emr-resources-327442518701-us-west-2
- The output from this project will be saved in the subfolder
s3a://aws-emr-resources-327442518701-us-west-2/datalakes-project/
- The name of my S3 bucket is
- Test the pipeline with local data, to check the correctness of codes:
- Run
etl_local.py test
on one song file and one log file - Run
etl_local.py full
on full data downloaded from udacity workspace
- Run
- Run the pipeline with data stored in S3, either
- Run
etl.py
locally (very slow, the speed is limited by Internet data transfer Internet between different regions) OR - Run
etl.py
on an emr cluster
- Run
etl_local.py
: pipeline for data stored on S3s3://udacity-dend/
etl.py
: pipeline for test data downloaded from project workspace and stored locally on/Users/hetianlin/OneDrive/GitHub/data_lakes_with_Spark/data/
sample_dl.cfg
is a copy ofdl.cfg
without key and secret, becuase they ought NOT to be accessible to the public
python etl_local.py full
songs_table: 71 rows
+------------------+-----------------+------------------+----+---------+
| song_id| title| artist_id|year| duration|
+------------------+-----------------+------------------+----+---------+
|SOGOSOV12AF72A285E|¿Dónde va Chichi?|ARGUVEV1187B98BA17|1997|313.12934|
+------------------+-----------------+------------------+----+---------+
only showing top 1 row
artists_table: 69 rows
+------------------+---------------+---------------+---------------+----------------+
| artist_id| artist_name|artist_location|artist_latitude|artist_longitude|
+------------------+---------------+---------------+---------------+----------------+
|AR3JMC51187B9AE49D|Backstreet Boys| Orlando, FL| 28.53823| -81.37739|
+------------------+---------------+---------------+---------------+----------------+
only showing top 1 row
time_table: 6820 rows
+-------------+----------+-------------------+----+---+----+-----+----+--------+
| ts| timestamp| start_time|hour|day|week|month|year| weekday|
+-------------+----------+-------------------+----+---+----+-----+----+--------+
|1542241826796|1542241826|2018-11-15 01:30:26| 1| 15| 46| 11|2018|Thursday|
+-------------+----------+-------------------+----+---+----+-----+----+--------+
only showing top 1 row
song_df: 71 rows
+------------------+--------------------+--------+----+------------------+
| song_id| title|duration|year| artist_id|
+------------------+--------------------+--------+----+------------------+
|SOAOIBZ12AB01815BE|I Hold Your Hand ...|43.36281|2000|ARPBNLO1187FB3D52F|
+------------------+--------------------+--------+----+------------------+
only showing top 1 row
songplays_table: 6834 rows
user_id level song_id artist_id sessionId \
0 26 free None None 583
1 26 free None None 583
2 26 free None None 583
location \
0 San Jose-Sunnyvale-Santa Clara, CA
1 San Jose-Sunnyvale-Santa Clara, CA
2 San Jose-Sunnyvale-Santa Clara, CA
userAgent year month songplay_id
0 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5... 2018 11 0
1 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5... 2018 11 1
2 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5... 2018 11 2