DATA LAKES WITH SPARK

Author: Tianlin He

Date: 18 Oct 2020

Tag: #Udacity #Data Engineering #AWS #S3 #Spark

Project Overview

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that

extracts the data (format=JSON) from S3,
process the data with Spark ,
load the data back to S3 (format=Parquet) as a set of dimensional tables.

You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Dataset

You'll be working with two datasets that reside in S3 with the following link:

Song data: s3://udacity-dend/song_data
Log data: s3://udacity-dend/log_data

Song Data

The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Data

The log data are in JSON format files generated by this event simulator based on the songs in the dataset above. The log files are partitioned by year and month. For example, here are filepaths to two files in this dataset:

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like:

Project Development

The project is developed in a local enviornment (MacOS) and based on libraries including pyspark spark.jars.packages, and org.apache.hadoop:hadoop-aws:2.7.0.

Overview of the steps

On AWS console, create IAM user
- save the access key and secrete in dl.cfg
On AWS console, create a S3 bucket
- The name of my S3 bucket is aws-emr-resources-327442518701-us-west-2
- The output from this project will be saved in the subfolder s3a://aws-emr-resources-327442518701-us-west-2/datalakes-project/
Test the pipeline with local data, to check the correctness of codes:
1. Run etl_local.py test on one song file and one log file
2. Run etl_local.py full on full data downloaded from udacity workspace
Run the pipeline with data stored in S3, either
- Run etl.py locally (very slow, the speed is limited by Internet data transfer Internet between different regions) OR
- Run etl.py on an emr cluster

Files

etl_local.py: pipeline for data stored on S3 s3://udacity-dend/
etl.py: pipeline for test data downloaded from project workspace and stored locally on /Users/hetianlin/OneDrive/GitHub/data_lakes_with_Spark/data/
sample_dl.cfg is a copy of dl.cfg without key and secret, becuase they ought NOT to be accessible to the public

Test on local data

python etl_local.py full

songs_table: 71 rows
+------------------+-----------------+------------------+----+---------+
|           song_id|            title|         artist_id|year| duration|
+------------------+-----------------+------------------+----+---------+
|SOGOSOV12AF72A285E|¿Dónde va Chichi?|ARGUVEV1187B98BA17|1997|313.12934|
+------------------+-----------------+------------------+----+---------+
only showing top 1 row

artists_table: 69 rows
+------------------+---------------+---------------+---------------+----------------+
|         artist_id|    artist_name|artist_location|artist_latitude|artist_longitude|
+------------------+---------------+---------------+---------------+----------------+
|AR3JMC51187B9AE49D|Backstreet Boys|    Orlando, FL|       28.53823|       -81.37739|
+------------------+---------------+---------------+---------------+----------------+
only showing top 1 row

time_table: 6820 rows
+-------------+----------+-------------------+----+---+----+-----+----+--------+
|           ts| timestamp|         start_time|hour|day|week|month|year| weekday|
+-------------+----------+-------------------+----+---+----+-----+----+--------+
|1542241826796|1542241826|2018-11-15 01:30:26|   1| 15|  46|   11|2018|Thursday|
+-------------+----------+-------------------+----+---+----+-----+----+--------+
only showing top 1 row

song_df: 71 rows
+------------------+--------------------+--------+----+------------------+
|           song_id|               title|duration|year|         artist_id|
+------------------+--------------------+--------+----+------------------+
|SOAOIBZ12AB01815BE|I Hold Your Hand ...|43.36281|2000|ARPBNLO1187FB3D52F|
+------------------+--------------------+--------+----+------------------+
only showing top 1 row

songplays_table: 6834 rows
  user_id level song_id artist_id  sessionId  \
0      26  free    None      None        583   
1      26  free    None      None        583   
2      26  free    None      None        583   

                             location  \
0  San Jose-Sunnyvale-Santa Clara, CA   
1  San Jose-Sunnyvale-Santa Clara, CA   
2  San Jose-Sunnyvale-Santa Clara, CA   

                                           userAgent  year  month  songplay_id  
0  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...  2018     11            0  
1  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...  2018     11            1  
2  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/5...  2018     11            2

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
img		img
.gitignore		.gitignore
README.md		README.md
etl.py		etl.py
etl_local.py		etl_local.py
sample_dl.cfg		sample_dl.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DATA LAKES WITH SPARK

Project Overview

Dataset

Song Data

Log Data

Project Development

Overview of the steps

Files

Test on local data

About

Releases

Packages

Languages

tianlinhe/data_lakes_Spark

Folders and files

Latest commit

History

Repository files navigation

DATA LAKES WITH SPARK

Project Overview

Dataset

Song Data

Log Data

Project Development

Overview of the steps

Files

Test on local data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages