Skip to content

Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program

Notifications You must be signed in to change notification settings

vermicida/data-lake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: Data Lake

A music streaming startup, Sparkify, has grown their user base and song database even more and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to.

You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.

Table of contents


Structure

This tree shows the repository structure. Only the project's main files are described.

.
├── images
│   ├── iam-create-user-01.png
│   ├── iam-create-user-02.png
│   ├── iam-create-user-03.png
│   ├── iam-create-user-04.png
│   ├── iam-create-user-05.png
│   ├── s3-artists-parquet.png
│   ├── s3-create-bucket-01.png
│   └── s3-create-bucket-02.png
│   └── s3-create-bucket-03.png
├── src
│   ├── pipelines
│   │   ├── __init__.py
│   │   ├── artists.py             # Artists table pipeline
│   │   ├── base.py                # Pipeline common code
│   │   ├── cache.py               # Dataframes cache
│   │   ├── songplays.py           # Songplays table pipeline
│   │   ├── songs.py               # Songs table pipeline
│   │   ├── time.py                # Time table pipeline
│   │   └── users.py               # Users table pipeline
│   ├── config.py                  # Application config manager
│   ├── etl.py                     # ETL pipeline script
│   └── sparkify.cfg               # Application config file
├── .editorconfig
├── .gitignore
├── README.md
└── requirements.txt

Requirements

It is assumed that the tools below are properly installed locally:

  • Python is a programming language that lets you work quickly and integrate systems more effectively. Versión 3.x is required.
  • pip is the package installer for Python.
  • Virtualenv is a tool to create isolated Python environments.
  • Amazon Web Services is the world's most comprehensive and broadly adopted cloud platform, offering over 165 fully featured services from data centers globally.

Preparing AWS

Before running the pipelines that will load the source data into our data lake, we must tune up some things in our AWS account.

Creating a bucket

We are using Amazon S3 as data lake, so we must first create the bucket in which the data will be stored.

Open the AWS console, log into your account and go to S3. Choose Buckets from the left side menu, and then click the button Create bucket.

Create bucket

Then choose a name for the bucket; remember to check the naming rules here. For the region we will use US West (Oregon), the same as the source data bucket.

Bucket name and region

For other options and permission we will keep the default values. Just follow the steps and confirm the bucket creation.

Once the bucket is created, select it from the main list. A popup box will appear from the right, showing some interesting properties regarding the selected bucket. Click the button Copy the Bucket ARN, because we will use it in a couple of minutes.

Bucket ARN

Creating an administrator user

Now go to IAM. Choose User from the left side menu, and then click the button Add user.

Add user

Fill in the User name box, and also mark the Programmatic access checkbox.

User details

Do not add any permission to the user; just go to the end of the wizard, when the credentials are shown.

User credentials

Copy the Access key ID and the Secret access key directly from the site and paste them in the application configuration file src/sparkify.cfg: do it in the proper configuration keys:

...
ACCESS_KEY_ID = the-access-key-id
SECRET_ACCESS_KEY = the-secret-access-key
...

Important: never share this credentials to anyone, your AWS account can be compromised.

It's time to add some permissions to the user. Go to the Users menu and navigate to the user Summary. Here you can select the tab Permissions and click the button Add inline policy.

User summary

The inline policy can be created easily using a visual editor, but we will write the JSON document corresponding the policy instead. Select the tab JSON.

Inline policy

And now copy and paste the JSON below into the policy editor.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::udacity-dend"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::udacity-dend/log-data/*",
                "arn:aws:s3:::udacity-dend/song-data/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": [
                "data-lake-bucket-arn"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject"
            ],
            "Resource": [
                "data-lake-bucket-arn/*"
            ]
        }
    ]
}

The permissions are listed in the Statement collection:

  • The user can list the source bucket
  • The user can get all the objects stored in the log-data and song-data folders
  • The user can list the data lake bucket
  • The user can get, create and delete all the objects stored in the data lake bucket

In the last two permissions, make the Resource to point to the data lake: replace the text data-lake-bucket-arn with tha actual bucket ARN that we get before. Once the policy is ready, click the button Review policy and then Save changes.

Cloning the repository

The first step is to clone this repository. Just type the following command in your Terminal:

# Clone the repository...
git clone https://github.com/vermicida/data-lake.git

# ...and move to its directory
cd data-lake

Creating the virtual environment

Once the project is cloned, you must create a new virtual environment for it. You may have multiple Python versions installed, so it's important to specify the correct one with the --python argument. The code on this repository has been successfully tested over Python 3.6, but it must work on any Python 3 version; set the version that applies to your case in the following commands:

# Create the virtual environment...
virtualenv --python python3.6 project-env

# ...and activate it
source project-env/bin/activate

It's time to install the project's dependencies. You will use pip to do this job:

pip install -r requirements.txt

It will be ready in a minute.


How to use

Here are listed the steps to follow in order to make the ETL pipeline work.

Running the ETL

This step will test your patience: it can takes a few hours to complete. It performs the actions described below:

  1. Create a Spark session.
  2. Read the song data, stored in S3, into a dataframe. Note: I decided to take just a slice of this data because of the slowness of the writting process that comes later. Instead of get all the JSON files, it just take those from partition A/A/A.
  3. Transform the song data to create the songs dimension table and store it back in S3 in Apache Parquet format.
  4. Transform the song data to create the artists dimension table and store it back in S3 in Apache Parquet format.
  5. Read the log data, stored in S3, into a dataframe.
  6. Transform the log data to create the users dimension table and store it back in S3 in Apache Parquet format.
  7. Transform the log data to create the time dimension table and store it back in S3 in Apache Parquet format.
  8. Transform the song and log data to create the songplays fact table and store it back in S3 in Apache Parquet format.

Run this command:

# Move to the src directory...
cd src

# ...and run the ETL
python etl.py

Now better go to eat something, take a nap, or watch the last episode of Money Heist.

When the ETL process ends, go to your data lake bucket and verify that it worked like expected. The image below shows a sample of the artists table in Apache Parquet format.

User summary

About

Data Lake, the code corresponding the project #4 of the Udacity's Data Engineer Nanodegree Program

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages