# Parallel ETL

## Import libraries

In [1]:
import pandas as pd
import boto3
import configparser
from time import time

In [2]:
%load_ext sql

## STEP 1. Get the params of the created redshift cluster

Once we have run the previous exercise (L4_E1), we need to take note of:
* The redshift cluster <font color='red'>endpoint</font>.
* The <font color='red'>IAM role ARN</font> that give access to Redshift to read from S3.

In [3]:
config = configparser.ConfigParser()
config.read_file(open('dwh.cfg'))
KEY=config.get('AWS','key')
SECRET= config.get('AWS','secret')

DWH_DB= config.get("DWH","DWH_DB")
DWH_DB_USER= config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD= config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT = config.get("DWH","DWH_PORT")

# Copied from L4_E1 notebook once it is run
DWH_ENDPOINT="dwhcluster.cw5pguvqjfzx.us-east-1.redshift.amazonaws.com" 
DWH_ROLE_ARN="arn:aws:iam::113458468422:role/dwhRole"

## STEP 2. Connect to the Redshift Cluster

In [4]:
conn_string="postgresql://{}:{}@{}:{}/{}".format(DWH_DB_USER, DWH_DB_PASSWORD, DWH_ENDPOINT, DWH_PORT,DWH_DB)
print(conn_string)
%sql $conn_string

postgresql://dwhuser:Passw0rd@dwhcluster.cw5pguvqjfzx.us-east-1.redshift.amazonaws.com:5439/dwh


Now we create an S3 resource and we make use of `udacity-labs` bucket, where data is already loaded.

In [5]:
s3 = boto3.resource("s3",
                    region_name="us-east-1",
                    aws_access_key_id=KEY,
                    aws_secret_access_key=SECRET)

sampleDbBucket =  s3.Bucket("udacity-labs")
for obj in sampleDbBucket.objects.filter(Prefix="tickets"):
    print(obj)

s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/full/')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/full/full.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00000-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00001-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00002-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00003-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00004-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00005-d33afb94-b8af-407d-abd5-

As we see, we have a dataset called `tickets`.
* On the one hand, we have one file called `full.csv.gz`, where we have all the data.
* On the other hand, we have the same data, but splitted in 9 different files.

So we're going to compare how long it takes to load the data both from the splitted sources and the one-part file.

## STEP 3a. Create table for partitioned data

In [6]:
%%sql 
DROP TABLE IF EXISTS "sporting_event_ticket";
CREATE TABLE "sporting_event_ticket" (
    "id" double precision DEFAULT nextval('sporting_event_ticket_seq') NOT NULL,
    "sporting_event_id" double precision NOT NULL,
    "sport_location_id" double precision NOT NULL,
    "seat_level" numeric(1,0) NOT NULL,
    "seat_section" character varying(15) NOT NULL,
    "seat_row" character varying(10) NOT NULL,
    "seat" character varying(10) NOT NULL,
    "ticketholder_id" double precision,
    "ticket_price" numeric(8,2) NOT NULL
);

 * postgresql://dwhuser:***@dwhcluster.cw5pguvqjfzx.us-east-1.redshift.amazonaws.com:5439/dwh
Done.
Done.


[]

## STEP 3b. Load partitioned data into the cluster

We make use of the COPY command to load data from `s3://udacity-labs/tickets/split/part` prefix.

In [7]:
%%time
qry = f"""
    COPY sporting_event_ticket FROM 's3://udacity-labs/tickets/split/part'
    CREDENTIALS 'aws_iam_role={DWH_ROLE_ARN}'
    gzip DELIMITER ';' compupdate off REGION 'us-west-2'
"""

%sql $qry

 * postgresql://dwhuser:***@dwhcluster.cw5pguvqjfzx.us-east-1.redshift.amazonaws.com:5439/dwh
Done.
CPU times: user 4.82 ms, sys: 349 µs, total: 5.17 ms
Wall time: 14.8 s


[]

## STEP 4a. Create table for non-partitioned data

Let's now do the same, but for non-partitioned data.

In [8]:
%%sql
DROP TABLE IF EXISTS "sporting_event_ticket_full";
CREATE TABLE "sporting_event_ticket_full" (
    "id" double precision DEFAULT nextval('sporting_event_ticket_seq') NOT NULL,
    "sporting_event_id" double precision NOT NULL,
    "sport_location_id" double precision NOT NULL,
    "seat_level" numeric(1,0) NOT NULL,
    "seat_section" character varying(15) NOT NULL,
    "seat_row" character varying(10) NOT NULL,
    "seat" character varying(10) NOT NULL,
    "ticketholder_id" double precision,
    "ticket_price" numeric(8,2) NOT NULL
);

 * postgresql://dwhuser:***@dwhcluster.cw5pguvqjfzx.us-east-1.redshift.amazonaws.com:5439/dwh
Done.
Done.


[]

## STEP 4b. Load non-partitioned data into the cluster

We load the data from load data from the `s3://udacity-labs/tickets/full/full.csv.gz` file.

In [9]:
%%time

qry = f"""
    COPY sporting_event_ticket FROM 's3://udacity-labs/tickets/full/full.csv.gz'
    CREDENTIALS 'aws_iam_role={DWH_ROLE_ARN}'
    gzip DELIMITER ';' compupdate off REGION 'us-west-2'
"""

%sql $qry

 * postgresql://dwhuser:***@dwhcluster.cw5pguvqjfzx.us-east-1.redshift.amazonaws.com:5439/dwh
Done.
CPU times: user 5.67 ms, sys: 205 µs, total: 5.87 ms
Wall time: 21.6 s


[]

As we can see, the data loading is faster if we make use of partitioned data sources.

## STEP 5. Clean up your resources

We follow the last step from the **L4_E1 notebook** to clean up the resources, so that we don't incur in additional costs. 