# Load Data, parallel (partitioned) vs. single-threaded (non-partitioned)

When moving large amounts of data from S3 staging area to Redshift, it is better to use the copy command instead of insert. The benefit of using the copy command is that the ingestion can be parallelized if the data is broken into parts. Each part can be independently ingested by a slice in the cluster. As we are creating 4 nodes in this cluster, we can assume that there will be 4 parallel ingestions into the cluster. This can significantly reduce the time it takes to ingest large payloads. We'll show this in this notebook.

In [1]:
%load_ext sql

In [3]:
import boto3
import configparser
import pandas as pd
from time import time

## Connect to the redshift cluster 
(run first notebook to set up the cluster and to get the necessary params)

In [4]:
config = configparser.ConfigParser()
config.read_file(open('dwh.cfg'))
KEY=config.get('AWS','key')
SECRET= config.get('AWS','secret')

DWH_DB= config.get("DWH","DWH_DB")
DWH_DB_USER= config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD= config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT = config.get("DWH","DWH_PORT")

In [5]:
# FILL IN THE REDSHIFT ENPOINT HERE
# e.g. DWH_ENDPOINT="redshift-cluster-1.csmamz5zxmle.us-west-2.redshift.amazonaws.com" 
DWH_ENDPOINT="dwhcluster.cyki8txni5j5.us-west-2.redshift.amazonaws.com" 
    
#FILL IN THE IAM ROLE ARN you got in step 2.2 of the previous exercise
#e.g DWH_ROLE_ARN="arn:aws:iam::988332130976:role/dwhRole"
DWH_ROLE_ARN="arn:aws:iam::873674308518:role/dwhRole"

In [6]:
conn_string=f"postgresql://{DWH_DB_USER}:{DWH_DB_PASSWORD}@{DWH_ENDPOINT}:{DWH_PORT}/{DWH_DB}"

print(conn_string)
%sql $conn_string

postgresql://dwhuser:Passw0rd@dwhcluster.cyki8txni5j5.us-west-2.redshift.amazonaws.com:5439/dwh


'Connected: dwhuser@dwh'

In [7]:
s3 =  boto3.resource("s3",
                     region_name="us-west-2",
                     aws_access_key_id=KEY,
                     aws_secret_access_key=SECRET
                     )

sampleDbBucket =  s3.Bucket("udacity-labs")

In [8]:
# Check the sample data
for obj in sampleDbBucket.objects.filter(Prefix="tickets"):
    print(obj)

s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/full/')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/full/full.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00000-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00001-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00002-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00003-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00004-d33afb94-b8af-407d-abd5-59c0ee8f5ee8-c000.csv.gz')
s3.ObjectSummary(bucket_name='udacity-labs', key='tickets/split/part-00005-d33afb94-b8af-407d-abd5-

**Note:** There is a full data set in the folder `full` and a partitioned set split in 10 containing the same data in the folder `split`.

## Copy The Partitioned Data

**Create table**

In [9]:
%%sql 
DROP TABLE IF EXISTS "sporting_event_ticket";
CREATE TABLE "sporting_event_ticket" (
    "id" double precision DEFAULT nextval('sporting_event_ticket_seq') NOT NULL,
    "sporting_event_id" double precision NOT NULL,
    "sport_location_id" double precision NOT NULL,
    "seat_level" numeric(1,0) NOT NULL,
    "seat_section" character varying(15) NOT NULL,
    "seat_row" character varying(10) NOT NULL,
    "seat" character varying(10) NOT NULL,
    "ticketholder_id" double precision,
    "ticket_price" numeric(8,2) NOT NULL
);

* postgresql://dwhuser:***@dwhcluster.cyki8txni5j5.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
Done.


[]

**Load partitioned data**

Use the COPY command to load data from `s3://udacity-labs/tickets/split/part` using your iam role credentials. Use gzip delimiter `;`.

"/part" is the prefix of the partitioned file and signals that all these files can be loaded in parallel.

(Note: the `compupdate off` command means that we ask AWS not to do any performance tuning in the background. We do this for demo purpose only, because we want to compare the load times.)

In [10]:
%%time
qry = f"""
    COPY sporting_event_ticket 
    FROM 's3://udacity-labs/tickets/split/part'
    credentials 'aws_iam_role={DWH_ROLE_ARN}'
    gzip delimiter ';' compupdate off region 'us-west-2';
"""

%sql $qry

* postgresql://dwhuser:***@dwhcluster.cyki8txni5j5.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
Wall time: 11.2 s


[]

## Load Non-partitioned data
**Create table**

In [11]:
%%sql
DROP TABLE IF EXISTS "sporting_event_ticket_full";
CREATE TABLE "sporting_event_ticket_full" (
    "id" double precision DEFAULT nextval('sporting_event_ticket_seq') NOT NULL,
    "sporting_event_id" double precision NOT NULL,
    "sport_location_id" double precision NOT NULL,
    "seat_level" numeric(1,0) NOT NULL,
    "seat_section" character varying(15) NOT NULL,
    "seat_row" character varying(10) NOT NULL,
    "seat" character varying(10) NOT NULL,
    "ticketholder_id" double precision,
    "ticket_price" numeric(8,2) NOT NULL
);

* postgresql://dwhuser:***@dwhcluster.cyki8txni5j5.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
Done.


[]

**Load non-partitioned data**

Use the COPY command to load data from `s3://udacity-labs/tickets/full/full.csv.gz` using your iam role credentials. Use gzip delimiter `;`.

- Note how it's slower than loading partitioned data

In [12]:
%%time

qry = """
    COPY sporting_event_ticket_full 
    FROM 's3://udacity-labs/tickets/full/full.csv.gz' 
    credentials 'aws_iam_role={DWH_ROLE_ARN}' 
    gzip delimiter ';' compupdate off region 'us-west-2';
"""

%sql $qry

* postgresql://dwhuser:***@dwhcluster.cyki8txni5j5.us-west-2.redshift.amazonaws.com:5439/dwh
Done.
Wall time: 22.9 s


[]

---