# Lesson 4 - Exercise 1: Load Staging Table with COPY Command

## Learning Objectives

By completing this exercise, you will:

1. Create your own S3 bucket for staging data
2. Upload data files to S3
3. Use the COPY command to load data into Redshift
4. Understand COPY options for different file formats
5. Validate successful data loads

## Prerequisites

- AWS credentials configured in `aws_config.py`
- `van_transit_trips_postgres.csv` data file
- Redshift Serverless workgroup available

## Context

In production Redshift workflows, data flows from ETL scripts to S3, then into Redshift via the COPY command. The COPY command reads files in parallel across all cluster nodes, making it far more efficient than row-by-row INSERTs.

**Note on Authentication**: In production, you would use `IAM_ROLE` for secure access. In this workspace, we use temporary credentials with the `CREDENTIALS` parameter since no default IAM role is configured on the cluster.

---
## Setup: Imports and Configuration

In [1]:
# ========= Imports
import os
import time
import uuid
from datetime import datetime

import pandas as pd
import numpy as np
import boto3
from botocore.exceptions import ClientError

print("Imports successful!")
print(f"   - pandas version: {pd.__version__}")
print(f"   - numpy version: {np.__version__}")

Imports successful!
   - pandas version: 2.3.1
   - numpy version: 2.2.6


In [2]:
# ========= Load AWS Credentials
import aws_config

AWS_REGION = os.getenv('AWS_REGION')
REDSHIFT_DATABASE = os.getenv('REDSHIFT_DATABASE')
REDSHIFT_WORKGROUP = os.getenv('REDSHIFT_WORKGROUP')

# Get credentials for COPY command
AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY')
AWS_SESSION_TOKEN = os.getenv('AWS_SESSION_TOKEN')

print("Configuration loaded!")
print(f"   - AWS Region: {AWS_REGION}")
print(f"   - Redshift Database: {REDSHIFT_DATABASE}")
print(f"   - Redshift Workgroup: {REDSHIFT_WORKGROUP}")
print(f"   - Credentials: {'Loaded' if AWS_ACCESS_KEY_ID else 'Missing'}")

Configuration loaded!
   - AWS Region: us-east-1
   - Redshift Database: dev
   - Redshift Workgroup: udacity-dwh-wg
   - Credentials: Loaded


In [3]:
# ========= Helper Function: Execute Redshift Query
def execute_redshift_query(sql, fetch_results=True):
    """Execute a SQL query on Redshift Serverless and optionally return results."""
    client = boto3.client('redshift-data', region_name=AWS_REGION)
    
    # Execute query
    response = client.execute_statement(
        WorkgroupName=REDSHIFT_WORKGROUP,
        Database=REDSHIFT_DATABASE,
        Sql=sql
    )
    query_id = response['Id']
    
    # Wait for completion
    status = 'SUBMITTED'
    while status in ['SUBMITTED', 'PICKED', 'STARTED']:
        time.sleep(1)
        status_response = client.describe_statement(Id=query_id)
        status = status_response['Status']
    
    if status == 'FAILED':
        error = status_response.get('Error', 'Unknown error')
        raise Exception(f"Query failed: {error}")
    
    # Fetch results if requested
    if fetch_results and status == 'FINISHED':
        try:
            result = client.get_statement_result(Id=query_id)
            columns = [col['name'] for col in result['ColumnMetadata']]
            rows = []
            for record in result['Records']:
                row = [list(field.values())[0] if field else None for field in record]
                rows.append(row)
            return pd.DataFrame(rows, columns=columns)
        except client.exceptions.ResourceNotFoundException:
            return None
    return None

print("Helper functions defined!")

Helper functions defined!


---
## Step 1: Create S3 Bucket for Staging

We'll create a unique S3 bucket to store our staging data.

In [4]:
# ========= Create S3 bucket
s3 = boto3.client('s3', region_name=AWS_REGION)

# Generate unique bucket name
unique_id = str(uuid.uuid4())[:8]
BUCKET_NAME = f"udacity-redshift-staging-{unique_id}"

try:
    if AWS_REGION == 'us-east-1':
        s3.create_bucket(Bucket=BUCKET_NAME)
    else:
        s3.create_bucket(
            Bucket=BUCKET_NAME,
            CreateBucketConfiguration={'LocationConstraint': AWS_REGION}
        )
    print(f"SUCCESS: Created bucket '{BUCKET_NAME}'")
except ClientError as e:
    if 'BucketAlreadyOwnedByYou' in str(e):
        print(f"Bucket already exists: {BUCKET_NAME}")
    else:
        raise e

print(f"\nS3 URI: s3://{BUCKET_NAME}/")

SUCCESS: Created bucket 'udacity-redshift-staging-4a247f6b'

S3 URI: s3://udacity-redshift-staging-4a247f6b/


---
## Step 2: Review and Upload Transit Trip Data to S3

We'll use the Vancouver Transit trips dataset which contains real transit data.

In [5]:
# ========= Review the source data
csv_path = "data/van_transit_trips_postgres.csv"

trips_df = pd.read_csv(csv_path)

print(f"Dataset: {csv_path}")
print(f"Shape: {trips_df.shape[0]:,} rows x {trips_df.shape[1]} columns")
print(f"\nColumns ({len(trips_df.columns)}):")
for i, col in enumerate(trips_df.columns, 1):
    print(f"   {i:2}. {col}")

Dataset: data/van_transit_trips_postgres.csv
Shape: 2,500 rows x 23 columns

Columns (23):
    1. trip_id
    2. rider_id
    3. route_id
    4. mode
    5. origin_station_id
    6. destination_station_id
    7. board_datetime
    8. alight_datetime
    9. country
   10. province
   11. fare_class
   12. payment_method
   13. transfers
   14. zones_charged
   15. distance_km
   16. base_fare_cad
   17. discount_rate
   18. discount_amount_cad
   19. yvr_addfare_cad
   20. total_fare_cad
   21. on_time_arrival
   22. service_disruption
   23. polyline_stations


In [6]:
# ========= Preview the data
print("Sample data (first 5 rows):")
trips_df.head()

Sample data (first 5 rows):


Unnamed: 0,trip_id,rider_id,route_id,mode,origin_station_id,destination_station_id,board_datetime,alight_datetime,country,province,...,zones_charged,distance_km,base_fare_cad,discount_rate,discount_amount_cad,yvr_addfare_cad,total_fare_cad,on_time_arrival,service_disruption,polyline_stations
0,T100000,R33247,R111,bus,S021,S004,2024-01-31 10:45:08,2024-01-31 11:12:09,CA,BC,...,1,7.26,3.32,0.0,0.0,0.0,3.32,True,False,S001|S025|S009|S008|S009
1,T100001,R43159,R033,bus,S005,S025,2024-08-08 00:16:41,2024-08-08 00:44:35,CA,BC,...,1,11.66,3.17,0.0,0.0,0.0,3.17,True,False,S004|S023|S025|S030|S018|S003|S020
2,T100002,R18110,R001,bus,S014,S002,2024-05-28 02:42:12,2024-05-28 03:14:48,CA,BC,...,1,15.35,3.12,0.32,1.0,0.0,2.12,False,False,S001|S004|S008|S009|S018|S021|S001|S019|S007
3,T100003,R97939,R023,seabus,S023,S021,2025-06-14 06:40:38,2025-06-14 06:53:33,CA,BC,...,1,4.74,3.12,0.32,1.0,0.0,2.12,True,False,S023|S018|S014|S008|S016|S020
4,T100004,R85766,R103,seabus,S009,S027,2024-01-29 18:06:53,2024-01-29 18:21:56,CA,BC,...,2,7.99,4.51,0.0,0.0,0.0,4.51,True,False,S028|S001|S026|S027|S006|S024|S014|S011|S009|S005


In [7]:
# ========= Upload CSV to S3
s3_key = 'staging/trips/van_transit_trips.csv'

# Upload the file
with open(csv_path, 'rb') as f:
    s3.put_object(
        Bucket=BUCKET_NAME,
        Key=s3_key,
        Body=f
    )

# Get file size
response = s3.head_object(Bucket=BUCKET_NAME, Key=s3_key)
file_size = response['ContentLength']

print(f"Uploaded: s3://{BUCKET_NAME}/{s3_key}")
print(f"   - Records: {len(trips_df):,}")
print(f"   - Size: {file_size:,} bytes")

Uploaded: s3://udacity-redshift-staging-4a247f6b/staging/trips/van_transit_trips.csv
   - Records: 2,500
   - Size: 459,888 bytes


In [8]:
# ========= Verify the upload
response = s3.list_objects_v2(Bucket=BUCKET_NAME, Prefix='staging/trips/')

print("Files in staging/trips/:")
for obj in response.get('Contents', []):
    print(f"   - {obj['Key']} ({obj['Size']:,} bytes)")

Files in staging/trips/:
   - staging/trips/van_transit_trips.csv (459,888 bytes)


---
## Step 3: Create Staging Table in Redshift

Before we can COPY data, we need a target table in Redshift. The table schema must match the CSV columns.

In [9]:
# ========= Create staging table with correct schema
create_table_sql = """
DROP TABLE IF EXISTS public.stg_trips_raw;

CREATE TABLE public.stg_trips_raw (
    trip_id               VARCHAR(32),
    rider_id              VARCHAR(32),
    route_id              VARCHAR(32),
    mode                  VARCHAR(16),
    origin_station_id     VARCHAR(32),
    destination_station_id VARCHAR(32),
    board_datetime        TIMESTAMP,
    alight_datetime       TIMESTAMP,
    country               VARCHAR(8),
    province              VARCHAR(8),
    fare_class            VARCHAR(16),
    payment_method        VARCHAR(32),
    transfers             INTEGER,
    zones_charged         INTEGER,
    distance_km           DECIMAL(10,2),
    base_fare_cad         DECIMAL(10,2),
    discount_rate         DECIMAL(5,3),
    discount_amount_cad   DECIMAL(10,2),
    yvr_addfare_cad       DECIMAL(10,2),
    total_fare_cad        DECIMAL(10,2),
    on_time_arrival       BOOLEAN,
    service_disruption    BOOLEAN,
    polyline_stations     VARCHAR(512)
);
"""

execute_redshift_query(create_table_sql, fetch_results=False)
print("SUCCESS: Created table public.stg_trips_raw")
print(f"   - Columns: 23 (matching CSV schema)")

SUCCESS: Created table public.stg_trips_raw
   - Columns: 23 (matching CSV schema)


---
## Step 4: Load Data with COPY Command

Now we'll use the COPY command to load data from S3 into Redshift.

### COPY Command Components

| Component | Purpose |
|-----------|----------|
| `COPY table_name` | Target table for the load |
| `FROM 's3://...'` | S3 path to data files |
| `CREDENTIALS` | AWS credentials for S3 access (workspace) |
| `IAM_ROLE` | IAM role ARN for S3 access (production) |
| `FORMAT AS CSV` | File format |
| `IGNOREHEADER 1` | Skip header row |

### Production vs Workspace Authentication

```sql
-- PRODUCTION: Use IAM Role (recommended)
COPY table FROM 's3://bucket/path/'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftS3Role'
FORMAT AS CSV;

-- WORKSPACE: Use temporary credentials
COPY table FROM 's3://bucket/path/'
CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...;token=...'
FORMAT AS CSV;
```

In [10]:
# ========= Execute COPY command
s3_path = f"s3://{BUCKET_NAME}/staging/trips/"

copy_sql = f"""
COPY public.stg_trips_raw
FROM '{s3_path}'
CREDENTIALS 'aws_access_key_id={AWS_ACCESS_KEY_ID};aws_secret_access_key={AWS_SECRET_ACCESS_KEY};token={AWS_SESSION_TOKEN}'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
REGION '{AWS_REGION}';
"""

print("Executing COPY command...")
print("=" * 60)
print(f"""
COPY public.stg_trips_raw
FROM '{s3_path}'
CREDENTIALS '***hidden***'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
REGION '{AWS_REGION}';
""")
print("=" * 60)

start_time = time.time()
execute_redshift_query(copy_sql, fetch_results=False)
elapsed = time.time() - start_time

print(f"\nSUCCESS: COPY completed in {elapsed:.1f} seconds")

Executing COPY command...

COPY public.stg_trips_raw
FROM 's3://udacity-redshift-staging-4a247f6b/staging/trips/'
CREDENTIALS '***hidden***'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
REGION 'us-east-1';


SUCCESS: COPY completed in 21.7 seconds


---
## Step 5: Validate the Data Load

In [11]:
# ========= Verify row count
result = execute_redshift_query("SELECT COUNT(*) as row_count FROM public.stg_trips_raw;")
db_count = int(result['row_count'].iloc[0])
csv_count = len(trips_df)

print("Row Count Validation:")
print(f"   - CSV file:  {csv_count:,} rows")
print(f"   - Database:  {db_count:,} rows")
print(f"   - Match: {'YES' if db_count == csv_count else 'NO - Check STL_LOAD_ERRORS'}")

Row Count Validation:
   - CSV file:  2,500 rows
   - Database:  2,500 rows
   - Match: YES


In [12]:
# ========= Check for NULL values in key columns
null_check_sql = """
SELECT 
    COUNT(*) as total_rows,
    SUM(CASE WHEN trip_id IS NULL THEN 1 ELSE 0 END) as null_trip_id,
    SUM(CASE WHEN rider_id IS NULL THEN 1 ELSE 0 END) as null_rider_id,
    SUM(CASE WHEN board_datetime IS NULL THEN 1 ELSE 0 END) as null_board_datetime,
    SUM(CASE WHEN fare_class IS NULL THEN 1 ELSE 0 END) as null_fare_class,
    SUM(CASE WHEN total_fare_cad IS NULL THEN 1 ELSE 0 END) as null_total_fare
FROM public.stg_trips_raw;
"""

result = execute_redshift_query(null_check_sql)
print("NULL Value Check:")
for col in result.columns:
    print(f"   - {col}: {result[col].iloc[0]}")

NULL Value Check:
   - total_rows: 2500
   - null_trip_id: 0
   - null_rider_id: 0
   - null_board_datetime: 0
   - null_fare_class: 0
   - null_total_fare: 0


In [13]:
# ========= Sample data from staging table
sample_sql = """
SELECT 
    trip_id, rider_id, route_id, mode, fare_class, 
    total_fare_cad, board_datetime
FROM public.stg_trips_raw
LIMIT 5;
"""

print("Sample Data from Staging Table:")
execute_redshift_query(sample_sql)

Sample Data from Staging Table:


Unnamed: 0,trip_id,rider_id,route_id,mode,fare_class,total_fare_cad,board_datetime
0,T100000,R33247,R111,bus,adult,3.32,2024-01-31 10:45:08
1,T100001,R43159,R033,bus,adult,3.17,2024-08-08 00:16:41
2,T100002,R18110,R001,bus,youth,2.12,2024-05-28 02:42:12
3,T100003,R97939,R023,seabus,youth,2.12,2025-06-14 06:40:38
4,T100004,R85766,R103,seabus,adult,4.51,2024-01-29 18:06:53


In [14]:
# ========= Summary statistics
summary_sql = """
SELECT 
    COUNT(*) as total_trips,
    COUNT(DISTINCT rider_id) as unique_riders,
    COUNT(DISTINCT route_id) as unique_routes,
    COUNT(DISTINCT fare_class) as fare_classes,
    MIN(board_datetime) as earliest_trip,
    MAX(board_datetime) as latest_trip,
    ROUND(SUM(total_fare_cad), 2) as total_revenue,
    ROUND(AVG(total_fare_cad), 2) as avg_fare
FROM public.stg_trips_raw;
"""

print("Staging Table Summary:")
print("=" * 60)
result = execute_redshift_query(summary_sql)
for col in result.columns:
    print(f"   {col}: {result[col].iloc[0]}")

Staging Table Summary:
   total_trips: 2500
   unique_riders: 2455
   unique_routes: 120
   fare_classes: 5
   earliest_trip: 2024-01-01 01:08:58
   latest_trip: 2025-06-29 20:37:07
   total_revenue: 7570.32
   avg_fare: 3.02


---
## Step 6: COPY Options Reference

The COPY command supports many options for different data scenarios.

### Common COPY Options

| Option | Purpose | Example |
|--------|---------|----------|
| `IGNOREHEADER n` | Skip first n rows (header) | `IGNOREHEADER 1` |
| `DELIMITER 'char'` | Column separator | `DELIMITER ','` |
| `TIMEFORMAT` | Timestamp parsing | `TIMEFORMAT 'auto'` |
| `DATEFORMAT` | Date parsing | `DATEFORMAT 'auto'` |
| `BLANKSASNULL` | Empty strings become NULL | - |
| `EMPTYASNULL` | Empty fields become NULL | - |
| `MAXERROR n` | Allow up to n errors | `MAXERROR 100` |
| `TRUNCATECOLUMNS` | Truncate data exceeding column width | - |

In [17]:
# ========= COPY Command Reference Examples

print("COPY Command Examples")
print("=" * 60)

# Example 1: Basic CSV load (Production with IAM Role)
print("""
1. PRODUCTION: Basic CSV Load with IAM Role
--------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1;
""")

# Example 2: CSV with error handling
print("""
2. Robust CSV Load with Error Handling
--------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1
DELIMITER ','
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
BLANKSASNULL
EMPTYASNULL
ACCEPTINVCHARS AS '?'
MAXERROR 100
TRUNCATECOLUMNS;
""")

# Example 3: Parquet format (recommended for production)
print("""
3. Parquet Format (Recommended for Production)
----------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;

Note: Parquet is 5-10x faster than CSV and doesn't need
      IGNOREHEADER or DELIMITER options.
""")

COPY Command Examples

1. PRODUCTION: Basic CSV Load with IAM Role
--------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1;


2. Robust CSV Load with Error Handling
--------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1
DELIMITER ','
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
BLANKSASNULL
EMPTYASNULL
ACCEPTINVCHARS AS '?'
MAXERROR 100
TRUNCATECOLUMNS;


3. Parquet Format (Recommended for Production)
----------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;

Note: Parquet is 5-10x faster than CSV and doesn't need
      IGNOREHEADER or DELIMITER options.



---
## Step 7: Check for Load Errors

If your COPY command fails, you can check the error logs.

In [19]:
# ========= Check for COPY errors
# Note: stl_load_errors requires superuser privileges
# In this workspace, we may not have access to system tables

error_sql = """
SELECT 
    starttime,
    filename,
    line_number,
    colname,
    err_reason
FROM stl_load_errors
ORDER BY starttime DESC
LIMIT 10;
"""

try:
    errors = execute_redshift_query(error_sql)
    if errors is not None and len(errors) > 0:
        print("Recent COPY errors:")
        print(errors)
    else:
        print("No COPY errors found!")
except Exception as e:
    if 'permission denied' in str(e).lower():
        print("Note: Cannot access stl_load_errors (requires superuser privileges)")
        print("\nIn production environments, you would check this table to debug COPY failures.")
        print("\nAlternative: If COPY fails, the error message in the exception usually")
        print("contains enough information to diagnose the issue.")
    else:
        print(f"Error: {e}")

Note: Cannot access stl_load_errors (requires superuser privileges)

In production environments, you would check this table to debug COPY failures.

Alternative: If COPY fails, the error message in the exception usually
contains enough information to diagnose the issue.


---
## Cleanup (Optional)

Delete the S3 bucket when you're done to avoid storage charges.

In [None]:
# ========= OPTIONAL: Delete S3 bucket
# Uncomment the code below to delete the bucket after the exercise

# print("Deleting S3 bucket and contents...")
# 
# # First delete all objects
# response = s3.list_objects_v2(Bucket=BUCKET_NAME)
# for obj in response.get('Contents', []):
#     s3.delete_object(Bucket=BUCKET_NAME, Key=obj['Key'])
#     print(f"   Deleted: {obj['Key']}")
# 
# # Then delete the bucket
# s3.delete_bucket(Bucket=BUCKET_NAME)
# print(f"   Deleted bucket: {BUCKET_NAME}")

print("Cleanup skipped. Uncomment the code above to delete the S3 bucket.")
print(f"\nBucket to delete later: {BUCKET_NAME}")

---
## Summary

In this exercise, you learned:

1. **S3 Bucket Creation** - Created a unique bucket for staging data
2. **Data Upload** - Uploaded CSV files to S3 using boto3
3. **Table Creation** - Created a staging table matching the source schema
4. **COPY Command** - Loaded data from S3 into Redshift using COPY
5. **Validation** - Verified the data load with row counts and sample queries

### Key Takeaways

- **COPY is faster** than INSERT for bulk loads (parallel across nodes)
- **Schema must match** between source files and target table
- **Check STL_LOAD_ERRORS** when loads fail or have fewer rows than expected
- **Use IAM roles** in production instead of credentials strings

### Production COPY Template

```sql
-- Idempotent staging load (PRODUCTION)
TRUNCATE TABLE stg.trips_raw;

COPY stg.trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto';

-- Validate
SELECT COUNT(*) FROM stg.trips_raw;
```