# Lesson 4 - Exercise 1: Load Staging Table with COPY Command

## Learning Objectives

By completing this exercise, you will:

1. Create your own S3 bucket for staging data
2. Upload data files to S3
3. Use the COPY command to load data into Redshift
4. Understand COPY options for different file formats
5. Validate successful data loads

## Prerequisites

- AWS credentials configured in `aws_config.py`
- Redshift Serverless workgroup available

## Context

In production Redshift workflows, data flows from ETL scripts to S3, then into Redshift via the COPY command. The COPY command reads files in parallel across all cluster nodes, making it far more efficient than row-by-row INSERTs.

**Note on Authentication**: In production, you would use `IAM_ROLE` for secure access. In this workspace, we use temporary credentials with the `CREDENTIALS` parameter since no default IAM role is configured on the cluster.

---
## Setup: Imports and Configuration

In [18]:
# ========= Imports
import os
import time
import uuid
from datetime import datetime

import pandas as pd
import numpy as np
import boto3
from botocore.exceptions import ClientError

print("Imports successful!")
print(f"   - pandas version: {pd.__version__}")
print(f"   - numpy version: {np.__version__}")

Imports successful!
   - pandas version: 2.3.1
   - numpy version: 2.2.6


In [19]:
# ========= Load AWS Credentials
import aws_config

AWS_REGION = os.getenv('AWS_REGION')
REDSHIFT_DATABASE = os.getenv('REDSHIFT_DATABASE')
REDSHIFT_WORKGROUP = os.getenv('REDSHIFT_WORKGROUP')

# Get credentials for COPY command
AWS_ACCESS_KEY_ID = os.getenv('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.getenv('AWS_SECRET_ACCESS_KEY')
AWS_SESSION_TOKEN = os.getenv('AWS_SESSION_TOKEN')

print("Configuration loaded!")
print(f"   - AWS Region: {AWS_REGION}")
print(f"   - Redshift Database: {REDSHIFT_DATABASE}")
print(f"   - Redshift Workgroup: {REDSHIFT_WORKGROUP}")
print(f"   - Credentials: {'Loaded' if AWS_ACCESS_KEY_ID else 'Missing'}")

Configuration loaded!
   - AWS Region: us-east-1
   - Redshift Database: dev
   - Redshift Workgroup: udacity-dwh-wg
   - Credentials: Loaded


In [20]:
# ========= Helper Function: Execute Redshift Query
def execute_redshift_query(sql, fetch_results=True):
    """Execute a SQL query on Redshift Serverless and optionally return results."""
    client = boto3.client('redshift-data', region_name=AWS_REGION)
    
    # Execute query
    response = client.execute_statement(
        WorkgroupName=REDSHIFT_WORKGROUP,
        Database=REDSHIFT_DATABASE,
        Sql=sql
    )
    query_id = response['Id']
    
    # Wait for completion
    status = 'SUBMITTED'
    while status in ['SUBMITTED', 'PICKED', 'STARTED']:
        time.sleep(1)
        status_response = client.describe_statement(Id=query_id)
        status = status_response['Status']
    
    if status == 'FAILED':
        error = status_response.get('Error', 'Unknown error')
        raise Exception(f"Query failed: {error}")
    
    # Fetch results if requested
    if fetch_results and status == 'FINISHED':
        try:
            result = client.get_statement_result(Id=query_id)
            columns = [col['name'] for col in result['ColumnMetadata']]
            rows = []
            for record in result['Records']:
                row = []
                for field in record:
                    value = list(field.values())[0] if field else None
                    row.append(value)
                rows.append(row)
            return pd.DataFrame(rows, columns=columns)
        except ClientError:
            return None
    
    return None

print("Helper functions defined!")

Helper functions defined!


In [21]:
# ========= Build COPY credentials string
# This creates the credentials parameter for the COPY command

def get_copy_credentials():
    """Build the CREDENTIALS string for COPY command."""
    creds = f"aws_access_key_id={AWS_ACCESS_KEY_ID};aws_secret_access_key={AWS_SECRET_ACCESS_KEY}"
    if AWS_SESSION_TOKEN:
        creds += f";token={AWS_SESSION_TOKEN}"
    return creds

# Test that credentials are available
if AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY:
    print("COPY credentials ready!")
else:
    print("WARNING: Missing AWS credentials. COPY command will fail.")

COPY credentials ready!


---
## Step 1: Create Your Own S3 Bucket

Each student needs their own S3 bucket for staging data. We'll create one with a unique name.

In [22]:
# ========= Create a unique S3 bucket for this workspace
s3 = boto3.client('s3', region_name=AWS_REGION)

# Generate a unique bucket name
BUCKET_NAME = f"udacity-redshift-staging-{uuid.uuid4().hex[:8]}"

try:
    # Create bucket (us-east-1 doesn't need LocationConstraint)
    if AWS_REGION == 'us-east-1':
        s3.create_bucket(Bucket=BUCKET_NAME)
    else:
        s3.create_bucket(
            Bucket=BUCKET_NAME,
            CreateBucketConfiguration={'LocationConstraint': AWS_REGION}
        )
    print(f"SUCCESS: Created bucket '{BUCKET_NAME}'")
except ClientError as e:
    print(f"ERROR: {e}")

# Save bucket name for later use
print(f"\n*** IMPORTANT: Save this bucket name ***")
print(f"    BUCKET_NAME = '{BUCKET_NAME}'")

SUCCESS: Created bucket 'udacity-redshift-staging-27070584'

*** IMPORTANT: Save this bucket name ***
    BUCKET_NAME = 'udacity-redshift-staging-27070584'


In [23]:
# ========= Create folder structure in S3
folders = [
    'staging/trips/',
    'staging/stations/',
    'staging/events/'
]

for folder in folders:
    s3.put_object(Bucket=BUCKET_NAME, Key=folder)
    print(f"Created: s3://{BUCKET_NAME}/{folder}")

Created: s3://udacity-redshift-staging-27070584/staging/trips/
Created: s3://udacity-redshift-staging-27070584/staging/stations/
Created: s3://udacity-redshift-staging-27070584/staging/events/


---
## Step 2: Upload Sample Data to S3

We'll create sample trip data and upload it to our S3 bucket.

In [24]:
# ========= Create sample trip data
np.random.seed(42)
n_records = 1000

trips_df = pd.DataFrame({
    'trip_id': [f'TRIP_{i:06d}' for i in range(1, n_records + 1)],
    'rider_id': [f'RIDER_{np.random.randint(1, 201):05d}' for _ in range(n_records)],
    'start_station_id': np.random.randint(1, 50, n_records),
    'end_station_id': np.random.randint(1, 50, n_records),
    'start_time': pd.date_range('2024-01-01', periods=n_records, freq='15min'),
    'end_time': pd.date_range('2024-01-01 00:30:00', periods=n_records, freq='15min'),
    'duration_minutes': np.random.randint(5, 60, n_records),
    'distance_km': np.round(np.random.uniform(0.5, 15.0, n_records), 2),
    'fare_usd': np.round(np.random.uniform(2.50, 25.00, n_records), 2)
})

print(f"Created {len(trips_df)} trip records")
trips_df.head()

Created 1000 trip records


Unnamed: 0,trip_id,rider_id,start_station_id,end_station_id,start_time,end_time,duration_minutes,distance_km,fare_usd
0,TRIP_000001,RIDER_00103,40,9,2024-01-01 00:00:00,2024-01-01 00:30:00,52,1.5,10.73
1,TRIP_000002,RIDER_00180,49,39,2024-01-01 00:15:00,2024-01-01 00:45:00,49,0.58,10.46
2,TRIP_000003,RIDER_00093,44,31,2024-01-01 00:30:00,2024-01-01 01:00:00,40,14.09,9.31
3,TRIP_000004,RIDER_00015,19,32,2024-01-01 00:45:00,2024-01-01 01:15:00,34,12.42,24.45
4,TRIP_000005,RIDER_00107,42,41,2024-01-01 01:00:00,2024-01-01 01:30:00,59,12.23,6.33


In [25]:
# ========= Upload CSV to S3
csv_buffer = trips_df.to_csv(index=False)

s3.put_object(
    Bucket=BUCKET_NAME,
    Key='staging/trips/trips_2024_01.csv',
    Body=csv_buffer.encode('utf-8')
)

print(f"Uploaded: s3://{BUCKET_NAME}/staging/trips/trips_2024_01.csv")
print(f"   - Records: {len(trips_df)}")
print(f"   - Size: {len(csv_buffer):,} bytes")

Uploaded: s3://udacity-redshift-staging-27070584/staging/trips/trips_2024_01.csv
   - Records: 1000
   - Size: 83,449 bytes


In [26]:
# ========= Verify the upload
response = s3.list_objects_v2(Bucket=BUCKET_NAME, Prefix='staging/trips/')

print("Files in staging/trips/:")
for obj in response.get('Contents', []):
    print(f"   - {obj['Key']} ({obj['Size']:,} bytes)")

Files in staging/trips/:
   - staging/trips/ (0 bytes)
   - staging/trips/trips_2024_01.csv (83,449 bytes)


---
## Step 3: Create Staging Table in Redshift

Before we can COPY data, we need a target table in Redshift.

In [27]:
# ========= Create staging table
create_table_sql = """
DROP TABLE IF EXISTS public.stg_trips_raw;

CREATE TABLE public.stg_trips_raw (
    trip_id VARCHAR(50),
    rider_id VARCHAR(50),
    start_station_id INTEGER,
    end_station_id INTEGER,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    duration_minutes INTEGER,
    distance_km DECIMAL(10,2),
    fare_usd DECIMAL(10,2)
);
"""

execute_redshift_query(create_table_sql, fetch_results=False)
print("SUCCESS: Created table public.stg_trips_raw")

SUCCESS: Created table public.stg_trips_raw


---
## Step 4: Load Data with COPY Command

Now we'll use the COPY command to load data from S3 into Redshift.

### COPY Command Components

| Component | Purpose |
|-----------|----------|
| `COPY table_name` | Target table for the load |
| `FROM 's3://...'` | S3 path to data files |
| `CREDENTIALS` | AWS credentials for S3 access (workspace) |
| `IAM_ROLE` | IAM role ARN for S3 access (production) |
| `FORMAT AS CSV` | File format |
| `IGNOREHEADER 1` | Skip header row |

### Production vs Workspace Authentication

```sql
-- PRODUCTION: Use IAM Role (recommended)
COPY table FROM 's3://bucket/path/'
IAM_ROLE 'arn:aws:iam::123456789:role/RedshiftS3Role'
FORMAT AS CSV;

-- WORKSPACE: Use temporary credentials
COPY table FROM 's3://bucket/path/'
CREDENTIALS 'aws_access_key_id=...;aws_secret_access_key=...;token=...'
FORMAT AS CSV;
```

In [28]:
# ========= COPY data from S3 to Redshift
copy_sql = f"""
COPY public.stg_trips_raw
FROM 's3://{BUCKET_NAME}/staging/trips/'
CREDENTIALS '{get_copy_credentials()}'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
REGION '{AWS_REGION}';
"""

# Print a sanitized version (hide credentials)
print("Executing COPY command...")
print("=" * 60)
print(f"""
COPY public.stg_trips_raw
FROM 's3://{BUCKET_NAME}/staging/trips/'
CREDENTIALS '***hidden***'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
REGION '{AWS_REGION}';
""")
print("=" * 60)

try:
    execute_redshift_query(copy_sql, fetch_results=False)
    print("\nSUCCESS: COPY command completed!")
except Exception as e:
    print(f"\nERROR: {e}")
    print("\nTroubleshooting tips:")
    print("  1. Check that your AWS credentials haven't expired")
    print("  2. Verify the S3 bucket and path exist")
    print("  3. Check STL_LOAD_ERRORS for detailed error info")

Executing COPY command...

COPY public.stg_trips_raw
FROM 's3://udacity-redshift-staging-27070584/staging/trips/'
CREDENTIALS '***hidden***'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
REGION 'us-east-1';


SUCCESS: COPY command completed!


In [29]:
# ========= Validate the load
validation_sql = """
SELECT 
    COUNT(*) as row_count,
    MIN(start_time) as earliest_trip,
    MAX(start_time) as latest_trip,
    ROUND(AVG(fare_usd), 2) as avg_fare
FROM public.stg_trips_raw;
"""

result = execute_redshift_query(validation_sql)
print("Load Validation Results:")
print(result)

Load Validation Results:
   row_count        earliest_trip          latest_trip avg_fare
0       1000  2024-01-01 00:00:00  2024-01-11 09:45:00    13.44


In [30]:
# ========= View sample records
sample_sql = """
SELECT * FROM public.stg_trips_raw LIMIT 5;
"""

result = execute_redshift_query(sample_sql)
print("Sample Records:")
result

Sample Records:


Unnamed: 0,trip_id,rider_id,start_station_id,end_station_id,start_time,end_time,duration_minutes,distance_km,fare_usd
0,TRIP_000001,RIDER_00103,40,9,2024-01-01 00:00:00,2024-01-01 00:30:00,52,1.5,10.73
1,TRIP_000002,RIDER_00180,49,39,2024-01-01 00:15:00,2024-01-01 00:45:00,49,0.58,10.46
2,TRIP_000003,RIDER_00093,44,31,2024-01-01 00:30:00,2024-01-01 01:00:00,40,14.09,9.31
3,TRIP_000004,RIDER_00015,19,32,2024-01-01 00:45:00,2024-01-01 01:15:00,34,12.42,24.45
4,TRIP_000005,RIDER_00107,42,41,2024-01-01 01:00:00,2024-01-01 01:30:00,59,12.23,6.33


---
## Step 5: COPY Options Reference

The COPY command supports many options for different data scenarios.

### Common COPY Options

| Option | Purpose | Example |
|--------|---------|----------|
| `IGNOREHEADER n` | Skip first n rows (header) | `IGNOREHEADER 1` |
| `DELIMITER 'char'` | Column separator | `DELIMITER ','` |
| `TIMEFORMAT` | Timestamp parsing | `TIMEFORMAT 'auto'` |
| `DATEFORMAT` | Date parsing | `DATEFORMAT 'auto'` |
| `BLANKSASNULL` | Empty strings become NULL | - |
| `EMPTYASNULL` | Empty fields become NULL | - |
| `MAXERROR n` | Allow up to n errors | `MAXERROR 100` |
| `TRUNCATECOLUMNS` | Truncate data exceeding column width | - |

In [31]:
# ========= COPY Command Reference Examples

print("COPY Command Examples")
print("=" * 60)

# Example 1: Basic CSV load (Production with IAM Role)
print("""
1. PRODUCTION: Basic CSV Load with IAM Role
--------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1;
""")

# Example 2: CSV with error handling
print("""
2. Robust CSV Load with Error Handling
--------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1
DELIMITER ','
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
BLANKSASNULL
EMPTYASNULL
ACCEPTINVCHARS AS '?'
MAXERROR 100
TRUNCATECOLUMNS;
""")

# Example 3: Parquet format (recommended for production)
print("""
3. Parquet Format (Recommended for Production)
----------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;

Note: Parquet is 5-10x faster than CSV and doesn't need
      IGNOREHEADER or DELIMITER options.
""")

COPY Command Examples

1. PRODUCTION: Basic CSV Load with IAM Role
--------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1;


2. Robust CSV Load with Error Handling
--------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1
DELIMITER ','
TIMEFORMAT 'auto'
DATEFORMAT 'auto'
BLANKSASNULL
EMPTYASNULL
ACCEPTINVCHARS AS '?'
MAXERROR 100
TRUNCATECOLUMNS;


3. Parquet Format (Recommended for Production)
----------------------------------------------
COPY public.stg_trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS PARQUET;

Note: Parquet is 5-10x faster than CSV and doesn't need
      IGNOREHEADER or DELIMITER options.



---
## Step 6: Check for Load Errors

If your COPY command fails, you can check the error logs.

In [32]:
# ========= Check for COPY errors
# Note: stl_load_errors requires superuser privileges
# In this workspace, we may not have access to system tables

error_sql = """
SELECT 
    starttime,
    filename,
    line_number,
    colname,
    err_reason
FROM stl_load_errors
ORDER BY starttime DESC
LIMIT 10;
"""

try:
    errors = execute_redshift_query(error_sql)
    if errors is not None and len(errors) > 0:
        print("Recent COPY errors:")
        print(errors)
    else:
        print("No COPY errors found!")
except Exception as e:
    if 'permission denied' in str(e).lower():
        print("Note: Cannot access stl_load_errors (requires superuser privileges)")
        print("\nIn production environments, you would check this table to debug COPY failures.")
        print("\nAlternative: If COPY fails, the error message in the exception usually")
        print("contains enough information to diagnose the issue.")
    else:
        print(f"Error: {e}")

Note: Cannot access stl_load_errors (requires superuser privileges)

In production environments, you would check this table to debug COPY failures.

Alternative: If COPY fails, the error message in the exception usually
contains enough information to diagnose the issue.


---
## Step 7: Idempotent Load Pattern

In production, use TRUNCATE + COPY to make loads repeatable.

In [33]:
# ========= Idempotent load pattern
def load_staging_table(table_name, s3_path, truncate_first=True):
    """Load a staging table from S3 with optional truncate."""
    
    if truncate_first:
        print(f"Truncating {table_name}...")
        execute_redshift_query(f"TRUNCATE TABLE {table_name};", fetch_results=False)
    
    copy_sql = f"""
    COPY {table_name}
    FROM '{s3_path}'
    CREDENTIALS '{get_copy_credentials()}'
    FORMAT AS CSV
    IGNOREHEADER 1
    TIMEFORMAT 'auto'
    REGION '{AWS_REGION}';
    """
    
    print(f"Loading from {s3_path}...")
    execute_redshift_query(copy_sql, fetch_results=False)
    
    # Validate
    count_result = execute_redshift_query(f"SELECT COUNT(*) as cnt FROM {table_name};")
    row_count = count_result['cnt'].iloc[0]
    print(f"SUCCESS: Loaded {row_count:,} rows into {table_name}")
    
    return row_count

# Test the idempotent load
load_staging_table(
    'public.stg_trips_raw',
    f's3://{BUCKET_NAME}/staging/trips/'
)

Truncating public.stg_trips_raw...
Loading from s3://udacity-redshift-staging-27070584/staging/trips/...
SUCCESS: Loaded 1,000 rows into public.stg_trips_raw


np.int64(1000)

---
## Step 8: Clean Up (Optional)

Delete the S3 bucket when you're done. Note: The bucket will also be deleted when your Udacity session ends.

In [17]:
# ========= Clean up S3 bucket (OPTIONAL - uncomment to run)
# WARNING: This will delete your bucket and all its contents!

# def delete_bucket(bucket_name):
#     s3_resource = boto3.resource('s3', region_name=AWS_REGION)
#     bucket = s3_resource.Bucket(bucket_name)
#     
#     # Delete all objects first
#     bucket.objects.all().delete()
#     print(f"Deleted all objects in {bucket_name}")
#     
#     # Delete the bucket
#     bucket.delete()
#     print(f"Deleted bucket {bucket_name}")

# delete_bucket(BUCKET_NAME)

print("Cleanup code is commented out. Uncomment to delete your bucket.")

Cleanup code is commented out. Uncomment to delete your bucket.


---
## Summary

### Key Takeaways

1. **Create your own S3 bucket** for staging data
2. **COPY is the production standard** for loading data from S3 to Redshift
3. **Use `IAM_ROLE` in production** for secure, managed access
4. **Use `CREDENTIALS` in development** when IAM roles aren't available
5. **Configure COPY options** based on your data format and quality
6. **Make loads idempotent** with TRUNCATE + COPY pattern
7. **Always validate** with row counts and sample queries

### Production COPY Template

```sql
-- Idempotent staging load (PRODUCTION)
TRUNCATE TABLE stg.trips_raw;

COPY stg.trips_raw
FROM 's3://your-bucket/staging/trips/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftS3ReadRole'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto';

-- Validate
SELECT COUNT(*) FROM stg.trips_raw;
```