# Lesson 2: Exercise 1 Solution - Write DDL for Riders

## Goal

Design a **slowly-changing snapshot** dimension for riders with a surrogate key, distribution and sort strategy optimized for joins from rider-centric facts.

## What You Will Build

Create `dw_dim_rider` with:

- `rider_sk` (surrogate key, `IDENTITY`)
- Natural key `rider_id` (string)
- Attributes: `rider_segment`, `effective_from`, `effective_to`, `is_current`
- Compression encodings
- Distribution and sort keys to support joins by rider

### Acceptance Criteria

- Table created successfully in Redshift
- `rider_sk` is `IDENTITY(1,1)` and `PRIMARY KEY`
- Distribution favors collocating rider-centric facts
- Sort key supports common lookups by rider

---

## Lesson 2 Exercise 1: Write DDL for Riders Solution

## Imports and Dependencies

Run this cell first to import all required libraries.

In [1]:
# ========= Imports
import os
import time
from typing import Dict, Any, List

import pandas as pd
import boto3

print("All imports successful!")
print(f"   - pandas version: {pd.__version__}")

All imports successful!
   - pandas version: 2.3.1


---
## Configuration

Configure your Redshift connection. 

In [2]:
# ========= CONFIG (edit for your environment)
# Set your AWS credentials in the aws_config.py file
from aws_config import *  # This sets all AWS env vars

# ---- Read configuration from environment
AWS_ACCESS_KEY_ID           = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY       = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN           = os.getenv("AWS_SESSION_TOKEN")
AWS_REGION                  = os.getenv("AWS_REGION")
REDSHIFT_DATABASE           = os.getenv("REDSHIFT_DATABASE")
REDSHIFT_WORKGROUP          = os.getenv("REDSHIFT_WORKGROUP")
REDSHIFT_SECRET_ARN         = os.getenv("REDSHIFT_SECRET_ARN")            # Optional
REDSHIFT_CLUSTER_IDENTIFIER = os.getenv("REDSHIFT_CLUSTER_IDENTIFIER")    # For provisioned
REDSHIFT_DB_USER            = os.getenv("REDSHIFT_DB_USER")               # For provisioned

print("Configuration loaded!")
print(f"   - AWS Region: {AWS_REGION}")
print(f"   - Redshift: {REDSHIFT_DATABASE} (workgroup: {REDSHIFT_WORKGROUP})")
print()
if AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY:
    print(f"   AWS credentials found (Key ID: {AWS_ACCESS_KEY_ID[:8]}...)")
    if AWS_SESSION_TOKEN:
        print(f"   AWS session token found (temporary credentials)")
else:
    print("   WARNING: AWS credentials NOT FOUND!")
    print("      Redshift operations will fail with 'NoCredentialsError'.")

Configuration loaded!
   - AWS Region: us-east-1
   - Redshift: dev (workgroup: udacity-dwh-wg)

   AWS credentials found (Key ID: ASIA54I5...)
   AWS session token found (temporary credentials)


---
## Redshift Functions

These helper functions match the patterns used in the final project. Learning them here will prepare you for the capstone.

In [3]:
# ========= Redshift Functions

session_boto = boto3.Session(region_name=AWS_REGION)
rsd = session_boto.client("redshift-data", region_name=AWS_REGION)


def _rs_kwargs() -> Dict[str, Any]:
    """
    Shared Redshift Data API connection args.
    
    Supports both:
    - Serverless: uses WorkgroupName (and optionally SecretArn)
    - Provisioned: uses ClusterIdentifier and DbUser
    """
    base = dict(Database=REDSHIFT_DATABASE)
    if REDSHIFT_WORKGROUP:
        base["WorkgroupName"] = REDSHIFT_WORKGROUP
        if REDSHIFT_SECRET_ARN:
            base["SecretArn"] = REDSHIFT_SECRET_ARN
    elif REDSHIFT_CLUSTER_IDENTIFIER and REDSHIFT_DB_USER:
        base["ClusterIdentifier"] = REDSHIFT_CLUSTER_IDENTIFIER
        base["DbUser"] = REDSHIFT_DB_USER
    else:
        raise RuntimeError("Configure Redshift serverless OR provisioned for Data API.")
    return base


def rs_exec(sql: str, params: List[Dict[str, Any]] = None, return_results=False, timeout_s=900):
    """
    Execute SQL on Redshift via the Data API.
    
    Args:
        sql: SQL statement to execute
        params: Optional list of parameter dicts for parameterized queries
        return_results: If True, fetch and return query results
        timeout_s: Maximum seconds to wait for query completion (default 15 min)
    
    Returns:
        List of dicts if return_results=True or query is SELECT, else None
    """
    sql = sql.strip()
    if not sql:
        return None
    
    # Build request kwargs
    kwargs = _rs_kwargs()
    kwargs["Sql"] = sql
    if params:
        kwargs["Parameters"] = params
    
    # Execute statement
    sid = rsd.execute_statement(**kwargs)["Id"]
    
    # Poll for completion
    start = time.time()
    while True:
        d = rsd.describe_statement(Id=sid)
        if d["Status"] in ("FINISHED", "FAILED", "ABORTED"):
            break
        if time.time() - start > timeout_s:
            raise TimeoutError("Redshift statement timeout")
        time.sleep(0.5)
    
    # Check for errors
    if d["Status"] != "FINISHED":
        raise RuntimeError(f"Redshift SQL failed: {d.get('Error')}\n---\n{sql}")
    
    # Return results for SELECT queries or when explicitly requested
    if return_results or sql.lower().startswith("select"):
        out, next_token = [], None
        while True:
            args = dict(Id=sid)
            if next_token:
                args["NextToken"] = next_token
            r = rsd.get_statement_result(**args)
            cols = [c["name"] for c in r["ColumnMetadata"]]
            for rec in r["Records"]:
                row = []
                for cell in rec:
                    row.append(next(iter(cell.values())))
                out.append(dict(zip(cols, row)))
            next_token = r.get("NextToken")
            if not next_token:
                break
        return out
    
    return None


print("Redshift functions defined: _rs_kwargs(), rs_exec()")

Redshift functions defined: _rs_kwargs(), rs_exec()


---
## Step 1: Design the dim_rider Table DDL

Define the DDL for the rider dimension table with:

| Component | Purpose |
|-----------|----------|
| **Surrogate key** (`rider_sk`) | Warehouse-generated, auto-incrementing ID |
| **Natural key** (`rider_id`) | Original ID from source system |
| **Attributes** (`rider_segment`) | Descriptive fields for analysis |
| **SCD fields** | `effective_from`, `effective_to`, `is_current` for slowly changing dimension support |
| **DISTKEY/SORTKEY** | Both on `rider_id` for fast joins during fact loading |

In [4]:
DDL_DIM_RIDER = """
-- =============================================================
-- public.dw_dim_rider
-- Grain: 1 row per rider (current snapshot with SCD support)
-- =============================================================

DROP TABLE IF EXISTS public.dw_dim_rider;

CREATE TABLE public.dw_dim_rider (
    -- Surrogate key (warehouse-generated)
    rider_sk        BIGINT IDENTITY(1,1),
    
    -- Natural key (from source system)
    rider_id        VARCHAR(32)  ENCODE zstd,
    
    -- Descriptive attributes
    rider_segment   VARCHAR(16)  ENCODE zstd,
    
    -- Slowly Changing Dimension (SCD) tracking fields
    effective_from  TIMESTAMP    ENCODE zstd,
    effective_to    TIMESTAMP    ENCODE zstd,
    is_current      BOOLEAN      ENCODE zstd,
    
    -- Primary key constraint
    PRIMARY KEY (rider_sk)
)
-- Collocate rider-centric joins by distributing on the natural key
DISTKEY (rider_id)
-- Speed up point lookups and range scans by rider_id
SORTKEY (rider_id);
"""

print("DDL for dim_rider:")
print("=" * 60)
print(DDL_DIM_RIDER)

DDL for dim_rider:

-- public.dw_dim_rider
-- Grain: 1 row per rider (current snapshot with SCD support)

DROP TABLE IF EXISTS public.dw_dim_rider;

CREATE TABLE public.dw_dim_rider (
    -- Surrogate key (warehouse-generated)
    rider_sk        BIGINT IDENTITY(1,1),
    
    -- Natural key (from source system)
    rider_id        VARCHAR(32)  ENCODE zstd,
    
    -- Descriptive attributes
    rider_segment   VARCHAR(16)  ENCODE zstd,
    
    -- Slowly Changing Dimension (SCD) tracking fields
    effective_from  TIMESTAMP    ENCODE zstd,
    effective_to    TIMESTAMP    ENCODE zstd,
    is_current      BOOLEAN      ENCODE zstd,
    
    -- Primary key constraint
    PRIMARY KEY (rider_sk)
)
-- Collocate rider-centric joins by distributing on the natural key
DISTKEY (rider_id)
-- Speed up point lookups and range scans by rider_id
SORTKEY (rider_id);



---
## Step 2: Execute the DDL

Create the `dw_dim_rider` table in Redshift.

In [5]:
rs_exec(DDL_DIM_RIDER)
print("Table public.dw_dim_rider created successfully!")

Table public.dw_dim_rider created successfully!


---
## Step 3: Validate the Table Structure

Query the information schema to verify the table was created with the correct columns.

In [6]:
validation_sql = """
SELECT 
    column_name,
    data_type,
    character_maximum_length,
    is_nullable
FROM information_schema.columns
WHERE table_schema = 'public'
  AND table_name = 'dw_dim_rider'
ORDER BY ordinal_position;
"""

columns = rs_exec(validation_sql, return_results=True)

print("Table Structure for dw_dim_rider:")
print("-" * 60)
if columns:
    df = pd.DataFrame(columns)
    display(df)
else:
    print("No columns found. Check if table was created.")

Table Structure for dw_dim_rider:
------------------------------------------------------------


Unnamed: 0,column_name,data_type,character_maximum_length,is_nullable
0,rider_sk,bigint,True,NO
1,rider_id,character varying,32,YES
2,rider_segment,character varying,16,YES
3,effective_from,timestamp without time zone,True,YES
4,effective_to,timestamp without time zone,True,YES
5,is_current,boolean,True,YES


---
## Step 4: Check Distribution and Sort Keys

Verify that the DISTKEY and SORTKEY were applied correctly.

In [7]:
properties_sql = """
SELECT 
    "column",
    type,
    encoding,
    distkey,
    sortkey
FROM pg_table_def
WHERE schemaname = 'public'
  AND tablename = 'dw_dim_rider'
ORDER BY sortkey, "column";
"""

properties = rs_exec(properties_sql, return_results=True)

print("Distribution and Sort Key Configuration:")
print("-" * 60)
if properties:
    df = pd.DataFrame(properties)
    display(df)
else:
    print("Could not retrieve table properties.")

Distribution and Sort Key Configuration:
------------------------------------------------------------


Unnamed: 0,column,type,encoding,distkey,sortkey
0,effective_from,timestamp without time zone,zstd,False,0
1,effective_to,timestamp without time zone,zstd,False,0
2,is_current,boolean,zstd,False,0
3,rider_segment,character varying(16),zstd,False,0
4,rider_sk,bigint,az64,False,0
5,rider_id,character varying(32),zstd,True,1


---

## Design Rationale

### Why This Design?

| Design Choice | Rationale |
|---------------|------------|
| **Surrogate key (`rider_sk`)** | Keeps fact tables narrow and stable. Source IDs can change; surrogate keys don't. |
| **DISTKEY on `rider_id`** | Collocates rows for rider-centric queries. When loading facts, we join on `rider_id` to get `rider_sk`. |
| **SORTKEY on `rider_id`** | Accelerates equality predicates and merge joins for common lookups. |
| **SCD fields** | `effective_from`, `effective_to`, `is_current` support tracking rider attribute changes over time. |
| **ENCODE zstd** | Efficient compression for string and timestamp columns. |

### Grain

**1 row = 1 rider** (current snapshot, with history tracked via SCD fields)

### Conformed Dimension

This `dw_dim_rider` table is designed to be **conformed** across multiple fact tables:
- `fact_trips` (rider who took the trip)
- `fact_events` (rider associated with the event)
- `fact_graph_edges` (riders in relationships)

All three facts will reference `rider_sk`, enabling consistent cross-dataset analysis.