# Lesson 3: Exercise 1 - Extract and Transform Trips from PostgreSQL

## Goal

Extract trips data from PostgreSQL and transform it into a **staging format** ready for warehouse loading. This is the first step in our ETL pipeline.

## Prerequisites

You should have completed:
- **Lesson 1, Exercise 1**: Connected to PostgreSQL and previewed the `raw_trips` table
- **Lesson 2, Exercise 1**: Designed the `dw_dim_rider` table in Redshift

## What You Will Build

A Pandas-based ETL script that:

1. Connects to PostgreSQL and extracts trips data
2. Cleans and standardizes fields (whitespace, nulls, data types)
3. Validates the transformation
4. Outputs to staging format (CSV/Parquet)

---

## Imports and Dependencies

Run this cell first to import all required libraries.

In [None]:
# ========= Imports
import os
from datetime import datetime
from typing import Tuple, List
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, text

print("All imports successful!")
print(f"   - pandas version: {pd.__version__}")
print(f"   - numpy version: {np.__version__}")

---
## Configuration

**Important:** These credentials match the `populate-postgres.py` script from Lesson 1. Update only if your environment differs.

In [None]:
# ========= PostgreSQL Configuration ==========
# These match the populate-postgres.py script from Lesson 1.
# Only change if your environment is different.

PG_HOST = "localhost"      # Database host
PG_PORT = "5432"           # Database port
PG_DB = "postgres"         # Database name (populate script uses 'postgres')
PG_USER = "temp"           # User from populate-postgres.py
PG_PASSWORD = "temp"       # Password from populate-postgres.py

# Build connection URI
PG_URI = f"postgresql://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"

# Output paths
OUTPUT_STAGING_CSV = "/tmp/stg_trips_raw.csv"
OUTPUT_STAGING_PARQUET = "/tmp/stg_trips_raw.parquet"

print("Configuration set!")
print(f"   - PostgreSQL: {PG_HOST}:{PG_PORT}/{PG_DB}")
print(f"   - User: {PG_USER}")
print(f"   - Output CSV: {OUTPUT_STAGING_CSV}")
print(f"   - Output Parquet: {OUTPUT_STAGING_PARQUET}")

---
## Populate the Database

Run this cell to populate PostgreSQL with sample data (if not already done).

In [None]:
!python populate-postgres.py

---
## Column Specification

Define the expected columns and their types for the staging table. This matches the structure we'll use in Redshift.

In [None]:
# ========= Column specs for staging (name, kind)
# kind: 's' = string, 'ts' = timestamp, 'i' = integer, 'f' = float, 'b' = boolean

TRIPS_COLSPEC = [
    ('trip_id', 's'),
    ('rider_id', 's'),
    ('route_id', 's'),
    ('mode', 's'),
    ('origin_station_id', 's'),
    ('destination_station_id', 's'),
    ('board_datetime', 'ts'),
    ('alight_datetime', 'ts'),
    ('country', 's'),
    ('province', 's'),
    ('fare_class', 's'),
    ('payment_method', 's'),
    ('transfers', 'i'),
    ('zones_charged', 'i'),
    ('distance_km', 'f'),
    ('base_fare_cad', 'f'),
    ('discount_rate', 'f'),
    ('discount_amount_cad', 'f'),
    ('yvr_addfare_cad', 'f'),
    ('total_fare_cad', 'f'),
    ('on_time_arrival', 'b'),
    ('service_disruption', 'b'),
    ('polyline_stations', 's'),
]

print(f"Column spec defined: {len(TRIPS_COLSPEC)} columns")
print("\nColumn breakdown:")
print(f"   - String columns: {sum(1 for _, k in TRIPS_COLSPEC if k == 's')}")
print(f"   - Timestamp columns: {sum(1 for _, k in TRIPS_COLSPEC if k == 'ts')}")
print(f"   - Integer columns: {sum(1 for _, k in TRIPS_COLSPEC if k == 'i')}")
print(f"   - Float columns: {sum(1 for _, k in TRIPS_COLSPEC if k == 'f')}")
print(f"   - Boolean columns: {sum(1 for _, k in TRIPS_COLSPEC if k == 'b')}")

---
## Helper Functions

These functions are adapted from the project solution for cleaning and transforming data.

In [None]:
# ========= Helper Functions (from project solution)

def trim_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Standardize text fields (strip whitespace) and handle NaN values properly.
    
    This is the same helper used in the final project for cleaning extracted data.
    
    Args:
        df: Input DataFrame with potentially messy string fields
    
    Returns:
        Cleaned DataFrame with stripped strings and proper NaN handling
    """
    df = df.copy()  # Avoid modifying the original DataFrame
    for c in df.select_dtypes(include=['object']).columns:
        # Convert to string and strip whitespace
        df[c] = df[c].astype(str).str.strip()
        # Replace 'nan', 'None', and empty strings with actual NaN
        df[c] = df[c].replace({'nan': np.nan, 'None': np.nan, 'NaN': np.nan, '': np.nan})
    return df


def validate_columns(df: pd.DataFrame, colspec: List[Tuple[str, str]]) -> dict:
    """
    Validate that DataFrame columns match the expected spec.
    
    Args:
        df: DataFrame to validate
        colspec: List of (column_name, type_code) tuples
    
    Returns:
        Dict with validation results
    """
    expected_cols = {c for c, _ in colspec}
    actual_cols = set(df.columns)
    
    return {
        'missing': expected_cols - actual_cols,
        'extra': actual_cols - expected_cols,
        'matched': expected_cols & actual_cols,
        'valid': expected_cols == actual_cols
    }


print("Helper functions defined: trim_df(), validate_columns()")

---
## Step 1: Connect to PostgreSQL

Establish a connection to the PostgreSQL database using SQLAlchemy (same pattern as Lesson 1, Exercise 1).

In [None]:
# ========= STEP 1: Connect to PostgreSQL
print("Step 1: Connecting to PostgreSQL...")
print("-" * 50)

engine = create_engine(PG_URI)

# Test the connection
with engine.connect() as conn:
    result = conn.execute(text("SELECT 1"))
    print("Successfully connected to PostgreSQL!")
    
    # Get row count
    count_result = conn.execute(text("SELECT COUNT(*) FROM raw_trips"))
    total_rows = count_result.scalar()
    print(f"Total rows in raw_trips: {total_rows:,}")

---
## Step 2: Extract Data from PostgreSQL

Pull all trips data from the `raw_trips` table. In production ETL, you might add filters for incremental loads.

**TODO**: Write a SQL query to extract all columns from `raw_trips`. The columns should match the TRIPS_COLSPEC defined above:

- trip_id, rider_id, route_id, mode
- origin_station_id, destination_station_id
- board_datetime, alight_datetime
- country, province, fare_class, payment_method
- transfers, zones_charged, distance_km
- base_fare_cad, discount_rate, discount_amount_cad, yvr_addfare_cad, total_fare_cad
- on_time_arrival, service_disruption, polyline_stations

In [None]:
# ========= STEP 2: Extract from PostgreSQL
print("Step 2: Extracting trips data from PostgreSQL...")
print("-" * 50)

# TODO: Write your SQL query to extract all trips columns
SQL_EXTRACT = """
-- TODO: Write your SELECT statement here

"""

with engine.connect() as conn:
    trips_raw = pd.read_sql(text(SQL_EXTRACT), conn)

print(f"Extracted {len(trips_raw):,} rows")
print(f"Columns: {len(trips_raw.columns)}")
print(f"\nSample data (first 3 rows):")
display(trips_raw.head(3))

---
## Step 3: Transform and Clean Data

Apply transformations to prepare the data for staging:
- Strip whitespace from string fields
- Standardize NULL representations
- Parse timestamps
- Ensure proper data types

**TODO**: Complete the transformation logic:

1. Apply `trim_df()` to clean string fields
2. Parse timestamp columns using `pd.to_datetime()` with `errors='coerce'`
3. Convert boolean columns (handle string representations like 'true', 'false')
4. Ensure numeric columns have proper types using `pd.to_numeric()`

In [None]:
# ========= STEP 3: Transform and clean
print("Step 3: Transforming and cleaning data...")
print("-" * 50)

# TODO: Apply string cleaning using trim_df()
trips_clean = trips_raw  # Replace with: trim_df(trips_raw)

# TODO: Parse timestamp columns
timestamp_cols = [c for c, k in TRIPS_COLSPEC if k == 'ts']
for col in timestamp_cols:
    if col in trips_clean.columns:
        # TODO: Convert column to datetime
        pass

# TODO: Ensure boolean columns are proper booleans
boolean_cols = [c for c, k in TRIPS_COLSPEC if k == 'b']
for col in boolean_cols:
    if col in trips_clean.columns:
        # TODO: Handle string boolean representations
        pass

# TODO: Ensure numeric columns are proper types
float_cols = [c for c, k in TRIPS_COLSPEC if k == 'f']
for col in float_cols:
    if col in trips_clean.columns:
        # TODO: Convert to numeric
        pass

int_cols = [c for c, k in TRIPS_COLSPEC if k == 'i']
for col in int_cols:
    if col in trips_clean.columns:
        # TODO: Convert to integer (use 'Int64' for nullable integers)
        pass

print(f"Transformation complete!")
print(f"   - Rows: {len(trips_clean):,}")
print(f"   - Columns: {len(trips_clean.columns)}")

---
## Step 4: Validate the Transformation

Before outputting, verify that:
- All expected columns are present
- Row counts match (no data loss)
- Key fields have no unexpected nulls

In [None]:
# ========= STEP 4: Validate
print("Step 4: Validating transformation...")
print("-" * 50)

# Check column alignment
validation = validate_columns(trips_clean, TRIPS_COLSPEC)

if validation['valid']:
    print("Column validation: PASSED")
else:
    print("Column validation: ISSUES FOUND")
    if validation['missing']:
        print(f"   Missing columns: {validation['missing']}")
    if validation['extra']:
        print(f"   Extra columns: {validation['extra']}")

# Check row counts
print(f"\nRow count check:")
print(f"   Source rows: {len(trips_raw):,}")
print(f"   Output rows: {len(trips_clean):,}")
print(f"   Match: {'YES' if len(trips_raw) == len(trips_clean) else 'NO - DATA LOSS!'}")

# Check for nulls in key fields
key_fields = ['trip_id', 'rider_id', 'route_id']
print(f"\nNull check for key fields:")
for field in key_fields:
    null_count = trips_clean[field].isna().sum()
    print(f"   {field}: {null_count} nulls ({null_count/len(trips_clean)*100:.2f}%)")

---
## Step 5: Output to Staging Format

Save the transformed data in formats suitable for warehouse loading:
- **CSV**: Human-readable, compatible with Redshift COPY
- **Parquet**: Compressed, columnar format for efficient loading

**TODO**: Write the staging DataFrame to CSV:

1. Select only the columns defined in TRIPS_COLSPEC (in order)
2. Save to CSV using `to_csv()` with `index=False`
3. Report the file size

In [None]:
# ========= STEP 5: Output to staging format
print("Step 5: Outputting to staging format...")
print("-" * 50)

# Select only the columns defined in the spec (in order)
output_cols = [c for c, _ in TRIPS_COLSPEC]
trips_staging = trips_clean[output_cols]

# TODO: Output to CSV
# trips_staging.to_csv(OUTPUT_STAGING_CSV, index=False)

# TODO: Report file size
# file_size = os.path.getsize(OUTPUT_STAGING_CSV) / 1024
# print(f"CSV saved: {OUTPUT_STAGING_CSV}")
# print(f"   Size: {file_size:.1f} KB")

---
## Step 6: Verify Output

In [None]:
# ========= STEP 6: Verify output
print("Step 6: Verifying output...")
print("-" * 50)

# Read back the CSV
trips_verify = pd.read_csv(OUTPUT_STAGING_CSV)

print(f"Read back {len(trips_verify):,} rows from CSV")
print(f"Columns match: {list(trips_verify.columns) == output_cols}")

# Show first few rows
print(f"\nFirst 3 rows of staged data:")
display(trips_verify.head(3))

---
## Step 7: Clean Up

Close the database connection.

In [None]:
# ========= STEP 7: Clean up
engine.dispose()
print("PostgreSQL connection closed.")

---
## Summary

Generate a final summary report of the ETL job.

In [None]:
# ========= Final Summary
print("=" * 60)
print("ETL JOB SUMMARY: Trips to Staging")
print("=" * 60)
print(f"""
Source:           PostgreSQL {PG_HOST}:{PG_PORT}/{PG_DB}
Table:            raw_trips
Output CSV:       {OUTPUT_STAGING_CSV}

Records:
  - Extracted:    {len(trips_raw):,}
  - Transformed:  {len(trips_clean):,}
  - Staged:       {len(trips_staging):,}

Data Quality:
  - Columns:      {len(trips_staging.columns)} (all expected)
  - Null trip_id: {trips_staging['trip_id'].isna().sum()}

Status:           SUCCESS
Completed:        {datetime.now().isoformat()}
""")
print("=" * 60)