# Lesson 1 — Exercise 1 (Python): Connect to PostgreSQL and Preview Trips

## Goal

Connect to a PostgreSQL source containing a `raw_trips` table and
produce an initial data profile to understand the schema.

## What to build

A Jupyter notebook that:

1.  Connects to Postgres via `SQLAlchemy`/`psycopg2`.

2.  Runs SQL queries to:

    -   Count rows.
    -   Sample 10 rows.
    -   Compute quick null/unique stats for key columns.

3.  Writes a CSV preview `/tmp/trips_preview.csv` and prints a short
    profile to stdout.

### Acceptance criteria

-   Uses env var `PG_URI` or individual connection parameters.
-   Prints row count, column dtypes, % null, and uniques for: `trip_id`,
    `rider_id`, `route_id`, `board_datetime`, `alight_datetime`.
-   Saves `/tmp/trips_preview.csv` with 10 sampled rows.

------------------------------------------------------------------------

## Lesson 1 Exercise 1: Connect to PostgreSQL and Preview Trips Solution

Populate the Postgres database by running available script

In [1]:
!python populate-postgres.py

Postgres is up!
Connecting to PostgreSQL...
Created table public.raw_trips.
Loaded trips from ./data/van_transit_trips_postgres.csv into public.raw_trips.


### Step 1: Imports and Configuration

Load required libraries and set up connection parameters from environment variables.

In [2]:
import os
import pandas as pd
from sqlalchemy import create_engine, text

# --- Configuration (from environment variables or defaults) ---
PG_HOST = os.environ.get("PG_HOST", "localhost")
PG_PORT = os.environ.get("PG_PORT", "5432")
PG_DB = os.environ.get("PG_DB", "postgres")
PG_USER = os.environ.get("PG_USER", "temp")
PG_PASSWORD = os.environ.get("PG_PASSWORD", "temp")

# Build connection URI
PG_URI = os.environ.get(
    "PG_URI", 
    f"postgresql://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"
)

OUT_FILE = "/tmp/trips_preview.csv"

print(f"PostgreSQL Host: {PG_HOST}:{PG_PORT}")
print(f"Database: {PG_DB}")
print(f"Output file: {OUT_FILE}")

PostgreSQL Host: localhost:5432
Database: postgres
Output file: /tmp/trips_preview.csv


### Step 2: Connect to PostgreSQL

Establish a connection to the PostgreSQL database using SQLAlchemy.

In [3]:
engine = create_engine(PG_URI)

# Test the connection
with engine.connect() as conn:
    result = conn.execute(text("SELECT 1"))
    print("Successfully connected to PostgreSQL!")

Successfully connected to PostgreSQL!


### Step 3: Count Total Rows

Get the total number of rows in the `raw_trips` table to understand the data volume.

In [4]:
SQL_COUNT = "SELECT COUNT(*) AS n FROM raw_trips"

with engine.connect() as conn:
    result = conn.execute(text(SQL_COUNT))
    row_count = result.scalar()

print(f"Total rows in raw_trips: {row_count:,}")

Total rows in raw_trips: 2,500


### Step 4: Sample Rows from the Table

Pull a sample of rows to understand the data structure and content.

In [5]:
SQL_SAMPLE = """
SELECT 
    trip_id, 
    rider_id, 
    route_id, 
    mode,
    origin_station_id, 
    destination_station_id,
    board_datetime, 
    alight_datetime,
    country, 
    province, 
    fare_class, 
    payment_method,
    transfers, 
    zones_charged, 
    distance_km,
    base_fare_cad, 
    discount_rate, 
    discount_amount_cad,
    yvr_addfare_cad, 
    total_fare_cad,
    on_time_arrival, 
    service_disruption,
    polyline_stations
FROM raw_trips
ORDER BY RANDOM()
LIMIT 10
"""

with engine.connect() as conn:
    sample_df = pd.read_sql(text(SQL_SAMPLE), conn)

print(f"Sampled {len(sample_df)} rows")
print("-" * 40)
display(sample_df)

Sampled 10 rows
----------------------------------------


Unnamed: 0,trip_id,rider_id,route_id,mode,origin_station_id,destination_station_id,board_datetime,alight_datetime,country,province,...,zones_charged,distance_km,base_fare_cad,discount_rate,discount_amount_cad,yvr_addfare_cad,total_fare_cad,on_time_arrival,service_disruption,polyline_stations
0,T101596,R98478,R030,bus,S011,S003,2024-08-04 06:11:34,2024-08-04 06:36:49,CA,BC,...,1,11.72,3.15,0.0,0.0,0.0,3.15,True,False,S023|S030|S018|S009|S025|S005|S015|S030
1,T100395,R39187,R060,bus,S013,S014,2025-01-27 01:13:26,2025-01-27 01:40:43,CA,BC,...,1,12.59,3.14,0.0,0.0,0.0,3.14,True,False,S018|S016|S002|S022|S001|S024|S006|S003|S017|S029
2,T101921,R96402,R113,bus,S002,S011,2025-06-27 04:01:03,2025-06-27 04:37:03,CA,BC,...,1,12.43,3.15,0.0,0.0,0.0,3.15,True,False,S015|S014|S008|S020|S030|S006|S019|S007|S026|S021
3,T100961,R20101,R037,skytrain,S025,S028,2025-02-09 11:24:54,2025-02-09 11:47:31,CA,BC,...,1,14.14,3.2,0.4,1.28,0.0,1.92,True,False,S013|S011|S017|S022|S023|S002|S004
4,T102294,R94935,R018,skytrain,S012,S011,2024-11-24 04:52:06,2024-11-24 05:10:52,CA,BC,...,1,11.73,3.33,0.0,0.0,0.0,3.33,True,False,S011|S023|S013|S005|S025
5,T101987,R35826,R009,bus,S013,S024,2024-10-08 16:10:45,2024-10-08 16:36:58,CA,BC,...,1,8.84,3.11,0.35,1.09,0.0,2.02,False,False,S010|S006|S010|S005|S020|S011|S015|S008
6,T101621,R96846,R077,skytrain,S026,S003,2024-05-03 00:29:29,2024-05-03 01:06:07,CA,BC,...,1,25.3,3.41,0.0,0.0,0.0,3.41,True,False,S026|S001|S029|S005|S026|S021
7,T100017,R56175,R048,bus,S025,S002,2024-07-15 05:23:51,2024-07-15 06:10:13,CA,BC,...,1,10.69,3.16,0.0,0.0,0.0,3.16,True,False,S028|S004|S006|S022|S006|S027|S022|S014|S021
8,T101879,R17609,R016,bus,S023,S008,2024-11-09 19:01:31,2024-11-09 19:36:42,CA,BC,...,1,13.67,3.2,0.0,0.0,0.0,3.2,True,False,S015|S025|S030|S014|S012
9,T101754,R86705,R065,skytrain,S006,S021,2024-12-24 00:32:31,2024-12-24 01:02:01,CA,BC,...,1,13.56,3.18,0.0,0.0,0.0,3.18,True,False,S009|S018|S021|S016|S018|S014|S004|S023|S002


### Step 5: Profile Key Columns

Compute null percentages and unique counts for the key columns we'll need for staging.

In [6]:
# Key columns for profiling
key_columns = ["trip_id", "rider_id", "route_id", "board_datetime", "alight_datetime"]

# Build profile
profile_data = []
for col in key_columns:
    if col in sample_df.columns:
        series = sample_df[col]
        profile_data.append({
            "column": col,
            "dtype": str(series.dtype),
            "null_pct": round(series.isna().mean() * 100, 2),
            "n_unique": series.nunique(dropna=True),
            "sample_value": series.dropna().iloc[0] if not series.dropna().empty else None
        })
    else:
        profile_data.append({
            "column": col,
            "dtype": "MISSING",
            "null_pct": None,
            "n_unique": None,
            "sample_value": None
        })

profile_df = pd.DataFrame(profile_data)

print("Key Column Profile (from sample):")
print("-" * 40)
display(profile_df)

Key Column Profile (from sample):
----------------------------------------


Unnamed: 0,column,dtype,null_pct,n_unique,sample_value
0,trip_id,object,0.0,10,T101596
1,rider_id,object,0.0,10,R98478
2,route_id,object,0.0,10,R030
3,board_datetime,datetime64[ns],0.0,10,2024-08-04 06:11:34
4,alight_datetime,datetime64[ns],0.0,10,2024-08-04 06:36:49


### Step 6: Check All Column Data Types

Review the data types of all columns to plan the staging table DDL.

In [7]:
print("All Column Data Types:")
print("-" * 40)
for col, dtype in sample_df.dtypes.items():
    null_count = sample_df[col].isna().sum()
    print(f"{col:30} {str(dtype):15} (nulls: {null_count})")

All Column Data Types:
----------------------------------------
trip_id                        object          (nulls: 0)
rider_id                       object          (nulls: 0)
route_id                       object          (nulls: 0)
mode                           object          (nulls: 0)
origin_station_id              object          (nulls: 0)
destination_station_id         object          (nulls: 0)
board_datetime                 datetime64[ns]  (nulls: 0)
alight_datetime                datetime64[ns]  (nulls: 0)
country                        object          (nulls: 0)
province                       object          (nulls: 0)
fare_class                     object          (nulls: 0)
payment_method                 object          (nulls: 0)
transfers                      int64           (nulls: 0)
zones_charged                  int64           (nulls: 0)
distance_km                    float64         (nulls: 0)
base_fare_cad                  float64         (nulls: 0)
discount

### Step 7: Write Sample to CSV

Save the sample trips to a CSV file for reference.

In [8]:
sample_df.to_csv(OUT_FILE, index=False)
print(f"Wrote {len(sample_df)} rows to {OUT_FILE}")

# Verify the file
verify_df = pd.read_csv(OUT_FILE)
print(f"Verified: {len(verify_df)} rows, {len(verify_df.columns)} columns")

Wrote 10 rows to /tmp/trips_preview.csv
Verified: 10 rows, 23 columns


### Step 8: Clean Up

Dispose of the SQLAlchemy engine to close connections.

In [9]:
engine.dispose()
print("PostgreSQL connection closed.")

PostgreSQL connection closed.


------------------------------------------------------------------------

## Summary

In this exercise, you:

1. Connected to PostgreSQL using SQLAlchemy
2. Counted total rows to understand data volume
3. Sampled rows to explore the data structure
4. Profiled key columns (trip_id, rider_id, route_id, timestamps)
5. Reviewed all column data types for staging planning
6. Exported a preview CSV for reference

These patterns (env-driven config, bounded reads, quick stats, CSV outputs) are exactly what you'll reuse when building **ETL pipelines**.