# Lesson 1 - Exercise 1 (Python): Connect to PostgreSQL and Preview Trips

## Goal

Connect to a PostgreSQL source containing a `raw_trips` table and
produce an initial data profile to understand the schema you'll stage
into Redshift later.

## What to build

A Jupyter notebook that:

1.  Connects to Postgres via `SQLAlchemy`/`psycopg2`.

2.  Runs SQL queries to:

    -   Count rows.
    -   Sample 10 rows.
    -   Compute quick null/unique stats for key columns.

3.  Writes a CSV preview `/tmp/trips_preview.csv` and prints a short
    profile to stdout.

------------------------------------------------------------------------

### Step 1: Imports and Configuration

Load required libraries and set up connection parameters from environment variables.

In [None]:
import os
import pandas as pd
from sqlalchemy import create_engine, text

# --- Configuration (from environment variables or defaults) ---
PG_HOST = os.environ.get("PG_HOST", "localhost")
PG_PORT = os.environ.get("PG_PORT", "5432")
PG_DB = os.environ.get("PG_DB", "transit")
PG_USER = os.environ.get("PG_USER", "postgres")
PG_PASSWORD = os.environ.get("PG_PASSWORD", "postgres")

# Build connection URI
PG_URI = os.environ.get(
    "PG_URI", 
    f"postgresql://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"
)

OUT_FILE = "/tmp/trips_preview.csv"

print(f"PostgreSQL Host: {PG_HOST}:{PG_PORT}")
print(f"Database: {PG_DB}")
print(f"Output file: {OUT_FILE}")

### Step 2: Connect to PostgreSQL

Establish a connection to the PostgreSQL database using SQLAlchemy.

In [None]:
engine = create_engine(PG_URI)

# Test the connection
with engine.connect() as conn:
    result = conn.execute(text("SELECT 1"))
    print("Successfully connected to PostgreSQL!")

### Step 3: Count Total Rows

Get the total number of rows in the `raw_trips` table to understand the data volume.

**TODO**: Write a SQL query to count the total number of rows in the `raw_trips` table. Execute the query and store the result in `row_count`.

In [None]:
SQL_COUNT = ""  # TODO: Write your SQL query here

with engine.connect() as conn:
    # TODO: Execute the query and extract the row count
    row_count = None

print(f"Total rows in raw_trips: {row_count:,}")

### Step 4: Sample Rows from the Table

Pull a sample of rows to understand the data structure and content.

**TODO**: Write a SQL query to randomly sample 10 rows from the `raw_trips` table. The query should select all columns listed below. Use `ORDER BY RANDOM()` to randomize the results.

In [None]:
# Columns to select:
# trip_id, rider_id, route_id, mode, origin_station_id, destination_station_id,
# board_datetime, alight_datetime, country, province, fare_class, payment_method,
# transfers, zones_charged, distance_km, base_fare_cad, discount_rate, discount_amount_cad,
# yvr_addfare_cad, total_fare_cad, on_time_arrival, service_disruption, polyline_stations

SQL_SAMPLE = """
-- TODO: Write your SQL query here
"""

with engine.connect() as conn:
    sample_df = pd.read_sql(text(SQL_SAMPLE), conn)

print(f"Sampled {len(sample_df)} rows")
print("-" * 40)
display(sample_df)

### Step 5: Profile Key Columns

Compute null percentages and unique counts for the key columns we'll need for staging.

**TODO**: Build a profile for the key columns listed below. For each column, compute:
- `dtype`: The pandas data type of the column
- `null_pct`: The percentage of null values (rounded to 2 decimal places)
- `n_unique`: The number of unique non-null values
- `sample_value`: A sample non-null value from the column

Store the results in `profile_data` as a list of dictionaries.

In [None]:
# Key columns for profiling
key_columns = ["trip_id", "rider_id", "route_id", "board_datetime", "alight_datetime"]

# Build profile
profile_data = []
for col in key_columns:
    if col in sample_df.columns:
        series = sample_df[col]
        # TODO: Build a dictionary with keys: column, dtype, null_pct, n_unique, sample_value
        profile_data.append({
            "column": col,
            "dtype": None,      # TODO: Get the dtype as a string
            "null_pct": None,   # TODO: Calculate null percentage (0-100)
            "n_unique": None,   # TODO: Count unique non-null values
            "sample_value": None  # TODO: Get a sample non-null value
        })
    else:
        profile_data.append({
            "column": col,
            "dtype": "MISSING",
            "null_pct": None,
            "n_unique": None,
            "sample_value": None
        })

profile_df = pd.DataFrame(profile_data)

print("Key Column Profile (from sample):")
print("-" * 40)
display(profile_df)

### Step 6: Check All Column Data Types

Review the data types of all columns to plan the staging table DDL.

**TODO**: Loop through all columns in `sample_df` and print each column name, its data type, and the count of null values. Format the output so columns align nicely.

In [None]:
print("All Column Data Types:")
print("-" * 40)
# TODO: Loop through sample_df.dtypes and print column info


### Step 7: Write Sample to CSV

Save the sample trips to a CSV file for reference.

**TODO**: Write `sample_df` to the CSV file specified by `OUT_FILE`. Then read it back to verify the write was successful.

In [None]:
# TODO: Write sample_df to CSV (without the index)


# TODO: Read the file back to verify
verify_df = None

print(f"Wrote {len(sample_df)} rows to {OUT_FILE}")
print(f"Verified: {len(verify_df)} rows, {len(verify_df.columns)} columns")

### Step 8: Clean Up

Dispose of the SQLAlchemy engine to close connections.

In [None]:
engine.dispose()
print("PostgreSQL connection closed.")

------------------------------------------------------------------------

## Summary

In this exercise, you:

1. Connected to PostgreSQL using SQLAlchemy
2. Counted total rows to understand data volume
3. Sampled rows to explore the data structure
4. Profiled key columns (trip_id, rider_id, route_id, timestamps)
5. Reviewed all column data types for staging planning
6. Exported a preview CSV for reference

These patterns (env-driven config, bounded reads, quick stats, CSV outputs) are exactly what you'll reuse when building **ETL pipelines** in later lessons and for the final **e-commerce project**.