# Lesson 3: Exercise 3 - Create Derived Fare Class Dimension

## Goal

Build a **derived dimension** that classifies fares into analytical categories. Derived dimensions add business meaning that doesn't exist in the raw source data.

## Prerequisites

You should have completed:
- **Lesson 1, Exercise 1**: Connected to PostgreSQL (`raw_trips` table)
- **Lesson 3, Exercise 1**: Extracted and transformed trips data

## What You Will Build

A Pandas-based transformation that:

1. Extracts fare data from PostgreSQL
2. Analyzes fare distribution
3. Creates fare buckets based on `total_fare_cad`
4. Adds discount bands based on `discount_rate`
5. Builds a reusable `dw_dim_fare_class` dimension table

### Why Derived Dimensions Matter

Raw data contains continuous values like `total_fare_cad = 3.47`. Analysts need categorical groupings:
- "How do Premium fares compare to Standard fares?"
- "What's the distribution of discounted vs. full-price trips?"

Derived dimensions compute these categories **once during ETL**, so every query uses consistent logic.

---

## Imports and Dependencies

Run this cell first to import all required libraries.

In [None]:
# ========= Imports
import os
from datetime import datetime
from typing import Tuple, List
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, text

print("All imports successful!")
print(f"   - pandas version: {pd.__version__}")
print(f"   - numpy version: {np.__version__}")

---
## Configuration

**Important:** These credentials match the `populate-postgres.py` script from Lesson 1. Update only if your environment differs.

In [None]:
# ========= PostgreSQL Configuration ==========
# These match the populate-postgres.py script from Lesson 1.

PG_HOST = "localhost"      # Database host
PG_PORT = "5432"           # Database port
PG_DB = "postgres"         # Database name (populate script uses 'postgres')
PG_USER = "temp"           # User from populate-postgres.py
PG_PASSWORD = "temp"       # Password from populate-postgres.py

PG_URI = f"postgresql://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"

# Output path
OUTPUT_DIM_FARE_CLASS = "/tmp/dim_fare_class.csv"

# ========= FARE CLASSIFICATION RULES ==========
# These thresholds define our fare buckets
FARE_BUCKETS = {
    'Budget': (0, 3.00),        # $0.00 - $2.99
    'Standard': (3.00, 4.50),   # $3.00 - $4.49
    'Premium': (4.50, 6.50),    # $4.50 - $6.49
    'Luxury': (6.50, float('inf'))  # $6.50+
}

# Discount classification rules
DISCOUNT_BANDS = {
    'Full Price': (0, 0.01),    # 0% discount
    'Light Discount': (0.01, 0.20),  # 1-19%
    'Moderate Discount': (0.20, 0.35),  # 20-34%
    'Heavy Discount': (0.35, 1.01)  # 35%+
}

print("Configuration set!")
print(f"   - PostgreSQL: {PG_HOST}:{PG_PORT}/{PG_DB} (user: {PG_USER})")
print(f"   - Output: {OUTPUT_DIM_FARE_CLASS}")
print(f"\nFare bucket thresholds:")
for name, (low, high) in FARE_BUCKETS.items():
    high_str = f"${high:.2f}" if high != float('inf') else "unlimited"
    print(f"   {name}: ${low:.2f} - {high_str}")
print(f"\nDiscount band thresholds:")
for name, (low, high) in DISCOUNT_BANDS.items():
    print(f"   {name}: {low*100:.0f}% - {high*100:.0f}%")

---
## Populate the Database

Run this cell to populate PostgreSQL with sample data (if not already done).

In [None]:
!python populate-postgres.py

---
## Helper Functions

Functions for categorizing continuous values into buckets.

In [None]:
# ========= Helper Functions

def trim_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Standardize text fields and handle NaN values.
    """
    df = df.copy()
    for c in df.select_dtypes(include=['object']).columns:
        df[c] = df[c].astype(str).str.strip()
        # Use mask instead of replace to avoid FutureWarning
        df[c] = df[c].mask(df[c].isin(['nan', 'None', 'NaN', '']), np.nan)
    return df


print("Helper function defined: trim_df()")

---
## Classification Functions

These functions classify continuous values (fares, discount rates) into categorical buckets.

**TODO**: Implement two classification functions:

1. `classify_fare(fare, buckets)`: 
   - Takes a fare amount and the FARE_BUCKETS dict
   - Returns the bucket name (e.g., 'Budget', 'Standard', 'Premium', 'Luxury')
   - Handle NaN values by returning 'Unknown'
   - Logic: Loop through buckets, check if `low <= fare < high`

2. `classify_discount(rate, bands)`:
   - Takes a discount rate (0.0 to 1.0) and the DISCOUNT_BANDS dict
   - Returns the band name (e.g., 'Full Price', 'Light Discount', etc.)
   - Handle NaN values by returning 'Unknown'
   - Logic: Loop through bands, check if `low <= rate < high`

In [None]:
# ========= Classification Functions

def classify_fare(fare: float, buckets: dict) -> str:
    """
    Classify a fare amount into a bucket.
    
    Args:
        fare: The fare amount in CAD
        buckets: Dict mapping bucket names to (min, max) tuples
    
    Returns:
        Bucket name as string
    
    Example:
        >>> classify_fare(3.50, FARE_BUCKETS)
        'Standard'
    """
    # TODO: Implement this function
    # 1. Check if fare is NaN using pd.isna(fare)
    # 2. Loop through buckets.items() to get (bucket_name, (low, high))
    # 3. Return bucket_name if low <= fare < high
    # 4. Return 'Unknown' if no bucket matches
    
    return 'Unknown'  # TODO: Replace with your implementation


def classify_discount(rate: float, bands: dict) -> str:
    """
    Classify a discount rate into a band.
    
    Args:
        rate: Discount rate (0.0 to 1.0)
        bands: Dict mapping band names to (min, max) tuples
    
    Returns:
        Band name as string
    """
    # TODO: Implement this function (similar logic to classify_fare)
    
    return 'Unknown'  # TODO: Replace with your implementation


# Test your functions
print("Testing classify_fare():")
print(f"   $2.50 -> {classify_fare(2.50, FARE_BUCKETS)} (expected: Budget)")
print(f"   $3.50 -> {classify_fare(3.50, FARE_BUCKETS)} (expected: Standard)")
print(f"   $5.00 -> {classify_fare(5.00, FARE_BUCKETS)} (expected: Premium)")
print(f"   $8.00 -> {classify_fare(8.00, FARE_BUCKETS)} (expected: Luxury)")

print("\nTesting classify_discount():")
print(f"   0.00 -> {classify_discount(0.00, DISCOUNT_BANDS)} (expected: Full Price)")
print(f"   0.10 -> {classify_discount(0.10, DISCOUNT_BANDS)} (expected: Light Discount)")
print(f"   0.25 -> {classify_discount(0.25, DISCOUNT_BANDS)} (expected: Moderate Discount)")
print(f"   0.40 -> {classify_discount(0.40, DISCOUNT_BANDS)} (expected: Heavy Discount)")

---
## Step 1: Connect and Extract Fare Data from PostgreSQL

Pull fare-related columns from the `raw_trips` table.

**TODO**: Write a SQL query to extract fare-related columns:
- `total_fare_cad`
- `discount_rate`
- `base_fare_cad`
- `discount_amount_cad`
- `zones_charged`
- `fare_class` (aliased as `source_fare_class`)

Filter out rows where `total_fare_cad IS NULL`.

In [None]:
# ========= STEP 1: Extract fare data from PostgreSQL
print("Step 1: Extracting fare data from PostgreSQL...")
print("-" * 50)

# Connect to PostgreSQL
engine = create_engine(PG_URI)

# TODO: Write SQL to extract fare-related columns
SQL_FARES = """
-- TODO: Write your SELECT statement here

"""

with engine.connect() as conn:
    trips = pd.read_sql(text(SQL_FARES), conn)

trips = trim_df(trips)
trips['total_fare_cad'] = pd.to_numeric(trips['total_fare_cad'], errors='coerce')
trips['discount_rate'] = pd.to_numeric(trips['discount_rate'], errors='coerce')

print(f"Extracted {len(trips):,} trips with fare data")
print(f"\nSample:")
display(trips.head())

---
## Step 2: Analyze Fare Distribution

Understand the distribution of fares to validate our bucket boundaries.

In [None]:
# ========= STEP 2: Analyze fare distribution
print("Step 2: Analyzing fare distribution...")
print("-" * 50)

print("Fare statistics (total_fare_cad):")
display(trips['total_fare_cad'].describe())

print("\nDiscount statistics (discount_rate):")
display(trips['discount_rate'].describe())

In [None]:
# Fare percentiles - helps validate bucket boundaries
print("Fare percentiles:")
print("-" * 50)
percentiles = [10, 25, 50, 75, 90, 95, 99]
for p in percentiles:
    value = trips['total_fare_cad'].quantile(p/100)
    print(f"   {p}th percentile: ${value:.2f}")

print(f"\nDiscount rate distribution:")
print("-" * 50)
discount_dist = trips['discount_rate'].value_counts(bins=5, sort=False)
print(discount_dist)

---
## Step 3: Apply Classifications

Categorize each trip into fare buckets and discount bands.

**TODO**: Apply your classification functions to create two new columns:

1. `fare_bucket`: Result of applying `classify_fare()` to `total_fare_cad`
2. `discount_band`: Result of applying `classify_discount()` to `discount_rate`

Use DataFrame `.apply()` with a lambda function.

In [None]:
# ========= STEP 3: Apply classifications
print("Step 3: Applying fare and discount classifications...")
print("-" * 50)

# TODO: Apply fare bucket classification
# trips['fare_bucket'] = trips['total_fare_cad'].apply(
#     lambda x: classify_fare(x, FARE_BUCKETS)
# )

# TODO: Apply discount band classification
# trips['discount_band'] = trips['discount_rate'].apply(
#     lambda x: classify_discount(x, DISCOUNT_BANDS)
# )

# Placeholder columns (remove after implementing above)
trips['fare_bucket'] = 'Unknown'
trips['discount_band'] = 'Unknown'

# Show distribution
print(f"\nFare bucket distribution:")
fare_bucket_counts = trips['fare_bucket'].value_counts()
for bucket, count in fare_bucket_counts.items():
    pct = count / len(trips) * 100
    print(f"   {bucket}: {count:,} ({pct:.1f}%)")

print(f"\nDiscount band distribution:")
discount_band_counts = trips['discount_band'].value_counts()
for band, count in discount_band_counts.items():
    pct = count / len(trips) * 100
    print(f"   {band}: {count:,} ({pct:.1f}%)")

In [None]:
# Cross-tabulation: fare bucket vs discount band
print("Cross-tabulation: Fare Bucket vs Discount Band")
print("-" * 50)
cross_tab = pd.crosstab(
    trips['fare_bucket'], 
    trips['discount_band'],
    margins=True
)
display(cross_tab)

---
## Step 4: Build the Dimension Table

Create a proper dimension table with surrogate keys and metadata.

**TODO**: Build the dimension table by:

1. Get unique combinations of `fare_bucket` and `discount_band`
2. Create a `fare_class_code` by combining shortened bucket and band names (e.g., 'BUD_FULL')
3. Create a `fare_class_label` as a readable combination (e.g., 'Budget / Full Price')
4. Add `fare_range` description based on FARE_BUCKETS
5. Sort by fare_bucket and discount_band

In [None]:
# ========= STEP 4: Build dimension table
print("Step 4: Building dimension table...")
print("-" * 50)


# TODO: Get unique combinations of fare_bucket and discount_band


# TODO: Create fare_class_code (e.g., 'BUD_FULL' for Budget/Full Price)
# Hint: Use string slicing to get first 3-4 chars of each part


# TODO: Create fare_class_label (e.g., 'Budget / Full Price')


# TODO: Add fare_range description

# Sort for consistent ordering
bucket_order = ['Budget', 'Standard', 'Premium', 'Luxury', 'Unknown']
band_order = ['Full Price', 'Light Discount', 'Moderate Discount', 'Heavy Discount', 'Unknown']

fare_classes['bucket_sort'] = fare_classes['fare_bucket'].map(
    {v: i for i, v in enumerate(bucket_order)}
).fillna(99)
fare_classes['band_sort'] = fare_classes['discount_band'].map(
    {v: i for i, v in enumerate(band_order)}
).fillna(99)

fare_classes = fare_classes.sort_values(['bucket_sort', 'band_sort']).reset_index(drop=True)
fare_classes = fare_classes.drop(columns=['bucket_sort', 'band_sort'])


print(f"Created {len(fare_classes)} unique fare class combinations")
print(f"\nFare class dimension:")
display(fare_classes)

In [None]:
# Create final dimension with warehouse-ready structure
print("Creating warehouse-ready dimension structure...")
print("-" * 50)

dim_fare_class = pd.DataFrame({
    'fare_class_code': fare_classes['fare_class_code'],
    'fare_class_label': fare_classes['fare_class_label'],
    'fare_bucket': fare_classes['fare_bucket'],
    'fare_range': fare_classes['fare_range'],
    'discount_band': fare_classes['discount_band'],
    'created_at': datetime.now(),
    'is_current': True
})

# Add surrogate key (will be IDENTITY in Redshift)
dim_fare_class.insert(0, 'fare_class_sk', range(1, len(dim_fare_class) + 1))

print(f"Final dimension structure:")
print(dim_fare_class.dtypes)
print(f"\nDimension rows: {len(dim_fare_class)}")
display(dim_fare_class)

---
## Step 5: Validate the Dimension

Ensure all trips can be classified and there are no gaps in coverage.

In [None]:
# ========= STEP 5: Validate
print("Step 5: Validating dimension coverage...")
print("-" * 50)

# Check for unclassified fares
unclassified_fares = trips[trips['fare_bucket'] == 'Unknown']
print(f"Unclassified fares: {len(unclassified_fares):,}")

if len(unclassified_fares) > 0:
    print("   Sample unclassified fare values:")
    display(unclassified_fares['total_fare_cad'].head(10))

# Check for unclassified discounts
unclassified_discounts = trips[trips['discount_band'] == 'Unknown']
print(f"\nUnclassified discounts: {len(unclassified_discounts):,}")

if len(unclassified_discounts) > 0:
    print("   Sample unclassified discount rates:")
    display(unclassified_discounts['discount_rate'].head(10))

---
## Step 6: Output the Dimension

Save the dimension table for loading into the data warehouse.

In [None]:
# ========= STEP 6: Output
print("Step 6: Outputting dimension table...")
print("-" * 50)

if not dim_fare_class.empty:
    dim_fare_class.to_csv(OUTPUT_DIM_FARE_CLASS, index=False)
    file_size = os.path.getsize(OUTPUT_DIM_FARE_CLASS) / 1024
    
    print(f"Saved to: {OUTPUT_DIM_FARE_CLASS}")
    print(f"File size: {file_size:.1f} KB")
    print(f"Total fare classes: {len(dim_fare_class)}")
else:
    print("No data to save - complete the TODO sections above first.")

---
## Step 7: Clean Up

In [None]:
# ========= STEP 7: Clean up
engine.dispose()
print("PostgreSQL connection closed.")

---
## Key Takeaways

### Derived Dimension Best Practices

1. **Analyze first**: Understand the data distribution before defining buckets
2. **Document rules**: Every threshold should have a business justification
3. **Handle edge cases**: Always include an "Unknown" category
4. **Version control**: Rules may change; track when and why

### Common Derived Dimensions

| Dimension | Based On | Categories |
|-----------|----------|------------|
| Age Band | birth_date | Child, Teen, Adult, Senior |
| Price Tier | unit_price | Budget, Mid, Premium |
| Distance Band | distance_km | Short, Medium, Long |
| Time of Day | timestamp | Morning, Afternoon, Evening, Night |