# Lesson 3: Exercise 3 Solution - Create Derived Fare Class Dimension

## Goal

Build a **derived dimension** that classifies fares into analytical categories. Derived dimensions add business meaning that doesn't exist in the raw source data.

## Prerequisites

You should have completed:
- **Lesson 1, Exercise 1**: Connected to PostgreSQL (`raw_trips` table)
- **Lesson 3, Exercise 1**: Extracted and transformed trips data

## What You Will Build

A Pandas-based transformation that:

1. Extracts fare data from PostgreSQL
2. Analyzes fare distribution
3. Creates fare buckets based on `total_fare_cad`
4. Adds discount bands based on `discount_rate`
5. Builds a reusable `dw_dim_fare_class` dimension table

### Why Derived Dimensions Matter

Raw data contains continuous values like `total_fare_cad = 3.47`. Analysts need categorical groupings:
- "How do Premium fares compare to Standard fares?"
- "What's the distribution of discounted vs. full-price trips?"

Derived dimensions compute these categories **once during ETL**, so every query uses consistent logic.

### Acceptance Criteria

- Fare buckets are clearly defined with boundaries
- Discount bands categorize discount levels
- Output dimension is ready for warehouse loading
- Logic is documented and reproducible

---

## Imports and Dependencies

Run this cell first to import all required libraries.

In [1]:
# ========= Imports
import os
from datetime import datetime
from typing import Tuple, List
import numpy as np
import pandas as pd
from sqlalchemy import create_engine, text

print("All imports successful!")
print(f"   - pandas version: {pd.__version__}")
print(f"   - numpy version: {np.__version__}")

All imports successful!
   - pandas version: 2.3.1
   - numpy version: 2.2.6


---
## Configuration

**Important:** These credentials match the `populate-postgres.py` script from Lesson 1. Update only if your environment differs.

In [2]:
# ========= PostgreSQL Configuration ==========
# These match the populate-postgres.py script from Lesson 1.

PG_HOST = "localhost"      # Database host
PG_PORT = "5432"           # Database port
PG_DB = "postgres"         # Database name (populate script uses 'postgres')
PG_USER = "temp"           # User from populate-postgres.py
PG_PASSWORD = "temp"       # Password from populate-postgres.py

PG_URI = f"postgresql://{PG_USER}:{PG_PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DB}"

# Output path
OUTPUT_DIM_FARE_CLASS = "/tmp/dim_fare_class.csv"

# ========= FARE CLASSIFICATION RULES ==========
# These thresholds define our fare buckets
FARE_BUCKETS = {
    'Budget': (0, 3.00),        # $0.00 - $2.99
    'Standard': (3.00, 4.50),   # $3.00 - $4.49
    'Premium': (4.50, 6.50),    # $4.50 - $6.49
    'Luxury': (6.50, float('inf'))  # $6.50+
}

# Discount classification rules
DISCOUNT_BANDS = {
    'Full Price': (0, 0.01),    # 0% discount
    'Light Discount': (0.01, 0.20),  # 1-19%
    'Moderate Discount': (0.20, 0.35),  # 20-34%
    'Heavy Discount': (0.35, 1.01)  # 35%+
}

print("Configuration set!")
print(f"   - PostgreSQL: {PG_HOST}:{PG_PORT}/{PG_DB} (user: {PG_USER})")
print(f"   - Output: {OUTPUT_DIM_FARE_CLASS}")
print(f"\nFare bucket thresholds:")
for name, (low, high) in FARE_BUCKETS.items():
    high_str = f"${high:.2f}" if high != float('inf') else "unlimited"
    print(f"   {name}: ${low:.2f} - {high_str}")
print(f"\nDiscount band thresholds:")
for name, (low, high) in DISCOUNT_BANDS.items():
    print(f"   {name}: {low*100:.0f}% - {high*100:.0f}%")

Configuration set!
   - PostgreSQL: localhost:5432/postgres (user: temp)
   - Output: /tmp/dim_fare_class.csv

Fare bucket thresholds:
   Budget: $0.00 - $3.00
   Standard: $3.00 - $4.50
   Premium: $4.50 - $6.50
   Luxury: $6.50 - unlimited

Discount band thresholds:
   Full Price: 0% - 1%
   Light Discount: 1% - 20%
   Moderate Discount: 20% - 35%
   Heavy Discount: 35% - 101%


---
## Verify Database Setup

This cell verifies that PostgreSQL has the `raw_trips` data. If verification fails, run:

```python
!python populate-postgres.py
```

In [3]:
# ========= Verify PostgreSQL ==========
print("Verifying PostgreSQL setup...")
try:
    engine = create_engine(PG_URI)
    with engine.connect() as conn:
        result = conn.execute(text("SELECT COUNT(*) FROM raw_trips"))
        count = result.scalar()
    print(f"   OK: raw_trips has {count:,} rows")
    engine.dispose()
    print("   Verification PASSED - ready to proceed!")
except Exception as e:
    print(f"   ERROR: {e}")
    print("\nPlease run: python populate-postgres.py")

Verifying PostgreSQL setup...
   OK: raw_trips has 2,500 rows
   Verification PASSED - ready to proceed!


---
## Helper Functions

Functions for categorizing continuous values into buckets.

In [4]:
# ========= Helper Functions

def trim_df(df: pd.DataFrame) -> pd.DataFrame:
    """
    Standardize text fields and handle NaN values.
    """
    df = df.copy()
    for c in df.select_dtypes(include=['object']).columns:
        df[c] = df[c].astype(str).str.strip()
        # Use mask instead of replace to avoid FutureWarning
        df[c] = df[c].mask(df[c].isin(['nan', 'None', 'NaN', '']), np.nan)
    return df


def classify_fare(fare: float, buckets: dict) -> str:
    """
    Classify a fare amount into a bucket.
    
    Args:
        fare: The fare amount in CAD
        buckets: Dict mapping bucket names to (min, max) tuples
    
    Returns:
        Bucket name as string
    
    Example:
        >>> classify_fare(3.50, FARE_BUCKETS)
        'Standard'
    """
    if pd.isna(fare):
        return 'Unknown'
    
    for bucket_name, (low, high) in buckets.items():
        if low <= fare < high:
            return bucket_name
    
    return 'Unknown'


def classify_discount(rate: float, bands: dict) -> str:
    """
    Classify a discount rate into a band.
    
    Args:
        rate: Discount rate (0.0 to 1.0)
        bands: Dict mapping band names to (min, max) tuples
    
    Returns:
        Band name as string
    """
    if pd.isna(rate):
        return 'Unknown'
    
    for band_name, (low, high) in bands.items():
        if low <= rate < high:
            return band_name
    
    return 'Unknown'


print("Helper functions defined: trim_df(), classify_fare(), classify_discount()")

Helper functions defined: trim_df(), classify_fare(), classify_discount()


---
## Step 1: Connect and Extract Fare Data from PostgreSQL

Pull fare-related columns from the `raw_trips` table.

In [5]:
# ========= STEP 1: Extract fare data from PostgreSQL
print("Step 1: Extracting fare data from PostgreSQL...")
print("-" * 50)

# Connect to PostgreSQL
engine = create_engine(PG_URI)

# Extract fare-related columns
SQL_FARES = """
SELECT 
    total_fare_cad,
    discount_rate,
    base_fare_cad,
    discount_amount_cad,
    zones_charged,
    fare_class as source_fare_class
FROM raw_trips
WHERE total_fare_cad IS NOT NULL
"""

with engine.connect() as conn:
    trips = pd.read_sql(text(SQL_FARES), conn)

trips = trim_df(trips)
trips['total_fare_cad'] = pd.to_numeric(trips['total_fare_cad'], errors='coerce')
trips['discount_rate'] = pd.to_numeric(trips['discount_rate'], errors='coerce')

print(f"Extracted {len(trips):,} trips with fare data")
print(f"\nSample:")
display(trips.head())

Step 1: Extracting fare data from PostgreSQL...
--------------------------------------------------
Extracted 2,500 trips with fare data

Sample:


Unnamed: 0,total_fare_cad,discount_rate,base_fare_cad,discount_amount_cad,zones_charged,source_fare_class
0,3.32,0.0,3.32,0.0,1,adult
1,3.17,0.0,3.17,0.0,1,adult
2,2.12,0.32,3.12,1.0,1,youth
3,2.12,0.32,3.12,1.0,1,youth
4,4.51,0.0,4.51,0.0,2,adult


---
## Step 2: Analyze Fare Distribution

Understand the distribution of fares to validate our bucket boundaries.

In [6]:
# ========= STEP 2: Analyze fare distribution
print("Step 2: Analyzing fare distribution...")
print("-" * 50)

print("Fare statistics (total_fare_cad):")
display(trips['total_fare_cad'].describe())

print("\nDiscount statistics (discount_rate):")
display(trips['discount_rate'].describe())

Step 2: Analyzing fare distribution...
--------------------------------------------------
Fare statistics (total_fare_cad):


count    2500.000000
mean        3.028128
std         0.941073
min         1.860000
25%         2.170000
50%         3.130000
75%         3.190000
max         9.580000
Name: total_fare_cad, dtype: float64


Discount statistics (discount_rate):


count    2500.000000
mean        0.122988
std         0.163261
min         0.000000
25%         0.000000
50%         0.000000
75%         0.320000
max         0.400000
Name: discount_rate, dtype: float64

In [7]:
# Fare percentiles
print("Fare percentiles:")
print("-" * 50)
percentiles = [10, 25, 50, 75, 90, 95, 99]
for p in percentiles:
    value = trips['total_fare_cad'].quantile(p/100)
    print(f"   {p}th percentile: ${value:.2f}")

print(f"\nDiscount rate distribution:")
print("-" * 50)
discount_dist = trips['discount_rate'].value_counts(bins=5, sort=False)
print(discount_dist)

Fare percentiles:
--------------------------------------------------
   10th percentile: $2.05
   25th percentile: $2.17
   50th percentile: $3.13
   75th percentile: $3.19
   90th percentile: $4.10
   95th percentile: $4.57
   99th percentile: $8.00

Discount rate distribution:
--------------------------------------------------
(-0.0014, 0.08]    1544
(0.08, 0.16]         89
(0.16, 0.24]          0
(0.24, 0.32]        411
(0.32, 0.4]         456
Name: count, dtype: int64


---
## Step 3: Apply Classifications

Categorize each trip into fare buckets and discount bands.

In [8]:
# ========= STEP 3: Apply classifications
print("Step 3: Applying fare and discount classifications...")
print("-" * 50)

# Apply fare bucket classification
trips['fare_bucket'] = trips['total_fare_cad'].apply(
    lambda x: classify_fare(x, FARE_BUCKETS)
)
print("Applied fare bucket classification")

# Apply discount band classification
trips['discount_band'] = trips['discount_rate'].apply(
    lambda x: classify_discount(x, DISCOUNT_BANDS)
)
print("Applied discount band classification")

# Show distribution
print(f"\nFare bucket distribution:")
fare_bucket_counts = trips['fare_bucket'].value_counts()
for bucket, count in fare_bucket_counts.items():
    pct = count / len(trips) * 100
    print(f"   {bucket}: {count:,} ({pct:.1f}%)")

print(f"\nDiscount band distribution:")
discount_band_counts = trips['discount_band'].value_counts()
for band, count in discount_band_counts.items():
    pct = count / len(trips) * 100
    print(f"   {band}: {count:,} ({pct:.1f}%)")

Step 3: Applying fare and discount classifications...
--------------------------------------------------
Applied fare bucket classification
Applied discount band classification

Fare bucket distribution:
   Standard: 1,389 (55.6%)
   Budget: 863 (34.5%)
   Premium: 207 (8.3%)
   Luxury: 41 (1.6%)

Discount band distribution:
   Full Price: 1,544 (61.8%)
   Heavy Discount: 456 (18.2%)
   Moderate Discount: 411 (16.4%)
   Light Discount: 89 (3.6%)


In [9]:
# Cross-tabulation: fare bucket vs discount band
print("Cross-tabulation: Fare Bucket vs Discount Band")
print("-" * 50)
cross_tab = pd.crosstab(
    trips['fare_bucket'], 
    trips['discount_band'],
    margins=True
)
display(cross_tab)

Cross-tabulation: Fare Bucket vs Discount Band
--------------------------------------------------


discount_band,Full Price,Heavy Discount,Light Discount,Moderate Discount,All
fare_bucket,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Budget,0,442,75,346,863
Luxury,22,10,0,9,41
Premium,207,0,0,0,207
Standard,1315,4,14,56,1389
All,1544,456,89,411,2500


---
## Step 4: Build the Dimension Table

Create a proper dimension table with surrogate keys and metadata.

In [10]:
# ========= STEP 4: Build dimension table
print("Step 4: Building dimension table...")
print("-" * 50)

# Create unique fare class combinations
fare_classes = trips[['fare_bucket', 'discount_band']].drop_duplicates().reset_index(drop=True)

# Add a composite fare_class code
fare_classes['fare_class_code'] = (
    fare_classes['fare_bucket'].str[:3].str.upper() + '_' +
    fare_classes['discount_band'].str.replace(' ', '').str[:4].str.upper()
)

# Add descriptive label
fare_classes['fare_class_label'] = (
    fare_classes['fare_bucket'] + ' / ' + fare_classes['discount_band']
)

# Add fare range description
def get_fare_range(bucket):
    if bucket in FARE_BUCKETS:
        low, high = FARE_BUCKETS[bucket]
        if high == float('inf'):
            return f"${low:.2f}+"
        return f"${low:.2f} - ${high:.2f}"
    return "Unknown"

fare_classes['fare_range'] = fare_classes['fare_bucket'].apply(get_fare_range)

# Sort for consistent ordering
bucket_order = ['Budget', 'Standard', 'Premium', 'Luxury', 'Unknown']
band_order = ['Full Price', 'Light Discount', 'Moderate Discount', 'Heavy Discount', 'Unknown']

fare_classes['bucket_sort'] = fare_classes['fare_bucket'].map(
    {v: i for i, v in enumerate(bucket_order)}
).fillna(99)
fare_classes['band_sort'] = fare_classes['discount_band'].map(
    {v: i for i, v in enumerate(band_order)}
).fillna(99)

fare_classes = fare_classes.sort_values(['bucket_sort', 'band_sort']).reset_index(drop=True)
fare_classes = fare_classes.drop(columns=['bucket_sort', 'band_sort'])

print(f"Created {len(fare_classes)} unique fare class combinations")
print(f"\nFare class dimension:")
display(fare_classes)

Step 4: Building dimension table...
--------------------------------------------------
Created 11 unique fare class combinations

Fare class dimension:


Unnamed: 0,fare_bucket,discount_band,fare_class_code,fare_class_label,fare_range
0,Budget,Light Discount,BUD_LIGH,Budget / Light Discount,$0.00 - $3.00
1,Budget,Moderate Discount,BUD_MODE,Budget / Moderate Discount,$0.00 - $3.00
2,Budget,Heavy Discount,BUD_HEAV,Budget / Heavy Discount,$0.00 - $3.00
3,Standard,Full Price,STA_FULL,Standard / Full Price,$3.00 - $4.50
4,Standard,Light Discount,STA_LIGH,Standard / Light Discount,$3.00 - $4.50
5,Standard,Moderate Discount,STA_MODE,Standard / Moderate Discount,$3.00 - $4.50
6,Standard,Heavy Discount,STA_HEAV,Standard / Heavy Discount,$3.00 - $4.50
7,Premium,Full Price,PRE_FULL,Premium / Full Price,$4.50 - $6.50
8,Luxury,Full Price,LUX_FULL,Luxury / Full Price,$6.50+
9,Luxury,Moderate Discount,LUX_MODE,Luxury / Moderate Discount,$6.50+


In [11]:
# Create final dimension with warehouse-ready structure
print("Creating warehouse-ready dimension structure...")
print("-" * 50)

dim_fare_class = pd.DataFrame({
    'fare_class_code': fare_classes['fare_class_code'],
    'fare_class_label': fare_classes['fare_class_label'],
    'fare_bucket': fare_classes['fare_bucket'],
    'fare_range': fare_classes['fare_range'],
    'discount_band': fare_classes['discount_band'],
    'created_at': datetime.now(),
    'is_current': True
})

# Add surrogate key (will be IDENTITY in Redshift)
dim_fare_class.insert(0, 'fare_class_sk', range(1, len(dim_fare_class) + 1))

print(f"Final dimension structure:")
print(dim_fare_class.dtypes)
print(f"\nDimension rows: {len(dim_fare_class)}")
display(dim_fare_class)

Creating warehouse-ready dimension structure...
--------------------------------------------------
Final dimension structure:
fare_class_sk                int64
fare_class_code             object
fare_class_label            object
fare_bucket                 object
fare_range                  object
discount_band               object
created_at          datetime64[us]
is_current                    bool
dtype: object

Dimension rows: 11


Unnamed: 0,fare_class_sk,fare_class_code,fare_class_label,fare_bucket,fare_range,discount_band,created_at,is_current
0,1,BUD_LIGH,Budget / Light Discount,Budget,$0.00 - $3.00,Light Discount,2025-12-15 20:44:31.211168,True
1,2,BUD_MODE,Budget / Moderate Discount,Budget,$0.00 - $3.00,Moderate Discount,2025-12-15 20:44:31.211168,True
2,3,BUD_HEAV,Budget / Heavy Discount,Budget,$0.00 - $3.00,Heavy Discount,2025-12-15 20:44:31.211168,True
3,4,STA_FULL,Standard / Full Price,Standard,$3.00 - $4.50,Full Price,2025-12-15 20:44:31.211168,True
4,5,STA_LIGH,Standard / Light Discount,Standard,$3.00 - $4.50,Light Discount,2025-12-15 20:44:31.211168,True
5,6,STA_MODE,Standard / Moderate Discount,Standard,$3.00 - $4.50,Moderate Discount,2025-12-15 20:44:31.211168,True
6,7,STA_HEAV,Standard / Heavy Discount,Standard,$3.00 - $4.50,Heavy Discount,2025-12-15 20:44:31.211168,True
7,8,PRE_FULL,Premium / Full Price,Premium,$4.50 - $6.50,Full Price,2025-12-15 20:44:31.211168,True
8,9,LUX_FULL,Luxury / Full Price,Luxury,$6.50+,Full Price,2025-12-15 20:44:31.211168,True
9,10,LUX_MODE,Luxury / Moderate Discount,Luxury,$6.50+,Moderate Discount,2025-12-15 20:44:31.211168,True


---
## Step 5: Validate the Derived Dimension

Ensure all trips can be classified and the dimension is complete.

In [12]:
# ========= STEP 5: Validate
print("Step 5: Validating derived dimension...")
print("-" * 50)

# Check all trips can be classified
unclassified_fares = trips[trips['fare_bucket'] == 'Unknown']
unclassified_discounts = trips[trips['discount_band'] == 'Unknown']

print(f"Classification coverage:")
print(f"   Unclassified fares: {len(unclassified_fares):,}")
print(f"   Unclassified discounts: {len(unclassified_discounts):,}")

# Check dimension completeness
trip_combinations = trips[['fare_bucket', 'discount_band']].drop_duplicates()
dim_combinations = dim_fare_class[['fare_bucket', 'discount_band']]

missing_in_dim = trip_combinations.merge(
    dim_combinations, 
    on=['fare_bucket', 'discount_band'], 
    how='left', 
    indicator=True
)
missing = missing_in_dim[missing_in_dim['_merge'] == 'left_only']

print(f"\nDimension completeness:")
print(f"   Trip combinations: {len(trip_combinations)}")
print(f"   Dimension entries: {len(dim_fare_class)}")
print(f"   Missing from dim: {len(missing)}")

if len(missing) == 0:
    print("   All combinations covered - PASSED")

Step 5: Validating derived dimension...
--------------------------------------------------
Classification coverage:
   Unclassified fares: 0
   Unclassified discounts: 0

Dimension completeness:
   Trip combinations: 11
   Dimension entries: 11
   Missing from dim: 0
   All combinations covered - PASSED


In [13]:
# Validate bucket boundaries make sense
print("Fare bucket validation:")
print("-" * 50)
for bucket in ['Budget', 'Standard', 'Premium', 'Luxury']:
    bucket_trips = trips[trips['fare_bucket'] == bucket]
    if len(bucket_trips) > 0:
        min_fare = bucket_trips['total_fare_cad'].min()
        max_fare = bucket_trips['total_fare_cad'].max()
        avg_fare = bucket_trips['total_fare_cad'].mean()
        print(f"   {bucket}:")
        print(f"      Count: {len(bucket_trips):,}")
        print(f"      Range: ${min_fare:.2f} - ${max_fare:.2f}")
        print(f"      Average: ${avg_fare:.2f}")

Fare bucket validation:
--------------------------------------------------
   Budget:
      Count: 863
      Range: $1.86 - $2.99
      Average: $2.19
   Standard:
      Count: 1,389
      Range: $3.00 - $4.14
      Average: $3.17
   Premium:
      Count: 207
      Range: $4.50 - $4.91
      Average: $4.57
   Luxury:
      Count: 41
      Range: $6.91 - $9.58
      Average: $7.91


---
## Step 6: Output the Dimension

Save the derived dimension for loading into the warehouse.

In [14]:
# ========= STEP 6: Output
print("Step 6: Outputting derived dimension...")
print("-" * 50)

# Save to CSV
dim_fare_class.to_csv(OUTPUT_DIM_FARE_CLASS, index=False)
file_size = os.path.getsize(OUTPUT_DIM_FARE_CLASS) / 1024

print(f"Saved to: {OUTPUT_DIM_FARE_CLASS}")
print(f"File size: {file_size:.1f} KB")
print(f"Total fare classes: {len(dim_fare_class)}")

Step 6: Outputting derived dimension...
--------------------------------------------------
Saved to: /tmp/dim_fare_class.csv
File size: 1.2 KB
Total fare classes: 11


---
## Step 7: Document the Logic

Create documentation for the derived dimension so future maintainers understand the rules.

In [15]:
# ========= STEP 7: Document
documentation = f"""
# Derived Dimension: dim_fare_class

## Purpose
Categorizes transit fares into analytical buckets for reporting and analysis.

## Source
Derived from PostgreSQL `raw_trips.total_fare_cad` and `raw_trips.discount_rate`

## Fare Bucket Rules
| Bucket   | Min (CAD) | Max (CAD) | Description          |
|----------|-----------|-----------|----------------------|
| Budget   | $0.00     | $2.99     | Low-cost fares       |
| Standard | $3.00     | $4.49     | Typical single-zone  |
| Premium  | $4.50     | $6.49     | Multi-zone fares     |
| Luxury   | $6.50     | +         | Extended travel      |

## Discount Band Rules
| Band              | Min Rate | Max Rate | Description           |
|-------------------|----------|----------|-----------------------|
| Full Price        | 0%       | 0%       | No discount applied   |
| Light Discount    | 1%       | 19%      | Minor concession      |
| Moderate Discount | 20%      | 34%      | Youth/Senior discount |
| Heavy Discount    | 35%      | 100%     | Major concession      |

## Usage
Join fact_trips to dim_fare_class on the composite key (fare_bucket, discount_band)
or create a lookup by fare_class_code.

## Maintenance
- Review bucket thresholds annually
- Update if fare structure changes
- Regenerate dimension after rule changes

Generated: {datetime.now().isoformat()}
"""

print(documentation)

# Save documentation
doc_path = "/tmp/dim_fare_class_README.md"
with open(doc_path, 'w') as f:
    f.write(documentation)
print(f"\nDocumentation saved to: {doc_path}")


# Derived Dimension: dim_fare_class

## Purpose
Categorizes transit fares into analytical buckets for reporting and analysis.

## Source
Derived from PostgreSQL `raw_trips.total_fare_cad` and `raw_trips.discount_rate`

## Fare Bucket Rules
| Bucket   | Min (CAD) | Max (CAD) | Description          |
|----------|-----------|-----------|----------------------|
| Budget   | $0.00     | $2.99     | Low-cost fares       |
| Standard | $3.00     | $4.49     | Typical single-zone  |
| Premium  | $4.50     | $6.49     | Multi-zone fares     |
| Luxury   | $6.50     | +         | Extended travel      |

## Discount Band Rules
| Band              | Min Rate | Max Rate | Description           |
|-------------------|----------|----------|-----------------------|
| Full Price        | 0%       | 0%       | No discount applied   |
| Light Discount    | 1%       | 19%      | Minor concession      |
| Moderate Discount | 20%      | 34%      | Youth/Senior discount |
| Heavy Discount    | 35%      | 10

---
## Step 8: Clean Up

Close the database connection.

In [16]:
# ========= STEP 8: Clean up
engine.dispose()
print("PostgreSQL connection closed.")

PostgreSQL connection closed.


---
## Summary

In [17]:
# ========= Final Summary
print("=" * 60)
print("DERIVED DIMENSION SUMMARY: Fare Class")
print("=" * 60)
print(f"""
Source:
  - PostgreSQL: {PG_HOST}:{PG_PORT}/{PG_DB}
  - Table: raw_trips
  - Records analyzed: {len(trips):,}

Output:
  - Dimension: {OUTPUT_DIM_FARE_CLASS}
  - Entries: {len(dim_fare_class)}

Fare Buckets:
{fare_bucket_counts.to_string()}

Discount Bands:
{discount_band_counts.to_string()}

Validation:
  - Unclassified fares: {len(unclassified_fares)}
  - Unclassified discounts: {len(unclassified_discounts)}
  - Dimension complete: {'YES' if len(missing) == 0 else 'NO'}

Status:    SUCCESS
Completed: {datetime.now().isoformat()}
""")
print("=" * 60)

DERIVED DIMENSION SUMMARY: Fare Class

Source:
  - PostgreSQL: localhost:5432/postgres
  - Table: raw_trips
  - Records analyzed: 2,500

Output:
  - Dimension: /tmp/dim_fare_class.csv
  - Entries: 11

Fare Buckets:
fare_bucket
Standard    1389
Budget       863
Premium      207
Luxury        41

Discount Bands:
discount_band
Full Price           1544
Heavy Discount        456
Moderate Discount     411
Light Discount         89

Validation:
  - Unclassified fares: 0
  - Unclassified discounts: 0
  - Dimension complete: YES

Status:    SUCCESS
Completed: 2025-12-15T20:44:55.624062



---
## Key Takeaways

### Derived Dimension Best Practices

1. **Analyze first**: Understand the data distribution before defining buckets
2. **Document rules**: Every threshold should have a business justification
3. **Handle edge cases**: Always include an "Unknown" category
4. **Version control**: Rules may change; track when and why

### Common Derived Dimensions

| Dimension | Based On | Categories |
|-----------|----------|------------|
| Age Band | birth_date | Child, Teen, Adult, Senior |
| Price Tier | unit_price | Budget, Mid, Premium |
| Distance Band | distance_km | Short, Medium, Long |
| Time of Day | timestamp | Morning, Afternoon, Evening, Night |