# 011 - Sklearn DuckDB SQL Regression (LightGBM + XGBoost)

This notebook demonstrates **complete batch ML training** for ETA (Estimated Time of Arrival) regression using:

1. **DuckDB SQL** for data loading and preprocessing from Delta Lake
2. **LightGBM** as primary model (fastest, highly accurate for regression)
3. **XGBoost** as fallback model
4. **All sklearn regression metrics** for comprehensive evaluation
5. **YellowBrick** for regression visualizations (coming next)

## Model Selection Rationale

Based on extensive research (2024-2025 benchmarks):

| Model | Training Speed | Accuracy | Best For |
|-------|----------------|----------|----------|
| **LightGBM** | 7x faster than XGBoost | Lowest MAPE | Large datasets, real-time |
| CatBoost | 3-4x slower than LightGBM | Best default | Categorical features |
| XGBoost | Slowest | Very good | Fine-grained control |

**LightGBM is chosen** for ETA prediction because:
- Fastest training and inference (critical for real-time ETA updates)
- Excellent regression performance (lowest MAPE in benchmarks)
- Memory-efficient histogram-based algorithm
- Industry standard: Uber, Lyft, DiDi use gradient boosting for ETA

## Target Variable

- **`simulated_actual_travel_time_seconds`**: Actual travel time in seconds
- This is a continuous regression target

In [1]:
import duckdb
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

In [2]:
MINIO_HOST = "localhost"
MINIO_PORT = "9000"
MINIO_ENDPOINT = f"{MINIO_HOST}:{MINIO_PORT}"
MINIO_ACCESS_KEY = "minioadmin"
MINIO_SECRET_KEY = "minioadmin123"

In [3]:
DELTA_PATHS = {
    "Transaction Fraud Detection": "s3://lakehouse/delta/transaction_fraud_detection",
    "Estimated Time of Arrival": "s3://lakehouse/delta/estimated_time_of_arrival",
    "E-Commerce Customer Interactions": "s3://lakehouse/delta/e_commerce_customer_interactions",
}

In [4]:
# Disable AWS EC2 metadata service lookup (prevents 169.254.169.254 errors)
os.environ["AWS_EC2_METADATA_DISABLED"] = "true"

# Create connection (in-memory database)
conn = duckdb.connect()

# Install and load required extensions
conn.execute("INSTALL delta; LOAD delta;")
conn.execute("INSTALL httpfs; LOAD httpfs;")

# Create a secret for S3/MinIO credentials
conn.execute(f"""
    CREATE SECRET minio_secret (
        TYPE S3,
        KEY_ID '{MINIO_ACCESS_KEY}',
        SECRET '{MINIO_SECRET_KEY}',
        REGION 'us-east-1',
        ENDPOINT '{MINIO_ENDPOINT}',
        URL_STYLE 'path',
        USE_SSL false
    );
""")
print("DuckDB extensions loaded and S3 secret configured")

DuckDB extensions loaded and S3 secret configured


## Feature Definitions

Define features upfront for LightGBM's native categorical handling.

### ETA Features Overview

| Category | Features |
|----------|----------|
| **Numerical** | estimated_distance_km, temperature_celsius, driver_rating, hour_of_day, initial_estimated_travel_time_seconds, debug_traffic_factor, debug_weather_factor, debug_incident_delay_seconds, debug_driver_factor |
| **Categorical** | trip_id, driver_id, vehicle_id, origin, destination, weather, day_of_week, vehicle_type |
| **Temporal** | year, month, day, hour, minute, second (extracted from timestamp) |

In [5]:
# Feature definitions for Estimated Time of Arrival
ETA_NUMERICAL_FEATURES = [
    "estimated_distance_km",
    "temperature_celsius",
    "driver_rating",
    "hour_of_day",
    "initial_estimated_travel_time_seconds",
    "debug_traffic_factor",
    "debug_weather_factor",
    "debug_incident_delay_seconds",
    "debug_driver_factor",
]

ETA_CATEGORICAL_FEATURES = [
    # IDs (high cardinality - label encoded)
    "trip_id",
    "driver_id",
    "vehicle_id",
    # Locations
    "origin",
    "destination",
    # Context
    "weather",
    "day_of_week",
    "vehicle_type",
    # Temporal (extracted from timestamp)
    "year",
    "month",
    "day",
    "hour",
    "minute",
    "second",
]

ETA_ALL_FEATURES = ETA_NUMERICAL_FEATURES + ETA_CATEGORICAL_FEATURES

# Categorical feature indices for LightGBM (position in feature list)
ETA_CAT_FEATURE_INDICES = list(range(
    len(ETA_NUMERICAL_FEATURES),
    len(ETA_ALL_FEATURES)
))

# Categorical feature names for LightGBM
ETA_CAT_FEATURE_NAMES = ETA_CATEGORICAL_FEATURES

print(f"Numerical features: {len(ETA_NUMERICAL_FEATURES)}")
print(f"Categorical features: {len(ETA_CATEGORICAL_FEATURES)}")
print(f"Categorical indices: {ETA_CAT_FEATURE_INDICES}")

Numerical features: 9
Categorical features: 14
Categorical indices: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]


## DuckDB SQL Preprocessing

All categorical features are **label-encoded in SQL** using `DENSE_RANK() - 1`.

This produces numeric data compatible with:
- LightGBM (pass `categorical_feature` for native handling)
- XGBoost (works directly with integers)
- YellowBrick (requires numeric data)
- All sklearn tools

In [6]:
def load_data_duckdb_sql(
    delta_path: str,
    sample_frac: float | None = None,
    max_rows: int | None = None,
) -> pd.DataFrame:
    """
    Load and preprocess ETA data using pure DuckDB SQL.
    
    All categorical features are label-encoded using DENSE_RANK() - 1.
    This produces numeric data compatible with:
    - LightGBM (pass categorical_feature for native handling)
    - XGBoost (works directly with integers)
    - YellowBrick (requires numeric data)
    - All sklearn tools
    
    Args:
        delta_path: Path to Delta Lake table
        sample_frac: Optional fraction of data to sample (0.0-1.0)
        max_rows: Optional maximum number of rows to load
    
    Returns:
        DataFrame with preprocessed features (all numeric) and target
    """
    # Single query: All features numeric, categoricals label-encoded
    query = f"""
    SELECT
        -- Numerical features (unchanged)
        estimated_distance_km,
        temperature_celsius,
        driver_rating,
        hour_of_day,
        initial_estimated_travel_time_seconds,
        debug_traffic_factor,
        debug_weather_factor,
        debug_incident_delay_seconds,
        debug_driver_factor,

        -- Categorical features: Label encoded with DENSE_RANK() - 1
        -- This produces 0-indexed integers compatible with all ML tools
        DENSE_RANK() OVER (ORDER BY trip_id) - 1 AS trip_id,
        DENSE_RANK() OVER (ORDER BY driver_id) - 1 AS driver_id,
        DENSE_RANK() OVER (ORDER BY vehicle_id) - 1 AS vehicle_id,
        DENSE_RANK() OVER (ORDER BY origin) - 1 AS origin,
        DENSE_RANK() OVER (ORDER BY destination) - 1 AS destination,
        DENSE_RANK() OVER (ORDER BY weather) - 1 AS weather,
        DENSE_RANK() OVER (ORDER BY day_of_week) - 1 AS day_of_week,
        DENSE_RANK() OVER (ORDER BY vehicle_type) - 1 AS vehicle_type,

        -- Timestamp components (already integers)
        CAST(date_part('year', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS year,
        CAST(date_part('month', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS month,
        CAST(date_part('day', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS day,
        CAST(date_part('hour', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS hour,
        CAST(date_part('minute', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS minute,
        CAST(date_part('second', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS second,

        -- Target (continuous - travel time in seconds)
        simulated_actual_travel_time_seconds

    FROM delta_scan('{delta_path}')
    """
    
    print(f"Loading ETA data with DuckDB SQL (all features numeric)...")

    # Add sampling clause
    if sample_frac is not None and 0 < sample_frac < 1:
        query += f" USING SAMPLE {sample_frac * 100}%"
        print(f"  Sampling: {sample_frac * 100}%")

    # Add limit clause
    if max_rows is not None:
        query += f" LIMIT {max_rows}"
        print(f"  Max rows: {max_rows}")

    df = conn.execute(query).df()
    print(f"  Loaded {len(df):,} rows with {len(df.columns)} columns")
    print(f"  All features numeric: {df.select_dtypes(include=['number']).shape[1]}/{len(df.columns)} columns")
    
    return df

In [7]:
# Set model type for training
MODEL_TYPE = "lightgbm"  # "lightgbm" or "xgboost"

# Load data from Delta Lake
df = load_data_duckdb_sql(
    DELTA_PATHS["Estimated Time of Arrival"],
    max_rows=10000
)

Loading ETA data with DuckDB SQL (all features numeric)...
  Max rows: 10000


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  Loaded 10,000 rows with 24 columns
  All features numeric: 24/24 columns


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 24 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   estimated_distance_km                  10000 non-null  float64
 1   temperature_celsius                    10000 non-null  float64
 2   driver_rating                          10000 non-null  float64
 3   hour_of_day                            10000 non-null  int32  
 4   initial_estimated_travel_time_seconds  10000 non-null  int32  
 5   debug_traffic_factor                   10000 non-null  float64
 6   debug_weather_factor                   10000 non-null  float64
 7   debug_incident_delay_seconds           10000 non-null  int32  
 8   debug_driver_factor                    10000 non-null  float64
 9   trip_id                                10000 non-null  int64  
 10  driver_id                              10000 non-null  int64  
 11  veh

In [9]:
# Check target variable distribution
print("Target Variable Statistics (simulated_actual_travel_time_seconds):")
print(df["simulated_actual_travel_time_seconds"].describe())
print(f"\nTarget range: {df['simulated_actual_travel_time_seconds'].min():.0f} - {df['simulated_actual_travel_time_seconds'].max():.0f} seconds")
print(f"Target range in minutes: {df['simulated_actual_travel_time_seconds'].min()/60:.1f} - {df['simulated_actual_travel_time_seconds'].max()/60:.1f} minutes")

Target Variable Statistics (simulated_actual_travel_time_seconds):
count    10000.000000
mean      4547.659900
std       2491.924058
min         60.000000
25%       2667.000000
50%       4231.000000
75%       6047.000000
max      19545.000000
Name: simulated_actual_travel_time_seconds, dtype: float64

Target range: 60 - 19545 seconds
Target range in minutes: 1.0 - 325.8 minutes


## Process Batch Data

Split features/target and prepare for training.

Data is **already all-numeric** from DuckDB SQL preprocessing.

In [10]:
def process_batch_data_duckdb(
    df: pd.DataFrame,
    test_size: float = 0.2,
    random_state: int = 42,
):
    """
    Process batch data for model training.
    
    Data is already all-numeric from load_data_duckdb_sql().
    Works with both LightGBM (pass categorical_feature) and XGBoost.
    
    Args:
        df: DataFrame from load_data_duckdb_sql() (all numeric)
        test_size: Fraction for test set
        random_state: Random seed
    
    Returns:
        X_train, X_test, y_train, y_test
    """
    # Split features and target
    y = df["simulated_actual_travel_time_seconds"]
    X = df.drop("simulated_actual_travel_time_seconds", axis=1)
    
    print(f"Features: {len(X.columns)} total ({len(ETA_NUMERICAL_FEATURES)} numeric, {len(ETA_CATEGORICAL_FEATURES)} label-encoded)")
    print(f"All features are numeric - compatible with YellowBrick and sklearn tools")
    
    # Train/test split (no stratification for regression)
    print(f"Splitting data: {1-test_size:.0%} train, {test_size:.0%} test...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        random_state=random_state,
    )
    
    print(f"  Training set: {len(X_train):,} samples")
    print(f"  Test set: {len(X_test):,} samples")
    
    # Target statistics
    print(f"  Target mean (train): {y_train.mean():.1f} seconds ({y_train.mean()/60:.1f} minutes)")
    print(f"  Target std (train): {y_train.std():.1f} seconds")
    
    return X_train, X_test, y_train, y_test

In [11]:
# Process data - same for both model types (data is already numeric)
X_train, X_test, y_train, y_test = process_batch_data_duckdb(df)

Features: 23 total (9 numeric, 14 label-encoded)
All features are numeric - compatible with YellowBrick and sklearn tools
Splitting data: 80% train, 20% test...
  Training set: 8,000 samples
  Test set: 2,000 samples
  Target mean (train): 4554.3 seconds (75.9 minutes)
  Target std (train): 2487.3 seconds


In [12]:
X_train.head()

Unnamed: 0,estimated_distance_km,temperature_celsius,driver_rating,hour_of_day,initial_estimated_travel_time_seconds,debug_traffic_factor,debug_weather_factor,debug_incident_delay_seconds,debug_driver_factor,trip_id,...,destination,weather,day_of_week,vehicle_type,year,month,day,hour,minute,second
9254,22.52,29.8,4.3,8,2162,1.41,1.0,0,1.01,773153,...,313136,4,0,0,2026,1,19,8,56,2
1561,68.93,22.6,3.7,7,6426,1.58,1.0,0,1.04,226377,...,322853,0,0,0,2026,1,19,7,49,58
1670,19.99,29.6,4.2,13,1887,1.37,1.0,0,1.01,650530,...,340954,0,0,0,2026,1,19,13,25,52
6087,41.43,21.1,3.7,7,3609,1.55,1.0,0,1.04,28751,...,408974,0,0,0,2026,1,19,7,31,14
6669,49.61,19.7,4.3,3,4028,1.3,1.0,0,1.01,376446,...,15812,0,0,0,2026,1,19,3,41,8


In [13]:
X_train.dtypes

estimated_distance_km                    float64
temperature_celsius                      float64
driver_rating                            float64
hour_of_day                                int32
initial_estimated_travel_time_seconds      int32
debug_traffic_factor                     float64
debug_weather_factor                     float64
debug_incident_delay_seconds               int32
debug_driver_factor                      float64
trip_id                                    int64
driver_id                                  int64
vehicle_id                                 int64
origin                                     int64
destination                                int64
weather                                    int64
day_of_week                                int64
vehicle_type                               int64
year                                       int32
month                                      int32
day                                        int32
hour                

## Model Creation (LightGBM Primary, XGBoost Fallback)

### LightGBM (Primary)
Optimal for ETA prediction:
- Fastest training (7x faster than XGBoost)
- Lowest MAPE in regression benchmarks
- Efficient histogram-based algorithm
- Native categorical feature support

### XGBoost (Fallback)
For comparison and YellowBrick compatibility:
- Well-documented, mature library
- Strong community support

### Optimal Hyperparameters (Research-Based)

Based on 2024-2025 benchmarks for ETA prediction:

| Parameter | LightGBM | XGBoost | Rationale |
|-----------|----------|---------|------------|
| learning_rate | 0.05 | 0.05 | Balance speed/accuracy |
| n_estimators | 1000 | 1000 | With early stopping |
| max_depth | 10 | 8 | LightGBM handles deeper |
| num_leaves | 100 | - | < 2^max_depth |
| subsample | 0.8 | 0.8 | Prevent overfitting |
| colsample_bytree | 0.8 | 0.8 | Feature sampling |

In [14]:
from typing import Literal

def create_batch_model(
    model_type: Literal["lightgbm", "xgboost"] = "lightgbm",
    cat_feature_names: list[str] | None = None,
):
    """
    Create regressor optimized for ETA prediction.
    
    Args:
        model_type: "lightgbm" (primary) or "xgboost" (fallback)
        cat_feature_names: Names of categorical features (for LightGBM)
    
    Returns:
        Configured regressor ready for training
    
    References:
        - LightGBM docs: https://lightgbm.readthedocs.io/en/latest/Parameters.html
        - XGBoost docs: https://xgboost.readthedocs.io/en/stable/parameter.html
        - ETA research: LightGBM achieves lowest MAPE in travel time prediction
    """
    if model_type == "lightgbm":
        print(f"Creating LGBMRegressor (primary model)")
        if cat_feature_names:
            print(f"  Categorical features: {cat_feature_names}")
        
        # Optimized LightGBM parameters for ETA prediction
        model = LGBMRegressor(
            # Core parameters
            n_estimators=1000,              # Max trees; early stopping finds optimal
            learning_rate=0.05,             # Good balance for large datasets
            max_depth=10,                   # LightGBM handles deeper trees well
            num_leaves=100,                 # Should be < 2^max_depth (1024)
            
            # Regularization
            min_child_samples=20,           # Minimum samples in leaf
            reg_alpha=0.1,                  # L1 regularization
            reg_lambda=0.1,                 # L2 regularization
            min_gain_to_split=0.01,         # Minimum gain to make a split
            
            # Sampling
            subsample=0.8,                  # Row sampling
            colsample_bytree=0.8,           # Column sampling
            subsample_freq=1,               # Frequency of subsample
            
            # Objective & Metric
            objective='regression',
            metric='rmse',
            
            # Boosting
            boosting_type='gbdt',
            
            # Performance
            n_jobs=-1,                      # Use all CPU cores
            random_state=42,
            verbose=1,                       # 1 = info level logging
        )
    
    elif model_type == "xgboost":
        print(f"Creating XGBRegressor (fallback model)")
        
        # Optimized XGBoost parameters for ETA prediction
        model = XGBRegressor(
            # Core parameters
            n_estimators=1000,              # Max trees; early stopping finds optimal
            learning_rate=0.05,             # Good balance
            max_depth=8,                    # XGBoost prefers shallower
            
            # Regularization
            min_child_weight=5,             # Minimum sum of instance weight
            gamma=0.1,                      # Minimum loss reduction
            reg_alpha=0.1,                  # L1 regularization
            reg_lambda=1.0,                 # L2 regularization
            
            # Sampling
            subsample=0.8,                  # Row sampling
            colsample_bytree=0.8,           # Column sampling
            
            # Objective
            objective='reg:squarederror',
            
            # Performance
            tree_method='hist',             # Faster histogram-based
            n_jobs=-1,
            random_state=42,
        )
    
    else:
        raise ValueError(f"Unknown model_type: {model_type}")
    
    return model

In [15]:
# Create model using MODEL_TYPE defined earlier
model = create_batch_model(
    model_type=MODEL_TYPE,
    cat_feature_names=ETA_CAT_FEATURE_NAMES if MODEL_TYPE == "lightgbm" else None,
)

Creating LGBMRegressor (primary model)
  Categorical features: ['trip_id', 'driver_id', 'vehicle_id', 'origin', 'destination', 'weather', 'day_of_week', 'vehicle_type', 'year', 'month', 'day', 'hour', 'minute', 'second']


## Train Model

Training with all-numeric features (label-encoded in DuckDB SQL).

**LightGBM**: Pass `categorical_feature` parameter for native categorical handling.
**XGBoost**: Works directly with integer-encoded categoricals.

In [None]:
# Train model based on type
if MODEL_TYPE == "lightgbm":
    # LightGBM training with native categorical support and early stopping
    from lightgbm import early_stopping, log_evaluation
    
    print("Training LightGBM with early stopping...")
    print("Logging every iteration...")
    print()
    
    model.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_names=["validation"],
        categorical_feature=ETA_CAT_FEATURE_NAMES,
        callbacks=[
            early_stopping(stopping_rounds=50, verbose=True),
            log_evaluation(period=1),  # Log EVERY iteration
            #log_evaluation(period=100)
        ],
    )
    print()
    print(f"Best iteration: {model.best_iteration_}")
    print(f"Best score (RMSE): {model.best_score_['validation']['rmse']:.4f}")
else:
    # XGBoost training with early stopping
    print("Training XGBoost with early stopping...")
    model.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        verbose=1,  # Log every iteration
    )
    print(f"Best iteration: {model.best_iteration}")
    print(f"Best score: {model.best_score:.4f}")

Training LightGBM with early stopping...
Logging every iteration...

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.334921 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 3237
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 19
[LightGBM] [Info] Start training from score 4554.302250
[1]	validation's rmse: 2398.31
Training until validation scores don't improve for 50 rounds
[2]	validation's rmse: 2283.23


## Evaluate Model

In [None]:
y_pred = model.predict(X_test)

## Model Comparison

Compare LightGBM (primary) vs XGBoost (fallback) on this dataset.

Run the notebook twice with `MODEL_TYPE = "lightgbm"` and `MODEL_TYPE = "xgboost"` to compare.

In [None]:
# -----------------------------------------------------------------------------
# PRIMARY METRICS - Most important for ETA prediction
# These are the industry-standard metrics for travel time estimation
# -----------------------------------------------------------------------------
primary_metric_functions = {
    # MAE: Mean Absolute Error - average absolute difference in seconds
    # Most interpretable: "On average, predictions are off by X seconds"
    "mean_absolute_error": metrics.mean_absolute_error,
    
    # RMSE: Root Mean Squared Error - penalizes large errors more
    # Critical for ETA: large errors are worse than many small ones
    "root_mean_squared_error": metrics.root_mean_squared_error,
    
    # MAPE: Mean Absolute Percentage Error - relative error
    # Industry standard for ETA: "Predictions are off by X%"
    "mean_absolute_percentage_error": metrics.mean_absolute_percentage_error,
    
    # R2: Coefficient of Determination - explained variance ratio
    # How much variance in travel time is explained by the model
    "r2_score": metrics.r2_score,
}
primary_metric_args = {
    "mean_absolute_error": {},
    "root_mean_squared_error": {},
    "mean_absolute_percentage_error": {},
    "r2_score": {},
}

In [None]:
# -----------------------------------------------------------------------------
# SECONDARY METRICS - Additional insights for regression
# -----------------------------------------------------------------------------
secondary_metric_functions = {
    # MSE: Mean Squared Error - used internally for optimization
    "mean_squared_error": metrics.mean_squared_error,
    
    # Median Absolute Error - robust to outliers
    # Useful when there are extreme travel times (accidents, etc.)
    "median_absolute_error": metrics.median_absolute_error,
    
    # Max Error - worst case prediction
    # Important for ETA: what's the worst prediction we could make?
    "max_error": metrics.max_error,
    
    # Explained Variance Score - similar to R2 but doesn't center
    "explained_variance_score": metrics.explained_variance_score,
}
secondary_metric_args = {
    "mean_squared_error": {},
    "median_absolute_error": {},
    "max_error": {},
    "explained_variance_score": {},
}

In [None]:
# -----------------------------------------------------------------------------
# LOGARITHMIC METRICS - For positive targets (travel time > 0)
# Useful when relative errors matter more than absolute errors
# -----------------------------------------------------------------------------
logarithmic_metric_functions = {
    # MSLE: Mean Squared Logarithmic Error
    # Penalizes underestimates more than overestimates
    # Good for ETA: better to overestimate than underestimate travel time
    "mean_squared_log_error": metrics.mean_squared_log_error,
    
    # RMSLE: Root Mean Squared Logarithmic Error
    "root_mean_squared_log_error": metrics.root_mean_squared_log_error,
}
logarithmic_metric_args = {
    "mean_squared_log_error": {},
    "root_mean_squared_log_error": {},
}

In [None]:
# -----------------------------------------------------------------------------
# DEVIANCE METRICS - For generalized linear models / distributional analysis
# -----------------------------------------------------------------------------
deviance_metric_functions = {
    # Mean Poisson Deviance - assumes count/positive data
    "mean_poisson_deviance": metrics.mean_poisson_deviance,
    
    # Mean Gamma Deviance - assumes positive continuous data
    # Good for travel times which are always positive
    "mean_gamma_deviance": metrics.mean_gamma_deviance,
    
    # Mean Tweedie Deviance - generalized (power parameter)
    # power=0: Normal, power=1: Poisson, power=2: Gamma
    "mean_tweedie_deviance": metrics.mean_tweedie_deviance,
}
deviance_metric_args = {
    "mean_poisson_deviance": {},
    "mean_gamma_deviance": {},
    "mean_tweedie_deviance": {"power": 1.5},  # Between Poisson and Gamma
}

In [None]:
# -----------------------------------------------------------------------------
# QUANTILE METRICS - For quantile regression / prediction intervals
# -----------------------------------------------------------------------------
quantile_metric_functions = {
    # Mean Pinball Loss (quantile loss) - for prediction intervals
    # alpha=0.5 is equivalent to MAE (median regression)
    "mean_pinball_loss": metrics.mean_pinball_loss,
}
quantile_metric_args = {
    "mean_pinball_loss": {"alpha": 0.5},  # Median (50th percentile)
}

In [None]:
# -----------------------------------------------------------------------------
# D2 SCORE METRICS - Fraction of deviance explained (like R2 for deviance)
# These are "coefficient of determination" analogs for different loss functions
# -----------------------------------------------------------------------------
d2_metric_functions = {
    # D2 Absolute Error Score - fraction of absolute error explained
    # Similar to R2 but based on MAE instead of MSE
    "d2_absolute_error_score": metrics.d2_absolute_error_score,
    
    # D2 Pinball Score - fraction of pinball loss explained
    "d2_pinball_score": metrics.d2_pinball_score,
    
    # D2 Tweedie Score - fraction of Tweedie deviance explained
    "d2_tweedie_score": metrics.d2_tweedie_score,
}
d2_metric_args = {
    "d2_absolute_error_score": {},
    "d2_pinball_score": {"alpha": 0.5},
    "d2_tweedie_score": {"power": 1.5},
}

## Comprehensive Metrics Evaluation

Sklearn regression metrics organized by category:

| Category | Metrics | Use Case |
|----------|---------|----------|
| **Primary** | MAE, RMSE, MAPE, R2 | Core ETA evaluation |
| **Secondary** | MSE, MedianAE, MaxError, EVS | Additional insights |
| **Logarithmic** | MSLE, RMSLE | Relative errors, positive targets |
| **Deviance** | Poisson, Gamma, Tweedie | Distributional analysis |
| **Quantile** | Pinball Loss | Prediction intervals |
| **D2 Scores** | D2_AE, D2_Pinball, D2_Tweedie | Fraction explained |

In [None]:
# =============================================================================
# COMPUTE ALL METRICS
# =============================================================================
metrics_to_log = {}

# -----------------------------------------------------------------------------
# PRIMARY METRICS
# -----------------------------------------------------------------------------
for name, func in primary_metric_functions.items():
    try:
        metrics_to_log[name] = func(y_test, y_pred, **primary_metric_args[name])
    except Exception as e:
        print(f"Error computing {name}: {e}")

# -----------------------------------------------------------------------------
# SECONDARY METRICS
# -----------------------------------------------------------------------------
for name, func in secondary_metric_functions.items():
    try:
        metrics_to_log[name] = func(y_test, y_pred, **secondary_metric_args[name])
    except Exception as e:
        print(f"Error computing {name}: {e}")

# -----------------------------------------------------------------------------
# LOGARITHMIC METRICS (require positive values)
# -----------------------------------------------------------------------------
# Clip predictions to positive values for log metrics
y_pred_positive = np.maximum(y_pred, 1e-10)
y_test_positive = np.maximum(y_test.values, 1e-10)

for name, func in logarithmic_metric_functions.items():
    try:
        metrics_to_log[name] = func(y_test_positive, y_pred_positive, **logarithmic_metric_args[name])
    except Exception as e:
        print(f"Error computing {name}: {e}")

# -----------------------------------------------------------------------------
# DEVIANCE METRICS (require strictly positive values)
# -----------------------------------------------------------------------------
for name, func in deviance_metric_functions.items():
    try:
        metrics_to_log[name] = func(y_test_positive, y_pred_positive, **deviance_metric_args[name])
    except Exception as e:
        print(f"Error computing {name}: {e}")

# -----------------------------------------------------------------------------
# QUANTILE METRICS
# -----------------------------------------------------------------------------
for name, func in quantile_metric_functions.items():
    try:
        metrics_to_log[name] = func(y_test, y_pred, **quantile_metric_args[name])
    except Exception as e:
        print(f"Error computing {name}: {e}")

# -----------------------------------------------------------------------------
# D2 SCORE METRICS
# -----------------------------------------------------------------------------
for name, func in d2_metric_functions.items():
    try:
        if "tweedie" in name or "pinball" in name:
            metrics_to_log[name] = func(y_test_positive, y_pred_positive, **d2_metric_args[name])
        else:
            metrics_to_log[name] = func(y_test, y_pred, **d2_metric_args[name])
    except Exception as e:
        print(f"Error computing {name}: {e}")

metrics_to_log

In [None]:
# Display metrics in a readable format
print("=" * 60)
print(f"ETA PREDICTION METRICS ({MODEL_TYPE.upper()})")
print("=" * 60)

print("\n--- PRIMARY METRICS (Industry Standard for ETA) ---")
print(f"MAE:  {metrics_to_log['mean_absolute_error']:.2f} seconds ({metrics_to_log['mean_absolute_error']/60:.2f} minutes)")
print(f"RMSE: {metrics_to_log['root_mean_squared_error']:.2f} seconds ({metrics_to_log['root_mean_squared_error']/60:.2f} minutes)")
print(f"MAPE: {metrics_to_log['mean_absolute_percentage_error']*100:.2f}%")
print(f"R2:   {metrics_to_log['r2_score']:.4f}")

print("\n--- SECONDARY METRICS ---")
print(f"MSE:               {metrics_to_log['mean_squared_error']:.2f}")
print(f"Median AE:         {metrics_to_log['median_absolute_error']:.2f} seconds")
print(f"Max Error:         {metrics_to_log['max_error']:.2f} seconds ({metrics_to_log['max_error']/60:.2f} minutes)")
print(f"Explained Var:     {metrics_to_log['explained_variance_score']:.4f}")

print("\n--- LOGARITHMIC METRICS ---")
print(f"MSLE:  {metrics_to_log['mean_squared_log_error']:.6f}")
print(f"RMSLE: {metrics_to_log['root_mean_squared_log_error']:.6f}")

print("\n--- DEVIANCE METRICS ---")
print(f"Poisson Deviance:  {metrics_to_log['mean_poisson_deviance']:.4f}")
print(f"Gamma Deviance:    {metrics_to_log['mean_gamma_deviance']:.6f}")
print(f"Tweedie Deviance:  {metrics_to_log['mean_tweedie_deviance']:.4f}")

print("\n--- QUANTILE & D2 METRICS ---")
print(f"Pinball Loss (50%): {metrics_to_log['mean_pinball_loss']:.2f}")
print(f"D2 Absolute Error:  {metrics_to_log['d2_absolute_error_score']:.4f}")
print(f"D2 Pinball:         {metrics_to_log['d2_pinball_score']:.4f}")
print(f"D2 Tweedie:         {metrics_to_log['d2_tweedie_score']:.4f}")

In [None]:
# Print all model parameters
print("\nModel parameters:")
if MODEL_TYPE == "lightgbm":
    all_params = model.get_params()
else:
    all_params = model.get_params()
    
for param, value in all_params.items():
    print(f"  {param}: {value}")

# YellowBrick

YellowBrick visualizations for regression will be added in the next phase:

| Category | Visualizers |
|----------|-------------|
| **Regression** | ResidualsPlot, PredictionError |
| **Feature Analysis** | Rank1D, Rank2D, PCA, Manifold, ParallelCoordinates |
| **Target** | FeatureCorrelation |
| **Model Selection** | FeatureImportances, LearningCurve, ValidationCurve |