Model Development with Various Transformations of Target Variable (Shortened Version)

This script trains and evaluates models using different transformations of the target variable, without using StandardScaler for input variables.

Note: This is a shortened version that does not include feature importances or detailed fold performance information. 
For the entire code, please view the file 
'7A) Model Development (Various Transformations of Target Variable, Without StandardScaler for Input Variables)' in the full-code folder.

Data source: 
- Preprocessed data: '6) daily_consumption_data_full.csv'

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import TimeSeriesSplit
from xgboost import XGBRegressor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
from sklearn.metrics import make_scorer, r2_score

df = pd.read_csv('6) daily_consumption_data_full.csv', parse_dates=['date'])

In [2]:
# Categorisation of input variables
weather_features = [
    'temp_max_today', 'humidity_min_today', 'humidity_max_today', 'windspeed_max_today', 'precip_sum_today', 'solarradiation_sum_today',
    'humidity_at_peak_consumption_today', 'windspeed_at_peak_consumption_today', 'precip_at_peak_consumption_today', 
    'solarradiation_at_peak_consumption_today', 'temp_range_today', 'windspeed_min_today'
]

temporal_features = [
    'is_winter', 'is_summer', 'is_autumn', 'is_weekend', 'is_holiday', 'day_of_week_sin', 
    'day_of_week_cos', 'week_of_month_sin', 'week_of_month_cos'
]

consumption_features = [
    'consumption_sum_today', 'consumption_peak_today', 'consumption_min_today', 'prev_day_peak', 'same_day_last_week_peak', 
    'avg_peak_3d', 'avg_peak_7d', 'max_peak_7d', 'max_peak_3d'
]

household_features = [
    'household_size', 'male_occupants', 'female_occupants', 'count_children', 'ownership_owned', 
    'ownership_rented', 'ownership_other', 'work_from_home', 'housing_house', 'housing_apartment', 
    'count_rooms', 'electric_central_heating', 'heating_manual_boiler', 'heating_thermostatic_valves', 
    'heating_auto_set_times', 'heating_auto_temp_control', 'heating_not_sure', 'uses_electric_heater'
]

pricing_features = [
    'prop_low_price', 'prop_high_price', 'tariff_at_peak_consumption_today'
]

attitudinal_and_behavioural_features = [
    'interest_in_renewable_energy', 'interest_in_microgeneration', 'climate_change_concern', 
    'lifestyle_environment', 'smart_meter_bill_understanding', 'smart_meter_consumption_understanding'
]

appliance_features = [
    'washing_machine_fixed_schedule', 'tumble_dryer_fixed_schedule', 'dishwasher_fixed_schedule', 
    'immersion_water_heater_fixed_schedule', 'electric_oven_fixed_schedule', 'electric_hob_fixed_schedule', 
    'ironing_fixed_schedule', 'electric_shower_fixed_schedule', 'kettle_fixed_schedule', 'lighting_fixed_schedule', 
    'electric_heater_fixed_schedule', 'washer-dryer_combined_timer_use', 'washing_machine_timer_use', 
    'tumble_dryer_timer_use', 'dishwasher_timer_use', 'electric_space_heating_timer_use', 
    'washer-dryer_combined_ownership', 'washing_machine_ownership', 'tumble_dryer_ownership', 'dishwasher_ownership', 
    'electric_space_heating_ownership', 'count_low_efficiency_bulbs', 'total_refrigeration_units', 
    'count_cooking_appliances', 'count_laundry_appliances', 'count_kitchen_appliances', 
    'count_heating_water_appliances', 'count_entertainment_devices', 'count_computing_devices', 'count_tv', 
    'tv_energy_score'
]

# Define feature sets for models
model_1_features = temporal_features + consumption_features
model_2_features = model_1_features + weather_features
model_3_features = model_2_features + household_features + pricing_features + attitudinal_and_behavioural_features + appliance_features

# Apply different transformations as the target variable is skewed

In [3]:
transformations = {
    'log_consumption_peak_next_day': np.log1p,
    'sqrt_consumption_peak_next_day': np.sqrt,
    'cbrt_consumption_peak_next_day': np.cbrt
}

for new_col, func in transformations.items():
    df[new_col] = func(df['consumption_peak_next_day'])

# Train models using each transformation of the target variable, consumption_peak_next_day

In [4]:
# Define the hyperparameter grid
param_grid = {
    'n_estimators': randint(100, 3000),
    'max_depth': randint(3, 10),
    'learning_rate': uniform(0.01, 0.29),
    'subsample': uniform(0.5, 0.5),
    'colsample_bytree': uniform(0.5, 0.5),
    'min_child_weight': randint(1, 10),
    'gamma': uniform(0, 0.5),
    'reg_alpha': uniform(0, 1),
    'reg_lambda': uniform(0, 1)
}

# Define date ranges for folds
# Considering we only have 2013 data, we will define the folds as the following and use December as the test set
folds = [
    ('2013-01-08', '2013-08-31', '2013-09-01', '2013-09-30'),
    ('2013-01-08', '2013-09-30', '2013-10-01', '2013-10-31'),
    ('2013-01-08', '2013-10-31', '2013-11-01', '2013-11-30')
]

In [5]:
def run_model(features, df, folds, param_grid, model_name):
    def create_dataset(df, start_date, end_date):
        mask = (df['date'] >= pd.to_datetime(start_date)) & (df['date'] <= pd.to_datetime(end_date))
        X = df.loc[mask, features]
        y = df.loc[mask, ['consumption_peak_next_day', 'log_consumption_peak_next_day', 
                          'sqrt_consumption_peak_next_day', 'cbrt_consumption_peak_next_day']]
        return X, y

    def calculate_metrics(y_true, y_pred, transformation='consumption_peak_next_day'):
        if transformation == 'log_consumption_peak_next_day':
            y_true, y_pred = np.expm1(y_true), np.expm1(y_pred)
        elif transformation == 'sqrt_consumption_peak_next_day':
            y_true, y_pred = y_true ** 2, y_pred ** 2
        elif transformation == 'cbrt_consumption_peak_next_day':
            y_true, y_pred = y_true ** 3, y_pred ** 3
        
        mae = np.mean(np.abs(y_true - y_pred))
        
        non_zero = (y_true != 0)
        mape = np.mean(np.abs((y_true[non_zero] - y_pred[non_zero]) / y_true[non_zero])) * 100
        
        wape = np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true)) * 100
        
        r2 = r2_score(y_true, y_pred)
        
        return mae, mape, wape, r2
    
    def wape_score(y_true, y_pred):
        return np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true))

    wape_scorer = make_scorer(wape_score, greater_is_better=False)

    results = {target: {'mae_scores_train': [], 'mape_scores_train': [], 'wape_scores_train': [], 'r2_scores_train': [],
                        'mae_scores_val': [], 'mape_scores_val': [], 'wape_scores_val': [], 'r2_scores_val': [],
                        'best_params': []} 
               for target in ['consumption_peak_next_day', 'log_consumption_peak_next_day', 
                              'sqrt_consumption_peak_next_day', 'cbrt_consumption_peak_next_day']}

    for fold, (train_start, train_end, val_start, val_end) in enumerate(folds, 1):
        X_train, y_train = create_dataset(df, train_start, train_end)
        X_val, y_val = create_dataset(df, val_start, val_end)
        
        tscv = TimeSeriesSplit(n_splits=3)
        
        for target in ['consumption_peak_next_day', 'log_consumption_peak_next_day', 
                       'sqrt_consumption_peak_next_day', 'cbrt_consumption_peak_next_day']:
            
            model = XGBRegressor(objective='reg:squarederror', n_jobs=-1, enable_categorical=True)
            random_search = RandomizedSearchCV(model, param_distributions=param_grid, 
                                               n_iter=100, cv=tscv.split(X_train), 
                                               scoring=wape_scorer, 
                                               n_jobs=-1, random_state=1)
            random_search.fit(X_train, y_train[target])
            
            best_model = random_search.best_estimator_
            best_params = random_search.best_params_
            results[target]['best_params'].append(best_params)
            
            train_predictions = best_model.predict(X_train)
            train_mae, train_mape, train_wape, train_r2 = calculate_metrics(y_train[target], train_predictions, target)
            results[target]['mae_scores_train'].append(train_mae)
            results[target]['mape_scores_train'].append(train_mape)
            results[target]['wape_scores_train'].append(train_wape)
            results[target]['r2_scores_train'].append(train_r2)
            
            val_predictions = best_model.predict(X_val)
            val_mae, val_mape, val_wape, val_r2 = calculate_metrics(y_val[target], val_predictions, target)
            results[target]['mae_scores_val'].append(val_mae)
            results[target]['mape_scores_val'].append(val_mape)
            results[target]['wape_scores_val'].append(val_wape)
            results[target]['r2_scores_val'].append(val_r2)

    print(f"\n--- Results for {model_name} ---")
    for target in ['consumption_peak_next_day', 'log_consumption_peak_next_day', 
                   'sqrt_consumption_peak_next_day', 'cbrt_consumption_peak_next_day']:
        print(f"\nResults for {target}:")
        print(f"Average Training MAE: {np.mean(results[target]['mae_scores_train']):.4f}")
        print(f"Average Training MAPE: {np.mean(results[target]['mape_scores_train']):.4f}%")
        print(f"Average Training WAPE: {np.mean(results[target]['wape_scores_train']):.4f}%")
        print(f"Average Training R²: {np.mean(results[target]['r2_scores_train']):.4f}")
        print(f"Average Validation MAE: {np.mean(results[target]['mae_scores_val']):.4f}")
        print(f"Average Validation MAPE: {np.mean(results[target]['mape_scores_val']):.4f}%")
        print(f"Average Validation WAPE: {np.mean(results[target]['wape_scores_val']):.4f}%")
        print(f"Average Validation R²: {np.mean(results[target]['r2_scores_val']):.4f}")

    test_start = '2013-12-01'
    test_end = '2013-12-30'

    X_train_final, y_train_final = create_dataset(df, df['date'].min(), pd.to_datetime(test_start) - pd.Timedelta(days=1))
    X_test, y_test = create_dataset(df, test_start, test_end)

    final_models = {}
    test_metrics = {}

    print("\n--- Test Metrics ---")
    for target in ['consumption_peak_next_day', 'log_consumption_peak_next_day', 
                   'sqrt_consumption_peak_next_day', 'cbrt_consumption_peak_next_day']:
        best_params = results[target]['best_params'][np.argmin(results[target]['wape_scores_val'])]
        final_model = XGBRegressor(**best_params, enable_categorical=True)
        final_model.fit(X_train_final, y_train_final[target])

        test_predictions = final_model.predict(X_test)
        test_mae, test_mape, test_wape, test_r2 = calculate_metrics(y_test[target], test_predictions, target)
        print(f"\nTest metrics for {target}:")
        print(f"MAE: {test_mae:.4f}")
        print(f"MAPE: {test_mape:.4f}%")
        print(f"WAPE: {test_wape:.4f}%")
        print(f"R²: {test_r2:.4f}")

        final_models[target] = final_model
        test_metrics[target] = {'mae': test_mae, 'mape': test_mape, 'wape': test_wape, 'r2': test_r2}

    return final_models, test_metrics

# Run and compare models

In [6]:
# Dictionary to store results
all_results = {}

# Run models 1 to 3
for i, features in enumerate([model_1_features, model_2_features, model_3_features], 1):
    model_name = f"Model {i}"
    final_models, test_metrics = run_model(features, df, folds, param_grid, model_name)
    all_results[model_name] = {
        'features': features,
        'test_metrics': test_metrics,
    }

# Train model 4, which only uses the top 15 features from model 3
model_4_features = [feature for feature, importance in 
                    sorted(zip(model_3_features, final_models['consumption_peak_next_day'].feature_importances_), 
                           key=lambda x: x[1], reverse=True)[:15]]
model_4, metrics_4 = run_model(model_4_features, df, folds, param_grid, "Model 4 (Only Top 15 Features from Model 3)")

all_results["Model 4"] = {
    'features': model_4_features,
    'test_metrics': metrics_4,
}

# Compare the models
print("\n--- Model Comparison ---")
for model_name, results in all_results.items():
    print(f"\n{model_name}:")
    print(f"Number of features: {len(results['features'])}")
    for target, metrics in results['test_metrics'].items():
        print(f"\nTest metrics for {target}:")
        print(f"  MAE: {metrics['mae']:.4f}")
        print(f"  MAPE: {metrics['mape']:.4f}%")
        print(f"  WAPE: {metrics['wape']:.4f}%")
        print(f"  R²: {metrics['r2']:.4f}")


--- Results for Model 1 ---

Results for consumption_peak_next_day:
Average Training MAE: 0.3335
Average Training MAPE: 56.9102%
Average Training WAPE: 29.9714%
Average Training R²: 0.6999
Average Validation MAE: 0.3579
Average Validation MAPE: 62.1932%
Average Validation WAPE: 31.9314%
Average Validation R²: 0.5990

Results for log_consumption_peak_next_day:
Average Training MAE: 0.3383
Average Training MAPE: 51.1096%
Average Training WAPE: 30.4043%
Average Training R²: 0.6679
Average Validation MAE: 0.3535
Average Validation MAPE: 53.8475%
Average Validation WAPE: 31.5326%
Average Validation R²: 0.5924

Results for sqrt_consumption_peak_next_day:
Average Training MAE: 0.3336
Average Training MAPE: 45.6123%
Average Training WAPE: 29.9862%
Average Training R²: 0.6772
Average Validation MAE: 0.3522
Average Validation MAPE: 48.4540%
Average Validation WAPE: 31.4192%
Average Validation R²: 0.5928

Results for cbrt_consumption_peak_next_day:
Average Training MAE: 0.3275
Average Training M

# Get the best hyperparameters for the cube root transformation of Model 4 (the optimal model)

In [7]:
best_params_cbrt_model4 = model_4['cbrt_consumption_peak_next_day'].get_params()
# Filter only the tuned hyperparameters
tuned_params = ['n_estimators', 'max_depth', 'learning_rate', 'subsample', 'colsample_bytree', 'min_child_weight', 'gamma', 'reg_alpha', 'reg_lambda']

print("\nBest tuned hyperparameters for cube root transformation of Model 4:")
for param in tuned_params:
    print(f"{param}: {best_params_cbrt_model4[param]}")


Best tuned hyperparameters for cube root transformation of Model 4:
n_estimators: 1270
max_depth: 8
learning_rate: 0.018057171166114286
subsample: 0.7390362533589154
colsample_bytree: 0.7292919068542159
min_child_weight: 5
gamma: 0.05667096138379524
reg_alpha: 0.45240482674645155
reg_lambda: 0.4500867470007611
