In this notebook, I use UCI's [Electrical Grid Stability data](https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+) to predict if a given combination of power system conditions would result in an unstable grid - and therefore risk causing blackouts or damaging equipment. After brief data exploration and processing, I ran baseline models, confirmed my feature selection, then optimised the models. The best classifier scored 98.3% accuracy on held out data, similar to a deep learning model [in this publication](https://link.springer.com/article/10.1007/s42979-021-00463-5/tables/3), while the best regressor gave an R<sup>2</sup> score of 95.9%. Finally, I pickled the best models and created [a simple app](https://share.streamlit.io/sowla/grid_stability_app/main/grid_stability.py), where users can adjust model inputs and see how resulting predictions are affected.

<a id="overview"></a>
### Overview

* [Introduction](#introduction)
* [Quick EDA](#quick-eda)
* [Build baseline models](#build-baseline-models) (fit baseline models, check feature selection and overfitting)
* [Optimise models](#optimise-models) (with randomised and grid searches, results summarised [here](#hyperparameter-tuning-summary))
* [Test final model](#test-final-model)
* [Interesting links](#interesting-links) (related to power grids)

<a id="introduction"></a>
### Introduction

Share of renewable electricity production in Germany has grown from [9% in 2002 to 51% in 2020](https://energy-charts.info/charts/renewable_share/chart.htm?l=en&c=DE&interval=year), an [important progress for meeting climate targets](https://2022.entsos-tyndp-scenarios.eu/wp-content/uploads/2021/04/entsog_entso-e_TYNDP2022_Joint_Scenarios_Final_Storyline_Report_210421.pdf). However, intermittency of weather-dependent renewable sources makes it [harder and more expensive](https://www.drax.com/wp-content/uploads/2020/08/200828_Drax20_Q2_Report_005.pdf) to maintain grid stability (a balance of electricity production and consumption). The [Decentral Smart Grid Control](https://iopscience.iop.org/article/10.1088/1367-2630/17/1/015002#njp505903s5) (DSGC) concept was proposed as a way to adjust price based on supply and demand in a decentralised way - giving consumers an incentive to adjust their usage and help stabilise the grid, without needing to centrally collect their usage data.

The data set I'm using [was originally simulated](https://arxiv.org/pdf/1508.02217v1.pdf) to explore if grid stability can be maintained under DSGC, assuming a 4-node architecture: one producer providing electricity to three consumers. There are 10,000 instances and 12 attributes:  
- `p[x]` (`p1` to `p4`): power produced or consumed; `p1 = abs(p2 + p3 + p4)`
- `g[x]` (`g1` to `g4`): price elasticity - willingness of each node to adapt their consumption or production per second (gamma)  
- `tau[x]` (`tau1` to `tau4`): how long it takes for each node to adapt their production or consumption in seconds

where `p1`, `g1` and `tau1` are related to the electricity producer; the rest are related to the electricity consumers.

There are also two target variables:  
- `stab`: a number representing grid stability (positive if unstable)
- `stabf`: a categorical version of `stab`

and I've worked on both the classification and regression problems.

In [None]:
# data exploration / preprocessing
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (StandardScaler, RobustScaler,
                                   FunctionTransformer, QuantileTransformer)

# model fitting / evaluation / export
from sklearn.linear_model import LogisticRegression, LinearRegression, Lasso, Ridge
from sklearn.ensemble import (RandomForestClassifier, RandomForestRegressor, 
                              VotingClassifier)
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR, LinearSVC, LinearSVR
from lightgbm import LGBMClassifier, LGBMRegressor, plot_metric, plot_importance
from sklearn.inspection import permutation_importance
from sklearn.model_selection import (train_test_split, StratifiedKFold, KFold,
                                     cross_validate,
                                     RandomizedSearchCV, GridSearchCV)
from sklearn.metrics import (plot_confusion_matrix, classification_report,
                             plot_roc_curve)
import pickle

from IPython.display import display_html

In [None]:
smart_grid_orig = pd.read_csv("../input/ucis-electrical-grid-stability-simulated-data/Data_for_UCI_named.csv")

smart_grid_orig.head(3)

Back to [Overview](#overview)

<a id="quick-eda"></a>
### Quick EDA

Since the data was simulated and very clean, I only briefly explored and processed it, e.g. basic quality checks, look at class balance, encode labels and rename columns for clarity:

In [None]:
assert smart_grid_orig.isna().sum().sum() == 0, "some data missing"

stab_fine = max(smart_grid_orig.query("stabf == 'stable'")["stab"]) < 0
stabf_fine = min(smart_grid_orig.query("stabf == 'unstable'")["stab"]) > 0
assert (stab_fine & stabf_fine), "unexpected stab/stabf relationship"

print("As expected, no missing data and `stab` values of less than 0 are considered stable.")

In [None]:
fig, axs = plt.subplots(1, 4, figsize=(15, 3))

for axs_ind, feature_group in enumerate(["tau", "p", "g"]):
    smart_grid_orig.boxplot(
        column=[feature_group + str(i + 1) for i in range(4)], 
        ax= axs[axs_ind]
    )
smart_grid_orig.boxplot(column="stab", ax= axs[3])

for axs_ind, title in enumerate(["reaction time", "power production/consumption",
                                 "price elasticity", "grid stability"]):
    axs[axs_ind].set(title=title);

In [None]:
print(smart_grid_orig["stabf"].value_counts(normalize=True))  # pretty balanced

In [None]:
smart_grid = smart_grid_orig.assign(stabf = lambda x: x.stabf.replace({"unstable": 0, "stable": 1}))

smart_grid.columns = (smart_grid.columns
                      .str.replace("tau", "delay")
                      .str.replace("p", "power")
                      .str.replace("g", "adapt"))

In [None]:
g = sns.PairGrid(smart_grid, diag_sharey=False,
                 corner=True, height=0.6, aspect=1)
g.map_lower(sns.scatterplot, s=1)
g.map_diag(sns.histplot);

In [None]:
plt.figure(figsize = (10, 5))
sns.heatmap(smart_grid.corr(), fmt=".2f", annot=True);  # default pearson

The correlation between stability (`stab` or `stabf`) and `delay[x]` or `adapt[x]` columns were weak, whereas there was no obvious relationship between `power[x]` columns and stability.

As expected, `power1` (power generated) was correlated with the other `power[x]` columns (power consumed), but there was no obvious correlation within `delay[x]` or `adapt[x]` columns.

In [None]:
# https://www.researchgate.net/post/Multicollinearity_issues_is_a_value_less_than_10_acceptable_for_VIF
feat_cols = smart_grid.drop(["stab", "stabf"], axis=1).columns

pd.DataFrame({
    "variables": smart_grid[feat_cols].columns,
    "VIF": [variance_inflation_factor(smart_grid[feat_cols].values, ind)
            for ind in range(len(feat_cols))]
}).transpose()

Both Pearson correlation and variance inflation factor suggested `power[x]` columns may be collinear, so I removed the producer `power1` column (this might explain why it's not an explanatory variable according to the UCI documentation):

In [None]:
fig, axs = plt.subplots(3, 4, figsize=(10, 5))
plt.subplots_adjust(wspace=0.3, hspace=0.5)

for row_ind, feat_type in enumerate(["delay", "power", "adapt"]):
    for col_ind in range(4):
        show_legend = True if (row_ind == 0) & (col_ind == 3) else False
        sns.histplot(
            smart_grid, x=feat_type + str(col_ind + 1), hue="stabf",
            multiple="stack", legend=show_legend,
            ax=axs[row_ind, col_ind]
        )
        
        if col_ind > 0:
            axs[row_ind, col_ind].set_ylabel("")

Share of unstable events:
- increased with reaction delay (of both producers and consumers) until roughly 5 seconds, after which the share of unstable events was relatively unaffected by further increases in delay times
- increased linearly with price elasticity (of both producers and consumers)
- still seemed uncorrelated with amount of power produced/consumed

In [None]:
(smart_grid
 .assign(
     sum_delay = lambda x: x["delay1"] + x["delay2"] + x["delay3"] + x["delay4"],
     sum_adapt = lambda x: x["adapt1"] + x["adapt2"] + x["adapt3"] + x["adapt4"]
 )
 .pipe((sns.scatterplot, "data"), 
       x="sum_delay", y="sum_adapt", hue="stabf", alpha=0.2)
);

Together, very high or very low sums of `delay[x]` and `adapt[x]` values should be indicators of (in)stability. Without further processing, these summarised values would highly correlate with existing features, so I'd rather keep the individual features.

In contrast, I think it might be worth removing the rest of the `power[x]` columns if they're unhelpful.

Back to [Overview](#overview)

<a id="build-baseline-models"></a>
### Build baseline models

*Fit baseline models (keeping feature scaling in pipeline to avoid data leakage but allow for easy adjustments when optimising), use feature importance and coefficients to confirm feature selection, check for overfitting*

I first split the data, holding out 20% as the test set and using stratification to keep a consistent class share for the classification task:

In [None]:
# following conventions, X contains the features and y contains the labels
X = smart_grid.drop(["stab", "stabf"], axis=1)
y = smart_grid[["stab", "stabf"]]

X_train_val_, X_test_, y_train_val, y_test = \
    train_test_split(X, y, test_size=0.2, stratify=y["stabf"], random_state=0)

# unsure about how X should be processed at the moment
X_train_val_w_pwr = X_train_val_.drop(["power1"], axis=1)
X_train_val_no_pwr = (X_train_val_
                      .drop(["power1", "power2", "power3", "power4"], axis=1))
# labels for both tasks
clf_y_train_val, clf_y_test = y_train_val["stabf"], y_test["stabf"]
reg_y_train_val, reg_y_test = y_train_val["stab"], y_test["stab"]

assert all(clf_y_train_val.value_counts(normalize=True) == \
    clf_y_test.value_counts(normalize=True)), \
    "inconsistent class share afer split"

To get a better understanding of the data and which models perform well, I fitted a few models using 5-fold cross validation to two versions of the data (with or without the consumer `power[x]` features):

In [None]:
# classifiers
log_reg = LogisticRegression(random_state=1)
rfc = RandomForestClassifier(random_state=1)
knc = KNeighborsClassifier()
lgbc = LGBMClassifier(random_state=1)
svc = SVC()

# regressors
lin_reg = LinearRegression()
rfr = RandomForestRegressor(random_state=1)
knr = KNeighborsRegressor()
lgbr = LGBMRegressor(random_state=1)
svr = SVR()

# cross-validation splitters
skf = StratifiedKFold(random_state=1, shuffle=True)
kf = KFold(random_state=1, shuffle=True)
print("For classifiers:", skf)
print("For regressors:", kf)

In [None]:
def get_baseline_scores(est_names, X_train, y_train_val, cv):
    baseline_scores_list = []
    for est_name in est_names:
        pipe = Pipeline(steps=[("scaler", StandardScaler()),
                               ("estimator", globals()[est_name])])
        
        est_res = cross_validate(pipe,
                                 # https://github.com/dmlc/xgboost/issues/6908,
                                 np.ascontiguousarray(X_train),
                                 y_train_val,
                                 cv=cv, return_train_score=True)
        baseline_scores_list.append(
            pd.DataFrame({"estimator": est_name, 
                          "train_score": est_res.get("train_score"),
                          "val_score": est_res.get("test_score"), 
                          "fit_time_s": est_res.get("fit_time")
                         }))
    baseline_scores_df = (pd.concat(baseline_scores_list)
                          .sort_values("val_score", ascending=False))
    return baseline_scores_df


def fmt_bl_scores(scores_df, caption):
    metric_cols = ["train_score", "val_score", "fit_time_s"]
    final_df = (scores_df
                .pipe(pd.pivot_table, 
                      values=metric_cols, 
                      index="estimator", 
                      aggfunc={columns: np.mean for columns in metric_cols})
                .sort_values("val_score", ascending=False)
                
                # display options
                .style.format("{:.1%}", subset=["train_score", "val_score"])
                .set_table_attributes("style='display:inline'")
                .set_caption(caption)._repr_html_())
    return final_df

blc_w_pwr = get_baseline_scores(["log_reg", "rfc", "knc", "lgbc", "svc"],
                                X_train_val_w_pwr, clf_y_train_val, skf)
blc_no_pwr = get_baseline_scores(["log_reg", "rfc", "knc", "lgbc", "svc"],
                                 X_train_val_no_pwr, clf_y_train_val, skf)

blr_w_pwr = get_baseline_scores(["lin_reg", "rfr", "knr", "lgbr", "svr"],
                                X_train_val_w_pwr, reg_y_train_val, kf)
blr_no_pwr = get_baseline_scores(["lin_reg", "rfr", "knr", "lgbr", "svr"],
                                 X_train_val_no_pwr, reg_y_train_val, kf)
        
display_html(
    fmt_bl_scores(blc_w_pwr, 
                  "Baseline classifier accuracy (with power columns)") +
    "\xa0" * 2 +  # so both still fit in a line on small screens
    fmt_bl_scores(blc_no_pwr, 
                  "Baseline classifier accuracy (no power columns)") +
    "<br/><br/>" +
    fmt_bl_scores(blr_w_pwr,
                  "Baseline regressor R2 (with power columns)") +
    "\xa0" * 2 +
    fmt_bl_scores(blr_no_pwr, 
                  "Baseline regressor R2 (no power columns)"),
    raw=True
)

Removing the consumer `power[x]` columns increased accuracy scores for the top four out of five classifiers, and maintained or increased the R<sup>2</sup> score for all regressors. To get a better feel for this, I looked at the feature/permutation importance in tree-based models and absolute values of coefficients in L1-penalised linear models for each feature. As examples, I'm showing one type each for classification and regression models:

In [None]:
X_train_wp, X_val_wp, clf_y_train_wp, clf_y_val_wp = \
    train_test_split(X_train_val_w_pwr, clf_y_train_val, 
                     test_size = 0.3, random_state=0)
X_train_wp, X_val_wp, reg_y_train_wp, reg_y_val_wp = \
    train_test_split(X_train_val_w_pwr, reg_y_train_val, 
                     test_size = 0.3, random_state=0)

## LightGBM feature importances
lgbc.fit(X_train_wp, clf_y_train_wp)

## RF permutation importances (model agnostic method)
rfr.fit(X_train_wp, reg_y_train_wp)
rfr_pi_res = permutation_importance(rfr, X_train_wp, reg_y_train_wp,
                                   n_repeats=10, random_state=0)
rfr_pi_df = (pd.DataFrame(rfr_pi_res.get("importances"),
                         index=X_train_wp.columns,
                         columns=[i + 1 for i in range(10)])
            .reset_index()
            .pipe(pd.melt,
                  id_vars=["index"], value_vars=[i + 1 for i in range(10)],
                  var_name="rep", value_name="permutation_importance")
            .rename(columns={"index": "features"}))
rfr_pi_order = (rfr_pi_df.groupby("features")
               .mean("permutation_importance")
               .sort_values("permutation_importance", ascending=False).index)

## Logistic Regression/Linear SVC coefficients (L1-based)
ss = StandardScaler()
def plot_abs_coefs(model_inst, model_name, y_train, ax):
    coefs_df = pd.DataFrame(
        {"features": X_train_wp.columns,
         "coefficients": (model_inst
                          .fit(ss.fit_transform(X_train_wp), clf_y_train_wp)
                          .coef_.flatten())}
    ).assign(abs_coef = lambda x: x["coefficients"].map(abs))
    feat_order = (coefs_df
                  .groupby("features").mean("abs_coef")
                  .sort_values("abs_coef", ascending=False).index)
    (sns.barplot(data=coefs_df, x="abs_coef", y="features",
                color="lightgrey", order=feat_order, ax=ax)
     .set(title=f"Absolute values of {model_name} coefficients"))
    
## Plots
fig, axs = plt.subplots(2, 2, figsize=(12, 8))
plt.subplots_adjust(wspace=0.3, hspace=0.4)

plot_importance(lgbc, xlabel="Total gains of splits which use the feature", 
                title="LightGBM Classifier feature importance",
                importance_type="gain", precision=0, grid=False, ax=axs[0, 0])
plot_abs_coefs(model_inst=LogisticRegression(solver="liblinear", penalty="l1"),
               model_name="logistic regression", 
               y_train=clf_y_train_wp, ax=axs[0, 1])

sns.barplot(data=rfr_pi_df, x="permutation_importance", y="features",
            ci="sd", order=rfr_pi_order, color="lightgrey",
            ax=axs[1, 0]).set(title="RF Regressor permutation importance")
plot_abs_coefs(model_inst=LinearSVR(random_state=1, max_iter=2000),
               model_name="linear SVR", y_train=reg_y_train_wp, ax=axs[1, 1]);

As suggested by the data exploration and baseline model performances, reaction delay and willingness to adapt seem to be much more relevant than amount of power produced/consumed for both the classification and regression tasks. To me, this suggests the `power[x]` columns were adding more noise than signal, and should be removed moving forward:

In [None]:
def process_X(X_df):
    final_df = X_df.drop(["power1", "power2", "power3", "power4"], axis=1)
    return final_df

X_train_val = process_X(X_train_val_)
X_test = process_X(X_test_)

In [None]:
fig, axs = plt.subplots(2, 1, figsize=(10, 8))
plt.subplots_adjust(hspace=0.3)
for metric_df, ax_pos, estimator, metric in \
    zip([blc_no_pwr, blr_no_pwr], [0, 1],
        ["classifier", "regressor"], ["accuracy", "R-squared"]):
    (metric_df
     .pipe(pd.melt, 
           id_vars="estimator", 
           value_vars=["train_score", "val_score"], 
           var_name="metric_name", value_name="metric_value")
     .pipe((sns.boxplot, "data"), x="metric_value", y="estimator", 
           hue="metric_name", ax=axs[ax_pos])
     .set(title=f"Baseline {estimator} performance",
         xlim=(-0.05, None), xlabel=metric, ylabel=estimator))

Focusing on models that were fitted on the data without power consumption columns, all classifiers scored over 80% on validation accuracy. Support Vector (`svc`), Light Gradient Boosting Machine (LightGBM, `lgbc`) and random forest (RF, `rfc`) models gave the best scores. Validation R<sup>2</sup> were extremely varied across regressors, with LightGBM (`lgbr`), RF (`rfr`) and K-Nearest Neighbour (`knr`) being the best performers.

For both classification and regression, the LightGBM and RF models had close-to-perfect train scores. This could suggest overfitting, but is not uncommon for such ensembles, since enough trees could have trained on each training case to outweigh those that didn't. Instead of scoring based on the default accuracy for classification and R<sup>2</sup> for regression, I could look at log loss and mean squared error (MSE), for example:

In [None]:
X_train, X_val, clf_y_train, clf_y_val = \
    train_test_split(X_train_val, clf_y_train_val, 
                     test_size = 0.3, random_state=0)
X_train, X_val, reg_y_train, reg_y_val = \
    train_test_split(X_train_val, reg_y_train_val, 
                     test_size = 0.3, random_state=0)

fig, axs = plt.subplots(1, 2, figsize=(10, 3))
plt.subplots_adjust(wspace=0.4)
lgbc.fit(X_train, clf_y_train, 
        eval_set=[(X_train, clf_y_train), (X_val, clf_y_val)], 
        eval_names=["train set", "validation set"], verbose=0)
lgbr.fit(X_train, reg_y_train, 
        eval_set=[(X_train, reg_y_train), (X_val, reg_y_val)], 
        eval_names=["train set", "validation set"], verbose=0)
plot_metric(lgbc, title="Baseline LightGBM Classifier",
            xlabel="n_estimators", grid=False, ax=axs[0])
plot_metric(lgbr, title="Baseline LightGBM Regression",
            xlabel="n_estimators", ylabel="MSE", grid=False, ax=axs[1]);

For the LightGBM classifier, neither training nor validation log loss had decreased to a stable level. If anything, I think it was probably underfitted, so I can increase the learning rate and/or number of trees. The fit looks much better for the LightGBM regressor, but the validation MSE doesn't start to deviate from the train MSE again, so I think it's still not overfitted.

Another way check the fit on training samples is to look at out-of-bag (OOB) score (ie. calculate scores using only trees that didn't train on the particular training samples), for example:

In [None]:
rfc.fit(X_train, clf_y_train)
rfc_oob = RandomForestClassifier(oob_score=True, random_state=0)
rfc_oob.fit(X_train, clf_y_train)

rfr.fit(X_train, reg_y_train)
rfr_oob = RandomForestRegressor(oob_score=True, random_state=0)
rfr_oob.fit(X_train, reg_y_train)

pd.DataFrame({
    "classifier": [rfc.score(X_train, clf_y_train),
                   rfc_oob.oob_score_,
                   rfc.score(X_val, clf_y_val)],
    "regressor": [rfr.score(X_train, reg_y_train),
                  rfr_oob.oob_score_,
                  rfr.score(X_val, reg_y_val)],
}, index=["training set score", "training set OOB score",
          "validation set score"]
).style.format("{:.3f}")

Unlike the default training scores, the OOB training scores were slightly lower than the validation scores for both tasks, so I don't think the baseline RF models were overfitted either.

Back to [Overview](#overview)

<a id="optimise-models"></a>
### Optimise models
*Hyperparameter tuning with randomised and grid searches*

To search for optimised hyperparameters and feature transformers, I used exhaustive grid searches when the search space is small, otherwise randomly searched 60 conditions, which should give me [a close approximate](https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html) if at least 5% of the total conditions in the search space are close-to-optimal.

The feature transformations I explored were: standardising the feature so it has a mean of 0 and standard deviation of 1 (`StandardScaler`), rescaling the feature using outlier-insensitive statistics (`RobustScaler`) or [collapsing any outliers](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html?highlight=scalers#quantiletransformer-uniform-output) (`QuantileTransformer`). For context, I also added `FunctionTransformer`, which allows me to exclude the transformation step.

In [None]:
def gen_search_eval(estimator_name: str, search_params,
                    pred_task:str, search_mode: str):

    cv = skf if pred_task == "clf" else kf
    y_train_val = clf_y_train_val if pred_task == "clf" else reg_y_train_val
    
    pipeline_steps = Pipeline(
        steps=[("transformer", StandardScaler()),
                ("estimator", globals()[estimator_name])]
    )
    
    common_args = dict(estimator=pipeline_steps, n_jobs=-1, 
                       cv=cv, verbose=1,
                       return_train_score=True,  # to check for overfitting
                       error_score="raise")  # if error during fitting
    random_cv = RandomizedSearchCV(param_distributions=search_params, 
                                   n_iter=60, random_state=1, **common_args)
    grid_cv = GridSearchCV(param_grid=search_params, **common_args)
    search_cv = grid_cv if search_mode == "grid" else random_cv

    search_cv.fit(X_train_val, y_train_val)
    return search_cv


def fmt_search_res(search_res):
    summary_df = pd.DataFrame(
        {key: search_res.cv_results_.get(key) 
         for key in ["params", "rank_test_score", 
                     "mean_fit_time", "mean_score_time",
                     "mean_train_score", "mean_test_score",
                     "std_train_score", "std_test_score"]}
    )
    
    final_df = (pd.DataFrame(summary_df["params"].tolist())  # expand params
                .reset_index()
                .merge(summary_df.reset_index(drop=True).reset_index())
                .drop(["index", "params"], axis=1)
                .sort_values(["mean_test_score", "mean_fit_time"], 
                             ascending=[False, True]))
    return final_df


# feature transformations important for distance-based algo eg. SVM, KNN
transformers = [StandardScaler(),  # very sensitive to outliers
                RobustScaler(),  # not affected by a few extreme outliers
                FunctionTransformer(lambda x: x),  # do nothing
                QuantileTransformer()]  # collapses outliers

**Support Vector Machine**

SVM generates hyperplanes that separate data points from each class for classification problems, or minimise distance of all points from the plane for regression problems. Some of the most important hyperparameters are probably the kernel that is used to transform the data, and a few parameters that adjust the regularisation.

In [None]:
svm_params = [{"estimator__kernel": ["rbf", "sigmoid"],  # default rbf
               # gamma reshapes decision boundary (high overfits)
               "estimator__gamma": ["auto", "scale"]},  # default scale
              {"estimator__kernel": ["linear"]}]  # gamma not affect linear

for param_dict in [0, 1]:
    svm_params[param_dict].update({"transformer": transformers,
                                   "estimator__C": np.logspace(-2, 2, 5)})

svc_raw_clf_res = gen_search_eval("svc", svm_params, "clf", "random")
print(f"\nBest classifier pipeline: {svc_raw_clf_res.best_estimator_}")

svc_clf_res = fmt_search_res(svc_raw_clf_res)
svc_clf_res.head(3)

In [None]:
svr_raw_reg_res = gen_search_eval("svr", svm_params, "reg", "random")
print(f"\nBest regressor pipeline: {svr_raw_reg_res.best_estimator_}")

svr_reg_res = fmt_search_res(svr_raw_reg_res)
svr_reg_res.head(3)

**Light Gradient Boosting Machine**

Gradient boosting models iteratively build weak prediction models, minimising loss at each stage. Of the gradient boosting models available, I picked LightGBM, since it tends to be [much faster than XGBoost](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#comparison-experiment) as well as [CatBoost](https://publications.waset.org/10009954/comparison-between-xgboost-lightgbm-and-catboost-using-a-home-credit-datasethttps://publications.waset.org/10009954/comparison-between-xgboost-lightgbm-and-catboost-using-a-home-credit-dataset), while giving similar results. LightGBM has an [overwhelming number of adjustable parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html), some that adjust the ensemble model itself and others that put constraints on the individual trees.

In [None]:
lgb_params = {
    "transformer": transformers,
    # generally, slower learning rates need more trees
    "estimator__n_estimators": [50, 100, 250, 500],  # default 100
    "estimator__learning_rate": [0.001, 0.01, 0.1, 0.2, 0.3],  # default 0.1
    "estimator__boosting_type": ["gbdt", "dart"],  # default gbdt
    # restrict tree growth and adjust regularisation to prevent overfitting
    "estimator__max_depth": [25, 50, -1],  # default -1 (no limit)
    "estimator__min_child_samples": [20, 30, 40],  # default 20
    "estimator__reg_alpha": np.logspace(-2, 2, 5),  # lambda_l1; default 0
    "estimator__reg_lambda": np.logspace(-2, 2, 5)  # lambda_l2; default 0
}

lgbc_raw_clf_res = gen_search_eval("lgbc", lgb_params, "clf", "random")
print(f"\nBest classifier pipeline: {lgbc_raw_clf_res.best_estimator_}")

lgbc_clf_res = fmt_search_res(lgbc_raw_clf_res)
lgbc_clf_res.head(3)

In [None]:
lgbr_raw_reg_res = gen_search_eval("lgbr", lgb_params, "reg", "random")
print(f"\nBest regressor pipeline: {lgbr_raw_reg_res.best_estimator_}")

lgbr_reg_res = fmt_search_res(lgbr_raw_reg_res)
lgbr_reg_res.head(3)

In [None]:
# take just estimator parts since plot_metric doesn't work with pipelines
tuned_lgbc = lgbc_raw_clf_res.best_estimator_[1]
tuned_lgbr = lgbr_raw_reg_res.best_estimator_[1]
qt = QuantileTransformer()

tuned_lgbc.fit(X_train, clf_y_train, 
           eval_set=[(X_train, clf_y_train), 
                     (X_val, clf_y_val)], 
           eval_names=["train set", "validation set"],
           verbose=0)
tuned_lgbr.fit(qt.fit_transform(X_train), reg_y_train, 
           eval_set=[(qt.fit_transform(X_train), reg_y_train), 
                     (qt.transform(X_val), reg_y_val)], 
           eval_names=["train set", "validation set"],
           verbose=0)

fig, axs = plt.subplots(1, 2, figsize=(10, 3))
plt.subplots_adjust(wspace=0.4)
plot_metric(tuned_lgbc, title="Tuned LightGBM Classifier",
            xlabel="n_estimators", grid=False, ax=axs[0])
plot_metric(tuned_lgbr, title="Tuned LightGBM Regressor",
            xlabel="n_estimators", ylabel="MSE", grid=False, ax=axs[1]);

With the tuned parameters, both training and validation loss decrease to a plateau for the best tuned classifier and regressor. The classifier fit improved from the baseline model; the regressor fit seems similar, but importantly, not overfitted.

**Random Forest**

Random forest models build decision trees from random subsets of samples and features, and also have many hyperparameters to tune. Of course the impact of each parameter may vary depending on the data set. For example, with only eight features in our data set, setting `max_features` to "auto", "sqrt" or "log2" should have no significant impact on performance - in all cases, three features should be considered at each split (though they could still be set to specific numbers).

In [None]:
rf_params = {
    "transformer": transformers,
    "estimator__n_estimators": [50, 100, 250, 500],  # default 100
    "estimator__max_depth": [15, 50, None],
    "estimator__min_samples_split": [2, 5, 10],  # default 2
    "estimator__min_samples_leaf": [1, 2, 5]  # default 1
}

rfc_raw_clf_res = gen_search_eval("rfc", rf_params, "clf", "random")
print(f"\nBest classifier pipeline: {rfc_raw_clf_res.best_estimator_}")

rfc_clf_res = fmt_search_res(rfc_raw_clf_res)
rfc_clf_res.head(3)

In [None]:
rfr_raw_reg_res = gen_search_eval("rfr", rf_params, "reg", "random")
print(f"\nBest regressor pipeline: {rfr_raw_reg_res.best_estimator_}")

rfr_reg_res = fmt_search_res(rfr_raw_reg_res)
rfr_reg_res.head(3)

In [None]:
tuned_rfc = rfc_raw_clf_res.best_estimator_[1]
tuned_rfc_oob = rfc_raw_clf_res.best_estimator_[1]
tuned_rfc_oob.set_params(oob_score=True)
tuned_rfr = rfr_raw_reg_res.best_estimator_[1]
tuned_rfr_oob = rfr_raw_reg_res.best_estimator_[1]
tuned_rfr_oob.set_params(oob_score=True)

rs = RobustScaler()
tuned_rfc.fit(rs.fit_transform(X_train), clf_y_train)
tuned_rfc_oob.fit(rs.fit_transform(X_train), clf_y_train)
qt = QuantileTransformer()
tuned_rfr.fit(qt.fit_transform(X_train), reg_y_train)
tuned_rfr_oob.fit(qt.fit_transform(X_train), reg_y_train)

pd.DataFrame({
    "classifier": [tuned_rfc.score(rs.transform(X_train), clf_y_train),
                   tuned_rfc_oob.oob_score_,
                   tuned_rfc.score(rs.transform(X_val), clf_y_val)],
    "regressor": [tuned_rfr.score(qt.transform(X_train), reg_y_train),
                  tuned_rfr_oob.oob_score_,
                  tuned_rfr.score(qt.transform(X_val), reg_y_val)],
}, index=["training set score", "training set OOB score",
          "validation set score"]
).style.format("{:.3f}")

Neither tuned RF classifier nor regressor seem overfitted.

**KNN**

k-nearest neighbours makes predictions based on values of the nearest neighbours. For classification this is combined by a simple majority vote, whereas averaging is used for regression. This of course means hyperparameters related to the number of neighbours, how distances are calculated and how much weight is put on each neighbour's vote affect the performance.

In [None]:
# typical k to start with (eg. https://stackoverflow.com/a/11569262):
print("Typical `k` to start with (square root of number of samples): ",
      np.sqrt(X_train_val.shape[0]))

knn_params = {
    "transformer": transformers,
    "estimator__n_neighbors": [5, 15, 45, 90],  # default 5
    # distance function, default minkowski
    "estimator__metric": ["minkowski", "euclidean", "manhattan"],
    "estimator__weights": ["uniform", "distance"],  # default uniform
    "estimator__algorithm" : ["auto", "ball_tree", "kd_tree"]  # default auto
}

knc_raw_clf_res = gen_search_eval("knc", knn_params, "clf", "random")
print(f"\nBest classifier pipeline: {knc_raw_clf_res.best_estimator_}")

knc_clf_res = fmt_search_res(knc_raw_clf_res)
knc_clf_res.head(3)

In [None]:
knr_raw_reg_res = gen_search_eval("knr", knn_params, "reg", "random")
print(f"\nBest regressor pipeline: {knr_raw_reg_res.best_estimator_}")

knr_reg_res = fmt_search_res(knr_raw_reg_res)
knr_reg_res.head(3)

**Logistic/Linear Regression**

Linear regression linearly combines the input features. In logistic regression, the results are then transformed into probabilities. I explored parameters controlling regularisation type and strength, as well as the algorithm used to find the optimal coefficients.

In [None]:
log_reg_params = [
    # compatible with both l1 (lasso) and l2 (ridge)
    {"estimator__solver": ["saga", "liblinear"],
     "estimator__penalty": ["l1", "l2"]},
    {"estimator__solver": ["lbfgs", "newton-cg", "sag"],  # default lbfgs
     "estimator__penalty": ["l2"]}]
    
for param_dict in [0, 1]:
    log_reg_params[param_dict].update({"transformer": transformers,
                                       # inverse of lambda, default 1
                                      "estimator__C": np.logspace(-2, 2, 5)})

log_reg_raw_clf_res = gen_search_eval("log_reg", log_reg_params, "clf", "grid")
print(f"\nBest performing pipeline: {log_reg_raw_clf_res.best_estimator_}")

log_reg_clf_res = fmt_search_res(log_reg_raw_clf_res)
log_reg_clf_res.head(3)

In [None]:
lin_reg_params = [
    {"estimator": [LinearRegression()]},
    {"estimator": [Lasso(random_state=1)],
     "estimator__alpha": np.logspace(-2, 2, 5)},
    {"estimator": [Ridge(random_state=1)],
     "estimator__alpha": np.logspace(-2, 2, 5),
     "estimator__solver": ["auto", "svd", "sparse_cg", "lsqr", "sag"]} # default auto
    ]
    
lin_reg_raw_reg_res = gen_search_eval("lin_reg", lin_reg_params, "reg", "grid")
print(f"\nBest performing pipeline: {lin_reg_raw_reg_res.best_estimator_}")

lin_reg_reg_res = fmt_search_res(lin_reg_raw_reg_res)
lin_reg_reg_res.head(3)

<a id="hyperparameter-tuning-summary"></a>
**Optimisation summary**

In [None]:
def get_tuned_score(est_name:str, task:str):
    score = globals()[f"{est_name}_raw_{task}_res"].best_score_
    return score

summary_df_display_text = ""

for bl_df, mode in zip([blc_no_pwr, blr_no_pwr], ["clf", "reg"]):
    caption = "classifier summary" if mode == "clf" else "regressor summary"
    html_text = (
        bl_df
        .pipe(pd.pivot_table,
              values="val_score",
              index="estimator",
              aggfunc=np.mean)
        .rename(columns={"val_score": "baseline"})
        .reset_index()
        .sort_values("baseline", ascending=False)
        .assign(tuned = lambda x: x["estimator"].apply(get_tuned_score, task=mode),
                diff = lambda x: x["tuned"] - x["baseline"])
        
        # display options
        .style.format("{:.1%}", subset = ["baseline", "tuned", "diff"])
        .set_table_attributes("style='display:inline'")
        .set_caption(caption)._repr_html_()
    )
    summary_df_display_text = summary_df_display_text + html_text + "\xa0" * 2
    
display_html(summary_df_display_text, raw=True)

Overall, optimisation maintained or slightly improved accuracy in all classifiers (by 0.1 - 1.5%) and R<sup>2</sup> in all regressors (by 0.0 - 1.4%), with the best performances coming from the SVM classifier and LightGBM regressor.

Back to [Overview](#overview)

<a id="test-final-model"></a>
### Test final model

In [None]:
best_clf = svc_raw_clf_res.best_estimator_
best_clf.fit(X_train_val, clf_y_train_val)
print("SMV classifier test score: ", best_clf.score(X_test, clf_y_test), "\n")

print(classification_report(
    y_true=clf_y_test, y_pred=best_clf.predict(X_test),
    labels=[0, 1], target_names=["Unstable", "Stable"]
))

fig, axs = plt.subplots(1, 2, figsize=(10, 3))
plt.subplots_adjust(wspace=0.3)
plot_confusion_matrix(best_clf, X_test, clf_y_test, ax=axs[0])
plot_roc_curve(best_clf, X_test, clf_y_test, ax=axs[1])

In [None]:
best_reg = lgbr_raw_reg_res.best_estimator_
best_reg.fit(X_train_val, reg_y_train_val)
print("LightGBM regressor test score: ", best_reg.score(X_test, reg_y_test))

Testing the final classifier on the held out test data scored 98.3% accuracy. Precision, recall and the resulting F1 scores are high in both classes, and the ROC curve looks great. Of course a score this high is extremely rare when working with real data, and suggests there isn't a lot of noise in this simulated data set. Still, [in this publication](https://link.springer.com/article/10.1007/s42979-021-00463-5/tables/3), a deep learning model showed similar performance (97.5% accuracy, 98.7% precision, 98.2% F1-score) on the same data.

The final regressor also performed similarly well, scoring 95.9% R<sup>2</sup> on held out test data. I then pickled both models so that I could easily use them in [a simple app](https://share.streamlit.io/sowla/grid_stability_app/main/grid_stability.py) that can be used to see how adjusting model inputs affect the resulting predictions:

In [None]:
pickle.dump(best_clf, open("grid_clf.pkl", "wb"))
pickle.dump(best_reg, open("grid_reg.pkl", "wb"))

Back to [Overview](#overview)

<a id="interesting-links"></a>
### Interesting links

I thought I'd share a few interesting links/facts I found while doing this project in case they're interesting to anyone else :)

- Commercial electricity producers and consumers have to give [quarter-hourly forecasts](https://www.amprion.net/Energy-Market/Balancing-Groups/Balancing-Group-Price/Important-Information.html) of the amount of electricity they'll produce/consume. You can see current German data [here](https://www.smard.de/home/marktdaten?marketDataAttributes=%7B%22resolution%22:%22quarterhour%22,%22region%22:%22DE%22,%22moduleIds%22:%5B1000100,1000101,1000102,1000103,1000104,1000108,1000109,1000110,1000111,1000112,1000113,1000121,5000410,1001226,1001228,1001227,1001223,1001224,1001225,1004066,1004067,1004068,1004069,1004071,1004070,2000122,6000411%5D,%22selectedCategory%22:null,%22activeChart%22:true,%22style%22:%22color%22,%22from%22:1621366895489,%22to%22:1621626095488%7D) (it looks like predictions for power consumption is a lot more accurate than for power generation).
- Transmission system operators have to constantly keep track of and react to changes within [their own grids as well as their neighbours'](https://www.entsoe.eu/regions/). So much coordination has to go on that they form "[Regional Security Coordinator](https://www.entsoe.eu/major-projects/rscis/#why-do-we-need-to-strengthen-regional-coordination-now)" companies together.
- "Prosumers", individuals and businesses that act as consumers *and* producers, can [contribute to the energy system](https://smarten.eu/wp-content/uploads/2020/05/Smart_Energy_Prosumers_2020.pdf) in many ways.