MIT License

Copyright (c) Microsoft Corporation. All rights reserved.

This notebook is adapted from Francesca Lazzeri Energy Demand Forecast Workbench workshop.

Copyright (c) 2021 PyLadies Amsterdam, Alyona Galyeva

# Linear regression with recursive feature elimination

In [3]:
%matplotlib inline
import os
import pickle
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import pandas as pd
import numpy as np
from typing import Callable
# from azureml.core import Workspace, Dataset
# from azureml.core.experiment import Experiment
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import *
from sklearn.feature_selection import RFECV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder


This notebook shows how to train a linear regression model to create a forecast of future energy demand. In particular, the model will be trained to predict energy demand in period $t_{+1}$, one hour ahead of the current time period $t$. This is known as 'one-step' time series forecasting because we are predicting one period into the future.

In [None]:
# os.mkdir('Linear_models')
# os.mkdir('NaiveBayes_models')

In [4]:
WORKDIR = os.getcwd()
TRAIN_DIR = os.path.join(WORKDIR, '../data-processing/data/train')
TEST_DIR = os.path.join(WORKDIR, '../data-processing/data/test')

In [5]:
# assign all the training data into DFs
for train_data in os.listdir(TRAIN_DIR):
    locals()['df_' + train_data[5:-4]] = pd.read_csv(TRAIN_DIR + '/' + train_data)
    print(f'The local variable of type DataFrame, df_{train_data[5:-4]} is created')
# df_train_1.head()

The local variable of type DataFrame, df_train_4 is created
The local variable of type DataFrame, df_train_3 is created
The local variable of type DataFrame, df_train_2 is created
The local variable of type DataFrame, df_train_1 is created
The local variable of type DataFrame, df_outlier_train_4 is created
The local variable of type DataFrame, df_outlier_train_3 is created
The local variable of type DataFrame, df_outlier_train_2 is created
The local variable of type DataFrame, df_outlier_train_1 is created


In [6]:
# assign all the testing data into DFs
for test_data in os.listdir(TEST_DIR):
    locals()['df_' + test_data[5:-4]] = pd.read_csv(TEST_DIR + '/' + test_data)
    print(f'The local variable of type DataFrame, df_{test_data[5:-4]} is created')
# df_test_1.head()

The local variable of type DataFrame, df_test_4 is created
The local variable of type DataFrame, df_test_1 is created
The local variable of type DataFrame, df_test_2 is created
The local variable of type DataFrame, df_test_3 is created
The local variable of type DataFrame, df_outlier_test_4 is created
The local variable of type DataFrame, df_outlier_test_1 is created
The local variable of type DataFrame, df_outlier_test_3 is created
The local variable of type DataFrame, df_outlier_test_2 is created


Create design matrix - each column in this matrix represents a model feature and each row is a training example. We remove the *demand* and *timeStamp* variables as they are not model features.

In [None]:
# # Create a function that can retutn all the DataFrames of the training & testing variables
# # making it easier to modify all the variables at once. This

# def modify_all_df(function: Callable) -> None:
#     for df_key in locals().keys():
#         if 'data_' in df_key:
#             # locals[df_key].apply(lambda x: function)
#             function(locals()[df_key])

In [7]:
import re
train_regex = re.compile(r'df_(\w+_)?train_\d')
df_train_list = list(filter(train_regex.match, locals().keys()))
df_train_list

['df_train_4',
 'df_train_3',
 'df_train_2',
 'df_train_1',
 'df_outlier_train_4',
 'df_outlier_train_3',
 'df_outlier_train_2',
 'df_outlier_train_1']

In [30]:
# set the X for every training data, by dropping the 'timestamp' & 'local_actual_mv' columns
# set the y for every training data, by only selecting the 'load_actual_mw'
X_train_list = []
for train_df in df_train_list:
        locals()[f'X_{train_df[3:]}'] = locals()[train_df].drop(['timestamp', 'load_actuals_mw'], axis = 1)
        X_train_list.append(f'X_{train_df[3:]}')
        print(f'The varaible, X_{train_df[3:]} is created.')

        locals()[f'y_{train_df[3:]}'] = locals()[train_df]['load_actuals_mw']
        print(f'The varaible, y_{train_df[3:]} is created.')

The varaible, X_train_4 is created.
The varaible, y_train_4 is created.
The varaible, X_train_3 is created.
The varaible, y_train_3 is created.
The varaible, X_train_2 is created.
The varaible, y_train_2 is created.
The varaible, X_train_1 is created.
The varaible, y_train_1 is created.
The varaible, X_outlier_train_4 is created.
The varaible, y_outlier_train_4 is created.
The varaible, X_outlier_train_3 is created.
The varaible, y_outlier_train_3 is created.
The varaible, X_outlier_train_2 is created.
The varaible, y_outlier_train_2 is created.
The varaible, X_outlier_train_1 is created.
The varaible, y_outlier_train_1 is created.


### Create predictive model pipeline

Here we use sklearn's Pipeline functionality to create a predictive model pipeline. For this model, the pipeline implements the following steps:
- **one-hot encode categorical variables** - this creates a feature for each unique value of a categorical feature. For example, the feature *dayofweek* has 7 unique values. This feature is split into 7 individual features dayofweek0, dayofweek1, ... , dayofweek6. The value of these features is 1 if the timeStamp corresponds to that day of the week, otherwise it is 0.
- **recursive feature elimination with cross validation (RFECV)** - it is often the case that some features add little predictive power to a model and may even make the model accuracy worse. Recursive feature elimination tests the model accuracy on increasingly smaller subsets of the features to identify the subset which produces the most accurate model. Cross validation is used to test each subset on multiple folds of the input data. The best model is that which achieves the lowest mean squared error averaged across the cross validation folds.
- **train final model** - the best model found in after the feature elimination process is used to train the final estimator on the whole dataset.

Identify indices for categorical columns for one hot encoding and create the OneHotEncoder:

In [9]:
categorical_cols = ['month', 'season']
sample_df = locals()["df_train_1"] # Cuz all the DFs' structure are the same, so we'll only need one sample for the col index
cat_cols_index = [sample_df.columns.get_loc(col) for col in sample_df.columns if col in categorical_cols]
cat_cols_index

[6, 7]

In [33]:
# transfer the categorical data into dummy variables
for train_X in X_train_list:
    locals()['dummy_' + train_X[2:]] = pd.get_dummies(data = locals()[train_X], columns = categorical_cols)
    print(f"The variable, dummy_{train_X[2:]} is created")

The variable, dummy_train_4 is created
The variable, dummy_train_3 is created
The variable, dummy_train_2 is created
The variable, dummy_train_1 is created
The variable, dummy_outlier_train_4 is created
The variable, dummy_outlier_train_3 is created
The variable, dummy_outlier_train_2 is created
The variable, dummy_outlier_train_1 is created


In [34]:
lr = LinearRegression(fit_intercept=True)
lr

LinearRegression()

In [35]:
naive_bayes = GaussianNB()

For hyperparameter tuning and feature selection, cross validation will be performed using the training set. With time series forecasting, it is important that test data comes from a later time period than the training data. This also applies to each fold in cross validation. Therefore a time series split is used to create three folds for cross validation as illustrated below. Each time series plot represents a separate training/test split, with the training set coloured in blue and the test set coloured in red. Note that, even in the first split, the training data covers at least a full year so that the model can learn the annual seasonality of the demand.

In [36]:
tscv = TimeSeriesSplit(n_splits=3)

Create the RFECV object. Note the metric for evaluating the model on each fold is the negative mean squared error. The best model is that which maximises this metric.

In [37]:
regr_cv = RFECV(estimator=lr,
             cv=tscv,
             scoring='neg_mean_squared_error',
             verbose = 2,
             n_jobs = -1)

Create the model pipeline object.

In [38]:
lr_pipeline = Pipeline([('rfecv', regr_cv)])
lr_pipeline

Pipeline(steps=[('rfecv',
                 RFECV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None),
                       estimator=LinearRegression(), n_jobs=-1,
                       scoring='neg_mean_squared_error', verbose=2))])

In [39]:
naive_bayes_pipeline = Pipeline([('rfecv', regr_cv)])
naive_bayes_pipeline

Pipeline(steps=[('rfecv',
                 RFECV(cv=TimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None),
                       estimator=LinearRegression(), n_jobs=-1,
                       scoring='neg_mean_squared_error', verbose=2))])

In [40]:
def save_model(model_name: str, DIR: str) -> None:
    with open(os.path.join(DIR, model_name + '.pkl'), 'wb') as f:
        pickle.dump(model_name, f)

In [41]:
dummy_train_1

Unnamed: 0,temperature,solar_ghi,solar_prediction_mw,wind_prediction_mw,workdayornot,temperature_lag1,temperature_lag2,temperature_lag3,temperature_lag4,temperature_lag5,...,solar_prediction_mw_lag4,solar_prediction_mw_lag5,solar_prediction_mw_lag6,month_1,month_2,month_3,month_4,month_5,season_1,season_2
0,134.6351,0.0,0.0,53.9495,0,134.6573,134.6795,134.8876,134.9232,134.9587,...,0.0,0.0,0.0000,1,0,0,0,0,1,0
1,134.6129,0.0,0.0,50.5166,0,134.6351,134.6573,134.6795,134.8876,134.9232,...,0.0,0.0,0.0000,1,0,0,0,0,1,0
2,134.5925,0.0,0.0,47.2634,0,134.6129,134.6351,134.6573,134.6795,134.8876,...,0.0,0.0,0.0000,1,0,0,0,0,1,0
3,134.5740,0.0,0.0,44.7925,0,134.5925,134.6129,134.6351,134.6573,134.6795,...,0.0,0.0,0.0000,1,0,0,0,0,1,0
4,134.5555,0.0,0.0,42.2637,0,134.5740,134.5925,134.6129,134.6351,134.6573,...,0.0,0.0,0.0000,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14471,143.4997,0.0,0.0,127.1455,0,143.6210,143.7457,143.8703,143.9949,144.1505,...,0.0,0.0,0.6215,0,0,0,0,1,0,1
14472,143.3816,0.0,0.0,126.5843,0,143.4997,143.6210,143.7457,143.8703,143.9949,...,0.0,0.0,0.0000,0,0,0,0,1,0,1
14473,143.2635,0.0,0.0,124.9796,0,143.3816,143.4997,143.6210,143.7457,143.8703,...,0.0,0.0,0.0000,0,0,0,0,1,0,1
14474,143.1455,0.0,0.0,125.5090,0,143.2635,143.3816,143.4997,143.6210,143.7457,...,0.0,0.0,0.0000,0,0,0,0,1,0,1


In [59]:
for i in range(1,5):
    try:
        # Linear Regression
        LR_DIR = os.path.join(WORKDIR + '/Linear_models')

        lr_pipeline.fit(locals()['dummy_train_' + str(i)].values, locals()['y_train_' + str(i)]) # for processed data
        save_model('lr_model_' + str(i), LR_DIR)

        lr_pipeline.fit(locals()['dummy_outlier_train_' + str(i)].values, locals()['y_train_' + str(i)]) # for outliers included data
        save_model('lr_ourliers_model_' + str(i), LR_DIR)
    except Exception as e:
        with open(os.path.join(WORKDIR, 'error_handler.txt'), 'a') as f:
            f.write('< LR_model >\n')
            f.write(f'{e}\n')
            f.write('=================\n')

    try:
        # Naive Bayes
        NB_DIR = os.path.join(WORKDIR + '/NaiveBayes_models')

        naive_bayes_pipeline.fit(locals()['dummy_train_' + str(i)].values, locals()['y_train_' + str(i)])
        save_model('NaiveBayes_model_' + str(i), NB_DIR)

        naive_bayes_pipeline.fit(locals()['dummy_outlier_train_' + str(i)].values, locals()['y_train_' + str(i)])
        save_model('NaiveBayes_outliers_model_' + str(i), NB_DIR)
    except Exception as e:
        with open(os.path.join(WORKDIR, 'error_handler.txt'), 'a') as f:
            f.write('< NaiveBayes_model >\n')
            f.write(f'{e}\n')
            f.write('=================\n')

Fitting estimator with 36 features.
Fitting estimator with 36 features.
Fitting estimator with 36 features.
Fitting estimator with 35 features.
Fitting estimator with 34 features.
Fitting estimator with 35 features.
Fitting estimator with 35 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 34 features.
Fitting estimator with 34 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 33 features.
Fitting estimator with 32 features.
Fitting estimator with 29 features.
Fitting estimator with 33 features.
Fitting estimator with 28 features.
Fitting estimator with 27 features.
Fitting estimator with 31 features.
Fitting estimator with 30 features.
Fitting estimator with 26 features.
Fitting estimator with 32 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 29 features.
Fitting estimator with 22 fe