# Choosing window features using LASSO

In this notebook, we will create a large number of window features using a pipeline and then use LASSO as a feature selection method to reduce the number of features we use.


## Data set synopsis


We will work with the hourly electricity demand dataset. It is the electricity demand for the state of Victora in Australia from 2002 to the start of 2015. 

For instructions on how to download, prepare, and store the dataset, refer to notebook number 4, in the folder "01-Create-Datasets" from this repo.


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_context("talk")

# Load data

In [2]:
data = pd.read_csv(
    "../Datasets/victoria_electricity_demand.csv",
    usecols=["demand", "temperature", "date_time"],
    index_col=["date_time"],
    parse_dates=["date_time"],
)

In [3]:
# For this demo we will use a subset of the data
data = data.loc["2010":]

In [4]:
data.head()

Unnamed: 0_level_0,demand,temperature
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2010-01-01 00:00:00,8314.448682,21.525
2010-01-01 01:00:00,8267.187296,22.4
2010-01-01 02:00:00,7394.528444,22.15
2010-01-01 03:00:00,6952.04752,21.8
2010-01-01 04:00:00,6867.199634,20.25


# Create lag and window features using a pipeline

In [5]:
from feature_engine.timeseries.forecasting import LagFeatures, WindowFeatures, ExpandingWindowFeatures
from feature_engine.imputation import DropMissingData
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [6]:
df = data.copy()

In [7]:
# Lag features
lag_transformer = LagFeatures(variables=["demand", "temperature"],
                              periods=[1, 2, 3, 24, 24 * 7])

In [8]:
# Window features
window_transformer = WindowFeatures(
    variables=["demand", "temperature"],
    functions=["mean", "std", "kurt", "skew"],
    window=[24, 24 * 7, 24 * 7 * 4, 24 * 7 * 4 * 12],
    periods=1,
)

In [9]:
# Expanding features
expanding_window_transformer = ExpandingWindowFeatures(
    variables=["demand"], 
    functions=["mean", "std", "kurt", "skew"]
)

In [10]:
# Drop missing data introduced by window and lag features
imputer = DropMissingData()

In [11]:
pipe = Pipeline(
    [
        ("lag", lag_transformer),
        ("rolling", window_transformer),
        ("expanding", expanding_window_transformer),
        ("drop_missing", imputer)
    ]
)

df = pipe.fit_transform(df)
df

Unnamed: 0_level_0,demand,temperature,demand_lag_1,temperature_lag_1,demand_lag_2,temperature_lag_2,demand_lag_3,temperature_lag_3,demand_lag_24,temperature_lag_24,...,demand_window_8064_kurt,demand_window_8064_skew,temperature_window_8064_mean,temperature_window_8064_std,temperature_window_8064_kurt,temperature_window_8064_skew,demand_expanding_mean,demand_expanding_std,demand_expanding_kurt,demand_expanding_skew
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2010-12-03 00:00:00,7650.165828,17.825,8311.641438,18.325,8194.758870,18.650,8810.225934,19.000,7594.965872,18.500,...,-0.103898,0.422491,15.744401,5.742699,0.659439,0.778176,9842.090580,1804.188369,-0.103898,0.422491
2010-12-03 01:00:00,7927.140368,17.850,7650.165828,17.825,8311.641438,18.325,8194.758870,18.650,7914.538048,18.375,...,-0.104160,0.422422,15.743942,5.742385,0.660363,0.778423,9841.818798,1804.241597,-0.104074,0.422631
2010-12-03 02:00:00,7327.146056,17.675,7927.140368,17.850,7650.165828,17.825,8311.641438,18.325,7321.428112,17.875,...,-0.104297,0.422397,15.743378,5.741955,0.661546,0.778705,9841.581422,1804.255693,-0.104144,0.422815
2010-12-03 03:00:00,7088.725786,17.625,7327.146056,17.675,7927.140368,17.850,7650.165828,17.825,7045.315052,17.425,...,-0.104314,0.422377,15.742823,5.741552,0.662685,0.778992,9841.269728,1804.361038,-0.104420,0.422872
2010-12-03 04:00:00,7458.314830,17.625,7088.725786,17.625,7327.146056,17.675,7927.140368,17.850,7396.896962,17.275,...,-0.104304,0.422436,15.742305,5.741194,0.663728,0.779267,9840.928560,1804.509421,-0.104740,0.422842
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2015-02-28 19:00:00,9596.777060,28.350,9979.909902,30.850,10258.585392,31.550,10019.921572,31.250,9980.108798,19.700,...,-0.398990,0.291512,15.987450,5.252119,1.053306,0.793359,9463.653128,1752.041445,0.252346,0.499886
2015-02-28 20:00:00,8883.230296,22.200,9596.777060,28.350,9979.909902,30.850,10258.585392,31.550,9411.874558,18.750,...,-0.398995,0.291267,15.988492,5.253737,1.051447,0.793594,9463.656071,1752.022191,0.252414,0.499886
2015-02-28 21:00:00,8320.260550,18.900,8883.230296,22.200,9596.777060,28.350,9979.909902,30.850,8653.510960,18.300,...,-0.398790,0.291143,15.988814,5.254039,1.050537,0.793438,9463.643240,1752.004951,0.252485,0.499911
2015-02-28 22:00:00,8110.055916,18.900,8320.260550,18.900,8883.230296,22.200,9596.777060,28.350,8256.683092,18.150,...,-0.398505,0.291119,15.988796,5.254028,1.050578,0.793450,9463.617965,1751.993833,0.252528,0.499947


In [12]:
# Let's split the data into a training set and test set
# We'll hold the most recent day as a test set
split_date = df.index[-1] - pd.Timedelta("1D")
df_train = df[df.index <= split_date]
df_test =  df[df.index > split_date]

In [13]:
df_train.tail()

Unnamed: 0_level_0,demand,temperature,demand_lag_1,temperature_lag_1,demand_lag_2,temperature_lag_2,demand_lag_3,temperature_lag_3,demand_lag_24,temperature_lag_24,...,demand_window_8064_kurt,demand_window_8064_skew,temperature_window_8064_mean,temperature_window_8064_std,temperature_window_8064_kurt,temperature_window_8064_skew,demand_expanding_mean,demand_expanding_std,demand_expanding_kurt,demand_expanding_skew
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-02-27 19:00:00,9980.108798,19.7,10068.040568,20.85,10483.536412,21.55,10960.255988,22.5,10190.802512,19.7,...,-0.401271,0.293753,15.975459,5.238756,1.063314,0.792157,9464.004063,1752.302275,0.251445,0.499568
2015-02-27 20:00:00,9411.874558,18.75,9980.108798,19.7,10068.040568,20.85,10483.536412,21.55,9610.025236,18.6,...,-0.401475,0.2936,15.975663,5.238869,1.062868,0.792025,9464.015478,1752.284576,0.251491,0.499553
2015-02-27 21:00:00,8653.51096,18.3,9411.874558,18.75,9980.108798,19.7,10068.040568,20.85,8719.930158,17.8,...,-0.401489,0.293441,15.975769,5.238916,1.062665,0.791956,9464.014325,1752.265215,0.251564,0.499561
2015-02-27 22:00:00,8256.683092,18.15,8653.51096,18.3,9411.874558,18.75,9980.108798,19.7,8271.486968,17.8,...,-0.401304,0.293361,15.975905,5.238962,1.062443,0.791866,9463.996399,1752.249983,0.251627,0.499591
2015-02-27 23:00:00,8716.498334,17.8,8256.683092,18.15,8653.51096,18.3,9411.874558,18.75,8800.900636,17.65,...,-0.401094,0.293336,15.976079,5.239011,1.06219,0.791753,9463.969697,1752.239804,0.251666,0.499627


In [14]:
df_test.head()

Unnamed: 0_level_0,demand,temperature,demand_lag_1,temperature_lag_1,demand_lag_2,temperature_lag_2,demand_lag_3,temperature_lag_3,demand_lag_24,temperature_lag_24,...,demand_window_8064_kurt,demand_window_8064_skew,temperature_window_8064_mean,temperature_window_8064_std,temperature_window_8064_kurt,temperature_window_8064_skew,demand_expanding_mean,demand_expanding_std,demand_expanding_kurt,demand_expanding_skew
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-02-28 00:00:00,8003.228986,17.65,8716.498334,17.8,8256.683092,18.15,8653.51096,18.3,8121.868698,17.05,...,-0.400972,0.29326,15.976265,5.239049,1.061962,0.791634,9463.953166,1752.223953,0.251731,0.499656
2015-02-28 01:00:00,7522.86262,17.2,8003.228986,17.65,8716.498334,17.8,8256.683092,18.15,7629.796248,16.55,...,-0.400779,0.293268,15.976438,5.239082,1.061758,0.791524,9463.920861,1752.218042,0.251751,0.499693
2015-02-28 02:00:00,7156.310422,17.2,7522.86262,17.2,8003.228986,17.65,8716.498334,17.8,7316.774812,15.95,...,-0.400588,0.293403,15.9766,5.239099,1.061606,0.791426,9463.877934,1752.222443,0.251728,0.499721
2015-02-28 03:00:00,7074.676782,17.2,7156.310422,17.2,7522.86262,17.2,8003.228986,17.65,7280.901386,15.3,...,-0.400539,0.293662,15.976804,5.239115,1.061434,0.791303,9463.826904,1752.23667,0.251676,0.499735
2015-02-28 04:00:00,7204.031944,17.2,7074.676782,17.2,7156.310422,17.2,7522.86262,17.2,7603.832712,14.25,...,-0.400547,0.293967,15.977034,5.239128,1.061256,0.791167,9463.77407,1752.253314,0.251617,0.499744


# Use LASSO to select features

Let's create the target and features. 

In [15]:
# Create target variable
y_train = df_train["demand"]

# Drop demand and temperature as features, we do not know them at predict time.
X_train = df_train.drop(columns=["demand", "temperature"])

We will apply standard scaling because we are using LASSO.


In [16]:
X_train_ = StandardScaler().fit_transform(X_train)
X_train = pd.DataFrame(data=X_train_, columns=X_train.columns)

In [17]:
from sklearn.linear_model import Lasso

In [18]:
model = Lasso(alpha=1, random_state=0)
model.fit(X_train, y_train)

In [19]:
feature_importances = pd.Series(index=X_train.columns, data=model.coef_)

In [20]:
feature_importances.abs().sort_values(ascending=False)

demand_lag_1                    2190.943047
demand_lag_2                     877.510005
demand_lag_168                   213.888675
temperature_lag_1                190.948998
demand_window_24_mean            147.480480
temperature_lag_3                103.812987
demand_lag_24                    102.349227
demand_window_24_skew             63.210775
temperature_lag_168               55.132948
demand_window_168_mean            41.089953
demand_lag_3                      36.014360
temperature_window_24_mean        32.886558
demand_window_672_mean            21.355831
temperature_window_24_std         17.670526
temperature_window_24_skew        13.614140
temperature_lag_24                12.771168
demand_expanding_kurt              9.456486
demand_expanding_mean              7.193448
demand_window_672_kurt             5.706723
temperature_window_24_kurt         5.626029
temperature_window_168_skew        4.992910
temperature_lag_2                  4.831044
temperature_window_8064_std     

We can see that the lag features are most important but some window features are also selected! This is to be expected as the most recent observations tend to be very predictive of the next immediate observation.