# Train Neural Network

This notebook serves as example to show you how you can train and test a model yourself based on your own data. 

### Imports

In [1]:
import numpy as np 
import pandas as pd
import os
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from tensorflow import keras
import pickle

# add dicrectories of this repo to system path for and then import necessary functions
sys.path.append(os.path.dirname(os.getcwd())+'/src') 
from utils import *
from model import * 

### Settings and Configurations

First you need to to configure which features to use, plus the width and overlap of the sliding windows for feature creation. The configurations refer to the data examples provded in the ```data/customer_data/``` folder. The assumption is that the data of every individual household is stored in one file and that the energy consumption and weather data are along with their reference to the timestamp are stored in the same data frame. While we use energy data in 15 min resolution, you could potentially also train models on other resolutions. However, you then may want to make adjustments in the feature calculation (especially with respect to the temporal features). 

In [2]:
#--------------------------
# FEATURES CONFIGURATION
#--------------------------
feat_time = True # whether or not to encode the time as cyclic features 
feat_energymean = True # whether or not to use the mean energy of the sequence as feature
feat_weather = True # whether or not to use the weather as feature 
feat_dhw = True # whether or not to use the DHW production as feature
weather_columns = ['daily_avgtemp', 'hourly_avgtemp', 'daily_maxtemp', 'daily_mintemp'] # columns in data frame to be used for weather features

#-----------------------------
# SLIDING WINDOW CONFIGURATION
#-----------------------------
window_length = 8 # width of sliding window - 8 values for 15 min intervals means 2 hours
window_overlap = 4 # overleap of sliding windows - 4 values for 15 min intervals means 1 hour

#------------------------------------------
# DEFINING COLUMNS TO BE USED OF DATA FRAME
#------------------------------------------
key_consumption_input = 'kWh_aggregated' # column name of the input energy consumption
key_consumption_output = 'kWh_heat_pump' # column name of the output energy consumption to be learned
key_timestamp = 'timestamp' # column name of the timestamp

### Create Sliding Window Data
The following code prepares the arrays of features used as input by the neural network and the arrays of targets (energy sequences) to be learned by the network. The corresponding functions to prepare these arrays assume that the data is stored in one data frame per customer, which holds the energy consumption and weather data as separate columns. Consequently, the example below creates a dictionary to hold the pre-processed data for each customer. The keys of the dictionary are the customerIDs and the the values are dictionaries with the input and output value for the network in form of arrays. In these arrays each row corresponds to one sliding window calculated from the original time series. This approach provides more flexibility to decide in a later step which households to use for training and which for testing.

**NOTE:** The code below depends on the settings chosen above (e.g. in terms of which features to be used and which sliding window configurations). Also the order of features matters, as the models provided in this repository were trained with the features in the order defined here. The code below however also shows you how you can save and restore the data dictionary created, so that for large data sets you only have to create it once per configuration. Of course, in general, the approach below only serves as example and you are open to handle the feature creation in any other way that fits to you workflow. 

**Note (once again): if you change the feature-related settings above, you need to rerun the followign cell to create a new data dictionary.**

In [3]:
#-------------------------------------------------------
# OPTIONAL: LOAD DICTIONARY IF IT HAS BEEN SAVED EARLIER
#-------------------------------------------------------

# define the folder where the dictionary should be saved and in case it does not exist create it
folder = os.path.dirname(os.getcwd()) + '/data_temp/'
mkdir(folder)

# create a unique file name according to the features and sliding window configuration
file_name = 'data_dict_time{}_energymean{}_weather{}_dhw{}_windowlength{}_overlap{}_weathercols{}.pkl'.format(feat_time, feat_energymean, feat_weather, feat_dhw, window_length, window_overlap, str(weather_columns).replace('\'', '').replace(' ', '').replace('[', '').replace(']', '').replace(',', '-'))

# when the dictionary has been created previously, load it from file
if file_exists(folder + file_name):

    print('Data dictionary already exists. Loading it from file...')
    with open(folder + file_name, 'rb') as f:
        data_dict = pickle.load(f)
    print('Data dictionary loaded from file.')

#--------------------------
# PREPARE DATA DICTIONARY
#--------------------------

else: 
    print('Creating data dictionary...')

    # load overview of systems with available data 
    df_systems = pd.read_csv(os.path.dirname(os.getcwd()) + '/data/meta_data.csv')

    # create a dictionary to save the data with customerID as key and a dictionary with features and targets as value
    data_dict = {}

    # loop over each HP system and create features and targets 
    for idx, row in tqdm(df_systems.iterrows(), total=df_systems.shape[0]):

        # get customer id and information whether HP is responsible for DHW production
        customer_id = row['customer_id']
        dhw_value = row['dhw_production']

        # load original smart meter and weather data frame 
        df = pd.read_csv('{}/data/customer_data/{}.csv'.format(os.path.dirname(os.getcwd()), customer_id))

        # check that the data is valid 
        if len(df) == 0: 
            continue

        # create the features as input to the network (i.e. X)
        np_features = create_features_array_from_original_data_frame(df, window_length, window_overlap, key_consumption_input, dhw_value=dhw_value, key_timestamp=key_timestamp, feat_energymean=feat_energymean, feat_time=feat_time, feat_weather=feat_weather, feat_dhw=feat_dhw, weather_columns=weather_columns)
        
        # create the targets as output of the network (i.e. Y)
        np_targets = create_features_array_from_original_data_frame(df, window_length, window_overlap, key_consumption_output, dhw_value=dhw_value, key_timestamp=key_timestamp, feat_energymean=False, feat_time=False, feat_weather=False, feat_dhw=False, weather_columns=weather_columns)

        # make sure that features and targets match in their length
        if np_targets.shape[0] > np_features.shape[0]:
            np_targets = np.delete(np_targets, -1, axis=0)

        elif np_targets.shape[0] < np_features.shape[0]:
            np_features = np.delete(np_features, -1, axis=0)

        np.testing.assert_equal(np_features.shape[0], np_targets.shape[0])

        # add to dictionary 
        data_dict[customer_id] = {'X': np_features, 'y': np_targets, 'original_data' : df}

#----------------------------------------------
# OPTIONAL: SAVE DATA DICTIONARY FOR LATER USE
#----------------------------------------------
    
# save the dictionary
if not file_exists(folder+file_name):

    print('Saving data dictionary to file...')
    with open(folder + file_name, 'wb') as f:
        pickle.dump(data_dict, f)
    print('Data dictionary saved to file.')


Creating data dictionary...


  0%|          | 0/25 [00:00<?, ?it/s]

100%|██████████| 25/25 [00:01<00:00, 18.28it/s]


Saving data dictionary to file...
Data dictionary saved to file.


### Create Training and Test Set Through Concatenation
In the previous cell, we created the features and targets for our desired configuration and stored them in a dictionary (for each household separately). Next, we need to decide which households to consider for training and which for testing. In the file ```data/meta_data.csv```, we list which household belong to which fold of a 5-fold cross validation. Therefore, we just select any fold and use all corresponding households for testing, while all others are used for training. 

**NOTE:** In practice, you would probably combine this cell with the cells below and loop over the folds to train and evaluate a single model per fold to then calculate the average scores across all folds. At least, this is what we did in the paper. Although not implemented here, you could consider storing and loading the final feature and target arrays for each fold using ```np.save(...)``` and ```np.load(...)```, if your decision which household is part of which fold does not change.

In [4]:
#----------------------------------------
# SELECT FOLD
#----------------------------------------

# NOTE: the current fold is the one used for testing and all other folds are used for training 
# NOTE: in practice you would loop over the folds and train and evaluate the model for each fold, then average the scores for final reporting 
current_fold = 0

# define households to be used for training and evaluation 
df_systems = pd.read_csv(os.path.dirname(os.getcwd()) + '/data/meta_data.csv')
households_train = df_systems[df_systems['kfold'] != current_fold]['customer_id'].values
households_test = df_systems[df_systems['kfold'] == current_fold]['customer_id'].values

#----------------------------------------
# PREPARE TRAINING DATA SET 
#----------------------------------------

# create training set by concatenating features and targets of the different households that should be used for training
X_train = None 
y_train = None

# loop over training households and concatenate
for customer_id in tqdm(households_train):

    # create features 
    if X_train is None:
        X_train = data_dict[customer_id]['X']
    else: 
        X_train = np.concatenate([X_train, data_dict[customer_id]['X']], axis=0)

    # create targets
    if y_train is None:
        y_train = data_dict[customer_id]['y']
    else:
        y_train = np.concatenate([y_train, data_dict[customer_id]['y']], axis=0)

# drop rows with nan values in either X_train or y_train - but needs to be done in both arrays correspondingly
indices = np.where(np.isnan(X_train).any(axis=0))[0].tolist() + np.where(np.isnan(y_train).any(axis=0))[0].tolist()
if len(indices) > 0:
    print('Dropping {} rows due to nan values in either features or targets.'.format(len(indices)))
    X_train = np.delete(X_train, indices, axis=0)
    y_train = np.delete(y_train, indices, axis=0)

# NOTE: test data set is not created here because evaluation will happen in a later stage and will rather be handled individually per household

100%|██████████| 20/20 [00:00<00:00, 270.98it/s]


### Create and Train the Model
Below, an exemplary model is trained for 20 epochs. While we keep the configuration simple, note that more parameters can be configured (see the ```__init__()```-function in ```model.py```.) The parameter ```apple_silicon``` should be used for Apple silicon chips as it would choose a legacy version of the Adam optimizer, which is recommended with this version of tensorflow configured during anaconda installation of this repo. 

**NOTE:** The code is currently written in such a way that it uses all CPU available, but does not consider GPU training or any other types of speed optimization because this is highly individual. You may want to adjust these things according to your own needs and setup after forking the repo. Please consider the code provided just as the bare minimum to start with. 

**NOTE:** With the ```basepath``` parameter you have the option to provide the model a path to a folder where it should store the models and tensorboard files. When ```basepath=None```, the model instance will create a folder named ```results/<model_name>``` to store relevant files and tensorboards will be stored in ```results/tensorboards/<model_name>```. 

In [5]:
#----------------------------------------
# CREATE AND TRAIN MODEL 
#----------------------------------------
        
# define model parameters
# NOTE: more parameters can be configured - see model.py --> __init__() for more details - for this example, we just keep it simple
layer_shapes = [50, 100, 200, 100, 50]
epochs = 20
model_name = 'test_model' # NOTE: if no model name is provided, the model instance would create a unique model name based on the fold and the chosen configs
basepath = None # option to provide a path to a folder where the models and tensorboards should be saved - if None, results folder in this repo will be created
apple_silicon = True

# create model 
model = Model(model_name, current_fold, layer_shapes=layer_shapes, epochs=epochs, apple_silicon=apple_silicon, basepath=basepath)

# train model 
model.fit(X_train, y_train, validation_split=0.1, shuffle_data=True, verbose=1)

[24/02/19 17:04:50 test_model] [DEBUG] Handler: Intialization successful. Name: test_model
[24/02/19 17:04:50 test_model] [DEBUG] Model.set_epochs(): Number of epochs set to 20.
[24/02/19 17:04:50 test_model] [DEBUG] Model.set_batch_size(): Batch size set to 2048.
[24/02/19 17:04:50 test_model] [DEBUG] Model.__init__(): Object creation successful.
[24/02/19 17:04:50 test_model] [INFO] Model.fit(): Starting training.
[24/02/19 17:04:50 test_model] [DEBUG] Model.create_callbacks(): Callbacks created.
[24/02/19 17:04:50 test_model] [DEBUG] Model.__init_nn__(): Initialization successful.
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
[24/02/19 17:05:12 test_model] [DEBUG] Model.fit(): Reloaded best model from checkpoint.
[24/02/19 17:05:12 test_model] [INFO] Model.save_x_scaler(): X_Scaler saved as: /Users/to

### Evaluate Model With Test Data

To evaluate the model, we loop over all households belonging to the fold that was not used for training. Using the ```ResultReporter``` class from the ```resultreporter.py```-file, we can calculate the score per household. The model evaluation can then be done by averaging across the scores.

**NOTE:** In this repo, we only provide a minimum amount of data to show you how to use the code, therefore the results below cannot be interpreted as reflecting real performance.

In [6]:
#----------------------------------------
# SELECT FOLD
#----------------------------------------

# NOTE: the current fold is the one used for testing and all other folds are used for training 
# NOTE: in practice you would loop over the folds and train and evaluate the model for each fold, then average the scores for final reporting 
current_fold = 0

# define households to be used for training and evaluation 
df_systems = pd.read_csv(os.path.dirname(os.getcwd()) + '/data/meta_data.csv')
households_train = df_systems[df_systems['kfold'] != current_fold]['customer_id'].values
households_test = df_systems[df_systems['kfold'] == current_fold]['customer_id'].values

#--------------------------------------------
# APPLY MODEL TO TEST HOUSEHOLDS AND EVALUATE
#--------------------------------------------

# create a data frame to store the evaluation scores for all households 
df_eval = None 

# loop over customer households 
for customer_id in tqdm(households_test):

    # get sliding window features and original data frame from the data dictionary
    np_test_features = data_dict[customer_id]['X']
    df_original = data_dict[customer_id]['original_data']

    # make a prediction for the test features 
    np_test_predictions = model.predict(np_test_features)

    # transform the sliding window predictions back into time series and add as new column to original data frame
    df_predicted = create_data_series_from_sliding_windows(np_test_predictions, window_overlap, df_original=df_original, new_column_name='kWh_heat_pump_predicted')
    df_predicted.dropna(subset=['kWh_heat_pump_predicted', 'kWh_heat_pump'], inplace=True)

    # create an instance of the result reporter, which calculates the scores - NOTE: need to create one instance per household! 
    reporter = ResultReporter('regression', groundtruth=df_predicted['kWh_heat_pump'].values, predictions=df_predicted['kWh_heat_pump_predicted'].values)
    df_scores = reporter.getResultDataFrame(only_relevant_metrics=True, parameter_evaluated='{}'.format(customer_id))
    if df_eval is None:
        df_eval = df_scores
    else: 
        df_eval = pd.concat([df_eval, df_scores], axis=0)

# final adjustments to evaluation scores
df_eval.reset_index(inplace=True, drop=True)
df_eval.drop(columns=['task_type'], inplace=True)
df_eval.rename(columns={'parameter_evaluated': 'customer_id'}, inplace=True)
df_eval.insert(0, 'fold', current_fold)

# save the evaluation scores to file
handler = model.get_handler()
df_eval.to_csv(handler.basepath + 'evaluation_scores_fold{}.csv'.format(current_fold), index=False)
display(df_eval)

100%|██████████| 5/5 [00:00<00:00,  5.15it/s]


Unnamed: 0,fold,customer_id,meanAbsoluteError,meanSquaredError,rootMeanSquaredError,rootMeanSquaredLogError,R2Score,maxResidualError,medianAbsoluteError
0,0,8110168,0.395554,0.449508,0.670453,-0.399801,-0.705767,3.733869,0.189453
1,0,2104831,0.389896,0.270485,0.520082,-0.653769,-0.127457,3.602547,0.300877
2,0,1077878,0.166093,0.072521,0.269297,-1.311939,0.674504,2.180855,0.075477
3,0,1058711,1.091673,4.989607,2.233743,0.803679,-40.173324,9.332566,0.465779
4,0,1222107,0.386123,0.67293,0.820323,-0.198057,-7.262323,5.04699,0.202847
