# Keras + Tensorflow and Hypteropt Python tutorial
Made by Ties van der Heijden, TU Delft

In this exercise we will contintue with Dutch DAM price forecasting. This time we will give a detailed specification of our Neural Network, and we will optimize hyperparameters using HyperOpt.

To do this, the following packages are necessary:
- Numpy
- Pandas
- Matplotlib
- Tensorflow

And some specific functions are handy:
- SciKit Learn: KFold, StandardScaler
- Pathlib: Path


PS: Be sure to create a new environment for TF + Keras + Hyperopt, since pip and anaconda don't work too well together and can cause errors in the future. Better to have them conflict in a new python environment.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow
from tensorflow import keras
import tensorflow.keras.backend as K

from hyperopt import hp, fmin, tpe, STATUS_OK, Trials, space_eval

from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from pathlib import Path

import pickle

from tensorflow.keras.layers import LeakyReLU

from termcolor import colored

print(''), print(colored('Finished','green'));


[32mFinished[0m


## Keras model creation - MLP

Start the function with the following command:
```python
tensorflow.keras.backend.clear_session()
```
Else Keras will keep all the trained models stored in the RAM, which causes the memory to slowly fill up.<br>

First we will build a function that returns a keras MLP as a function of its hyperparameters. Use the following hyperparameters:
- Hidden nodes in layer 1
- Hidden nodes in layer 2
- Activation function of the hidden layers
- Loss function (see keras.losses)
- Dropout rate
- Weight initialization (see keras.initializers)

We will fix some things in the model:
- Use the SGD algorithm to train the model. The optimizers parameters can be included as variables for the function (lr, momentum and nesterov), see keras.optimizers.SGD.
- For kernel regularization we will use an L2 regularizer with 1e-4 penalty term (see keras.regularizers). This enforces some sparsity to the solution.

Build the following Keras sequential model<br>
Layer 1: Hidden layer 1 (see keras.layers.Dense)<br>
Layer 2: Dropout layer (see keras.layers.Dropout)<br>
Layer 3: Hidden layer 2<br>
Layer 4: Output layer - think about the activation function to be used in the output layer.<br>
**Compile the model with the specified loss function and optimizer, and return it!**<br>

Make sure that the function returns the model, so that the following code would work:<br>
model = model_build_function(params)<br>
fit = model.fit(x = ..., y = ..., batch_size = ..., epochs = ...)<br>

<ins>Handy link:<ins><br>
https://keras.io/api/

In [3]:
def neural_net(params):

    tensorflow.keras.backend.clear_session()

    print ('Params testing: ', params)
    model = Sequential()
    model.add(Dense(params['units1'], input_dim = x_train_array.shape[1], kernel_initializer=params['weight_init']))
    model.add(Dropout(params['dropout1']))

    model.add(Dense(params['units2'], kernel_initializer=params['weight_init']))
    model.add(Dropout(params['dropout2']))

    model.add(Dense(24))
    model.add(Activation(params['activation']))

    sgd_optimizer = keras.optimizers.SGD(lr=params['learning_rate']/1000, decay=1e-7, momentum=params['momentum'], nesterov=params['nesterov'])

    model.compile(loss = params['loss'], optimizer = sgd_optimizer, metrics = ["mae"])

    model.fit(x_train_array, y_train_array, epochs=params['nb_epochs'], batch_size=params['batch_size'], verbose = 1, validation_data = (x_val_array, y_val_array))

    preds  = model.predict(x_val_array, batch_size = params['batch_size'], verbose = 1)
    acc = mean_absolute_error(y_val_array, preds)
    print('MAE:', acc)
    sys.stdout.flush()
    return {'loss': -acc, 'status': STATUS_OK}

print(''), print(colored('Finished','green'));


[32mFinished[0m


# Define the hyperparameter search space

(1) Define the hyperparameters:
- Hidden nodes layer 1 and 2, which need to take integer values only. The types of parameters that are available can be found in the Hyperopt FMin wiki. A quantized uniform distribution could be used here. To limit the search space, the domain can be divided in steps of 5 nodes. For hidden layer one, search between 150 and 300 nodes per layer. For hidden layer two, search between 50 and 200 nodes per layer.
- Dropout rate, which needs to take continuous values lower than 1. A unifor distribution can be used, search between 0 and 0.5.
- Activation function for the hidden layers. This is a clear case of the 'choice' function in hyperopt. Try 'ReLu' and the 'LeakyReLu'. Optional: add a nested uniform distribution for the alpha parameter of the LeakyReLu.
- Loss function. another 'choice' parameter. Try the RMSE and MAE.
- Weight initialization. Use the choice-type parameter to try both 'RandomNormal' and 'RandomUniform' (see Keras Initializers doc).
- Learning rate of the SGD, to easy things make it a quantized unfirom distribution between 1 and 20 in steps of 1 and divide this by 1000 in your loop.
- Momentum, make this a uniform distribution between 0 and 0.5.
- Nesterov, which is a Boolean that can be described using the choice function.
- Epochs, which is a integer value between 100 and 300. Steps of 10 can be used.
- Batch size, which can take an integer value from 50 to 200. Note: if the optimization crashes due to memory issues, reduce the batch size.

(2) Define the search space:
In HyperOpt, a search space is defined as a python dictionary with the hyperparameters. Like in the following example:
```python
    n1 = hp.quniform('Hidden nodes layer 1', 150, 300, 5)
    n2 = hp.quniform('Hidden nodes layer 2', 50, 200, 5)
    
    search_space = {
        'Hidden nodes layer 1': n1,
        'Hidden nodes layer 2': n2
    }
```



<ins>Handy link:<ins><br>
http://hyperopt.github.io/hyperopt/#documentation <br>
https://github.com/hyperopt/hyperopt/wiki/FMin <br>
https://keras.io/api/

In [5]:
space = {   'units1': hp.quniform('units1', 150, 300, 5),

            'units2': hp.quniform('units2', 50, 200, 5),

            'dropout1': hp.uniform('dropout1', 0, 0.5),

            'dropout2': hp.uniform('dropout2', 0, 0.5),

            'batch_size' : hp.choice('batch_size', [100]),

            'nb_epochs' :  hp.choice('nb_epochs', [100, 300, 10]),

            'activation': hp.choice('activation', ['relu', LeakyReLU(alpha=0.05)]),
            
            'loss': hp.choice('loss', ['mae', 'rmse']),

            'weight_init': hp.choice('weight_init', ['random_normal', 'random_uniform']),

            'momentum': hp.uniform('momentum', 0, 0.5),

            'learning_rate': hp.quniform('learning_rate', 1, 20, 1),

            'nesterov': hp.choice('nesterov', [True, False]),
            
        }

print(''), print(colored('Finished','green'));


[32mFinished[0m


## Build your train function

Now you can build the last piece of the puzzle needed to optimize hyperparameters of your MLP.<br>

Define a function that takes a dictionary of hyperparameters as input. Make sure to redefine integer values as such, since hyperopt returns floats from quantized distributions.

The function should
(1) Read the hyperparameters.
(2) Loop over a 5-Fold Cross Validation (see scikit-learn KFold function) in which:
- A Keras model is declared with given hyperparameters.
- Scale the input-features using the scikitlearn StandardScaler. Scale the test-set with the scaling factors from the training set. This has to be done in the KFold loop to prevent information leakage.
- The model is trained over the train set in the given fold.
- The trained model is evaluated over the test set of the given fold, using the <ins>Mean Absolute Error!</ins> <br>
note: you can read in your data before calling the function, this saves you a lot of runtime. 
(3) Make a python list with the MAE (for example called 'losses') of the five folds and return a dictionary in the following format:
```python
    {'loss': np.mean(losses), 'status': STATUS_OK, 'losses': losses}
```


## Ready to loop

(1) Read in your data, no need to scale them. Just make sure to have an input features array (X) and a target array (y).<br>
(2) Declare a hyperopt trials object.
(3) Run the search! Let's use the Tree Parzen Estimator algorithm. Use the fmin function: 
```python 
def train(hyperparameters):
    ...
    return dict

search_space = {...}
trials = Trials()
X, y = load_data()

best = fmin(fn = train, 
            space = search_space, 
            algo = tpe.suggest, 
            max_evals = 500, 
            trials = trials, 
            show_progressbar = True
           )
```
(4) Save your trials object! This can be stored as a pickle. Don't mess with pickles, since they can potentially form safety hazards for your PC. Here is an example of proper pickle-usage:
```python
save_trials_path = Path(path_to_folder)
with open(save_trials_path / 'trials.pickle', 'wb') as pickle_file:
    pickle.dump(trials, pickle_file)

...rest of code
```

Note: run it once first with max_evals = 1 to check if everything works. Also, if this takes ages you can reduce the search space by having some hyperparameters fixed (for example by only using the MAE loss, fixing SGD parameters to the standard, using fixed epochs and/or batch_size), this would allow for a smaller amount of evals. This assignment is just to show what is possible on a big computer, this might not be feasible on your own PC.

In [4]:
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam
from keras.utils import np_utils
from keras.layers.advanced_activations import LeakyReLU, PReLU
from keras.layers.normalization import BatchNormalization
from keras.regularizers import l1, l2, l1_l2

# (1) Load data

feat_X = pd.read_pickle(f"./variables/feat_X.pkl")
feat_y = pd.read_pickle(f"./variables/feat_y.pkl")

from sklearn.model_selection import train_test_split

perc_test = 0.1

X_train, X_val, y_train, y_val = train_test_split(feat_X, feat_y, test_size = perc_test,shuffle = False, random_state = 1236548)
print('Number of samples in the training set:', X_train.shape[0])
print('Number of samples in the test set:', X_val.shape[0])

print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)

x_train_array = np.array(X_train, dtype = float)
y_train_array = np.array(y_train)
x_val_array = np.array(X_val, dtype = float)
y_val_array = np.array(y_val)

print(x_train_array.shape)
print(x_val_array.shape)
print(y_train_array.shape)
print(y_val_array.shape)

# (2) Trials object

trials = Trials()

# (3) Run

best = fmin(neural_net, space, algo=tpe.suggest, max_evals = 1, trials=trials)
print('best: ')
print(best)

# (4) Save

save_trials_path = Path('./')
with open(save_trials_path / 'trials.pickle', 'wb') as pickle_file:
    pickle.dump(trials, pickle_file)

print(''), print(colored('Finished','green'));

 5.5243 - mae: 5.5243

 - 0s 8ms/step - loss: 5.4747 - mae: 5.4747 - val_loss: 4.3006 - val_mae: 4.3006

Epoch 253/300
 1/17 [>.............................]
 - ETA: 0s - loss: 5.4575 - mae: 5.4575

 2/17 [==>...........................]
 - ETA: 0s - loss: 5.6002 - mae: 5.6002

 - ETA: 0s - loss: 5.4362 - mae: 5.4362

 - ETA: 0s - loss: 5.3977 - mae: 5.3977

 - 0s 15ms/step - loss: 5.4378 - mae: 5.4378 - val_loss: 4.4698 - val_mae: 4.4698

Epoch 254/300
 1/17 [>.............................]
 - ETA: 0s - loss: 5.3949 - mae: 5.3949

 - ETA: 0s - loss: 5.4400 - mae: 5.4400

 - 0s 8ms/step - loss: 5.4081 - mae: 5.4081 - val_loss: 4.6696 - val_mae: 4.6696

Epoch 255/300
 1/17 [>.............................]
 - ETA: 0s - loss: 5.0821 - mae: 5.0821

 - ETA: 0s - loss: 5.5468 - mae: 5.5468

 - ETA: 0s - loss: 5.5571 - mae: 5.5571

 - 0s 13ms/step - loss: 5.5751 - mae: 5.5751 - val_loss: 4.8256 - val_mae: 4.8256

Epoch 256/300
 1/17 [>.............................]
 - ETA: 0s - loss