# BME-336546-C11-Hyperparameters optimization in neural networks

## Medical topic
Refer to previous tutorial.

## Dataset
Refer to previous tutorial.

## Main ML topic:
Hyperparameters optimization through validation of CNN's.

## Theory reminders
A nice visualiization that shows the linkage between the gradient descent in the error surface and the goodness of fit is shown here:
<center><img src="images/gif.gif" width=400><center>

As we remember we use scaling mostly due to sensitivity of gradient descent to scaling and/or biased learning.
<center><img src="images/grad_descent.png" width=400><center>

In neural networks, we have multiple layers that we need to scale correctly. How should we do that? We can actually *learn* the scaling and shifting factors. Mostly, we will define trainable parameters $\beta$ and $\gamma$ per layer and every batch in the layer is standardized as follows:
<center><img src="images/batch_normalization.png" width=300><center>

Thus, these parameters also affect the loss function $(L(w) \rightarrow L(w,\beta, \gamma))$ and they can be updated by gradient descent. Training become much faster by using batch normalization.

There are two more main factors that affect the gradient descent process:
* Stochastic vs. minibatch.
* Constant learning rate vs. adaptive one.

Here is a quick summary of stochastic vs. minibatch:


<left><img src="images\batch_stoch_2.png" width="400"><left>
<right><img src="images\batch_stoch.png" width="400"><right>


For the momentum gradient descent, we add another term to the constant learning rate that depends on the previous step. Thus, if the current step is "at the same direction" as the previous one, the weights would be "pushed" towards there with larger learning rate. If not, then it would be regulated by the new step.  
$$\begin{align}
v_{n+1} &= \mu v_n - \alpha \frac{\partial{L}}{\partial{W_n}}\\
W_{n+1} &= W_n + v_{n+1}
\end{align}$$

Finally, in neural networks, we have a lot of hyperparameters (degrees of freedom) that we need to tune or set. Among them we can find the following:
* Batch size.
* Number of epochs.
* Types of regularization.
* Weights' initialization.
* Learning rate types and magnitude.
* Activation functions.
* Number of neurons or filters in every layer.
* Network's depth (number of layers).

## Data loading
The datasets are located on a shared file on triton at `/MLdata/MLcourse/LTAF/`. Let's convert this notebook into `.py` file or work directly with the `ipynb` file and check out the benefits of PyCharm Professional when it comes to working with remote servers.  

In [None]:
import numpy as np
import itertools
from tqdm import tqdm
import pickle
import sys
import pandas as pd
import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
mpl.style.use(['ggplot']) 
# %matplotlib inline
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from IPython.display import display, clear_output
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
import os
# os.environ['TF_CPP_MIN_LOG_LEVEL']='3'

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential, load_model 
from tensorflow.keras.layers import Dense, Dropout, Activation, Conv1D, MaxPool1D, Flatten, BatchNormalization
from tensorflow.keras import utils
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.optimizers import SGD

In [None]:
data_src = '/MLdata/MLcourse/LTAF/'
y = np.load(data_src + 'y_LTAF.npy')
rr = np.load(data_src + 'rr_LTAF.npy')

rr_train, rr_test, y_train, y_test = train_test_split(rr, y, test_size = 0.20, random_state = 336546, stratify=y)
# rr_train, rr_val, y_train, y_val = train_test_split(rr_train_orig, y_train_orig, test_size = 0.20, random_state = 336546, stratify=y_train_orig)

# rr_train_orig = rr_train_orig.reshape(rr_train_orig.shape[0],rr_train_orig.shape[1],1)
rr_train = rr_train.reshape(rr_train.shape[0],rr_train.shape[1],1)
# rr_val = rr_val.reshape(rr_val.shape[0],rr_val.shape[1],1)
rr_test = rr_test.reshape(rr_test.shape[0],rr_test.shape[1],1)

In [None]:
tf.keras.backend.clear_session()
config = tf.compat.v1.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config))

## Specific task: Tune your neural network
Build the same neural network as in previous tutorial inside `create_model` and add batch normalization between every convolution/dense layer. Don't change `window_size` and `len_Sub_window`. Choose only 3 hyperparmeters to tune and set no more than 2 options for every hypeparmeter. Notice that the hyparameters of batch size, number of epochs and weights' initialization are external to the model itself and do not count as arguments of the function `create_model`. However, you can include it in the grid search. If you choose not to tune the hyperparameters of batch size and/or epochs, then you must define them in `KerasClassifier`. Use more guidence [here](https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/).

In [None]:
def create_model(window_size=60, len_sub_window=10,n_filters_start=64,n_hidden_start=512,dropout=0.5,lr=0.01, momentum=0):
    #----------------------Implement your code here:------------------------------
    model = Sequential()
    model.add(Conv1D(n_filters_start, len_sub_window, activation='relu', input_shape=(60, 1)))
    model.add(BatchNormalization())
    model.add(Conv1D(2 * n_filters_start, len_sub_window, activation='relu'))
    model.add(BatchNormalization())
    model.add(MaxPool1D())
    model.add(Conv1D(4 * n_filters_start, len_sub_window, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(dropout))
    model.add(Flatten())
    model.add(Dense(n_hidden_start, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(int(n_hidden_start / 2), activation='relu'))
    model.add(BatchNormalization())
    model.add(Dense(int(n_hidden_start / 4), activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(dropout))   
    model.add(Dense(1, activation='sigmoid'))
    optimizer = SGD(lr=lr, momentum=momentum)
    model.compile(optimizer=optimizer, metrics=['accuracy'], loss='binary_crossentropy')
    #------------------------------------------------------------------------------
    return model

In [None]:
model = KerasClassifier(build_fn=create_model,verbose=1, epochs=30)
batch_size = [2000, 4000]
n_filters_start = [32, 64, 128]
dropout = [0.1, 0.2, 0.5]
param_grid = dict(batch_size=batch_size, n_filters_start=n_filters_start, dropout=dropout)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1, cv=3) 

In [None]:
grid_result = grid.fit(rr_train, y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Fit your best model with the complete training set, name it `final_model` and evaluate on the testing set.

In [None]:
#----------------------Implement your code here:------------------------------
final_model = KerasClassifier(build_fn=create_model,verbose=1, epochs=30, **grid_result.best_params_)
final_model.fit(rr_train, y_train)
#------------------------------------------------------------------------------

In [None]:
from sklearn.metrics import confusion_matrix
calc_TN = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 0]
calc_FP = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[0, 1]
calc_FN = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[1, 0]
calc_TP = lambda y_true, y_pred: confusion_matrix(y_true, y_pred)[1, 1]

In [None]:
def stat_metric(y_test, y_pred_test, y_pred_proba_test, clf_name, temp=np.empty(())):
    TN = calc_TN(y_test, y_pred_test)
    FP = calc_FP(y_test, y_pred_test)
    FN = calc_FN(y_test, y_pred_test)
    TP = calc_TP(y_test, y_pred_test)
    Se = TP/(TP+FN)
    Sp = TN/(TN+FP)
    PPV = TP/(TP+FP)
    NPV = TN/(TN+FN)
    Acc = (TP+TN)/(TP+TN+FP+FN)
    F1 = (2*Se*PPV)/(Se+PPV)
    print('The fitted classifier is ' + clf_name + '\n')
    print('Sensitivity is {:.2f}. \nSpecificity is {:.2f}. \nPPV is {:.2f}. \nNPV is {:.2f}. \nAccuracy is {:.2f}. \nF1 is {:.2f}. '.format(Se,Sp,PPV,NPV,Acc,F1))
    if temp.size == 1:
        print('AUROC is {:.2f}'.format(roc_auc_score(y_test, y_pred_proba_test[:,1])))
    else:
        print('AUROC is {:.2f}'.format(roc_auc_score(y_test, temp[:,1])))

In [None]:
y_pred_test = final_model.predict(rr_test)
y_pred_test[y_pred_test>=0.5] = 1
y_pred_test[y_pred_test<0.5] = 0

In [None]:
temp = final_model.predict(rr_test)
temp2 = np.zeros((temp.shape[0], 2))
temp2[:,0] = 1-temp[:,0]
temp2[:,1] = temp[:,0]

In [None]:
stat_metric(y_test, y_pred_test, y_pred_proba_test, clf_name='CNN', temp=temp2)

Images credit:
* [batch vs. stochastic](https://www.kdnuggets.com/2016/08/gentlest-introduction-tensorflow-part-2.html/2)
* [batch vs. stochastic2](https://xzz201920.medium.com/gradient-descent-stochastic-vs-mini-batch-vs-batch-vs-adagrad-vs-rmsprop-vs-adam-3aa652318b0d)
* [gif](https://towardsdatascience.com/improving-vanilla-gradient-descent-f9d91031ab1d)
* [scaling](https://www.commonlounge.com/discussion/fc3fb95081e54ab3b00368aacbdc62be/history)
* [batch normalization](https://arxiv.org/pdf/1502.03167v3.pdf)

#### *This tutorial was written by [Moran Davoodi](mailto:morandavoodi@gmail.com) with the assitance of [Yuval Ben Sason](mailto:yuvalbse@gmail.com) & Kevin Kotzen*

# That's all folks! Hope you enjoyed the course :)