# What are Hyperparameter?

Model Hyperparameters are properties that govern the entire training process. They include variables which determines the network structure (Number of Hidden Units etc) and the variables which determine how the network is trained (for example, Learning Rate).

# Why do we need Hyperparameters?

Deep learning has created a significant impact in the field of computer vision,
natural language processing, and speech recognition. Due to large amount of data being generated day after day, it could be used to train Deep Neural Networks and is preferred over traditional Machine Learning algorithms for higher performance and precision which has led to various successful commercial products.

Even though Deep learning has been a booming field, there are certain practical aspects of it, which remains a black box. You need to understand that Applied Deep learning is a highly iterative process. THere are various hyperparameters, that are to be kept in mind before training the model.

These hyperparameters play a major role in balancing the tradeoffs and fit a good generalization over the entire training dataset. Some of such tradeoff concepts are as follows

## Generalization
Suppose we have trained a classification model on 10k images with their labels. We test the model and it was able to predict labels with a mindblowing 99% accuracy.

But, when we try the same model on an unseen data, the accuracy failed to perform well. This is the case of <b> Overfitting</b>

Our goal while training a network is to have it generalize well over the training data which means capturing the true signal of the data, rather than memorizing the noise in the data.

In statistics, it is termed as <b>"Goodness of Fit"</b> which refers to how well our predicted values match to the true values
![MLMeme.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/MachineLearningGeneralization.jpg?raw=true)

## Bias-Variance Tradeoff

Bias-Variance Tradeoff is one of the important aspects of Applied Machine Learning. It has simple and powerful implications around the model complexity and its performance.

We say there's a <b>Bias</b> in a model when the algorithm is not flexible enough to generalize well from the data. Linear parametric algorithms with low complexity such as Regression and Naive Bayes tends to have a high bias.

<b>Variance</b> occurs in the model when the algorithm is sensetive and highly flexible to the training data. Non-linear non-parametric algorithms with high complexity such as Decision trees, Neural Network etc tend to have a high variance.

![Bias-VarianceTradeoff.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/1_kADA5Q4al9DRLoXck6_6Xw.png?raw=true)

## Overfitting vs Underfitting

In both the cases, the model fails to predict the unseen data.
Algorithms <b>overfit</b> on the training data, when it memorizes the noise instead of the data. Usually, complex algorithms such as neural networks are prone to overfitting.

Algorithms <b>underfit</b> on the training data when it is not able to capture the true signal from the data. Underfitted models have bad accuracy in training as well as on the test data.

To reduce these tradeoffs and making sure that the Neural networks generalize well on the training data, it is imperative that the network is well architected. 

# How should you Architect your Keras Neural network: Hyperparameters

## What are the types of Hyperparameters

### 1. Number of Hidden Layers and Neuron Counts
Below information is taken from: [Keras Layers](https://keras.io/layers/core/)

Layer types and when to used them:

* **Activation**: Layer that simply adds an activation function, the activation function can also be specified as part of other layer type.
* **ActivityRegularization**: Used to add L1/L2 regularization outside of a layer. Ridge(L1) and Lasso(L2) regression can also be specified as part of other layer type.
* **Dense**: THe original neyral network type. Every neuron is connected to the next layer. The input vector is one dimensional and placing certain inputs next to each other does not have an effect.
* **Dropout**: Dropout consists in randomly setting a fraction rate of input(0-1) at each update during training time, which helps prevent overfitting. Dropout oly occurs during training.
* **Flatten**: Flattens the input to 1D. Does not effect the batch size.
* **Input**: Input layer is the entry point of the network. It augments the tensor obejct with certain attributes that is used by the keras model.
* **Lambda**: Wraps arbitrary expression as a Layer Object.
* **Masking**: Masks (Conceals) a sequence by using mask values to skip timesteps
* **Permute**: Permutes the dimensions of the input according to a given pattern. Useful for eg. connecting RNNs and covnets together
* **RepeatVector**: Repeats the input N times
* **Reshape**: Similar to Numpy reshapes
* **SpatialDropout1D**: Similar as dropout, but drops entire 1 dimensional feature maps instead of individual elements
* **SpatialDropout2D**: Similar as dropout, but drops entire 2 dimensional feature maps instead of individual elements
* **SpatialDropout3D**: Similar as dropout, but drops entire 3 dimensional feature maps instead of individual elements

### 2A. Activation Functions

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron's input is relevant for the model's prediction.

Below information is taken from: [Activation Function Cheat Sheets](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html)

* **Linear**: Pass through activation function. Usually used on the output layer of a regression neural network.
![Linear.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/linear.png?raw=true)
* **ELU**: Exponential linear unit, tends to converge cost to zero faster and produce more accurate results. Can produce negative outputs.
![ELU.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/elu.PNG?raw=true)
* **SELU**: Scaled Exponential Linear unit, essentially ELU multilplied by a scaling constant.
* **SoftPlus**: Softplus activation function $\log(exp(x) + 1)$  [Introduced](https://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf) in 2001.
* **tanh** Classic neural network activation function, though often replaced by relu family on modern networks.
![tanh.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/TanH.PNG?raw=true)
* **softsign** Softsign activation function. $x / (abs(x) + 1)$ Similar to tanh, but not widely used.
* **RELU** - Very popular neural network activation function.  Used for hidden layers, cannot output negative values.  No trainable parameters.
![ReLU.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/Relu.PNG?raw=true)
* **sigmoid** - Classic neural network activation.  Often used on output layer of a binary classifier.
![Sigmoid.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/sigmoid.PNG?raw=true)
* **hard_sigmoid** - Less computationally expensive variant of sigmoid.
* **exponential** - Exponential (base e) activation function.


### 2B. Advanced Activation Functions

Below information is taken from: [Keras Advanced Activation Functions](https://keras.io/layers/advanced-activations/)

* **LeakyReLU**: Leaky version of a Rectified Linear Unit. It allows a small gradient when the unit is not active, controlled by alpha hyperparameter.
![LeakyReLU](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/LeakyRelU.PNG?raw=true)
* **PReLU**: Parametric Rectified Linear Unit, learns the alpha hyperparameter.

### 3A. Regularizaion: L1, L2

* [Keras Regularization](https://keras.io/regularizers/)


Regularizers allow to apply penalties on layer parameters or layer activity during optimization. These penalties are incorporated in the loss function that the network optimizes. 

In order to create less complex model when you have a large number of features in your dataset, some of the Regularization techniques used to address over-fitting and featre are:

* **Lasso Regression(L1)**: (Least Absolute Shrinkage and Selection Operator) adds "absolute value of magnitue" of coefficients as penalty term to the loss function
![Lasso](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/Lasso.png?raw=true)
* **Ridge Regression(L2)**: adds “squared magnitude” of coefficient as penalty term to the loss function.
![Ridge](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/Ridge.PNG?raw=true)

<i> The key difference between the techniques is that Lasso shrinks the less important feature's coeeficient to zero thus, removing some feature altogether. Whereas, Ridge regression suppresses the less important feature but doesn't eliminate completely. So, Lasso works well for feature selection in case we have a huge number of features. </i>

In our ResNet implementation, we had implemented Ridge Regression for image classifications.

### 3B. Dropout

* [Keras Dropout](https://keras.io/layers/core/)

Dropout, as explained in the layers section, is another method to curb overfitting. A fraction of neuron information from incoming layer is dropped intentionally to make sure that the network is not over learning.

### 4. Batch Normalization
Batch

## Aim
We are going to use a simple dataset : Write about dataset
and we are going to evaluate the network with our own hyperparameters. Eventually we are going to use Bayesian Optimization to optimize those hyperparameters to minimize log loss of the network.

Following we are going to encode the dataset into a feature vector:

In [3]:
import pandas as pd
from scipy.stats import zscore

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

# Generate dummies for Job
col = 'job'
df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
df.drop(col, axis=1, inplace=True)

# Generate dummies for Area
col = 'area'
df = pd.concat([df, pd.get_dummies(df[col], prefix=col)], axis=1)
df.drop(col, axis=1, inplace=True)

# Check for Missing values of income, and fill it with the column's median
med = df['income'].median()
df['income'] = df['income'].fillna(med)

# Standardize ranges
df['income'] = zscore(df['income'])
df['aspect'] = zscore(df['aspect'])
df['save_rate'] = zscore(df['save_rate'])
df['age'] = zscore(df['age'])
df['subscriptions'] = zscore(df['subscriptions'])

# Convert to numpy - Classification
x_columns = df.columns.drop('product').drop('id')
x = df[x_columns].values
dummies = pd.get_dummies(df['product']) # Classification
products = dummies.columns
y = dummies.values

Below is a function definintion for Evaluate network, which takes in few hyperparameters and calculate the log loss of the architecture

In [5]:
import os
import numpy as np
import tensorflow.keras.initializers
import statistics
import tensorflow.keras
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, LeakyReLU, PReLU
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam

def evaluate_network(dropout, lr, neuronPct, neuronShrink):
    SPLITS = 2
    # Bootstrapping to reduce overfit
    boot = StratifiedShuffleSplit(n_splits=SPLITS, test_size=0.1)
    
    # Track progress
    mean_benchmark = []
    epochs_needed = []
    num = 0
    neuronCount = int(neuronPct*5000)
    
    # loop through samples
    for train, test in boot.split(x,df['product']):
        num +=1
        
        # Split training and testing data
        x_train = x[train]
        y_train = y[train]
        y_test = y[test]
        x_test = x[test]
        
        # Construct neural network
        model = Sequential()
        
        layer = 0
        while neuronCount>25 and layer<10:
            if layer==0:
                model.add(Dense(neuronCount,
                               input_dim= x.shape[1],
                               activation= PReLU()))
            else:
                model.add(Dense(neuronCount, activation= PReLU()))
            model.add(Dropout(dropout))
            
            neuronCount = neuronCount*neuronShrink
            
        # Add the output layer, and put activation as Softmax (Classification)    
        model.add(Dense(y.shape[1], activation='softmax'))
        # Learning rate is explicitly mentioned as it is needed to be optimized
        model.compile(loss='categorical_crossentropy', optimizer= Adam(lr= lr))
        
        # Using Earlystopping callback to monitor if val_loss is decreasing, and if 
        # it starts increasing again or becomes constant, wait for 100 iterations and
        # restore the best weights which had the least loss. This is another approach
        # to make sure overfitting does not happen
        monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=100,
                               verbose=1, mode='auto', restore_best_weights= True)
        
        # Train on the bootstrap sample
        model.fit(x_train, y_train, validation_data=(x_test, y_test), callbacks=[monitor],
                 verbose= 0, epochs=1000)
        epochs = monitor.stopped_epoch
        epochs_needed.append(epochs)
        
        # Predict on the out of boot(validation)
        pred = model.predict(x_test)
        
        # Measure this bootstrap's log loss
        y_compare = np.argmax(y_test, axis=1)
        score = metrics.log_loss(y_compare, pred)
        mean_benchmark.append(score)
        m1 = statistics.mean(mean_benchmark)
        m2 = statistics.mean(epochs_needed)
        mdev = statistics.pstdev(mean_benchmark)
        
    tensorflow.keras.backend.clear_session() # Clears GPU if it has any other network loaded
    return(-m1)

print(evaluate_network(
    dropout=0.2,
    lr=1e-3,
    neuronPct=0.2,
    neuronShrink=0.2))

Restoring model weights from the end of the best epoch.
Epoch 00107: early stopping
Restoring model weights from the end of the best epoch.
Epoch 00227: early stopping
-0.7663188438257202


In evaluate network function, we have four parameters:
* **Dropout**:What percentage of neuron data to drop from one layer to the next
* **Learning Rate**: Parameter of the optimizer, how quicky it tends to reach the minima.
* **Neuron Percentage**: What value of 5000 neurons to be kept in 1st layer
* **Neuron Shrink**: By what percentage should be reduce the number of neurons in the subsequent layers

When we gave constant value to them, we got a logloss of approx 76%. Now, we are going to optimize these values so that this loss can be further decreased. We are also going to note the values at which we get the least loss.