# What are Hyperparameter?

Model Hyperparameters are properties that govern the entire training process. They include variables which determines the network structure (Number of Hidden Units etc) and the variables which determine how the network is trained (for example, Learning Rate).

# Why do we need Hyperparameters?

Deep learning has created a significant impact in the field of computer vision,
natural language processing, and speech recognition. Due to large amount of data being generated day after day, it could be used to train Deep Neural Networks and is preferred over traditional Machine Learning algorithms for higher performance and precision which has led to various successful commercial products.

Even though Deep learning has been a booming field, there are certain practical aspects of it, which remains a black box. You need to understand that Applied Deep learning is a highly iterative process. THere are various hyperparameters, that are to be kept in mind before training the model.

These hyperparameters play a major role in balancing the tradeoffs and fit a good generalization over the entire training dataset. Some of such tradeoff concepts are as follows

## Generalization
Suppose we have trained a classification model on 10k images with their labels. We test the model and it was able to predict labels with a mindblowing 99% accuracy.

But, when we try the same model on an unseen data, the accuracy failed to perform well. This is the case of <b> Overfitting</b>

Our goal while training a network is to have it generalize well over the training data which means capturing the true signal of the data, rather than memorizing the noise in the data.

In statistics, it is termed as <b>"Goodness of Fit"</b> which refers to how well our predicted values match to the true values
![MLMeme.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/1_o8AOh44AcgCx81ARFbZQZg.png?raw=true)

## Bias-Variance Tradeoff

Bias-Variance Tradeoff is one of the important aspects of Applied Machine Learning. It has simple and powerful implications around the model complexity and its performance.

We say there's a <b>Bias</b> in a model when the algorithm is not flexible enough to generalize well from the data. Linear parametric algorithms with low complexity such as Regression and Naive Bayes tends to have a high bias.

<b>Variance</b> occurs in the model when the algorithm is sensetive and highly flexible to the training data. Non-linear non-parametric algorithms with high complexity such as Decision trees, Neural Network etc tend to have a high variance.

![Bias-VarianceTradeoff.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/1_kADA5Q4al9DRLoXck6_6Xw.png?raw=true)

## Overfitting vs Underfitting

In both the cases, the model fails to predict the unseen data.
Algorithms <b>overfit</b> on the training data, when it memorizes the noise instead of the data. Usually, complex algorithms such as neural networks are prone to overfitting.

Algorithms <b>underfit</b> on the training data when it is not able to capture the true signal from the data. Underfitted models have bad accuracy in training as well as on the test data.

To reduce these tradeoffs and making sure that the Neural networks generalize well on the training data, it is imperative that the network is well architected. 

# How should you Architect your Keras Neural network: Hyperparameters

## What are the types of Hyperparameters

### 1. Number of Hidden Layers and Neuron Counts
Below information is taken from: [Keras Layers](https://keras.io/layers/core/)

Layer types and when to used them:

* **Activation**: Layer that simply adds an activation function, the activation function can also be specified as part of other layer type.
* **ActivityRegularization**: Used to add L1/L2 regularization outside of a layer. Ridge(L1) and Lasso(L2) regression can also be specified as part of other layer type.
* **Dense**: THe original neyral network type. Every neuron is connected to the next layer. The input vector is one dimensional and placing certain inputs next to each other does not have an effect.
* **Dropout**: Dropout consists in randomly setting a fraction rate of input(0-1) at each update during training time, which helps prevent overfitting. Dropout oly occurs during training.
* **Flatten**: Flattens the input to 1D. Does not effect the batch size.
* **Input**: Input layer is the entry point of the network. It augments the tensor obejct with certain attributes that is used by the keras model.
* **Lambda**: Wraps arbitrary expression as a Layer Object.
* **Masking**: Masks (Conceals) a sequence by using mask values to skip timesteps
* **Permute**: Permutes the dimensions of the input according to a given pattern. Useful for eg. connecting RNNs and covnets together
* **RepeatVector**: Repeats the input N times
* **Reshape**: Similar to Numpy reshapes
* **SpatialDropout1D**: Similar as dropout, but drops entire 1 dimensional feature maps instead of individual elements
* **SpatialDropout2D**: Similar as dropout, but drops entire 2 dimensional feature maps instead of individual elements
* **SpatialDropout3D**: Similar as dropout, but drops entire 3 dimensional feature maps instead of individual elements

### 2. Activation Functions

Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron's input is relevant for the model's prediction.

Below information is taken from: [Activation Function Cheat Sheets](https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html)

* **Linear**: Pass through activation function. Usually used on the output layer of a regression neural network.
![Linear.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/linear.png?raw=true)
* **ELU**: Exponential linear unit, tends to converge cost to zero faster and produce more accurate results. Can produce negative outputs.
![ELU.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/elu.PNG?raw=true)
* **SELU**: Scaled Exponential Linear unit, essentially ELU multilplied by a scaling constant.
* **SoftPlus**: Softplus activation function $log(exp(x) + 1)$  [Introduced](https://papers.nips.cc/paper/1920-incorporating-second-order-functional-knowledge-for-better-option-pricing.pdf) in 2001.
* **tanh** Classic neural network activation function, though often replaced by relu family on modern networks.
![tanh.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/TanH.PNG?raw=true)
* **softsign** Softsign activation function. $x / (abs(x) + 1)$ Similar to tanh, but not widely used.
* **RELU** - Very popular neural network activation function.  Used for hidden layers, cannot output negative values.  No trainable parameters.
![ReLU.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/Relu.PNG?raw=true)
* **sigmoid** - Classic neural network activation.  Often used on output layer of a binary classifier.
![Sigmoid.png](https://github.com/smit585/ReinforcementLearning/blob/BayesianOptimization/GitLab%20Images/sigmoid.PNG?raw=true)
* **hard_sigmoid** - Less computationally expensive variant of sigmoid.
* **exponential** - Exponential (base e) activation function.
