In [2]:
from IPython.display import Image

# Deep Neural Network Training 

## Vanishing and Exploding Gradient

Gradients often get smaller and smaller as the algorithm progresses down to the lower layers.

Looking at the logistic activation function, you can see that when
inputs become large (negative or positive), the function saturates at 0 or 1, with a
derivative extremely close to 0. 

![title](../Images/Hands_on_ml/sigmoid.png)

### Xavier and He initialization

In their paper, Glorot and Bengio propose a way to significantly alleviate this problem

The authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs

![title](../Images/Hands_on_ml/11_func_table.png)

### Nonsaturating Activation Functions

One of the insights in the 2010 paper by Glorot and Bengio was that the vanishing/exploding 
gradients problems were in part due to a poor choice of activation function

Relu suffers from a problem known as the dying ReLUs: during training, some neurons effectively die, meaning they stop outputting anything other than 0. 

Especially if you used a large learning rate

To solve this problem, you may want to use a variant of the ReLU function, such as the leaky ReLU. 

This function is defined as LeakyReLU α(z) = max(αz, z)

![title](../Images/Hands_on_ml/11_func_leakyrelu.png)

Setting α = 0.2 (huge leak) seemed to result in better performance than α = 0.01 (small leak).

The parametric leaky ReLU (PReLU), where α is authorized to be learned during training (instead of being a hyperparameter, it becomes a parameter that can be modified by backpropagation like any other parameter).

This was reported to strongly outperform ReLU on large image datasets, but on smaller
datasets it runs the risk of overfitting the training set.

![title](../Images/Hands_on_ml/11_func_ELU.png)

Exponential linear unit (ELU) that outperformed all the ReLU variants in their experiments: training time was reduced and the neural network performed better on the test set. 

Although your mileage will vary, in general ELU > leaky ReLU (and its variants) > ReLU > tanh > logis‐
tic. If you care a lot about runtime performance, then you may prefer leaky ReLUs over ELUs. 

If you don’t want to tweak yet another hyperparameter, you may just use the default α values suggested
earlier (0.01 for the leaky ReLU, and 1 for ELU)

### Batch Normalization

Batch Normalization is a tool that address the vanishing/exploding gradients problems, and more generally the problem that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change .

In other words, this operation lets the model learn the optimal scale and mean of the inputs for each layer.

![title](../Images/Hands_on_ml/11_Batch_normalize.png)

Very good for estimation but, training time and execution time goes worse

You can use Batch Normalization tf.layer.batch.normalization()

You can use functool's partial module for shrinking repetition

Whenever you run an operation that depends on the batch_norm layer, you need to set the is_train
ing placeholder to True or False