# Chapter 11: Training deep neural networks


Neural networks suffer from vanishing (too small) or exploding (too large) gradients.
This caused them to be abandoned.

Xavier Glorot and Yoshua Bengio showed in 2010 that this was caused by logistic activation (mean is 0.5, and gradient is close to the edges: output 0 or output 1) or the initialization used.

The outcome was to have a fan-in as close to fan-out for hidden layers, and the initialization is done in a crafty way to account for differences in fan-in and fan-out called Xavier initialization now. $\sigma^2 = 1/{fan}$, fan is the average of fan-in and fan-out.


Another approach is to use Rectified Linear Units (ReLU) with another kind of initialization called He Initialization. $\sigma^2 = 2/{fan}_{in}$

A third approach is Scaled Exponential Linear Units (SELU) with a third initialization mechanism called LeCunn Initialization. $\sigma^2 = 1/{fan}_{in}$

This can be modified in Keras by doing kernel_initializer="he_uniform" or "he_normal" etc.

The 2010 paper showed that just because biological neurons use logistic activation, we don't have to. And in fact, using them causes all sorts of mathematical problems. ReLU activation functions work well in practice, but suffer from a problem where once they start outputting 0, they stay there. As a result, a leaky ReLU works better when it outputs a small negative value instead of 0 all the way through.

The exponential LU is an exponential function shifted down by 1, so it outputs -1 (instead of 0 at $-\inf$ and 0 at 1): ${ELU}(z) = exp(z) - 1$. Differentiable, doesn't cause vanishing or exploding gradients. Slower to compute

SELU needs sequential networks (no skip connections). Needs a specific initializer "lecun_normal", and standard scaling of inputs. When this happens, SELU will self-normalize (mean 0 and variance 1, this is desirable) when all hidden layers use SELU.


Initialization choices listed above only help at the start of the training. During training, the intermediate layers can still have poor gradients. Batch Normalization is a set of extra layers added that seek to estimate the mean and variance of their inputs at their layer (during training), and modify output scaling and output shifting as errors are calculated. After training, the layer modifies its inputs by using the training means and variances, and also modifies the output to scale and shift it to ensure that the behavior at that layer is good.

I don't have a good sense of it though. To standard-scale the input, it must be applied before a hidden neural network layer, and to modify the output, it should be applied after the neural network layer. Not sure how this really works in practice.




