# Training Deep Neural Networks

In chapter 10, we introduced ANN's and trained our first deep neural networks, but they were shallow in nature with only a few hidden layers. What if we tackle a much more complex problem, such as detecting hundreds of different types of objects in high-resolution images? You may need to train a much deeper neural network - perhaps with 10 or more hidden layers. But training networks that deep present their own challenges:

* You may be faced with the _vanishing grandients_ problem or the related _exploding gradients problem_, which is when the gradients during backpropogation become smaller and smaller or larger and larger, making the lower layers very hard to train
* You might not have enough training data for such a large network, or it might be too costly to label.
* Training might be too slow.
* A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy.

## The Vanishing / Exploding Gradients Problem

As discussed in the previous chapter, backpropogation works by going from the ouput layer to the input layer, propogating the the error gradients along the way. Once the algorithm has computed the gradient of the cost function with regard to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step.

Unfortuantely, gradients often get smaller and smaller as the algorithm progress down to the lower layers, and as a result, the GD update leaves the lower layers weights virtually unchanged, which is why it's called the `Vanishing Gradients Problem`.

In some cases, the opposite can happen - the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges, which is called the `Exploding Gradients Problem`. The `Exploding Gradients Problem` is more common in recurrent neural networks.

More generally, deep neural networks suffer from unstable gradients - different layers may learn at widely different speeds.

## Glorot and He Initialization

`Glorot initialization` (also known as Xavier initialization) and `He initialization` are techniques for initializing the weights of a neural network, and they are designed to address the `The Vanishing / Exploding Gradients Problem` of training deep neural networks by providing a better starting point for the optimization process.

`Glorot initialization` is recommended for tanh- and sigmoid-based activations functions (e.g None, tanh, logistic, softmax) and `He initialization` is recommended for ReLU-based activation functions.

By default keras uses Glorot initialization with a uniform distribution. When creating a layer, you can change to He initialization by setting `kernal_initializer="he_uniform"` or `kernal_initializer="he_normal"`:

In [2]:
import tensorflow as tf
import keras 

print(f"Tensorflow: {tf.__version__}")
print(f"keras: {keras.__version__}")

Tensorflow: 2.13.0
keras: 2.13.1


In [3]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<keras.src.layers.core.dense.Dense at 0x162d26980>

## Nonsaturating Activation Functions

One of the key insights in the 2010 paper by Glorot and Benigio was that the problems with unstable graidents were in part due to a poor choice of activation functions. But it turns out that the ReLU activation function performs much better in DNN, mostly because it does not saturate for positive values and because it is fast to compute.

Howeverm the ReLU activation function is not perfect - it suffers from a problem known as the dying ReLU's: during training, some neurons effectively "die", meaning they stop outputting anything other than 0. A neuron dies when it's weights get tweaked in such a way that the weighted sum of it's inputs are negative for all instances in the training set. Once a ReLU neuron is in this state, it stops learning and does not contribute to the gradients during backpropagation.

To solve this problem, you may want to use a a variant of the ReLU function such as the Leaky ReLU. The Leaky ReLU introduces a small, non-zero slope for the negative values of the input.