# Training Deep Neural Networks

In chapter 10, we introduced ANN's and trained our first deep neural networks, but they were shallow in nature with only a few hidden layers. What if we tackle a much more complex problem, such as detecting hundreds of different types of objects in high-resolution images? You may need to train a much deeper neural network - perhaps with 10 or more hidden layers. But training networks that deep present their own challenges:

* You may be faced with the _vanishing grandients_ problem or the related _exploding gradients problem_, which is when the gradients during backpropogation become smaller and smaller or larger and larger, making the lower layers very hard to train
* You might not have enough training data for such a large network, or it might be too costly to label.
* Training might be too slow.
* A model with millions of parameters would severely risk overfitting the training set, especially if there are not enough training instances or if they are too noisy.

## The Vanishing / Exploding Gradients Problem

As discussed in the previous chapter, backpropogation works by going from the ouput layer to the input layer, propogating the the error gradients along the way. Once the algorithm has computed the gradient of the cost function with regard to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step.

Unfortuantely, gradients often get smaller and smaller as the algorithm progress down to the lower layers, and as a result, the GD update leaves the lower layers weights virtually unchanged, which is why it's called the `Vanishing Gradients Problem`.

In some cases, the opposite can happen - the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges, which is called the `Exploding Gradients Problem`. The `Exploding Gradients Problem` is more common in recurrent neural networks.

More generally, deep neural networks suffer from unstable gradients - different layers may learn at widely different speeds.

## Glorot and He Initialization

`Glorot initialization` (also known as Xavier initialization) and `He initialization` are techniques for initializing the weights of a neural network, and are designed to address the `The Vanishing / Exploding Gradients Problem` of training deep neural networks by providing a better starting point for the optimization process.

`Glorot initialization` is recommended for tanh- and sigmoid-based activations functions (e.g None, tanh, logistic, softmax) and `He initialization` is recommended for ReLU-based activation functions.

By default keras uses Glorot initialization with a uniform distribution. When creating a layer, you can change to He initialization by setting `kernal_initializer="he_uniform"` or `kernal_initializer="he_normal"`:

In [1]:
import tensorflow as tf
import keras 

print(f"Tensorflow: {tf.__version__}")
print(f"keras: {keras.__version__}")

Tensorflow: 2.13.0
keras: 2.13.1


In [3]:
keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")

<keras.src.layers.core.dense.Dense at 0x162d26980>

## Nonsaturating Activation Functions

One of the key insights in the 2010 paper by Glorot and Benigio was that the problems with unstable graidents were in part due to a poor choice of activation functions. But it turns out that the ReLU activation function performs much better in DNN's, mostly because it does not saturate for positive values and because it is fast to compute.

Howeverm the ReLU activation function is not perfect - it suffers from a problem known as the dying ReLU's: during training, some neurons effectively "die", meaning they stop outputting anything other than 0. A neuron dies when it's weights get tweaked in such a way that the weighted sum of it's inputs are negative for all instances in the training set. Once a ReLU neuron is in this state, it stops learning and does not contribute to the gradients during backpropagation.

To solve this problem, you may want to use a a variants of the ReLU function.

### Leaky ReLU

The Leaky ReLU introduces a small, non-zero slope for the negative values of the input, which is controlled by the hyperparameter `alpha`, which is typically set `0.01`. How large or small `alpha` is controlls how _leaky_ the ReLU functions becomes.

To use the Leaky ReLU in Keras:

In [4]:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(alpha=0.2)
])

2023-11-29 18:43:58.704269: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2023-11-29 18:43:58.704312: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2023-11-29 18:43:58.704322: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
2023-11-29 18:43:58.704506: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-11-29 18:43:58.704835: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


### Exponential Linear Unit (ELU)

The ELU activation function looks a lot like the ReLU function, with a few major differences:

* It takes on negative values when `z < 0`, which allows the unit to have an average output closer to 0 and helps allevaite the vanishing graidents problem. The `alpha` hyperparameter defines the value that the ELU function approcahes `z` is a large negative number, and it is usually set to 1.
* It has a non-zero gradient for `z < 0`, which helps it avoid the dead neurons problem.
* If `alpha` is equal to 1 then the function is smooth everywhere, includingaround `z = 0`, which helps speed up GD since it does not bounce as much to the left and right of `z = 0`.

The main drawback of the ELU activcation function is that it is slower to compute than the ReLU function and it's variants (due to it's use of exponential function), even though it's faster convergence rate during training will compensate for that slow computation.

In [5]:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer="he_normal"),
    keras.layers.ELU(alpha=0.1)
])

### Scaled Exponential Linear Unit

Scaled Exponential Linear Unit (SELU) is a variation of the Exponential Linear Unit (ELU) activation function that is it is designed to induce self-normalizing properties in a neural network. The SELU activation function often signficantly outperforms other activation functions for DNN's, however, it needs to meet a few conditions in order to work properly:

* The input features must be standardized (mean of 0 and a standard deviation of 1)
* Every hidden layers weights must be initialized with LeCun normal initialization. In Keras, this means setting `kernal_initializer="lecun_normal"`.
* The networks architecture must be Sequential.
* Every hidden layer must be Dense, although there is some research that shows SeLU can improve performance in CNN's as well.

To use SeLU in Keras:

In [6]:
keras.layers.Dense(10, activation="selu", kernel_initializer="lecun_normal")

<keras.src.layers.core.dense.Dense at 0x16298d510>

## Batch Normalization

Although using He initialization calong with ELU (or any varianet of ReLU) can significantly reduce the danger of the vanishing / exploding gradients problems at the beggining of training, it doesn't guarantee that they won't come back during training.

`Batching Normalization` is a technique designed to address this problem by adding an operation in the model just before or after the activatino function of each hidden layer. This operation simply zero-centers and normalizes each input, then scales and shifts the result using two new paremeter vectors per layer: one for scaling, and the other for shifting. In other words, the oepration lets the model learn the optimal sacle and mean of each of the layers inputs. In many cases if you ad a BN layer as the first layer of your network, you do not need to standardize your training set!

The Batch Normalization Algorithm works by: 
1.  Calculate the mean of each input across the entire mini-batch.
2. Calculate the standard deviation of each input across the entire mini-batch.
3. Standardize each input instance in the mini-batch using the mean and standard deviation calculated above.
4. Offset and scale each instance after standardization using a different offset parameter and scaling parameter for each instance.

Each input will have its own mean, standard deviation, offset and scaling parameters & the offset and scaling parameter vectors are learned through backpropagation. The mean and standard deviation parameter vectors are calculated per mini-batch during training time and not learned during backpropagation.Additionally, batch normalization has regularization effects, reducing the need for other regularization techniques like dropout in some cases. 

However, there is a runtime penalty: Batch Normalization opten slows down the network when making predictions since there are more computations to perform. However, it is possible to fuse a Batch Normalization layer with previous layer after training to avoid this, through updating the previous layer's weights and biases so that its output matches the scale and offset of the Batch Normalization layer.

### Implementing Batch Normalization with Keras

As with most things with Keras, implementing Batch Normalization is simple and intuitive. Just add a `BatchNormalization` layer before or after each hidden layers activation function, and optionally add a BN layer as well as the first layer in your model:

In [12]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_4 (Flatten)         (None, 784)               0         
                                                                 
 batch_normalization_10 (Ba  (None, 784)               3136      
 tchNormalization)                                               
                                                                 
 dense_13 (Dense)            (None, 300)               235500    
                                                                 
 batch_normalization_11 (Ba  (None, 300)               1200      
 tchNormalization)                                               
                                                                 
 dense_14 (Dense)            (None, 100)               30100     
                                                                 
 batch_normalization_12 (Ba  (None, 100)              

Let's look a the parameters of the first BN layer. Two are trainianble (via backpropogtion) and two are not:

In [14]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization_10/gamma:0', True),
 ('batch_normalization_10/beta:0', True),
 ('batch_normalization_10/moving_mean:0', False),
 ('batch_normalization_10/moving_variance:0', False)]

The authors of BN argued in favor adding the BN layers before the activation functions, rather than after (as we just did), but which is preferable seems to depend on the task - you can experiment with this too to se which option works best on your dataset. To add the BN layers before the actviation functoins, you must remove the activation function from the hidden layers and add them as seperate layers after the BN layers.

Moreover, since a Batch Normalizatoin layer includes one offset parameter per input, you can remove the bias term fro mthe previous layer (just pass `use_bias=False` when creating it):

In [17]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28,28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, kernel_initializer="he_normal", use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_5 (Flatten)         (None, 784)               0         
                                                                 
 batch_normalization_13 (Ba  (None, 784)               3136      
 tchNormalization)                                               
                                                                 
 dense_16 (Dense)            (None, 300)               235200    
                                                                 
 batch_normalization_14 (Ba  (None, 300)               1200      
 tchNormalization)                                               
                                                                 
 dense_17 (Dense)            (None, 100)               30000     
                                                                 
 batch_normalization_15 (Ba  (None, 100)              

The `BatchNormalization` class has quite a few hyperparameters we can tweak, although the defaults will usually be fine:

* `momentum` - used by the BN layer when it updates to exponential moving averages. A good momentum value is typically close to 1: for example `0.9`, `0.99`, or `0.999`. You may want more 9's for larger datasets and smaller mini batches.
* `axis` - determines which axis should be normalized, and defaults to -1 meaning it will normalize the last axis.

`BatchNormalization` has become one of hte most-used layers in deep neural networks, to the point that it is often omitted in diagrams as it assumed that BN will be added after every layer!

## Gradinet Clippings