# The Vanishing/Exploding Gradients Problems

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np

import tensorflow as tf
from tensorflow import keras

print(tf.__version__)
print(keras.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

2.2.0
2.3.0-tf
Num GPUs Available:  1


### Glorot and He Initialization

In [2]:
keras.layers.Dense(10, activation ="relu", 
                   kernel_initializer="he_normal" )

<tensorflow.python.keras.layers.core.Dense at 0x18321414e88>

In [3]:
he_avg_init = keras.initializers.VarianceScaling(scale=2., 
                                                 mode='fan_avg', 
                                                distribution='uniform')
keras.layers.Dense(10, activation='sigmoid', 
                   kernel_initializer=he_avg_init)

<tensorflow.python.keras.layers.core.Dense at 0x18328a53048>

### Nonsaturating Activation Functions

In [12]:
model = keras.models.Sequential([
    keras.layers.Dense(10, kernel_initializer='he_normal'),
    keras.layers.LeakyReLU(alpha=0.2),
    keras.layers.Dense(10, kernel_initializer='he_normal'),
    keras.layers.PReLU(),
    keras.layers.Dense(10, activation ="selu", 
                       kernel_initializer="lecun_normal" )
])

So, which activation function should you use for the hidden layers of your deep neural networks? Although your mileage will vary, in general SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic. If the network’s architecture prevents it from self-normalizing, then ELU may perform better than SELU (since SELU is not smooth at z = 0). If you care a lot about runtime latency, then you may prefer leaky ReLU. If you don’t want to tweak yet another hyperparameter, you may use the default α values used by Keras (e.g., 0.3 for leaky ReLU). If you have spare time and computing power, you can use cross-validation to evaluate other activation functions, such as RReLU if your network is overfitting or PReLU if you have a huge training set. That said, because ReLU is the most used activation function (by far), many libraries and hardware accelerators provide ReLU-specific optimizations; therefore, if speed is your priority, ReLU might still be the best choice.

### Batch Normalization

In [13]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [14]:
model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_7 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_15 (Batc (None, 784)               3136      
_________________________________________________________________
dense_20 (Dense)             (None, 300)               235500    
_________________________________________________________________
batch_normalization_16 (Batc (None, 300)               1200      
_________________________________________________________________
dense_21 (Dense)             (None, 100)               30100     
_________________________________________________________________
batch_normalization_17 (Batc (None, 100)               400       
_________________________________________________________________
dense_22 (Dense)             (None, 10)               

In [15]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization_15/gamma:0', True),
 ('batch_normalization_15/beta:0', True),
 ('batch_normalization_15/moving_mean:0', False),
 ('batch_normalization_15/moving_variance:0', False)]

You may find that training is rather slow, because each epoch takes much more time when you use Batch Normalization. This is usually counterbalanced by the fact that convergence is much faster with BN, so it will take fewer epochs to reach the same performance. All in all, wall time will usually be shorter (this is the time measured by the clock on your wall).

### Gradient Clipping

In [21]:
optimizer = keras.optimizers.SGD(clipvalue=1.0)
optimizer = keras.optimizers.SGD(clipnorm=1.0)
model.compile(loss='mse', optimizer=optimizer)

# Reusing Pretrained Layers

In [22]:
model.save("my_model_A.h5")

In [24]:
model_A = keras.models.load_model("my_model_A.h5")
# Reusing all layers except the last layer
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Note that model_A and model_B_on_A now share some layers. When you train model_B_on_A, it will also affect model_A. If you want to avoid that, you need to clone model_A before you reuse its layers. To do this, you clone model A’s architecture with clone_model(), then copy its weights (since clone_model() does not clone the weights):

In [25]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

In [26]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False
    
model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",
                     metrics=["accuracy"])

You must always compile your model after you freeze or unfreeze layers.

In [27]:
# history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
#                            validation_data=(X_valid_B, y_valid_B))

# for layer in model_B_on_A.layers[:-1]:
#     layer.trainable = True

# optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-2
# model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
#                      metrics=["accuracy"])
# history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
#                            validation_data=(X_valid_B, y_valid_B))

# Faster Optimizers

### Momentum Optimization

In [28]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

Due to the momentum, the optimizer may overshoot a bit, then come back, overshoot again, and oscillate like this many times before stabilizing at the minimum. This is one of the reasons it’s good to have a bit of friction in the system: it gets rid of these oscillations and thus speeds up convergence.

### Nesterov Accelerated Gradient

In [29]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

### AdaGrad

In [30]:
optimizer = keras.optimizers.Adagrad()

### RMSProp

In [31]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

### Adam and Nadam and Adamax Optimization

In [33]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
optimizer = keras.optimizers.Nadam()
optimizer = keras.optimizers.Adamax()

Adaptive optimization methods (including RMSProp, Adam, and Nadam optimization) are often great, converging fast to a good solution. However, a 2017 paper20 by Ashia C. Wilson et al. showed that they can lead to solutions that generalize poorly on some datasets. So when you are disappointed by your model’s performance, try using plain Nesterov Accelerated Gradient instead: your dataset may just be allergic to adaptive gradients. Also check out the latest research, because it’s moving fast.

##### TRAINING SPARSE MODELS


All the optimization algorithms just presented produce dense models, meaning that most parameters will be nonzero. If you need a blazingly fast model at runtime, or if you need it to take up less memory, you may prefer to end up with a sparse model instead.

One easy way to achieve this is to train the model as usual, then get rid of the tiny weights (set them to zero). Note that this will typically not lead to a very sparse model, and it may degrade the model’s performance.

A better option is to apply strong ℓ1 regularization during training (we will see how later in this chapter), as it pushes the optimizer to zero out as many weights as it can (as discussed in “Lasso Regression” in Chapter 4).

If these techniques remain insufficient, check out the TensorFlow Model Optimization Toolkit (TF-MOT), which provides a pruning API capable of iteratively removing connections during training based on their magnitud

### Learning Rate Scheduling

In [34]:
#Power Scheduling
optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)

#

# Avoiding Overfitting Through Regularization

### ℓ1 and ℓ2 Regularization

In [35]:
layer = keras.layers.Dense(100, activation="elu",
                           kernel_initializer="he_normal",
                           kernel_regularizer=keras.regularizers.l2(0.01))

Since you will typically want to apply the same regularizer to all layers in your network, as well as using the same activation function and the same initialization strategy in all hidden layers, you may find yourself repeating the same arguments. This makes the code ugly and error-prone. To avoid this, you can try refactoring your code to use loops. Another option is to use Python’s functools.partial() function, which lets you create a thin wrapper for any callable, with some default argument values:

In [36]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense, 
                          activation='elu',
                          kernel_initializer='he_normal',
                          kernel_regularizer=keras.regularizers.l2(0.01))
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax",
                     kernel_initializer="glorot_uniform")
])

### Dropout

In practice, you can usually apply dropout only to the neurons in the top one to three layers (excluding the output layer).

In [37]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So make sure to evaluate the training loss without dropout (e.g., after training).

### Max-Norm Regularization

In [38]:
keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",
                   kernel_constraint=keras.constraints.max_norm(1.))

<tensorflow.python.keras.layers.core.Dense at 0x183976defc8>

# Default DNN configuration

### Kernel Initializer -------------- He initialization
### Activation Function ------------- ELU
### Normalization ------------------- None if shallow; Batch Norm if deep
### Regularization ------------------ Early stopping (+ℓ2 reg. if needed)
### Optimizer ----------------------- Momentum (or RMSProp or Nadam)
### Learning Rate Schedule ---------- 1cycle

# Default DNN configuration for a self-normalizing net

### Kernel Initializer -------------- LeCun initialization
### Activation Function ------------- SELU
### Normalization ------------------- None(self-normalization)
### Regularization ------------------ Alpha Dropout if needed
### Optimizer ----------------------- Momentum (or RMSProp or Nadam)
### Learning Rate Schedule ---------- 1cycle

Don’t forget to normalize the input features! You should also try to reuse parts of a pretrained neural network if you can find one that solves a similar problem, or use unsupervised pretraining if you have a lot of unlabeled data, or use pretraining on an auxiliary task if you have a lot of labeled data for a similar task.

While the previous guidelines should cover most cases, here are some exceptions:

- If you need a sparse model, you can use ℓ1 regularization (and optionally zero out the tiny weights after training). If you need an even sparser model, you can use the TensorFlow Model Optimization Toolkit. This will break self-normalization, so you should use the default configuration in this case

- If you need a low-latency model (one that performs lightning-fast predictions), you may need to use fewer layers, fold the Batch Normalization layers into the previous layers, and possibly use a faster activation function such as leaky ReLU or just ReLU. Having a sparse model will also help. Finally, you may want to reduce the float precision from 32 bits to 16 or even 8 bits (see “Deploying a Model to a Mobile or Embedded Device”). Again, check out TF-MOT.

- If you are building a risk-sensitive application, or inference latency is not very important in your application, you can use MC Dropout to boost performance and get more reliable probability estimates, along with uncertainty estimates.