## A large portion of this code is taken from Aurélien Géron's: Hands-On machine learning with SciKit-Learn, Keras and Tensorflow (2nd edition). I have put comments based on information in the book as well as information I found elsewhere.

## Chapter 11

## Vanishing/Exploding gradients problem

Vanishing gradients:
  - If the gradients at the upper layer are small, they're divided and assigned to each neuron in the lower layer.
    These values are then further divided and assigned to each neron in the layer below it.
    As the number of layers are large in a Deep Neural Network, the gradients at the bottom are zero.
    During forward propagation, this causes the weights to remain the same and training never converges.
Exploding gradients:
  - If the loss function has a steep gradient at the location where you're computing it,
    there will be a large change in the weight values. This in turn can cause you to overshoot the minima
    and land on the other side of the loss function. If the gradient at that point is also large,
    this can cause another large change in the weight values, and another overshoot.
  - Sometimes these overshoots will get larger and larger getting you more and more away from the minima.
    Sometimes these overshoots will get you into a region of the loss function where the gradients are smaller,
    thus temporarily keeping the changes to the weights low. But if you get back down to where the
    gradients are large, you may find yourself again overshooting. This will cause you to oscillate,
    and never get to the minima.
  - More generally, neural networks suffer from unstable gradients. 
    Different layers may learn at widely different speeds.

These problems result from using the Sigmoid activation function, 
and the Standard Normal initial weight distribution
(0 mean, 1 std. dev.). With this initialization, 
the input variance for each layer increases at the output of that layer. 
This causes the top layer's sigmoid function to get input that saturates it's output.
With a saturated output, the gradient is close to zero.
Backpropagation has no gradient to propagate back,
any small gradient keeps getting diluted at each lower layer.
The Sigmoid function has a mean of 0.5, which exacerbates the problem.
The Hyperbolic tangent function behaves better (since it's mean is 0).

### Glorot and He Initialization

Solution is to use Glorot initialization:
$$fan_{avg} = \frac{fan_{in} + fan_{out}}{2}$$
Initialize weights with Normal distribution with 0 mean, variance $\sigma^2 = \frac{1}{fan_{avg}}$

Or, a uniform distribution between $\pm$r, with $r = \sqrt{\frac{3}{fan_{avg}}} = \sqrt{3{\sigma}^2}$

Glorot initialization can speed up training considerably.

| Initialization | Activation functions | $\sigma^2$ (Normal) |
|---|---|---|
|Glorot|None, tanh, logistic, softmax|$\frac{1}{fan_{avg}}$|
|He|ReLU and variants|$\frac{2}{fan_{in}}$|
|LeCun|SELU|$\frac{1}{fan_{in}}$|

By default Keras uses Glorot initialization with a uniform distribution.
To use He init with $fan_{in}$:
  - keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal') # or kernel_initializer='he_uniform'

To use He init with $fan_{avg}$:
  - he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg', distribution='uniform')
  - keras.layers.Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)
  - With distribution="uniform", samples are drawn from a uniform distribution within [-limit, limit], with 
    - limit = sqrt(3 * scale / n).
    - n = number of input weights
  - So you can specify the limits of the uniform distribution by setting the scale value

In [2]:
# By default Keras uses Glorot initialization with a uniform distribution. To use He init with 𝑓𝑎𝑛_𝑖𝑛:

# keras.layers.Dense(10, activation='relu', kernel_initializer='he_normal') 
# or kernel_initializer='he_uniform'

# To use He init with 𝑓𝑎𝑛_𝑎𝑣𝑔:
# he_avg_init = keras.initializers.VarianceScaling(scale=2, mode='fan_avg', distribution='uniform')
# keras.layers.Dense(10, activation='sigmoid', kernel_initializer=he_avg_init)

# With distribution="uniform", samples are drawn from a uniform distribution within [-limit, limit], with
#     limit = sqrt(3 * scale / n).
#     n = number of input weights
# So you can specify the limits of the uniform distribution by setting the scale value

### Nonsaturating Activation Functions

ReLU:
  - $$\begin{align}RELU(z) & = 0 \;\;\; if \; z < 0 \\
                       & = z \;\;\; if \; z >= 0 \end{align}$$                   
  - ReLU function thus does not saturate for positive values.
  - It is also fast to compute, since we only have to look at
    z to decide the output.
  - If a layer's weighted sums are negative for all input instances,
    the output is 0 (and stays 0 forever). This is the dying ReLU problem.
  - Use Leaky ReLU instead
  
Leaky ReLU:
  - $$Leaky \; ReLU_\alpha(z) = max(\alpha z, z)$$
  - $\alpha$ is the slope of the function for negative z
  - Setting $\alpha$ to 0.2 (a huge leak) seems to perform better than 0.01 (a small leak)
  - Default value of $\alpha$ = 0.3
  - **Preferred if you're worried about runtime latency**
  
RReLU (Randomized Leaky ReLU) 
  - uses random $\alpha$ for training, 
    and a fixed average $\alpha$ for testing.
    It performed fairly well and seemed to act as a regularizer
  - **Preferred if you're overfitting**
  
PReLU (Parametric Leaky ReLU) 
  - allows $\alpha$ to be learned during training.
    It becomes a parameter that backprop can modify.
    It strongly outperformed ReLU on large datasets, but risks overfitting on smaller datasets.
  - **Preferred if you have a huge training set**
  
ELU (Exponential Linear Unit):
  - Outperformed all ReLUs
  - Training time was reduced
  - Network performed better on test set
  - $$\begin{align}ELU_\alpha(z) & = \alpha(e^z - 1) & if \; z < 0 \\
                                 & = z & if \; z >= 0 \end{align}$$
  - ELU increases linearly with slope 1 for z > 0, and
    becomes more and more negative with an asymptote at -$\alpha$ as z becomes more negative
  - If $\alpha\,=\,1$, the function is smooth everywhere, including at z = 0.
    This allows Gradient Descent to speed up, as it does not bounce left and right of 0.
  - ELU is slower to compute. It's faster training rate helps at training time,
    but slow computation hurts during testing.
    
SELU (Scaled ELU):
  - If:
    - Your network consists only of a stack on Dense layers
    - All hidden layers use the SELU activation function
    - Input features are all standardized (mean 0, std. dev. 1)
    - All hidden-layer weights are initialized with LeCun normalization (kernel_initializer = 'lecun_normal')
    - Network's architecture must be sequential. No Recurrent Networks, Skip connections (Wide and Deep nets).
    - Some researchers have mentioned SELU works well for Convolutional Networks as well (even though there are no Dense layers).
    
**In general, prefer SELU > ELU > Leaky ReLU (and it's variants) > tanh > logistic**

In [3]:
# This is how you use the ReLU/SELU functions
#
# model = keras.models.Sequential([
#     [ ... ]
#     keras.layers.Dense(10, kernel_initializer = "he_normal"), 
#     keras.layers.LeakyReLU(alpha = 0.2),   # For LeakyReLU, this layer should come after each layer you want to
#                                            # apply it to. Default alpha = 0.3.
#                                            # For PReLU, replace this LeakyReLU() with PReLU().
#                                            # No implementation of RReLU() yet - you write your own.
#     [ ... ]
# ]) 

# layer = keras.layers.Dense(10, activation='selu', kernel_initializer='lecun_normal')

### Batch Normalization (BN)

Batch Normalization is one of the most-used layers in deep neural networks. So much so that it is omitted from diagrams and it is assumed that it as there after each layer.

Why do we need it:
  - The Glorot/He initialization plus any nonsaturating activation function reduces the danger of vanishing/exploding gradients at the beginning of training. But they can come back during training. BN addresses this issue.

How:
  - Add BN layer before/after the activation function.

How does it work:
  - Evaluates mean and std dev of the input over the current mini-batch (hence it's called Batch Normalization).
  
  $$\displaystyle\begin{align}\mu_B & = \frac{1}{m_B}\sum_{i=1}^{m_B}x^{(i)} \\
                          {\sigma_B}^2 & = \frac{1}{m_B}\sum_{i=1}^{m_B}{(x^{(i)} - \mu_B)}^2\end{align}$$
  - 0-centers and normalizes each input, then scales and shifts it according to two new parameter vectors per layer.
  
  $$\begin{align}\widehat{x}^{(i)} & = \frac{x^{(i)}\;-\;\mu_B}{\sqrt{{\sigma_B}^2\;+\;\epsilon}} \\
                           z^{(i)} & = \gamma\;\otimes\;\widehat{x}^{(i)}\;+\;\beta\end{align}$$
                           
  <paragraph><center>where $\;\otimes \;=\;$ elementwise$\;$ multiplication</center></paragraph>
  
  <paragraph><center>$\epsilon \;=\; $small number to avoid divide-by-zero error, called a smoothing term, typically $10^{-5}$</center></paragraph>
  - During backpropagation, each batch-normalized layer learns:
    - $\gamma$ : the output scale vector
    - $\beta$  : the output offset vector
    - $\mu$    : final input mean vector (learned by using a moving average of the layer's input mean)
    - $\sigma$ : final input std dev vector (learned by using a moving average of the layer's input std dev)
  - The batch mean/std.dev are used during training, and final mean/std.dev are used after training.
  - Benefits:
    - Improved neural networks
    - Strongly reduced vanishing gradients problem
    - Networks were also much less sensitive to weight initialization
    - Could use much larger learning rates, speeding up learning
    - Acts as a regularizer, reducing the need for anu other regulatization technique
    - Does not affect shallower networks as much, but has a tremendous impact on deep networks.
    - Although converges faster, training per epoch is rather slow. All in all, wall time will be shorter.

In [None]:
# AAdd a Batch Normalization layer before/after each hidden layer's activation function,
# and optionally after the first layer in your model (as shown here).
#
# model = Sequential([  # In this tiny example with just 2 hidden layers, it's unlikely that Batch Normalization
#                       # will have a very positive impact. For deeper networks, it can make a tremendous difference.
#   Flatten(input_shape=[28, 28]),
#   BatchNormalization(),
#   Dense(300, activation='elu', kernel_initializer='he_normal'),
#   BatchNormalization(),
#   Dense(100, activation='elu', kernel_initializer='he_normal'),
#   BatchNormalization(),
#   Dense(10, activation='softmax')
# ])
# model.summary()  # This will show you the Batch Normalization layers. The number of parameters
#                  # added for the BN layers is 4 x (number of input parameters).
#                  # mu and sigma are not trainable via backprop, and this is what Keras shows at the bottom.
# [(var.name, var.trainable) for var in model.layers[1].variables] # will list trainable/untrainable parameters.
#
# model.layers[1].updates  # Shows the operations Keras created to train the trainable params (gamma, beta)
#                          # at each iteration. Since we're using the Tensorflow backend, these are TF operations.

# Shows how to add BN layers before the activation function.
# You should try adding them before and after the activation function,
# and choosing the way that works best.
# model = Sequential([
#   Flatten(input_shape=[28, 28]),
#   BatchNormalization(),
#   Dense(300, kernel_initializer='he_normal', use_bias=False), # Removed activation function.
#                                                               # Since BN layer already has bias,
#                                                               # remove bias from the layer using use_bias=False.
#   BatchNormalization(),
#   Activation('elu'),
#   Dense(100, kernel_initializer='he_normal', use_bias=False),
#   BatchNormalization(),
#   Activation('elu'),
#   Dense(10, activation='softmax')
# ])
# 

${\bf Momentum}$ hyperparameter for BN layer:

BN layer has a lot of hyperparameters you can tweak.
Most defaults will be fine.
Usually you would need to tweak the momentum.
Given a new vector of input means or std dev (call it $\widehat{v}\,$) computed over the current batch,
the BN layer updates the running average $\widehat{v}$ using:
$$\widehat{v} = \widehat{v}\;\times\;momentum\;+\;v\;\times\;(1 - momentum)$$
A good momentum is typically close to 1 (ex. 0.9, 0.9, 0.999). You want more 9's for larger datasets, and smaller mini-batches.

${\bf Axis}$ hyperparameter for BN layer:

- Determines which axis should be normalized.
- Defaults to -1, which means the last axis. So if you have batch shape [batch size, features], the normalization is across the batch size for each feature.
- If BN layer was before the Flatten layer, batch shape would be [batch size, height, width], so normalization would be across the batch size and height for each width. This means we would get 28 means and 28 std devs for each column. If you want to use the 784 pixels independently, you should set axis=[1, 2].

### Gradient Clipping

- Gradient Clipping is used in Recurrent Neural Networks (RNN), since Batch Normalization is tricky. Other than RNNs, BN layer works fine.
- Gradient Clipping clips the gradient during backprop so it does not exceed some threshold. ex:

optimizer = keras.optimizers.SGD(clipvalue=1.0)  # all gradients clipped to $\pm$1.0

model.compile(loss='mse', optimizer=optimizer)

- clipvalue can change the orientation of the gradient vector. ex. if original gradient = [0.9, 100.0], after clipping it will be [0.9, 1.0]
- use clipnorm instead of clipvalue to keep orientation of gradient vector. This clips gradient if it's L2 norm > threshold. clipnorm=1.0 will change [0.9, 100.0] to [0.00899, 0.9999], preserving it's orientation.

## Using pretrained layers

- Transfer learning: 
  - Use the lower layers of a DNN trained to solve a similar problem as the first layers of your network. ex. To recognize facial expressions, you can use a DNN that is trained to recognize faces. The lower layers of this network will have learned where the mouth, eyes, etc, are on the face. You can then add your upper layers to recognize facial expressions.
  - Transfer learning will work best when the inputs have similar low-level features.
  - The more similar the tasks are, the more layers you will want to use. For very similar tasks, keep all hidden layers, and replace just the output layer.
- Initially, try locking the weights for the pre-trained layers, so when training the upper layers, those weights remain the same.
- If that does not work, try unfreezing one or two top hidden layers to let them updated weights during backprop. The more training data you have, the more layers you can unfreeze. Reduce the learning rate when you unfreeze layers.
- If you still have problems:
  - and you have little training data, try dropping the top hidden layers and freezing all remaining hidden layers.
  - and you have plenty of training data, try replacing the top hidden layers instead of dropping them, and even adding more hidden layers.

### Transfer learning with Keras

In [11]:
# If you use a model's weights as-is, any weight changes to the new model will affect the original as well.
# So you have to clone the original.
# This is just and example. Transfer learning does not work well with small dense networks, probably because
# - small networks learn few patterns
# - dense networks learn very specific patterns which are unlikely to be useful in other tasks
# Transfer learning works best with deep convolutional neural networks, which tend to learn
# feature detectors that are much more general (especially in the lower layers).
# Deep Convolutional networks in Chapter 14
# 
# model_A = keras.models.load_model('my_model_A.h5')
# model_A_clone = keras.models.clone_model(model_A) # model_A is the model that solves the similar problem.
# model_A_clone.set_weights(model_A.get_weights())  # This is done since clone_model() does not clone weights.
# model_B_on_A = Sequential(model_A_clone.layers[:-1]) # Include all layers except output
# model_B_on_A.add(Dense(1, activation='sigmoid'))     # Our own output layer

# for layer in model_B_on_A.layers[:-1]:               # Freeze borrowed layer weights. This may be needed for the
#     layer.trainable = False                          # first few epochs until the new output layer has learned
#                                                      # reasonable weights. This is done since it's weights are 
#                                                      # random and will wreck the borrowed layer weights.

# model_B_on_A.compile(loss='binary_crossentropy',     # You must always compile your model after you
#                      optimizer='sgd',                # freeze/unfreeze layers.
#                      metrics='accuracy')

# history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,         # Train the model for a few epochs
#                            validation_data=(X_valid_B, y_valid_B))

# for layer in model_B_on_A.layers[:-1]:               # Now our new layer must have learned good weights, so
#     layer.trainable = False                          # unfreeze the lower layers.

# optimizer = keras.optimizer.SGD(lr=1e-4)             # After unfreezing the lower layers, it is a good idea to
#                                                      # lower the learning rate to avoid damaging the
#                                                      # lower-layer weights.

# model_B_on_A.compile(loss='binary_crossentropy',     # You must always compile your model after you
#                      optimizer='sgd',                # freeze/unfreeze layers.
#                      metrics='accuracy')

# history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
#                            validation_data=(X_valid_B, y_valid_B))

### Unsupervised pretraining

Suppose you have a complex task but not much labeled training data, and cannot find a model trained on a similar task. You should:
  - try to get more labeled training data. If you cannot,
  - try to perform unsupervised pretraining:
    - Requires plenty of unlabeled training data
    - Use it to pretrain an unsupervised model (ex. autoencoder/GAN (Generative Adversarial Network)). See Ch 17.
    - Use the lower layers of the unsupervised model, add your output layer on top, and fine-tune the final network using supervised learning (with the labeled training data).

### Pretraining on an auxiliary task

If you don't have much labeled training data, another approach is to:
  - train a first network on an auxiliary task for which you have labeled data. Then reuse the lower layers for your actual task.
  - ex. for a Natural Language Processing (NLP) task, you can:
    - download a corpus of millions of text documents,
    - randomly mask out some words and train a model to predict what those words are.
    - This will train your model to "understand" the language to some extent.
    - Then you can reuse it for your actual task and fine-tune it for your labeled data.
 
 Self-supervised learnin:
   - you automatically generate the labels from the data itself.
   - train a model on the resulting labeled dataset using supervised learning.
   - Does not require human labeling whatsoever, so it is best classified as a form of unsupervised learning.

## Faster Optimizers

Most popular algorithms:
  - momentum optimization
  - Nesterov Accelerated Gradient
  - AdaGrad
  - RMSProp
  - Adam
  - Nadam

### Momentum Optimization

  - Gradient Descent takes small regular steps down the slope. The step size is also a function of the gradient value. If the gradient is tiny, the step size will be tiny also.

$$\theta \leftarrow \theta - \eta\nabla_{\theta}J(\theta)$$

  - Momentum optimization changes the step size as a function of time, quickly taking larger and larger steps.
  
$$\begin{align}m & \leftarrow\beta m - \eta \nabla_{\theta}J(\theta) \\
          \theta & \leftarrow \theta + m \end{align}$$
          <paragraph><center>where $\beta$ is the momentum and should be set between 0 (high friction) and 1 (no friction).
          If $\beta$ = 0.9, the velocity is 10 x $\eta\nabla_{theta}J(\theta)$, so
          momentum optimization goes 10 times faster than gradient descent.</center></paragraph>
  - Benefits:
    - escapes from plateaus faster than Gradient Descent
    - can roll past local optima
    - For deep networks that don't use Batch Normalization, the upper layers will end up with inputs of very different scales. Momentum optimization helps a lot to converge for layers across all input scales.
  - Issues:
    - The optimizer may overshoot in one direction, then overshoot in another thus oscillating for a while before stabilizing at the minimum. Good to have friction to get rid of oscillations and speed up coverage.
  - momentum value of 0.9 usually works well in practice.

### Nesterov Accelerated Gradient (NAG) AKA Nesterov Momentum Optimization

  - This variant of momentum optimization is almost always faster than vanilla momentum optimization. In fact it is significantly faster than regular momentum optimization.
  - It measures the gradient of the cost function slightly ahead in the direction of the momentum. The vanilla momentum optimization computes the gradient before the momentum update to $\theta$. NAG updates the momentum to $(\theta + \beta \; m)$ and then applies the gradient to it. This works because in general the momentum will point in the right direction.
  
  $$\begin{align}m & \leftarrow\beta m - \eta \nabla_{\theta}J(\theta + \beta \; m) \\
          \theta & \leftarrow \theta + m \end{align}$$

In [None]:
# optimizer = kears.optimizer.SGD(lr=0.0001, momentum=0.9, nesterov=True)  # Nesterov Accelerated Gradient (NAG), 
#                                                                          # AKA Nesterov Momentum Optimization

### AdaGrad

  - It would be nice if the algorithm could correct it's direction to point more towards the global optimum. AdaGrad achieves this by scaling down the gradient vector along the steepest dimensions.

$$\begin{align}s & \leftarrow s + \nabla_{\theta}J(\theta) \otimes \nabla_{\theta} J(\theta) \\
         \theta  & \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \odiv \sqrt{s + \epsilon)} \end{align}$$
         <paragraph><center>Accululate the square of the gradient into the variable s.
    Then scale down the gradient vector by a factor of $\sqrt{s + \epsilon}$.
    The $\odiv$ is the symbol for elementwise division.
    $\epsilon$ is a smoothing term to avoid div-by-zero. Typically set to 1e-10. This decays the learning rate faster for steeper dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. </center></paragraph>
  - Benefits:
    - It helps point the update toward the global optimum.
    - This algorithm requires much less tuning of the learning rate $\eta$
    - Performs well for simple quadratic problems - may be efficient for Linear Regression.
  - Issues:
    - Often stops too early. The learning rate gets scaled down so much that the algorithm stops before reaching global optimum. You should not use it for Deep NNs.

### RMSProp

  - Since AdaGrad runs the risk of slowing down a bit too fast and never converging, RMSProp accumulates the gradients from only the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in during gradient accumulation (first equation below):

$$\begin{align}s & \leftarrow \beta s + (1 - \beta) \nabla_{\theta}J(\theta) \otimes \nabla_{\theta} J(\theta) \\
         \theta  & \leftarrow \theta - \eta \nabla_{\theta} J(\theta) \odiv \sqrt{s + \epsilon)} \end{align}$$
         <paragraph><center>Accumulate the square of the latest gradients into the variable s.
    Then scale down the gradient vector by a factor of $\sqrt{s + \epsilon}$.
    The $\odiv$ is the symbol for elementwise division.
    The decay rate is $\beta$.
    $\epsilon$ is a smoothing term to avoid div-by-zero. Typically set to 1e-10. This decays the learning rate faster for steeper dimensions than for dimensions with gentler slopes. This is called an adaptive learning rate. </center></paragraph>

  - The decay rate $\beta$ is typically set to 0.9 (which typically works well).
  - Can lead to solutions that generalize poorly on some datasets. So if you're seeing this, try using plain Nesterov Accelerated Gradient instead.
  - RMSProp was the preferred optimization algorithm for many researchers until Adam optimization came around.

In [None]:
# optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9) # rho is the decay rate beta in the above text

### Adam, AdaMax and Nadam Optimization

  - Adam (Adaptive Moment Estimation) combines the ideas of momentum optimization and RMSProp.
    - It keeps track of exponentially decaying average of past gradients
    - Exponentially decaying average of past squared gradients
    
 $$\begin{align}m & \leftarrow \beta_1 m - (1 - \beta) \nabla_{\theta} J(\theta) \\
                s & \leftarrow \beta_2 s + (1 - \beta_2) \nabla_{\theta}J(\theta) \otimes \nabla_{\theta} J(\theta) \\
      \widehat{m} & \leftarrow \frac{m}{1 - \beta_1^T} \\
      \widehat{s} & \leftarrow \frac{s}{1 - \beta_2^T} \\
           \theta & \leftarrow \theta + \;\eta \; \widehat{m} \; \odiv \sqrt{\widehat{s} + \epsilon}
      \end{align}$$
      <paragraph><center>T represents the iteration number starting at 1.
    $\widehat{m}$ and $\widehat{s}$ are initialized to 0 at the beginning of training, so the third and fourth equations are needed to boost them.</center></paragraph>
    - Momentum decay $\beta_1$ is typically initialized to 0.9
    - Scaling decay $\beta_2$ is typically initialized to 0.999
    - Smoothing term $\epsilon$ defaults to None, which tells Keras to use keras.backend.epsilon which is 1e-7. If you want, you can change it using keras.backend.set_epsilon()
    - Learning rate $\eta$ is typically set to 0.001. Since Adam is adaptive, it requires less tuning of $\eta$
    - Can lead to solutions that generalize poorly on some datasets. So if you're seeing this, try using plain Nesterov Accelerated Gradient instead.
  - AdaMax
    - replaces the L2 norm of parameter updates of Adam with the $L_{\infty}$ norm (the max value of the time-decayed gradients).
    - Adam performs better, so you can try this algorithm if you're seeing problems with AdaMax
  - Nadam
    - This is Adam + Nesterov trick, so it will converge slightly faster than Adam.
    - Can lead to solutions that generalize poorly on some datasets. So if you're seeing this, try using plain Nesterov Accelerated Gradient instead.

In [None]:
# optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

### Training Sparse Models

- If you need
  - blazing fast model at runtime, or
  - you need to take up less memory
- then
  - train the model as usual
  - set it's tiny weights to 0 (using model.set_weights()?)
  - This may typically not lead to a very sparse model, and
  - may degrade the model's performance.
- Better option
  - Apply strong L1 regularization during training (this zeroes out many weights)
- If this is not enough
  - Use TensorFlow Model Optimization Toolkit (TF-MOT) which has a pruning API that iteratively removes connections during training based on their magnitude.

### Optimizer Comparison

<table>
    <caption>Optimizer Comparison</caption>
    <tr><th>Class</th><th>Convergence Speed</th><th>Convergence quality</th></tr>
    <tr><td>SGD</td><td>*</td><td>**</td></tr>
    <tr><td>SGD(momentum=...)</td><td>**</td><td>***</td></tr>
    <tr><td>SGD(momentum=..., nesterov=True)</td><td>**</td><td>***</td></tr>
    <tr><td>Adagrad</td><td>***</td><td>* (stops too early)</td></tr>
    <tr><td>RMSProp</td><td>***</td><td>** or ***</td></tr>
    <tr><td>Adam</td><td>***</td><td>** or ***</td></tr>
    <tr><td>Nadam</td><td>***</td><td>** or ***</td></tr>
    <tr><td>AdaMax</td><td>***</td><td>** or ***</td></tr>
</table>