# Chapter 11: Training deep neural networks


Neural networks suffer from vanishing (too small) or exploding (too large) gradients.
This caused them to be abandoned.

Xavier Glorot and Yoshua Bengio showed in 2010 that this was caused by logistic activation (mean is 0.5, and gradient is close to the edges: output 0 or output 1) or the initialization used.

The outcome was to have a fan-in as close to fan-out for hidden layers, and the initialization is done in a crafty way to account for differences in fan-in and fan-out called Xavier initialization now. $\sigma^2 = 1/{fan}$, fan is the average of fan-in and fan-out.


Another approach is to use Rectified Linear Units (ReLU) with another kind of initialization called He Initialization. $\sigma^2 = 2/{fan}_{in}$

A third approach is Scaled Exponential Linear Units (SELU) with a third initialization mechanism called LeCunn Initialization. $\sigma^2 = 1/{fan}_{in}$

This can be modified in Keras by doing kernel_initializer="he_uniform" or "he_normal" etc.

The 2010 paper showed that just because biological neurons use logistic activation, we don't have to. And in fact, using them causes all sorts of mathematical problems. ReLU activation functions work well in practice, but suffer from a problem where once they start outputting 0, they stay there. As a result, a leaky ReLU works better when it outputs a small negative value instead of 0 all the way through.

The exponential LU is an exponential function shifted down by 1, so it outputs -1 (instead of 0 at $-\inf$ and 0 at 1): ${ELU}(z) = exp(z) - 1$. Differentiable, doesn't cause vanishing or exploding gradients. Slower to compute

SELU needs sequential networks (no skip connections). Needs a specific initializer "lecun_normal", and standard scaling of inputs. When this happens, SELU will self-normalize (mean 0 and variance 1, this is desirable) when all hidden layers use SELU.


Initialization choices listed above only help at the start of the training. During training, the intermediate layers can still have poor gradients. Batch Normalization is a set of extra layers added that seek to estimate the mean and variance of their inputs at their layer (during training), and modify output scaling and output shifting as errors are calculated. After training, the layer modifies its inputs by using the training means and variances, and also modifies the output to scale and shift it to ensure that the behavior at that layer is good.

I don't have a good sense of it though. To standard-scale the input, it must be applied before a hidden neural network layer, and to modify the output, it should be applied after the neural network layer. Not sure how this really works in practice.


This chapter was mostly talk. The exercises are where the information get solidified.



# Exercises

1. It is ok to initialize values to the same thing, though you are better off with random initialization to different values.

2. Nope, not ok to initialize values to 0. Some of the activation functions have a zero gradient (or undefined gradient) at 0. Best to initialize to nonzero values.

3. SELU is differentiable everywhere, and usually converges faster, even though it is slower to compute.
 When SELU is used in all layers, it self-normalizes to mean 0 and standard deviation of 1, which greatly helps convergence.

4. SELU: when the input can be scaled, and 

5. No idea. Probably the result doesn't converge?

6. Sparse models can be produced by:
  * Using dropout. This removes some nodes.
  * High value of regularization.
  * Ue Tensorflow's Model Optimization Toolkit (MOT) to prune connections with small magnitude.

7. Dropout might down training as we might need more iterations to converge. It does speeds up inference. MC Dropout slows down training as we have to get the boosted model iteratively. And it does slow down inference as well (inference requires training with dropout turned on and keeping the previous inferred results to average)

8. Doing that below here.






In [68]:
# Common imports

import matplotlib.cm as cm
from matplotlib.image import imread
import matplotlib as mpl
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as p3

import numpy as np

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

from sklearn.metrics import accuracy_score
from sklearn.metrics import silhouette_samples
from sklearn.metrics import silhouette_score

from sklearn.datasets import fetch_california_housing
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import tensorflow as tf
from tensorflow import keras

print("TF version ", tf.__version__)
print("Keras version ", keras.__version__)

# Custom error handler for the entire notebook so stack traces are not lost
from IPython.core.ultratb import AutoFormattedTB

# initialize the formatter for making the tracebacks into strings
itb = AutoFormattedTB(mode = 'Plain', tb_offset = 1)

# Define a global with the stack trace that we can append to in the handler.
viki_stack_trace = ''

# this function will be called on exceptions in any cell
def custom_exc(shell, etype, evalue, tb, tb_offset=None):
    global viki_stack_trace

    # still show the error within the notebook, don't just swallow it
    shell.showtraceback((etype, evalue, tb), tb_offset=tb_offset)

    # grab the traceback and make it into a list of strings
    stb = itb.structured_traceback(etype, evalue, tb)
    sstb = itb.stb2text(stb)

    print (sstb) # <--- this is the variable with the traceback string
    viki_stack_trace = viki_stack_trace + sstb

# this registers a custom exception handler for the whole current notebook
get_ipython().set_custom_exc((Exception,), custom_exc)


TF version  2.3.0
Keras version  2.4.0


In [8]:
cifar = keras.datasets.cifar10.load_data()

In [52]:
(X, y), (testX, testy) = keras.datasets.cifar10.load_data()

In [53]:
X.shape

(50000, 32, 32, 3)

In [56]:
y.shape

(50000, 1)

In [55]:
X[2].shape

(32, 32, 3)

In [75]:
from sklearn.base import clone

def create_keras_classifier_model(n_classes=100):
    """Keras multinomial logistic regression creation model
 
    Args:
        n_classes(int): Number of classes to be classified
 
    Returns:
        Compiled keras model
 
    """
    # create model
    model = keras.models.Sequential()
    
    # The input: we get 32x32 pixels, each with 3 colors (rgb)
    model.add(keras.layers.Flatten(input_shape=[32,32,3]))
    # Then the hidden layers, fully connected (100 by default)
    for i in range(20):
        model.add(keras.layers.Dense(
            n_classes, 
            activation="elu",
            kernel_initializer=tf.keras.initializers.HeNormal()
        ))
    # Now add the output layer: 10 classes in CIFAR10, so 10 outputs.
    model.add(keras.layers.Dense(10, activation="softmax"))

    print(model.summary())
    # Compile model
    model.compile(
        loss="sparse_categorical_crossentropy", 
        optimizer="nadam",
        metrics=["accuracy"]
    )
    return model
 
estimator = keras.wrappers.scikit_learn.KerasClassifier(
    build_fn=create_keras_classifier_model,
    n_classes=10,
    class_weight={0: 1, 1:3})

viki_stack_trace = ''

mm = create_keras_classifier_model(100)


Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_7 (Flatten)          (None, 3072)              0         
_________________________________________________________________
dense_182 (Dense)            (None, 100)               307300    
_________________________________________________________________
dense_183 (Dense)            (None, 100)               10100     
_________________________________________________________________
dense_184 (Dense)            (None, 100)               10100     
_________________________________________________________________
dense_185 (Dense)            (None, 100)               10100     
_________________________________________________________________
dense_186 (Dense)            (None, 100)               10100     
_________________________________________________________________
dense_187 (Dense)            (None, 100)             

Need to create a model and test against the training data.

In [None]:
history = mm.fit(trainX, trainy, epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30