# Optimizing DNNs

This is companion code the Lecture 11, and close follows Chapter 11 of Geron. 

In this lab, we're going to show how to implement all of the DNN optimizations we discussed in Lecture 11 in Keras. This includes:

* __Selecting Activation Functions__: SELU, ELU, Leaky ReLU, ReLu. 
* __Choosing Weight Initializations__: He for ReLU variants, LeCun for SELU and Glorot for others. 
* __Batch Normalization__
* __Gradient Clipping__
* __Choosing Optimizers__: Momentum, Nesterov, AdaGrad, RMSProp, Adam, Nadam and AdaMax. 
* __Regularization__: $\ell_p$ regularization, Max-Norm, dropout and early stopping. 

### Note: 

In the notebook below I will be raining each model for 10 steps for time reasons. This is really not enough to see the differences between some of the gains as the initial model is still clearly making steady improvement at this stage. You should really train the model for at least 100 epochs. 

We'll start with the fashion MNIST dataset and a simple network:

In [6]:
import tensorflow as tf
from tensorflow import keras

fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.

class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]


model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu"))
model.add(keras.layers.Dense(100, activation="relu"))
model.add(keras.layers.Dense(10, activation="softmax"))

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_3 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 300)               235500    
_________________________________________________________________
dense_10 (Dense)             (None, 100)               30100     
_________________________________________________________________
dense_11 (Dense)             (None, 10)                1010      
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________


In [7]:
## Compile the model with loss: sparse_categorical_crossentropy, optimizer: sgg, and metric: accuracy. 
model.compile(loss="sparse_categorical_crossentropy",
             optimizer="sgd",
             metrics=["accuracy"])

history = model.fit(X_train,y_train, epochs=10, 
          validation_data=[X_valid,y_valid])

## Fit the model for 10 epochs, vailidating on the validation data. 

Train on 55000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Time

I'm getting roughly .87 accuracy after 10 epochs, 8s per epoch for a training time of 84s.  Not bad. 

To time the whole output we can use the `time` library to mark the beginning and the end of our run. The `time.time()` command returns the number of seconds after January 1, 1970, 00:00:00 (UTC), the so called _epoch_. We can then measure the total time of a training run with 

    import time

    start = time.time()
    ## Run Training Code
    end = time.time()
    print(end - start)

## Activation Functions

Changing the activation functions using Keras is simple. In the creation of the dense layer, you just specify `activation=` and `kernel_initializer=`

    keras.layers.Dense(10, activation="relu", kernel_initializer="he_normal")
    
For the full list of activations, see https://keras.io/activations/. In our case, let's change all of our layers to ELU units with He initialization. The default kernel initializer is `glorot_uniform`, or uniformally distributed weights according to the Glorot normalization. If we change the activation we should also change the initializer. The list of initializers can be found at https://keras.io/initializers/.

In [11]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_5 (Flatten)          (None, 784)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 300)               235500    
_________________________________________________________________
dense_16 (Dense)             (None, 100)               30100     
_________________________________________________________________
dense_17 (Dense)             (None, 10)                1010      
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
_________________________________________________________________


In [12]:
## Compile the model with loss: sparse_categorical_crossentropy, optimizer: sgg, and metric: accuracy. 
model.compile(loss="sparse_categorical_crossentropy",
             optimizer="sgd",
             metrics=["accuracy"])

import time
start = time.time()

history = model.fit(X_train,y_train, epochs=10, 
          validation_data=[X_valid,y_valid])

## Fit the model for 10 epochs, vailidating on the validation data. 

end = time.time()
print("Training Time:",end - start)

Train on 55000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Time: 81.31610941886902


In this case we're getting slightly better results with the original ReLU function, but try training longer and see what you find. We're already doing fairly well here so we might not see too many gains. 

## Batch Normalization

Recall that Batch Normalization adds a layer before each dense layer that normalizes the input data. In Keras implementing batch normalization works exactly as you might expect, by adding a `keras.layers.BatchNormalization()` before any sequential layer we want to normalize.

In [13]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(10, activation="softmax"))

model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_6 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization (BatchNo (None, 784)               3136      
_________________________________________________________________
dense_18 (Dense)             (None, 300)               235500    
_________________________________________________________________
batch_normalization_1 (Batch (None, 300)               1200      
_________________________________________________________________
dense_19 (Dense)             (None, 100)               30100     
_________________________________________________________________
batch_normalization_2 (Batch (None, 100)               400       
_________________________________________________________________
dense_20 (Dense)             (None, 10)               

Notice that the number of parameters has gone up. This is because each batch normalization layer has a number of trainable parameters governing the trained center and shape. Lets take a quick look at a batch normalization layer. Using the `model.layers` list, we will access the variables from the first layer and take a look at the 3136 variables inside:

In [15]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('batch_normalization/gamma:0', True),
 ('batch_normalization/beta:0', True),
 ('batch_normalization/moving_mean:0', False),
 ('batch_normalization/moving_variance:0', False)]

We see that there are two trainable parameters, the center `gamma` and the scale `beta` and two parameters computed from the batch, the `moving_mean` $\hat{\mu}$ and the `moving_variance` $\hat{\sigma}^2$. The last two parameters are the moving average of the batch means and variances, to be used at prediction time. There are a lot of little hyperparameters to play with when using batch normalization, but usually the defaults can be used. 

A final note: batch normalization tries to properly center the data so a bias term would be redundant. Deep layers be default contain a bias term, so you may want to turn if off using `use_bias=False`.

In [21]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal",use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(10, activation="softmax"))

model.summary()

Model: "sequential_11"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_11 (Flatten)         (None, 784)               0         
_________________________________________________________________
batch_normalization_13 (Batc (None, 784)               3136      
_________________________________________________________________
dense_31 (Dense)             (None, 300)               235200    
_________________________________________________________________
batch_normalization_14 (Batc (None, 300)               1200      
_________________________________________________________________
dense_32 (Dense)             (None, 100)               30000     
_________________________________________________________________
batch_normalization_15 (Batc (None, 100)               400       
_________________________________________________________________
dense_33 (Dense)             (None, 10)              

In [22]:
## Compile the model with loss: sparse_categorical_crossentropy, optimizer: sgg, and metric: accuracy. 
model.compile(loss="sparse_categorical_crossentropy",
             optimizer="sgd",
             metrics=["accuracy"])

import time
start = time.time()

history = model.fit(X_train,y_train, epochs=10, 
          validation_data=[X_valid,y_valid])

## Fit the model for 10 epochs, vailidating on the validation data. 

end = time.time()
print("Training Time:",end - start)

Train on 55000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Time: 125.75716733932495


It has also been argued that bias layers should be added before the activation layer, so that the data is properly centered and scaled before activation. To do this, remove the activation call from the dense layers and add a `keras.layers.Activation("elu")` layer after each batch normalization layer:

In [23]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(300, kernel_initializer="he_normal",use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("elu"))
model.add(keras.layers.Dense(100, kernel_initializer="he_normal",use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("elu"))
model.add(keras.layers.Dense(10, activation="softmax"))

model.summary()

Model: "sequential_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_12 (Flatten)         (None, 784)               0         
_________________________________________________________________
batch_normalization_16 (Batc (None, 784)               3136      
_________________________________________________________________
dense_34 (Dense)             (None, 300)               235200    
_________________________________________________________________
batch_normalization_17 (Batc (None, 300)               1200      
_________________________________________________________________
activation (Activation)      (None, 300)               0         
_________________________________________________________________
dense_35 (Dense)             (None, 100)               30000     
_________________________________________________________________
batch_normalization_18 (Batc (None, 100)             

In [24]:
## Compile the model with loss: sparse_categorical_crossentropy, optimizer: sgg, and metric: accuracy. 
model.compile(loss="sparse_categorical_crossentropy",
             optimizer="sgd",
             metrics=["accuracy"])

import time
start = time.time()

history = model.fit(X_train,y_train, epochs=10, 
          validation_data=[X_valid,y_valid])

## Fit the model for 10 epochs, vailidating on the validation data. 

end = time.time()
print("Training Time:",end - start)

Train on 55000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Time: 125.7387306690216


We see that our training is taking longer due to the extra steps added, but we are rewarded with a higher validation accuracy. 

## Gradient Clipping and Optimizers

Gradient clipping is done at the optimizer level by setting the `clipvalue` to your desired value. For example, to use gradient clipping with SGD we use

    optimizer = keras.optimizers.SGD(clipvalue=1.0)
    model.compile(loss="mse", optimizer=optimizer)
    
#### Momentum Optimization and Nesterov Accelerated Gradient

Momentum optimization can be similarly used by setting the `momentum = ` value in most optimizers to be nonzero. For example, to enable momentum optimization in SGD we use

    optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)
    
Similarly, to use the Nesterov trick we pass `nesterov = True` to the optimizer: 

    optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)
    
#### RMSProp and Adam

RMSProp and Adam are build in keras optimizers and can be initialized using 

    optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)
 
and

    optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
    
For Adam, `beta_1` is the momentum falloff rate and `beta_2` is the scale falloff rate.

Nadam and AdaMax are also implemented: https://keras.io/optimizers/.  


### Implementing

To use a different optimizer, and to be able to specify it's parameters, we just change the compile line 

    model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])

to take an initialized optimizer:

    optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)
    model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

The code below implements the Adam optimizer.

In [16]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(10, activation="softmax"))

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_7 (Flatten)          (None, 784)               0         
_________________________________________________________________
batch_normalization_3 (Batch (None, 784)               3136      
_________________________________________________________________
dense_21 (Dense)             (None, 300)               235500    
_________________________________________________________________
batch_normalization_4 (Batch (None, 300)               1200      
_________________________________________________________________
dense_22 (Dense)             (None, 100)               30100     
_________________________________________________________________
batch_normalization_5 (Batch (None, 100)               400       
_________________________________________________________________
dense_23 (Dense)             (None, 10)               

In [17]:
## Compile the model with loss: sparse_categorical_crossentropy, optimizer: Adam, and metric: accuracy. 

optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

model.compile(loss="sparse_categorical_crossentropy",
             optimizer=optimizer,
             metrics=["accuracy"])

import time
start = time.time()

history = model.fit(X_train,y_train, epochs=10, 
          validation_data=[X_valid,y_valid])

## Fit the model for 10 epochs, vailidating on the validation data. 

end = time.time()
print("Training Time:",end - start)

Train on 55000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Time: 129.37052965164185


Just by changing the optimizer we're above 90% accuracy, and with no appreciable difference in training time.  

## Adding Regularization

Regularization is done at the layer level, allowing us to choose which layers we regularize if we with. To turn on regularization, pass a `kernal_regularizer = ` to the layer creation function

    layer = keras.layers.Dense(100, activation="elu",
        kernel_initializer="he_normal",
        kernel_regularizer=keras.regularizers.l2(0.01))
        
For $\ell_1$ of mixed $\ell_1-\ell_2$ regularization use `keras.regularizers.l1` or `keras.regularizers.l1_l2`. For Max-Norm regularization use `keras.constraints.max_norm(1.)`. The full list of regularizers can be found here: https://keras.io/regularizers/

In [25]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal",kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Dense(10, activation="softmax",kernel_regularizer=keras.regularizers.l2(0.01)))

model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_13 (Flatten)         (None, 784)               0         
_________________________________________________________________
batch_normalization_19 (Batc (None, 784)               3136      
_________________________________________________________________
dense_37 (Dense)             (None, 300)               235500    
_________________________________________________________________
batch_normalization_20 (Batc (None, 300)               1200      
_________________________________________________________________
dense_38 (Dense)             (None, 100)               30100     
_________________________________________________________________
batch_normalization_21 (Batc (None, 100)               400       
_________________________________________________________________
dense_39 (Dense)             (None, 10)              

In [26]:
## Compile the model with loss: sparse_categorical_crossentropy, optimizer: Adam, and metric: accuracy. 

optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

model.compile(loss="sparse_categorical_crossentropy",
             optimizer=optimizer,
             metrics=["accuracy"])

import time
start = time.time()

history = model.fit(X_train,y_train, epochs=10, 
          validation_data=[X_valid,y_valid])

## Fit the model for 10 epochs, vailidating on the validation data. 

end = time.time()
print("Training Time:",end - start)

Train on 55000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Time: 126.95995712280273


Regularization is slowing down the training which is perhaps unsurprising , but it seems have really punished us for no particular gain in accuracy. We may be regularizing too much, but it could also be that while it takes longer to get where its going, the regularized classifier is more accurate on new data.

## Dropout

To add dropout to the layers of the network, we simply need to add a dropout layer before any dense layers:

    keras.layers.Dropout(rate=0.2)
    
Dropout tends to significantly slow convergence, but the results are much better if you have the training time.

In [27]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal",kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal",kernel_regularizer=keras.regularizers.l2(0.01)))
model.add(keras.layers.Dropout(rate=0.2))
model.add(keras.layers.Dense(10, activation="softmax",kernel_regularizer=keras.regularizers.l2(0.01)))

model.summary()

Model: "sequential_14"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_14 (Flatten)         (None, 784)               0         
_________________________________________________________________
dropout (Dropout)            (None, 784)               0         
_________________________________________________________________
dense_40 (Dense)             (None, 300)               235500    
_________________________________________________________________
dropout_1 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_41 (Dense)             (None, 100)               30100     
_________________________________________________________________
dropout_2 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_42 (Dense)             (None, 10)              

In [28]:
## Compile the model with loss: sparse_categorical_crossentropy, optimizer: Adam, and metric: accuracy. 

optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

model.compile(loss="sparse_categorical_crossentropy",
             optimizer=optimizer,
             metrics=["accuracy"])

import time
start = time.time()

history = model.fit(X_train,y_train, epochs=10, 
          validation_data=[X_valid,y_valid])

## Fit the model for 10 epochs, vailidating on the validation data. 

end = time.time()
print("Training Time:",end - start)

Train on 55000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training Time: 67.59742403030396


# Question:

What about using dropout and batch normalization together? Does the order matter? For example, we could use

Dense -> Batch Norm -> Activation -> Dropout -> Dense

as the original paper suggests, but recent test suggest this is actually not the optimal order. Perform some test to try to determine for the network above which is the proper ordering. 