### grp

## Hands-On Machine Learning with Scikit-Learn & TensorFlow

## CHAPTER 11: Training Deep Neural Nets

### Common DNN problems:
1.  Vanishing Gradients:
    -  gradients become smaller and smaller as the algorithm performs backpropagation down to the lower layers hence gradient descent leaves the lower layer weights unchanged thus never reaching a good solution during training
    -  **solutions** => He Init; ELU; BN
2.  Exploding Gradients:
    -  gradients become larger and larger hence many layers get immensely large weight updates and the algorithm loses control
    -  **solutions** => He Init; ELU; BN; Gradient Clipping
3.  Slow Training:
    -  **solutions** => faster optimizers (momentum, nesterov, adagrad, rmsprop, adam)
4.  Overfitting training set due to millions of parameters:
    -  **soltuions** => early stopping, L1 and L2 regularization, dropout, max-norm regularization, data augmentation

### _dying ReLUs_:
-  neurons die and only output 0
-  can happen with large learning rate
-  happens when weighted sum of the neuron's inputs is negative hence ReLU outputs 0
-  gradient of ReLU function is 0 when its input is negative
-  **solutions** => _leaky ReLU; ELU_

### _Batch Normalization_:
-  adds an operation in the model just before the activation function of each layer
-  function that zero-centering and normalizing the inputs to scale
-  estimates mini-batch or full-dataset or mean + standard deviation **as input parameters to function**
-  acts like a regularization technique however **BN has slower training while Gradient Descent is searching for optimal scales**
-  **doesn't not have a huge impact on small DNNs however BN is a great technique for larger DNNs (i.e. more layers/neruons)

### Gradient Clipping:
-  process of clipping gradients during backpropagation so gradients never exceed some threshold

### Reusing Pretrained Layers AKA "Transfer Learning":
-  better to find an existing NN with similar structure and reuse lower layers of network
-  helps speed up training and requires less training data
-  _**if input data of new model does not have the same size as model being reused then a preprocessing step will need to be transformed to resize new model => transfer learning only works well if inputs have similar low-level features**_
-  tensorflow model zoo => https://github.com/tensorflow/models

### Freezing Lower Layers / Caching Frozen Layers:
-  Freeze:
    -  means lower-layer weights will be fixed
    -  higher-layer weights will be easier to train
-  Cache:
    -  load frozen layers into memory
    -  significantly improves training speed _**if system has enough RAM**_

### Faster Optimizers:
-  default optimizer:
    -  Gradient Descent:
        -  updates weights by subtracting gradient of cost function related to weights multiplied by learning rate
        -  if gradients are small tweaking is slow
-  faster optimizers:
    -  Momentum
    -  Nesterov
    -  AdaGrad
    -  RMSProp
    -  Adam

### Learning Rate:
-  Constant Learning Rates:
    -  way too high => training might diverge
    -  too low => slow training
    -  slightly too high => inconsistent training
    -  **helps to train NN several times via a few epochs using different learning rates and comparing learning curves**
-  Learning Schedules (**reduce learning rate during training**):
    -  predetermined piecewise constant learning rate
    -  performance scheduling
    -  exponential scheduling
    -  power scheduling

### Regularization:
-  Early Stopping => interupt training when performance on validation set starts decreasing
-  L1/L2 => constrain NNs connection weights via **sparse model**
-  Dropout => neurons have probability of being temporarily _"dropped out"_ AKA ignored during current training set
-  MaxNorm => reduces max-norm hyperparameter (r)
-  Data Augmentation => **generates new training instances via existing instances to boost training set size**

## _Exercises_

In [1]:
import tensorflow as tf
print(tf.__version__)

import sklearn
print(sklearn.__version__)

1.13.1
0.20.0


In [2]:
n_inputs = 28 * 28 # MNIST
n_hidden1 = 300

### he init

In [4]:
tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

he_init = tf.variance_scaling_initializer()
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                          kernel_initializer=he_init, name="hidden1")

### leaky relu function

In [5]:
tf.reset_default_graph()

def leaky_relu(z, name=None):
    return tf.maximum(0.01 * z, z, name=name)

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name="hidden1")

### elu function

In [6]:
tf.reset_default_graph()

def elu(z, alpha=1):
    return np.where(z < 0, alpha * (np.exp(z) - 1), z)

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")

### bn

In [8]:
# compile layers
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
training = tf.placeholder_with_default(False, shape=(), name='training')

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training,
                                       momentum=0.9)

In [9]:
# quicker way to compile layers
tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
training = tf.placeholder_with_default(False, shape=(), name='training')

from functools import partial

my_batch_norm_layer = partial(tf.layers.batch_normalization,
                              training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

In [11]:
# complete bn example
tf.reset_default_graph()
import numpy as np

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

batch_norm_momentum = 0.9
learning_rate = 0.01
n_epochs = 20
batch_size = 200

def shuffle_batch(X, y, batch_size):
    rnd_idx = np.random.permutation(len(X))
    n_batches = len(X) // batch_size
    for batch_idx in np.array_split(rnd_idx, n_batches):
        X_batch, y_batch = X[batch_idx], y[batch_idx]
        yield X_batch, y_batch

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')

with tf.name_scope("dnn"):
    he_init = tf.variance_scaling_initializer()

    my_batch_norm_layer = partial(
            tf.layers.batch_normalization,
            training=training,
            momentum=batch_norm_momentum)

    my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=he_init)

    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
    hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2")
    bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
    logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs")
    logits = my_batch_norm_layer(logits_before_bn)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./bn_model.ckpt")

0 Validation accuracy: 0.8948
1 Validation accuracy: 0.917
2 Validation accuracy: 0.9296
3 Validation accuracy: 0.9392
4 Validation accuracy: 0.9456
5 Validation accuracy: 0.9524
6 Validation accuracy: 0.9542
7 Validation accuracy: 0.9598
8 Validation accuracy: 0.9612
9 Validation accuracy: 0.9644
10 Validation accuracy: 0.965
11 Validation accuracy: 0.967
12 Validation accuracy: 0.9678
13 Validation accuracy: 0.967
14 Validation accuracy: 0.9676
15 Validation accuracy: 0.9704
16 Validation accuracy: 0.9714
17 Validation accuracy: 0.971
18 Validation accuracy: 0.9716
19 Validation accuracy: 0.972


In [12]:
[v.name for v in tf.trainable_variables()]

['hidden1/kernel:0',
 'hidden1/bias:0',
 'batch_normalization/gamma:0',
 'batch_normalization/beta:0',
 'hidden2/kernel:0',
 'hidden2/bias:0',
 'batch_normalization_1/gamma:0',
 'batch_normalization_1/beta:0',
 'outputs/kernel:0',
 'outputs/bias:0',
 'batch_normalization_2/gamma:0',
 'batch_normalization_2/beta:0']

In [13]:
[v.name for v in tf.global_variables()]

['hidden1/kernel:0',
 'hidden1/bias:0',
 'batch_normalization/gamma:0',
 'batch_normalization/beta:0',
 'batch_normalization/moving_mean:0',
 'batch_normalization/moving_variance:0',
 'hidden2/kernel:0',
 'hidden2/bias:0',
 'batch_normalization_1/gamma:0',
 'batch_normalization_1/beta:0',
 'batch_normalization_1/moving_mean:0',
 'batch_normalization_1/moving_variance:0',
 'outputs/kernel:0',
 'outputs/bias:0',
 'batch_normalization_2/gamma:0',
 'batch_normalization_2/beta:0',
 'batch_normalization_2/moving_mean:0',
 'batch_normalization_2/moving_variance:0']

### gradient clipping

In [14]:
# complete gradient clip example
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_hidden3 = 50
n_hidden4 = 50
n_hidden5 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3")
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4")
    hidden5 = tf.layers.dense(hidden4, n_hidden5, activation=tf.nn.relu, name="hidden5")
    logits = tf.layers.dense(hidden5, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

learning_rate = 0.01

threshold = 1.0 # clips between -1 and 1

optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

init = tf.global_variables_initializer()
saver = tf.train.Saver()

n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./clipped_model.ckpt")

0 Validation accuracy: 0.4686
1 Validation accuracy: 0.7286
2 Validation accuracy: 0.8542
3 Validation accuracy: 0.8866
4 Validation accuracy: 0.9024
5 Validation accuracy: 0.9142
6 Validation accuracy: 0.922
7 Validation accuracy: 0.9266
8 Validation accuracy: 0.9286
9 Validation accuracy: 0.9338
10 Validation accuracy: 0.9384
11 Validation accuracy: 0.9416
12 Validation accuracy: 0.9458
13 Validation accuracy: 0.9496
14 Validation accuracy: 0.952
15 Validation accuracy: 0.9544
16 Validation accuracy: 0.9528
17 Validation accuracy: 0.957
18 Validation accuracy: 0.9582
19 Validation accuracy: 0.9606


### reuse tf model via loading specific layers

In [19]:
# re-using gradient clipping model
tf.reset_default_graph()

saver = tf.train.import_meta_graph("./clipped_model.ckpt.meta") # read meta from trained model

In [20]:
# graph's operations by names
for op in tf.get_default_graph().get_operations():
    print(op.name)

X
y
hidden1/kernel/Initializer/random_uniform/shape
hidden1/kernel/Initializer/random_uniform/min
hidden1/kernel/Initializer/random_uniform/max
hidden1/kernel/Initializer/random_uniform/RandomUniform
hidden1/kernel/Initializer/random_uniform/sub
hidden1/kernel/Initializer/random_uniform/mul
hidden1/kernel/Initializer/random_uniform
hidden1/kernel
hidden1/kernel/Assign
hidden1/kernel/read
hidden1/bias/Initializer/zeros
hidden1/bias
hidden1/bias/Assign
hidden1/bias/read
dnn/hidden1/MatMul
dnn/hidden1/BiasAdd
dnn/hidden1/Relu
hidden2/kernel/Initializer/random_uniform/shape
hidden2/kernel/Initializer/random_uniform/min
hidden2/kernel/Initializer/random_uniform/max
hidden2/kernel/Initializer/random_uniform/RandomUniform
hidden2/kernel/Initializer/random_uniform/sub
hidden2/kernel/Initializer/random_uniform/mul
hidden2/kernel/Initializer/random_uniform
hidden2/kernel
hidden2/kernel/Assign
hidden2/kernel/read
hidden2/bias/Initializer/zeros
hidden2/bias
hidden2/bias/Assign
hidden2/bias/read
dn

In [21]:
# import / re-use specific layers
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")

accuracy = tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")
training_op = tf.get_default_graph().get_operation_by_name("GradientDescent")

In [22]:
# document / re-name operations for better semantics
for op in (X, y, accuracy, training_op):
    tf.add_to_collection("my_important_ops", op)

# import example w/ new name
X, y, accuracy, training_op = tf.get_collection("my_important_ops")

# train data via pre-trained model
with tf.Session() as sess:
    saver.restore(sess, "./clipped_model.ckpt")
    
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./tf_reuse_model.ckpt")   

INFO:tensorflow:Restoring parameters from ./clipped_model.ckpt
0 Validation accuracy: 0.9616
1 Validation accuracy: 0.9628
2 Validation accuracy: 0.9626
3 Validation accuracy: 0.9642
4 Validation accuracy: 0.9644
5 Validation accuracy: 0.9652
6 Validation accuracy: 0.9664
7 Validation accuracy: 0.967
8 Validation accuracy: 0.9662
9 Validation accuracy: 0.9662
10 Validation accuracy: 0.9676
11 Validation accuracy: 0.9668
12 Validation accuracy: 0.9692
13 Validation accuracy: 0.9678
14 Validation accuracy: 0.9688
15 Validation accuracy: 0.972
16 Validation accuracy: 0.969
17 Validation accuracy: 0.9642
18 Validation accuracy: 0.97
19 Validation accuracy: 0.9694


### reuse tf model via loading all layers and chopping off layers not needed

In [23]:
# build a new graph for re-using layers
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50  # reused
n_hidden3 = 50  # reused
n_hidden4 = 20  # new!
n_outputs = 10  # new!

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")       # reused
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new!
    logits = tf.layers.dense(hidden4, n_outputs, name="outputs")                         # new!

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [24]:
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope="hidden[123]") # regular expression [restores hidden layers 1, 2, and 3]
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver() # saver for new model to save once trained

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./clipped_model.ckpt")

    for epoch in range(n_epochs):                                            
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size): 
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})        
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})     
        print(epoch, "Validation accuracy:", accuracy_val)                   

    save_path = saver.save(sess, "./tf_chopped_model.ckpt") # saves new model

INFO:tensorflow:Restoring parameters from ./clipped_model.ckpt
0 Validation accuracy: 0.918
1 Validation accuracy: 0.9358
2 Validation accuracy: 0.9426
3 Validation accuracy: 0.9482
4 Validation accuracy: 0.9522
5 Validation accuracy: 0.9538
6 Validation accuracy: 0.9554
7 Validation accuracy: 0.9574
8 Validation accuracy: 0.9588
9 Validation accuracy: 0.9592
10 Validation accuracy: 0.9606
11 Validation accuracy: 0.9612
12 Validation accuracy: 0.963
13 Validation accuracy: 0.9634
14 Validation accuracy: 0.9658
15 Validation accuracy: 0.9652
16 Validation accuracy: 0.9668
17 Validation accuracy: 0.967
18 Validation accuracy: 0.9654
19 Validation accuracy: 0.9692


### reuse tf model via other framework [_can get tricky!_]

In [25]:
tf.reset_default_graph()

n_inputs = 2
n_hidden1 = 3

In [26]:
original_w = [[1., 2., 3.], [4., 5., 6.]] # Load the weights from the other framework
original_b = [7., 8., 9.]                 # Load the biases from the other framework

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
# [...] Build the rest of the model

# Get a handle on the assignment nodes for the hidden1 variables
graph = tf.get_default_graph()
assign_kernel = graph.get_operation_by_name("hidden1/kernel/Assign")
assign_bias = graph.get_operation_by_name("hidden1/bias/Assign")
init_kernel = assign_kernel.inputs[1]
init_bias = assign_bias.inputs[1]

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init, feed_dict={init_kernel: original_w, init_bias: original_b})
    # [...] Train the model on your new task
    print(hidden1.eval(feed_dict={X: [[10.0, 11.0]]}))

[[ 61.  83. 105.]]


### freeze lower layers:
-  provide optimizer list of variables to train [_exclude "freeze" variables from lower layers_]

In [27]:
# create new graph for demonstrating freezing v1 example
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50  # reused
n_hidden3 = 50  # reused
n_hidden4 = 20  # new!
n_outputs = 10  # new!

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")       # reused
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2") # reused
    hidden3 = tf.layers.dense(hidden2, n_hidden3, activation=tf.nn.relu, name="hidden3") # reused
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu, name="hidden4") # new!
    logits = tf.layers.dense(hidden4, n_outputs, name="outputs")                         # new!

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

In [28]:
with tf.name_scope("train"):                                         
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)     
    train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                   scope="hidden[34]|outputs") # get all vars from hidden layers 3 & 4
    training_op = optimizer.minimize(loss, var_list=train_vars) # leaves out hidden layers 1 & 2 vars hence frozen!

In [29]:
init = tf.global_variables_initializer()
new_saver = tf.train.Saver()

In [30]:
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./clipped_model.ckpt")

    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./freeze_lower_layers_v1.ckpt")

INFO:tensorflow:Restoring parameters from ./clipped_model.ckpt
0 Validation accuracy: 0.8596
1 Validation accuracy: 0.9258
2 Validation accuracy: 0.9384
3 Validation accuracy: 0.9452
4 Validation accuracy: 0.9474
5 Validation accuracy: 0.952
6 Validation accuracy: 0.9524
7 Validation accuracy: 0.954
8 Validation accuracy: 0.9558
9 Validation accuracy: 0.9556
10 Validation accuracy: 0.9556
11 Validation accuracy: 0.955
12 Validation accuracy: 0.9556
13 Validation accuracy: 0.9552
14 Validation accuracy: 0.9566
15 Validation accuracy: 0.9554
16 Validation accuracy: 0.956
17 Validation accuracy: 0.9566
18 Validation accuracy: 0.9572
19 Validation accuracy: 0.9564


### freeze lower layers:
-  via **stop_gradient()**

In [31]:
# create new graph for demonstrating freezing v2 example
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50  # reused
n_hidden3 = 50  # reused
n_hidden4 = 20  # new!
n_outputs = 10  # new!

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [32]:
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                              name="hidden1") # reused frozen
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                              name="hidden2") # reused frozen
    hidden2_stop = tf.stop_gradient(hidden2)
    hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu,
                              name="hidden3") # reused, not frozen
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu,
                              name="hidden4") # new!
    logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!

In [33]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [34]:
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./clipped_model.ckpt")

    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./freeze_lower_layers_v2.ckpt")

INFO:tensorflow:Restoring parameters from ./clipped_model.ckpt
0 Validation accuracy: 0.9106
1 Validation accuracy: 0.9376
2 Validation accuracy: 0.9436
3 Validation accuracy: 0.9502
4 Validation accuracy: 0.951
5 Validation accuracy: 0.9526
6 Validation accuracy: 0.9522
7 Validation accuracy: 0.9528
8 Validation accuracy: 0.9522
9 Validation accuracy: 0.9546
10 Validation accuracy: 0.9536
11 Validation accuracy: 0.9532
12 Validation accuracy: 0.9546
13 Validation accuracy: 0.9558
14 Validation accuracy: 0.955
15 Validation accuracy: 0.9564
16 Validation accuracy: 0.9582
17 Validation accuracy: 0.9558
18 Validation accuracy: 0.957
19 Validation accuracy: 0.9576


### cache frozen layers

In [35]:
# create new graph for demonstrating caching example
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300 # reused
n_hidden2 = 50  # reused
n_hidden3 = 50  # reused
n_hidden4 = 20  # new!
n_outputs = 10  # new!

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                              name="hidden1") # reused frozen
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                              name="hidden2") # reused frozen & cached
    hidden2_stop = tf.stop_gradient(hidden2)
    hidden3 = tf.layers.dense(hidden2_stop, n_hidden3, activation=tf.nn.relu,
                              name="hidden3") # reused, not frozen
    hidden4 = tf.layers.dense(hidden3, n_hidden4, activation=tf.nn.relu,
                              name="hidden4") # new!
    logits = tf.layers.dense(hidden4, n_outputs, name="outputs") # new!

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [36]:
reuse_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                               scope="hidden[123]") # regular expression
restore_saver = tf.train.Saver(reuse_vars) # to restore layers 1-3

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [37]:
import numpy as np

n_batches = len(X_train) // batch_size

with tf.Session() as sess:
    init.run()
    restore_saver.restore(sess, "./clipped_model.ckpt")
    
    h2_cache = sess.run(hidden2, feed_dict={X: X_train}) # cache
    h2_cache_valid = sess.run(hidden2, feed_dict={X: X_valid}) # cache

    for epoch in range(n_epochs):
        shuffled_idx = np.random.permutation(len(X_train))
        hidden2_batches = np.array_split(h2_cache[shuffled_idx], n_batches)
        y_batches = np.array_split(y_train[shuffled_idx], n_batches)
        for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):
            sess.run(training_op, feed_dict={hidden2:hidden2_batch, y:y_batch})

        accuracy_val = accuracy.eval(feed_dict={hidden2: h2_cache_valid, 
                                                y: y_valid})             
        print(epoch, "Validation accuracy:", accuracy_val)               

    save_path = saver.save(sess, "./cache_layers.ckpt")

INFO:tensorflow:Restoring parameters from ./clipped_model.ckpt
0 Validation accuracy: 0.912
1 Validation accuracy: 0.9312
2 Validation accuracy: 0.9392
3 Validation accuracy: 0.9448
4 Validation accuracy: 0.9464
5 Validation accuracy: 0.9502
6 Validation accuracy: 0.951
7 Validation accuracy: 0.9516
8 Validation accuracy: 0.953
9 Validation accuracy: 0.9526
10 Validation accuracy: 0.9548
11 Validation accuracy: 0.9552
12 Validation accuracy: 0.9552
13 Validation accuracy: 0.9558
14 Validation accuracy: 0.9554
15 Validation accuracy: 0.9572
16 Validation accuracy: 0.9568
17 Validation accuracy: 0.9576
18 Validation accuracy: 0.9582
19 Validation accuracy: 0.9588


### momentum optimization

In [38]:
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                       momentum=0.9)

### nesterov accelerated gradient

In [39]:
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate,
                                       momentum=0.9, use_nesterov=True)

### adagrad

In [40]:
optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)

### rmsprop

In [41]:
optimizer = tf.train.RMSPropOptimizer(learning_rate=learning_rate,
                                      momentum=0.9, decay=0.9, epsilon=1e-10)

### adam optimization

In [42]:
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)

### learning schedule

In [43]:
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

In [44]:
with tf.name_scope("train"):
    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False, name="global_step")
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step,
                                               decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)

In [45]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [46]:
n_epochs = 5
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./learning_schedule.ckpt")

0 Validation accuracy: 0.963
1 Validation accuracy: 0.9724
2 Validation accuracy: 0.9758
3 Validation accuracy: 0.9794
4 Validation accuracy: 0.982


### l1 regularization:
-  manual method

In [47]:
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    logits = tf.layers.dense(hidden1, n_outputs, name="outputs")

In [48]:
W1 = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
W2 = tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")

scale = 0.001 # l1 regularization hyperparameter

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                              logits=logits)
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")
    reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
    loss = tf.add(base_loss, scale * reg_losses, name="loss")

In [49]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [50]:
n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./l1_reg_v1.ckpt")

0 Validation accuracy: 0.8166
1 Validation accuracy: 0.8572
2 Validation accuracy: 0.8752
3 Validation accuracy: 0.8864
4 Validation accuracy: 0.8934
5 Validation accuracy: 0.8974
6 Validation accuracy: 0.902
7 Validation accuracy: 0.9042
8 Validation accuracy: 0.9062
9 Validation accuracy: 0.9064
10 Validation accuracy: 0.908
11 Validation accuracy: 0.9094
12 Validation accuracy: 0.909
13 Validation accuracy: 0.9108
14 Validation accuracy: 0.9098
15 Validation accuracy: 0.9096
16 Validation accuracy: 0.909
17 Validation accuracy: 0.9062
18 Validation accuracy: 0.9078
19 Validation accuracy: 0.907


### l1 regularization:
-  function method (automated approach)

In [54]:
tf.reset_default_graph()

n_inputs = 28 * 28 # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

scale = 0.001

In [55]:
my_dense_layer = partial(
    tf.layers.dense, activation=tf.nn.relu,
    kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

with tf.name_scope("dnn"):
    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    hidden2 = my_dense_layer(hidden1, n_hidden2, name="hidden2")
    logits = my_dense_layer(hidden2, n_outputs, activation=None,
                            name="outputs")

In [56]:
with tf.name_scope("loss"):                                     
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(  
        labels=y, logits=logits)                                
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")   
    reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    loss = tf.add_n([base_loss] + reg_losses, name="loss")

In [57]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [58]:
n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./l1_reg_v2.ckpt")

0 Validation accuracy: 0.8242
1 Validation accuracy: 0.8746
2 Validation accuracy: 0.8902
3 Validation accuracy: 0.8994
4 Validation accuracy: 0.9054
5 Validation accuracy: 0.9102
6 Validation accuracy: 0.9114
7 Validation accuracy: 0.915
8 Validation accuracy: 0.916
9 Validation accuracy: 0.9164
10 Validation accuracy: 0.9188
11 Validation accuracy: 0.9198
12 Validation accuracy: 0.918
13 Validation accuracy: 0.9198
14 Validation accuracy: 0.9208
15 Validation accuracy: 0.9218
16 Validation accuracy: 0.9212
17 Validation accuracy: 0.9198
18 Validation accuracy: 0.9208
19 Validation accuracy: 0.9196


### dropout

In [62]:
tf.reset_default_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [63]:
training = tf.placeholder_with_default(False, shape=(), name='training')

dropout_rate = 0.5  # == 1 - keep_prob
X_drop = tf.layers.dropout(X, dropout_rate, training=training)

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu,
                              name="hidden1")
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu,
                              name="hidden2")
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
    logits = tf.layers.dense(hidden2_drop, n_outputs, name="outputs")

In [64]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [65]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training: True})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./dropout_opt.ckpt")

0 Validation accuracy: 0.9252
1 Validation accuracy: 0.9464
2 Validation accuracy: 0.9544
3 Validation accuracy: 0.9606
4 Validation accuracy: 0.96
5 Validation accuracy: 0.962
6 Validation accuracy: 0.9668
7 Validation accuracy: 0.967
8 Validation accuracy: 0.9686
9 Validation accuracy: 0.9698
10 Validation accuracy: 0.9692
11 Validation accuracy: 0.971
12 Validation accuracy: 0.9702
13 Validation accuracy: 0.9684
14 Validation accuracy: 0.9698
15 Validation accuracy: 0.9708
16 Validation accuracy: 0.9722
17 Validation accuracy: 0.9718
18 Validation accuracy: 0.9738
19 Validation accuracy: 0.9726


### max norm udf function

In [66]:
def max_norm_regularizer(threshold, axes=1, name="max_norm",
                         collection="max_norm"):
    def max_norm(weights):
        clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
        clip_weights = tf.assign(weights, clipped, name=name)
        tf.add_to_collection(collection, clip_weights)
        return None # there is no regularization loss term
    return max_norm

In [67]:
tf.reset_default_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")

In [68]:
max_norm_reg = max_norm_regularizer(threshold=1.0)

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                              kernel_regularizer=max_norm_reg, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                              kernel_regularizer=max_norm_reg, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

In [69]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [70]:
n_epochs = 20
batch_size = 50

clip_all_weights = tf.get_collection("max_norm")

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            sess.run(clip_all_weights)
        acc_valid = accuracy.eval(feed_dict={X: X_valid, y: y_valid}) 
        print(epoch, "Validation accuracy:", acc_valid)               

    save_path = saver.save(sess, "./max_norm_opt.ckpt")             

0 Validation accuracy: 0.9526
1 Validation accuracy: 0.971
2 Validation accuracy: 0.9748
3 Validation accuracy: 0.9754
4 Validation accuracy: 0.9776
5 Validation accuracy: 0.9788
6 Validation accuracy: 0.9794
7 Validation accuracy: 0.9786
8 Validation accuracy: 0.9816
9 Validation accuracy: 0.9788
10 Validation accuracy: 0.9808
11 Validation accuracy: 0.9836
12 Validation accuracy: 0.9828
13 Validation accuracy: 0.9826
14 Validation accuracy: 0.9816
15 Validation accuracy: 0.9828
16 Validation accuracy: 0.9818
17 Validation accuracy: 0.9836
18 Validation accuracy: 0.9834
19 Validation accuracy: 0.984


### additional exercises:

https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb

1. Is it okay to initialize all the weights to the same value as long as that value is selected randomly using He initialization?
2. Is it okay to initialize the bias terms to 0?
3. Name three advantages of the ELU activation function over ReLU.
4. In which cases would you want to use each of the following activation functions: ELU, leaky ReLU (and its variants), ReLU, tanh, logistic, and softmax?
5. What may happen if you set the momentum hyperparameter too close to 1 (e.g., 0.99999) when using a MomentumOptimizer?
6. Name three ways you can produce a sparse model.
7. Does dropout slow down training? Does it slow down inference (i.e., making predictions on new instances)?
8. Deep Learning. a. Build a DNN with five hidden layers of 100 neurons each, He initialization, and the ELU activation function. b. Using Adam optimization and early stopping, try training it on MNIST but only on digits 0 to 4, as we will use transfer learning for digits 5 to 9 in the next exercise. You will need a softmax output layer with five neurons, and as always make sure to save checkpoints at regular intervals and save the final model so you can reuse it later. c. Tune the hyperparameters using cross-validation and see what precision you can achieve. d. Now try adding Batch Normalization and compare the learning curves: is it converging faster than before? Does it produce a better model? e. Is the model overfitting the training set? Try adding dropout to every layer and try again. Does it help?
9. Transfer learning. a. Create a new DNN that reuses all the pretrained hidden layers of the previous model, freezes them, and replaces the softmax output layer with a fresh new one. b. Train this new DNN on digits 5 to 9, using only 100 images per digit, and time how long it takes. Despite this small number of examples, can you achieve high precision? c. Try caching the frozen layers, and train the model again: how much faster is it now? d. Try again reusing just four hidden layers instead of five. Can you achieve a higher precision? e. Now unfreeze the top two hidden layers and continue training: can you get the model to perform even better?
10. Pretraining on an auxiliary task. a. In this exercise you will build a DNN that compares two MNIST digit images and predicts whether they represent the same digit or not. Then you will reuse the lower layers of this network to train an MNIST classifier using very little training data. Start by building two DNNs (let’s call them DNN A and B), both similar to the one you built earlier but without the output layer: each DNN should have five hidden layers of 100 neurons each, He initialization, and ELU activation. Next, add a single output layer on top of both DNNs. You should use TensorFlow’s concat() function with axis=1 to concatenate the outputs of both DNNs along the horizontal axis, then feed the result to the output layer. This output layer should contain a single neuron using the logistic acti‐ vation function. b. Split the MNIST training set in two sets: split #1 should containing 55,000 images, and split #2 should contain contain 5,000 images. Create a function that generates a training batch where each instance is a pair of MNIST images picked from split #1. Half of the training instances should be pairs of images that belong to the same class, while the other half should be images from dif‐ ferent classes. For each pair, the training label should be 0 if the images are from the same class, or 1 if they are from different classes. c. Train the DNN on this training set. For each image pair, you can simultane‐ ously feed the first image to DNN A and the second image to DNN B. The whole network will gradually learn to tell whether two images belong to the same class or not. d. Now create a new DNN by reusing and freezing the hidden layers of DNN A and adding a softmax output layer on with 10 neurons. Train this network on split #2 and see if you can achieve high performance despite having only 500 images per class.

### grp