# Training Deep Neural Nets

This is a theoretical article, so the most of the code doesn't actually work. I will update it later.

In [1]:
import tensorflow as tf
import numpy as np

  from ._conv import register_converters as _register_converters


By default the `fully_connected()` function uses Xavier initialization. We can change this to He initialization:

In [2]:
X = tf.placeholder(tf.float32, shape=(None, 784), name='X')
y = tf.placeholder(tf.float32, shape=(None), name='y')

In [3]:
n_hidden1 = 300

In [4]:
he_init = tf.contrib.layers.variance_scaling_initializer()
hidden_1 = tf.contrib.layers.fully_connected(X, n_hidden1, weights_initializer=he_init, scope='h1')

TensorFlow offers an `elu()` function that we can use to build our neural network.

In [6]:
hidden_1 = tf.contrib.layers.fully_connected(X, n_hidden1, activation_fn=tf.nn.elu)

TensorFlow does not have a predefined function for leaky ReLUs, but it is easy to define:

In [9]:
def leaky_relu(z, name=None):
    return tf.maximum(0.01 * z, z, name=name)

hidden1 = tf.contrib.layers.fully_connected(X, n_hidden1, activation_fn=leaky_relu)

## Implementing Batch Normalization with TensorFlow

In [10]:
from tensorflow.contrib.layers import batch_norm, fully_connected

n_inputs = 28*28 
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')

# tell the batch_norm() function wheter it should use the current
# mini-batch's mean and stddev or the running avg
is_training = tf.placeholder(tf.bool, shape=(), name='is_training')

bn_params = {
    'is_training': is_training,
    'decay': 0.99, # compute the running averages
    'updates_collections': None,
    # for non ReLU: 
    # 'scale': True 
}

hidden_1 = fully_connected(X, n_hidden1, scope='hidden1',
                          normalizer_fn=batch_norm, normalizer_params=bn_params)
hidden_2 = fully_connected(hidden_1, n_hidden2, scope='hidden2',
                          normalizer_fn=batch_norm, normalizer_params=bn_params)
logits = fully_connected(hidden_2, n_outputs, activation_fn=None, scope='outputs',
                        normalizer_fn=batch_norm, normalizer_params=bn_params)

To avoid repeating the same parameters over and over, we can create an argument scope using the `arg_scope()` function: the first parameter is a list of functions, and the other parameters will be passed to these functions automatically.

In [None]:
with tf.contrib.framework.arg_scope(
    [fully_connected], 
    normalizer_fn=batch_norm, 
    normalizer_params=bn_params):
    hidden_1 = fully_connected(X, n_hidden1, scope='hidden1')
    hidden_2 = fully_connected(hidden_1, n_hidden2, scope='hidden2')
    logits = fully_connected(hidden_2, n_outputs, scope='outputs', activation_fn=None)

The execution phase is also pretty much the same, with one exception. Whenever you run an operation that depends on the batch_norm layer, you need to set the is_train ing placeholder to True or False:

`sess.run(training_op, feed_dict={is_training: True, X: X_batch, y: y_batch})`

### Gradient Clipping

In TensorFlow, the optimier's `minimize()` function takes care of both computing the gradients and applying them, so we must instead call the optimizer's `compute_gradients()` method first, then create an operation to clip the gradients using the `clip_by_value()` function, and finally create an operation to apply the clipped gradients using the optimizer's `apply_gradients()` method:

In [None]:
threshold = 1.0
learning_rate = 0.01
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, threshold, threshold), var) 
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

## Reusing a TensorFlow Model

In [None]:
# ... construct the original model

with tf.Session() as sess:
    saver.restore(sess, './my_original_model.ckpt')
    # train it on out new task

In general we will want to reuse only part of the original model. A simple solution is to configure the `Saver` to restore only a subset of the variables from the original model. 

In [None]:
# ... build new model with the same definition as before for hidden layers 1-3

init = tf.global_variables_initializer()

reuse_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                              scope='hidden[123]')
reuse_vars_dict = dict([(var.name, var. name) for var in reuse_vars])
original_saver = tf.Saver(reuse_vars_dict) # saver to restore the original model

new_saver = tf.Saver() # saver to save the new model

with tf.Session() as sess:
    sess.run(init)
    original_saver.restore('./my_original_model.ckpt') # restore layers 1 to 3
    # ... train the new model
    new_saver.save('./my_new_model.ckpt')

### Reusing Models from Other Frameworks

If the model was trained using another framework, we will need to load weights manually, then assign them to the appropriate variables.

### Freezing the Lower Layers

A good idea to "freeze" the low level layer weights when training the new DNN: if the lower-layer weights are fixed, then the higher-layer weights will be easier to train. 

To freeze the lower layers during training, the simplest solution is to give the optimizer the list of variables to train, excluding the variables from the lower layer:

In [None]:
train_vars = tf.get_collection(tf.GrapthKeys.TRAINABLE_VARIABLES, 
                              scope='hidden[34]|outputs')
training_op = optimizer.minimize(loss, var_list=train_vars)

### $ \ell_1 $ and $ \ell_2 $ Regularization

In [None]:
# ... construct the neural net
with arg_scope(
    [fully_connected], 
    weights_regularizer=tf.contrib.layers.l1_regularizer(scale=0.01)):
    hidden1 = fully_connected(X, n_hidden1, scope='hidden1')
    # ...

__ Don't forget to add the regilarization losses to your overall loss__

In [None]:
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([base_loss] + reg_losses, name='loss')

### Dropout

In [None]:
keep_prob = 0.5
X_drop = tf.contrib.layers.dropout(X, keep_prob, is_training=is_training)

hidden_1 = fully_connected(X_drop, n_hidden1, scope='hidden1')
hidden1_drop = tf.conttib.layers.dropout(hidden_1, keep_prob, is_training=is_training)

# ...