In [57]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
import os

In [58]:
n_inputs = 28*28 #MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

In [59]:
X = tf.placeholder(tf.float32, shape = (None, n_inputs), name = "X")
y = tf.placeholder(tf.int64, shape = (None), name = "y")

# TRAINING DEEP NEURAL NETWORKS

Three issues with training deep neural networks:

1. Vanishing or exploding gradient problem.
2. Training network is slow.
3. Risk of overfitting with millions of parameters.

## 1) Vanishing Gradient Problem

### Choice of weight initialization

Xavier or He initialization of weights (based on activation function) instead of $\sigma(0,1)$ initialization.

In [103]:
#For some reason this cell throws an error if you run it twice.
he_init = tf.contrib.layers.variance_scaling_initializer()
hidden1 = fully_connected(X, n_hidden1, weights_initializer = he_init, 
                          scope = "h1")

ValueError: Variable h1/weights already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

  File "/Users/siddharth/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 208, in variable
    caching_device=caching_device)
  File "/Users/siddharth/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 177, in func_with_args
    return func(*args, **current_args)
  File "/Users/siddharth/anaconda3/lib/python3.7/site-packages/tensorflow/contrib/framework/python/ops/variables.py", line 244, in model_variable
    caching_device=caching_device, device=device)


### Nonsaturating Activation Functions

Poor activation function can lead to vanishing gradient. ReLU did not suffer from this issue. However, ReLU suffered from dying units problem where, if the input to a neuron become negative, the output would become 0 and the neural unit would be unable to learn any further. 

Leaky ReLU solver this issue where negative inputs would not lead to dead units but rather units in a coma that can be revived.

$$ LeakyRelu_{\alpha}(z) = Max(\alpha z,z)  $$

Where $\alpha$ determines what value the negative relu should converge towards (usually between 0.01 and 0.2).

An improvement on leaky relu which made the curve more smoother at 0 was the Exponential Linear Unit (ELU). This overcame the dying units problem as well as made convergence faster.

$$ELU_{\alpha}(z) = \alpha(exp(z) - 1)$$ if $z < 0$
$$ELU_{\alpha}(z) = z$$ if  $z >= 0$

Code for ELU is following:

In [5]:
hidden1 = fully_connected(X, n_hidden1, activation_fn = tf.nn.elu)

Code for leaky relu needs to be defined :

In [6]:
def leaky_relu(z, name = None):
    return tf.maximum(0.01*z, z, name = name)

hidden1 = fully_connected(X, n_hidden1, activation_fn = leaky_relu)

### Batch Normalization

Weight initialization and proper activation prevent vanishing gradients at the beginning of training but not during. In this case we need to use batch normalization.

Each layer in a neural net has a simple goal, to model the input from the layer below it, so each layer tries to adapt to it’s input but for hidden layers, things get a bit complicated. The input’s statistical distribution changes after a few iterations, so if the input statistical distribution keeps changing, called internal covariate shift, the hidden layers will keep trying to adapt to that new distribution hence slowing down convergence. It is like a goal that keeps changing for hidden layers.

So the batch normalization (BN) algorithm tries to normalize the inputs to each hidden layer so that their distribution is fairly constant as training proceeds. This improves convergence of the neural net.

BN acts like a regularizer reducing the need for other regularization techniques like dropout. 

We will use batch_norm() command (and not batch_normalization() command) to implement our BN: 

In [7]:
from tensorflow.contrib.layers import batch_norm

In [11]:
is_training = tf.placeholder(tf.bool, shape = (), name = 'is_training')
bn_params = {
    'is_training': is_training,
    'decay': 0.99,
    'updates_collections': None  
}

hidden1 = fully_connected(X, n_hidden1, scope = "hidden1", 
                          normalizer_fn = batch_norm, 
                          normalizer_params=bn_params)

hidden2 = fully_connected(hidden1, n_hidden2, scope = "hidden2",
                         normalizer_fn = batch_norm, normalizer_params = bn_params)

logits = fully_connected(hidden2, n_outputs, activation_fn = None,
                        scope = "outputs", normalizer_fn=batch_norm,
                        normalizer_params = bn_params)

Explanation of code:

- is_training will either be True or False and determines whether batch_norm() should use the current mini-batch's mean and standard deviation or the running averages that it keeps track of during test set.

- bn_params is a dictionary that will be passed to batch_norm function. The algorithm uses exponential decay to caculate to compute the running averages, which is why it requires decay parameters.

- Given new value v, the running average v' is updated through:

$$v' <- v'(decay) + v(1-decay)$$

- A good decay value is close to 1 (0.9, 0.99 etc). 
- Updates collection set to None to automatically update running averages.
- Lastly we create the layers by calling fully_connected() function.

To avoid repeating the same parameters over and over again you can make an *argument scope* using the arg_scope() function:


In [18]:
#needed to put reuse = TRUE because weight for these variables is already
#defined above.
with tf.contrib.framework.arg_scope(
    [fully_connected], normalizer_fn = batch_norm,
    normalizer_params = bn_params):
    hidden1 = fully_connected(X, n_hidden1, scope = "hidden1", reuse=True)
    hidden2 = fully_connected(hidden1, n_hidden2, scope = "hidden2",reuse=True)
    logits = fully_connected(hidden2, n_outputs, scope = "outputs", 
                           activation_fn = None, reuse=True)

The execution phase is also prett much the same, with one exception. Whenever you run an operation that depends on the batch_norm layer, you need to set the is_training placeholder to True or False.

In [22]:
#with tf.Session() as sess:
#    sess.run(init)
#    for epoch in range(n_epochs):
#        for X_batch, y_batch in zip(x_batches, y_batch):
#            sess.run(training_op, feed_dict = {is_training:True, X:X_batch, 
#                                          y: y_batch})
#            accuracy_score = accuracy.eval(
#                feed_dict = {is_training:False, X:X_test_scaled, y:y_test})
#            print(accuracy_score)

### Gradient Clipping

Preventing the gradient exploding problem instead of vanishing gradient. Clip the gradients during backpropogation so that they never exceed some threshold. 

We will achieve this using the *clip_by_value()* function in compute_gradients() method and apply it using apply_gradients() method.

In [26]:
#threshold = 1.0
#optimizer = tf.train.GradientDescentOptimizer(learning_rate)
#grads_and_cars = optimizer.compute_gradients(loss)
#capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
#             for grad, var in grad_and_vars]
#training_op = optimizer.apply_gradients(capped_gvs)

## Reusing Pretrained Layers

It is generally not a good idea to re-train very large DNN from scratch. Instead, find an existing NN that accomplishes a similar task and reuse the lower layers. This is called transfer learning.

In [28]:
saver = tf.train.Saver()
#os.chdir('/Users/siddharth/Desktop/TensorFlow/tmp2/')


In [36]:
#[....]construct original model

#If we wanted to restore the original model we would use the following:
#with tf.Session() as sess:
#    saver.restore(sess, "my_original_model.ckpt")

However, in general we will only reuse part of the original model. To configure Saver to restore only a subset of the variables from the original model. For example, the following code restores only hidden layers 1,2 and 3:

In [40]:
[...]
##build new model with the same definition as before for hidden layers 1-3

#init = tf.global_variables_initializer()

#reuse_vars = tf.get_collection(tf.GraphKeys, TRAINABLE_VARIABLES,
#                              scope = "hidden[123]")

#reuse_vars_dict = dict([(var.name, var.name) for var in reuse_vars])
#original_saver = tf.Saver(reuse_vars_dict) #saver to restore the original model

#new_saver = tf.Saver() #saver to save new model

#with tf.Session() as sess:
#    sess.run(init)
#    original_saver.restore("./my_original_model.ckpt")#restore layer 1 to 3
#    [...] #train the new model
#    new_saver.save("./my_new_model.ckpt")#save the whole model


[Ellipsis]

- First we build the new model, making sure to copy the original model's hidden layers 1 to 3.

- We also create node to initialize all variables and keep only the ones whose scope matches the expression "hidden[123]" (i.e. we get all the trainable variables in hidden layers 1 to 3).

- Then we get a list of all variables created.

- Next we create a dictionary mapping the name of each variable in the original model to its name in the new model (generally you want to keep the exact same names).

- Then we create a Saver that will restore only these variables, and we create another Saver to save the entire new model, not just layers 1 to 3.

- We then start a session and initialize all variables in the model, then restore the variable values from original model's layers 1 to 3. 

- Finally we train the model on the new task and save it.

### Freezing the Lower Layers

It is likely that the lower layers of DNN have learned to detect low level features in pictures that will be useful across both image classification tasks, so you can just reuse these layers as they are. it is generally a good idea to "freeze" their weights when training the new DNN: if the lower-layer weight are fixed, then the higher-layer weights are easier to train.

To freeze lower layers during training, the simplest solution is to give the optimizer the list of variables to train, excluding the variables from the lower layers:

In [46]:
#train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 
#                              scope = "hidden[34]outputs")
#training_op = optimizer.minimize(loss, var_list = train_vars)

The first line gets a list of trainable variables in hidden layers 3 and 4 as well as the output layer. This leaves out the variables in hidden layers 1 and 2.

The second line ensures only hidden layers specified in the first line (3 and 4) are optimized thus freezing layer 1 and 2. 

### Caching the Frozen Layers

Since the frozen layers won't change, it is possible to cache the output of the topmost frozen layer for each training instance. Since training goes through the whole dataset many times, this will give you a huge speed bosost as you will only need to go through the frozen layers once per training instance (instead of once per epoch). For example, you could first run the whole training set through the lower layers:


In [55]:
#hidden2_outputs = sess.run(hidden2, feed_dict = {X: X_train})

Then during training, instead of building batches of training instances, you would build batches of outputs from hidden layer 2 and feed them to the training operation

In [50]:
import numpy as np

In [56]:
n_epochs = 100
n_batches = 500

#for each in range(n_epochs):
#    shuffle_idx = rnd.permutation(len(hidden2_outputs))
#    hidden2_batches = np.array_split(hidden2_outputs[shuffled_idx], n_batches)
#    y_batches = np.array.split(y_train[shuffle_idx, n_batches])
#    for hidden2_batch, y_batch in zip(hidden2_batches, y_batches):
#        sess.run(training_op, feed_dict={hidden2: hidden2_batch, y:y_batch})

The last line runs the training operation defined earlier and feeds it a batch of outputs from second hidden layer.

## 2) Training Network is Slow

### Faster Optimizers

- Momentum Optimization
- Nesterov Accelerated Gradient
- AdaGrad
- RMSProp
- Adam Optimization

(Just use Adam)

Code for using any of these optimizers is simple. For example: the momentum optimizer can be implemented as follows:

In [60]:
optimizer = tf.train.MomentumOptimizer(learning_rate = learning_rate, 
                                       momentum = 0.9)

In [61]:
optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate)

### Learning Rate Scheduling

Setting learning rate high and then reducing it over time to reach the optimal learning rate. Common types of learning rate schedule are:

- Predetermined piecewise constant learning rate
- Performance scheduling
- Exponential scheduling
- Power scheduling

Of these exponential decay is favored where we set the learning rate to a function of the iteration number $t:\eta(t) = \eta_{0}10^{-t/r}$. It requires tuning $\eta_{0}$ and *r*. The learning rate will drop by a factor of 10 every r steps.

The implementation code is as follows:


In [62]:
initial_learning_rate = 0.1
decay_steps = 10000
decay_rate = 1/10
global_step = tf.Variable(0, trainable = False)
learning_rate = tf.train.exponential_decay(initial_learning_rate,
                                        global_step, decay_steps,
                                        decay_rate)
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum = 0.9)
training_op = optimizer.minimize(loss, global_step = global_step)

After setting the hyperparameter values, we create a non-trainable variable global_step (initialized to 0) to keep track of the current training iteration number. 

Then we define the exponentially decaying learning rate (with $\eta_{0} = 0.1$ and $r = 10,000$ using TensorFlows exponential_decay() function.

Next, we create an optimizer using this decaying learning rate (momentum in above case)..

Finally, create training operation calling optimizer's minimize() method.

Adaptive optimizers automatically reduce learning rate during training, it is not necessary to add an extra learning schedule. Use ofr other optimization algos.

## 3) Avoiding Overfitting Through Regularization

- Early stopping
- L1 regularization
- L2 regularization
- Dropout
- Max-norm regularization
- Data augmentation

### Early Stopping

One way to implement this in TF is to evaluate the model on a validation set at regular intervals (eg. every 50 steps), and save a "winner" snapshot if it outperforms previous "winner" snapshots. Count the number of steps since the last "winner" snapshot was saved, and interrupt training when this number reaches some limit (eg 2000 steps). Then restore the last "winner" snapshot.

Although early stopping works well in practice, you can usually get much higher performance out of your network by combining it with regularization techniques.

### L1 and L2 Regularization

Constrain neural network's connection weights (but typically not its biases). 

One way to do this using TF is to simply add the regularization term to your cost function. For example, assuming you have just one hidden layer with weights weights1 and one output layer with weights weights2, then you can apply L1 regularization like this:

In [65]:
[...]#construct they neural network
#base_loss = tf.reduce_mean(xentropy, name = "avg_xentropy")
#reg_losses = tf.reduce_sum(tf.abs(weights1) + tf.reduce_sum(tf.abs(weights2)))
#loss = tf.add(base_loss, scale*reg_losses, name = "loss")

[Ellipsis]

A more convenient way to implement L1/L2 regularizer is as follows:

In [76]:
#arg_scope = tf.contrib.framework.arg_scope
#with arg_scope(
#    [fully_connected],
#    weights_regularizer = tf.contrib.layers.l1_regularizer(scale = 0.1)):
#    hidden1 = fully_connected(X, n_hidden1, scope = "hidden1")
#    hidden2 = fully_connected(hidden1, n_hidden2, scope = "hidden2")
#    logits = fully_connected(hidden2, n_outputs, activation_fn=None, 
#                            scope = "out")

This code creates a NN with two hidden layers and output layer, and it also creates nodes in the graph to compute L1 regularization loss corresponding to each layer's weights.

Tensorflow automatically adds these nodes to a special collection containing all the regularization losses. You just need to add these regularization losses to your overall loss, like this:

In [77]:
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([base_loss] + reg_losses, name = "loss")

### Dropout

The most popular regularization technique for deep neural networks is arguably dropout.

At every training step, every neuron has a probability p of being temporarily "dropped out" meaning it will be entirely ignored during this training step, but it may be active during the next step.

Dropout rate is typically set to 50%.

To implement dropout using TensorFlow we can simply apply the dropout() function to the input layer and to the output of every hidden layer. During training, this function randomly drops some neurons and divides the remaining neurons by the keep probability. After training this function does nothing at all.

In [80]:
from tensorflow.contrib.layers import dropout

In [85]:
[...]

#is_training = tf.placeholder(tf.bool, shape = (), name = "is_training")

#keep_prob = 0.5
#X_drop = dropout(X, keep_prob, is_training = is_training)

#hidden1 = fully_connected(X_drop, n_hidden1, scope = "hidden1")
#hidden1_drop = dropout(hidden1, keep_prob, is_training = is_training)

#hidden2 = fully_connected(hidden1, drop, n_hidden2, scope = "hidden2")
#hidden2_drop = dropout(hidden2, keep_prob, is_training = is_training)

#logits = fully_connected(hidden2_Dropout, n_outputs, activation_fn = None,
#                        scope = "outputs")


[Ellipsis]

Of course, just like you did earlier for batch normalization, you need to set is_training to True when training and False when testing.

### Max-Norm Regularization

Max-norm regularization ensures that for each neuron, it constrains the weights w of the incoming connections such that IIwII <= r where r is the max-norm hyperparameter. 

TensorFlow does not provide off the shelf max_norm regularizer. The following code creates a node clip_weights that will clip the weights of the variable along with the second axis so that each row vector has a maximum norm of 1.0:

In [89]:
#threshold = 1.0
#clipped_weights = tf.clip_by_norm(weights, clip_norm = threshold, 
#                                  axes = 1)
#clip_weights = tf.assign(weights, clipped_weights)

You would then apply this operation after each training step, like so:

In [92]:
#with tf.Session() as sess:
#    [...]
#    for epoch in range(n_epochs):
#        [...]
#        for X_batch, y_batch in zip(X_batches, y_batches):
#            sess.run(training_op, feed_dict = {X: x_batch, y: y_batch})
#            clip_weights.eval()

In [88]:
#?zip

#The zip() function take iterables (can be zero or more), 
#makes iterator that aggregates elements based on the iterables 
#passed, and returns an iterator of tuples.

You may wonder how to get access to the weights variable of each layer. For this you can simply use a variable scope like this:

In [96]:
#hidden1 = fully_connected(X, n_hidden1, scope = "hidden1")

with tf.variable_scope("hidden1", reuse = True):
    weights1 = tf.get_variable("weights")

Tensor("hidden1/weights/read:0", shape=(784, 300), dtype=float32)


To print out all the global variables we can use the following command.


In [97]:
#To print out all the global variables we can use the following command

for variable in tf.global_variables():
    print(variable.name)

h1/weights:0
h1/biases:0
fully_connected/weights:0
fully_connected/biases:0
fully_connected_1/weights:0
fully_connected_1/biases:0
hidden1/weights:0
hidden1/BatchNorm/beta:0
hidden1/BatchNorm/moving_mean:0
hidden1/BatchNorm/moving_variance:0
hidden2/weights:0
hidden2/BatchNorm/beta:0
hidden2/BatchNorm/moving_mean:0
hidden2/BatchNorm/moving_variance:0
outputs/weights:0
outputs/BatchNorm/beta:0
outputs/BatchNorm/moving_mean:0
outputs/BatchNorm/moving_variance:0
Variable:0
hidden1/weights/Momentum:0
hidden1/BatchNorm/beta/Momentum:0
hidden2/weights/Momentum:0
hidden2/BatchNorm/beta/Momentum:0
outputs/weights/Momentum:0
outputs/BatchNorm/beta/Momentum:0


Although the preceeding solution should work fine, it is a bit messy. A cleaner solution is to create a max_norm_regularizer() function and use it just like the earlier l1_regularizer() function:

In [98]:
def max_norm_regularizer(threshold, axes = 1, name = "max_norm",
                        collection = "max_norm"):
    def max_norm(weights):
        clipped = tf.clip_by_norm(weights, clip_norm = threshold, axes = axes)
        clip_weights = tf.assign(weights, clipped, name = name)
        tf.add_to_collection(collection, clip_weights)
        return None #there is no regularization loss term
    return max_norm

The function returns parameterized max_norm() function that you can use like any other regularizer:

In [105]:
#max_norm_reg = max_norm_regularizer(threshold = 1.0)
#hidden1 = fully_connected(X, n_hidden1, scope = "hidden1",
#                         weights_regularizer = max_norm_reg)

You need to fetch these clipping operations and run them after each training step:

In [109]:
#clip_all_weights = tf.get_collection("max_norm")

#with tf.Session() as sess:
#    [...]
#    for epoch in range(n_epochs):
#        [...]
#        for X_batch, y_batch in zip(X_batches, y_batches):
#            sess.run(training_op, feed_dict = {X:X_batch, y:y_batch})
#            sess.run(clip_all_weights)

### Data Augmentation 

Stops overfitting therefore regularization technique.

TensorFlow offers several image manipulation operations such as transposing (shifting), rotating, resizing, flipping, and cropping, as well as adjusting the brightness, contrastm saturation and hue making it easy to implement data augmentation.