# UBCS3 Machine Learning:<br /> Hands on Machine Learning with Scikit-Learn and Tensorflow

## Chapter 11: Training Deep Neural Networks (2 March 2018)

See [Hands on Machine Learning with Scikit-Learn and Tensorflow](https://github.com/ageron/handson-ml) by Aurélien Géron.

### Goals

* Avoid vanishing gradients / exploding gradients problem.
* Expedite training for large networks.
* Avoid overfitting large parameter sets to training data.

### Important Notes

* Almost all of this material is compiled directly from Géron's [book](http://shop.oreilly.com/product/0636920052289.do) and/or [IPython Notebook](https://github.com/ageron/handson-ml/blob/master/11_deep_learning.ipynb), with the aim of better accommodating time constraints [at the expense of greater leaps between techniques].

In [1]:
import numpy as np
import os

# to make this notebook's output stable across runs
def reset_graph(seed=42):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

In [2]:
import tensorflow as tf

reset_graph()

  from ._conv import register_converters as _register_converters


### Vanishing Gradients

<img src="./img/sigmoid.png" width="400px">

#### Xavier/Glorot and He Weight initialization

For the signal to flow properly, the authors argue that we need the variance of the outputs of each layer to be equal to the variance of its inputs, and we also need the gradients to have equal variance before and after flowing through a layer in the reverse direction. A good compromise that has proven to work in practice: the connection weights must be initialized randomly as
$$
w \sim \mathcal N \big( 0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\big) %
\qquad \text{or} \qquad %
w \sim \mathrm{Unif}\big(\big[ - \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}},  \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\big] \big)
$$

| Activation Function | Uniform distribution $[-r,r]$ | Normal distribution                |
| ------------------- | ----------------------------- | ---------------------------------- |
| Logistic            | $r = \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}$ | $\sigma = \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}$ |
| Hyperbolic Tangent  | $r = 4 \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}$ | $\sigma = 4\sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}$ | 
| ReLU (and variants) | $r = \sqrt 2 \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}$ | $\sigma = \sqrt{2}\sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}$ |

Ref: [(Glorot & Bengio, 2010)](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)

In [3]:
n_inputs = 28 * 28  # MNIST
n_hidden1 = 300

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

In [4]:
he_init = tf.contrib.layers.variance_scaling_initializer() # for ReLU
xav_init = tf.contrib.layers.xavier_initializer() # for logistic # not used
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                          kernel_initializer=he_init, name="hidden1")

### Dying ReLUs

During training, some ReLU neurons "die", outputting only $0$. Sometimes, more than half of a network's neurons can die during training (esp. if large learning rate); unlikely for them to "come back to life" because the gradient is $0$! Some possible solutions are below.

#### Solution: Pick activation function that won't die

* **Leaky ReLU:** Defined by $\text{LReLU}(z) := \max\{ \alpha z , z \}, \alpha \approx .15$


<img src="./img/lrelu.png" width="400px">

##### Implementing Leaky ReLU in `tensorflow`

Leaky ReLU is not built in to `tensorflow` so we have to make it ourselves.

In [5]:
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

def leaky_relu(z, name=None):
    return tf.maximum(0.01 * z, z, name=name)

hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name="hidden1")


Let's train a neural network on MNIST using the Leaky ReLU. First let's create the graph:

In [6]:
reset_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

In [7]:
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

In [8]:
with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=leaky_relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

In [9]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

In [10]:
learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

In [11]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

In [12]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

Let's load the data:

In [13]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/")

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [14]:
n_epochs = 40
batch_size = 50

In [15]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        if epoch % 5 == 0:
            acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
            acc_test = accuracy.eval(feed_dict={X: mnist.validation.images, y: mnist.validation.labels})
            print(epoch, "Batch accuracy:", acc_train, "Validation accuracy:", acc_test)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Batch accuracy: 0.86 Validation accuracy: 0.9044
5 Batch accuracy: 0.94 Validation accuracy: 0.951
10 Batch accuracy: 0.96 Validation accuracy: 0.9666
15 Batch accuracy: 1.0 Validation accuracy: 0.9722
20 Batch accuracy: 1.0 Validation accuracy: 0.9748
25 Batch accuracy: 1.0 Validation accuracy: 0.9768
30 Batch accuracy: 0.98 Validation accuracy: 0.9778
35 Batch accuracy: 0.96 Validation accuracy: 0.9796


* **Exponential Linear Unit:** Defined by $\text{ELU}_\alpha(z) := \begin{cases}\alpha (e^z - 1) & z < 0\\ z & z \geq 0\end{cases}$

<img src="./img/elu.png" width="400px">

ELUs are easy to implement in `tensorflow` because they're part of the default package.

In [16]:
reset_graph()

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

In [17]:
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")

#### Batch Normalization

The technique consists of adding an operation in the model just before the activation function of each layer, simply zero-centering and normalizing the inputs, then scaling and shifting the result using two new parameters per layer (one for scaling, the other for shifting). BN lets the model learn the optimal scale and mean of the inputs for each layer.

It estimates the inputs’ mean and standard deviation by evaluating the mean and standard deviation of the inputs over the current mini-batch:
\begin{align*}
\text{empirical mean over $B$} = \mu_B &:= \frac{1}{m_B} \sum_{i=1}^{m_B} x^{(i)}\\
\text{empirical standard deviation over $B$} = \sigma^2_B &:= \frac{1}{m_B} \sum_{i=1}^{m_B} (x^{(i)} - \mu_B)^2\\
\text{zero-centered and normalized input} = \hat x^{(i)} &:= \frac{x^{(i)} - \mu_B}{\sqrt{\sigma^2_B + \varepsilon}}\\
\text{output of the BN operation} = z^{(i)} &:= \gamma \hat x^{(i)} + \beta
\end{align*}
Above, $m_B$ is the number of instances in the mini-batch, $\gamma$ is a scaling parameter for the layer, $\beta$ is a shifting/offset parameter for the layer, $\varepsilon$ is a tiny positive number to avoid division by zero ($10^{-5}$).

The BN algorithm uses exponential decay to compute the running averages over consecutive mini-batches:
$$
\hat \nu \leftarrow \hat \nu \cdot \text{momentum} + \nu \cdot (1 - \text{momentum})
$$
where $\text{momentum} \approx 0.9$.

**Note:** training is rather slow at first while Gradient Descent is searching for the optimal scales and offsets for each layer, but it accelerates once it has found reasonably good values

Ref: [(Ioffe & Szegedy, 2015)](https://arxiv.org/pdf/1502.03167v3.pdf)

##### Implementing Batch Normalization

First reset the workspace and define some parameters for the network we'll build.

In [18]:
reset_graph()

import tensorflow as tf

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

Define a `placeholder_with_default` to tell `tensorflow` whether or not we're in `'training'` mode. We only want to adjust batch normalization while in `'training'` mode. 

In [19]:
training = tf.placeholder_with_default(False, shape=(), name='training')

Then construct a network! `tf.layers` has a `batch_normalization` function that takes a `momentum` parameter, as described above.

In [20]:
from functools import partial

my_batch_norm_layer = partial(tf.layers.batch_normalization,
                              training=training, momentum=0.9)

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)
hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)
logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

**Example:** Build a network for MNIST using ELU activation and BN at each layer

In [21]:
reset_graph()

batch_norm_momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')

with tf.name_scope("dnn"):
    he_init = tf.contrib.layers.variance_scaling_initializer()

    my_batch_norm_layer = partial(
            tf.layers.batch_normalization,
            training=training,
            momentum=batch_norm_momentum)

    my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=he_init)

    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
    hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2")
    bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
    logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs")
    logits = my_batch_norm_layer(logits_before_bn)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), 
                              name='accuracy')
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [22]:
n_epochs = 20
batch_size = 200

In [23]:
tf.get_collection(tf.GraphKeys.UPDATE_OPS)

[<tf.Tensor 'dnn/batch_normalization/AssignMovingAvg:0' shape=(300,) dtype=float32_ref>,
 <tf.Tensor 'dnn/batch_normalization/AssignMovingAvg_1:0' shape=(300,) dtype=float32_ref>,
 <tf.Tensor 'dnn/batch_normalization_2/AssignMovingAvg:0' shape=(100,) dtype=float32_ref>,
 <tf.Tensor 'dnn/batch_normalization_2/AssignMovingAvg_1:0' shape=(100,) dtype=float32_ref>,
 <tf.Tensor 'dnn/batch_normalization_3/AssignMovingAvg:0' shape=(10,) dtype=float32_ref>,
 <tf.Tensor 'dnn/batch_normalization_3/AssignMovingAvg_1:0' shape=(10,) dtype=float32_ref>]

In [24]:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Test accuracy: 0.8665
1 Test accuracy: 0.8963
2 Test accuracy: 0.9128
3 Test accuracy: 0.9229
4 Test accuracy: 0.9277
5 Test accuracy: 0.9344
6 Test accuracy: 0.9382
7 Test accuracy: 0.9422
8 Test accuracy: 0.9445
9 Test accuracy: 0.947
10 Test accuracy: 0.9504
11 Test accuracy: 0.9536
12 Test accuracy: 0.9571
13 Test accuracy: 0.9568
14 Test accuracy: 0.96
15 Test accuracy: 0.9599
16 Test accuracy: 0.9622
17 Test accuracy: 0.9619
18 Test accuracy: 0.9636
19 Test accuracy: 0.9638


This is worse accuracy than before! The Géron has only claimed that ELU and BN perform well for deep nets, of which our toy is not an example. It is suggested this is why we cannot expect great performance out of them in this example. 

**One more thing:** notice that the list of trainable variables is shorter than the list of all global variables. This is because the moving averages are non-trainable variables. If you want to reuse a pretrained neural network (see below), you must not forget these non-trainable variables.

In [25]:
[v.name for v in tf.trainable_variables()]

['hidden1/kernel:0',
 'hidden1/bias:0',
 'batch_normalization/gamma:0',
 'batch_normalization/beta:0',
 'hidden2/kernel:0',
 'hidden2/bias:0',
 'batch_normalization_1/gamma:0',
 'batch_normalization_1/beta:0',
 'outputs/kernel:0',
 'outputs/bias:0',
 'batch_normalization_2/gamma:0',
 'batch_normalization_2/beta:0']

In [26]:
[v.name for v in tf.global_variables()]

['hidden1/kernel:0',
 'hidden1/bias:0',
 'batch_normalization/gamma:0',
 'batch_normalization/beta:0',
 'batch_normalization/moving_mean:0',
 'batch_normalization/moving_variance:0',
 'hidden2/kernel:0',
 'hidden2/bias:0',
 'batch_normalization_1/gamma:0',
 'batch_normalization_1/beta:0',
 'batch_normalization_1/moving_mean:0',
 'batch_normalization_1/moving_variance:0',
 'outputs/kernel:0',
 'outputs/bias:0',
 'batch_normalization_2/gamma:0',
 'batch_normalization_2/beta:0',
 'batch_normalization_2/moving_mean:0',
 'batch_normalization_2/moving_variance:0']

### Reusing pretrained layers

<img src="./img/pretrained-layers.png" width="400px">

### Reusing a Tensorflow Model

In [27]:
reset_graph()

Note that by default, a `Saver` saves the structure of the graph into a `.meta` file, so that's the file you should load:

In [28]:
saver = tf.train.import_meta_graph("./my_model_final.ckpt.meta")

Next you need to get a handle on all the operations you will need for training. If you don't know the graph's structure, you can list all the operations:

In [29]:
for op in tf.get_default_graph().get_operations():
    print(op.name)

X
y
hidden1/kernel/Initializer/random_uniform/shape
hidden1/kernel/Initializer/random_uniform/min
hidden1/kernel/Initializer/random_uniform/max
hidden1/kernel/Initializer/random_uniform/RandomUniform
hidden1/kernel/Initializer/random_uniform/sub
hidden1/kernel/Initializer/random_uniform/mul
hidden1/kernel/Initializer/random_uniform
hidden1/kernel
hidden1/kernel/Assign
hidden1/kernel/read
dnn/hidden1/kernel/Regularizer/clip_by_norm/mul
dnn/hidden1/kernel/Regularizer/clip_by_norm/Sum/reduction_indices
dnn/hidden1/kernel/Regularizer/clip_by_norm/Sum
dnn/hidden1/kernel/Regularizer/clip_by_norm/Rsqrt
dnn/hidden1/kernel/Regularizer/clip_by_norm/mul_1/y
dnn/hidden1/kernel/Regularizer/clip_by_norm/mul_1
dnn/hidden1/kernel/Regularizer/clip_by_norm/Const
dnn/hidden1/kernel/Regularizer/clip_by_norm/truediv/y
dnn/hidden1/kernel/Regularizer/clip_by_norm/truediv
dnn/hidden1/kernel/Regularizer/clip_by_norm/Minimum
dnn/hidden1/kernel/Regularizer/clip_by_norm/mul_2
dnn/hidden1/kernel/Regularizer/clip_b

Oops, that's a lot of operations! It's much easier to use TensorBoard to visualize the graph. The following hack will allow you to visualize the graph within Jupyter (if it does not work with your browser, you will need to use a `FileWriter` to save the graph and then visualize it in TensorBoard):

In [30]:
from IPython.display import clear_output, Image, display, HTML

def strip_consts(graph_def, max_const_size=32):
    """Strip large constant values from graph_def."""
    strip_def = tf.GraphDef()
    for n0 in graph_def.node:
        n = strip_def.node.add() 
        n.MergeFrom(n0)
        if n.op == 'Const':
            tensor = n.attr['value'].tensor
            size = len(tensor.tensor_content)
            if size > max_const_size:
                tensor.tensor_content = b"<stripped %d bytes>"%size
    return strip_def

def show_graph(graph_def, max_const_size=32):
    """Visualize TensorFlow graph."""
    if hasattr(graph_def, 'as_graph_def'):
        graph_def = graph_def.as_graph_def()
    strip_def = strip_consts(graph_def, max_const_size=max_const_size)
    code = """
        <script>
          function load() {{
            document.getElementById("{id}").pbtxt = {data};
          }}
        </script>
        <link rel="import" href="https://tensorboard.appspot.com/tf-graph-basic.build.html" onload=load()>
        <div style="height:600px">
          <tf-graph-basic id="{id}"></tf-graph-basic>
        </div>
    """.format(data=repr(str(strip_def)), id='graph'+str(np.random.rand()))

    iframe = """
        <iframe seamless style="width:1200px;height:620px;border:0" srcdoc="{}"></iframe>
    """.format(code.replace('"', '&quot;'))
    display(HTML(iframe))

In [31]:
show_graph(tf.get_default_graph())

Once you know which operations you need, you can get a handle on them using the graph's `get_operation_by_name()` or `get_tensor_by_name()` methods:

In [32]:
X = tf.get_default_graph().get_tensor_by_name("X:0")
y = tf.get_default_graph().get_tensor_by_name("y:0")

accuracy = tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")

KeyError: "The name 'eval/accuracy:0' refers to a Tensor which does not exist. The operation, 'eval/accuracy', does not exist in the graph."

In [34]:
training_op = tf.get_default_graph().get_operation_by_name("train/GradientDescent")

KeyError: "The name 'train/GradientDescent' refers to an Operation not in the graph."

In [35]:
with tf.Session() as sess:
    saver.restore(sess, "./my_model_final.ckpt")

    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_new_model_final.ckpt")    

INFO:tensorflow:Restoring parameters from ./my_model_final.ckpt
0 Test accuracy: 0.9636
1 Test accuracy: 0.965
2 Test accuracy: 0.9661
3 Test accuracy: 0.9664
4 Test accuracy: 0.9667
5 Test accuracy: 0.9679
6 Test accuracy: 0.9686
7 Test accuracy: 0.9691
8 Test accuracy: 0.9691
9 Test accuracy: 0.9698
10 Test accuracy: 0.9709
11 Test accuracy: 0.9707
12 Test accuracy: 0.9706
13 Test accuracy: 0.9714
14 Test accuracy: 0.9712
15 Test accuracy: 0.9707
16 Test accuracy: 0.9713
17 Test accuracy: 0.9715
18 Test accuracy: 0.9722
19 Test accuracy: 0.9716


Finally, it's possible to kick off layers and add new ones while holding onto lower layers, but I doubt we'll get there in only an hour. 

### Avoiding Overfitting Through Regularization

#### $\ell_1$ and $\ell_2$ regularization

Let's implement $\ell_1$ regularization manually. First, we create the model, as usual (with just one hidden layer this time, for simplicity):

In [36]:
reset_graph()

n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    logits = tf.layers.dense(hidden1, n_outputs, name="outputs")

Next, we get a handle on the layer weights, and we compute the total loss, which is equal to the sum of the usual cross entropy loss and the $\ell_1$ loss (i.e., the absolute values of the weights):

In [37]:
W1 = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
W2 = tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")

scale = 0.001 # l1 regularization hyperparameter

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                              logits=logits)
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")
    reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
    loss = tf.add(base_loss, scale * reg_losses, name="loss")

In [38]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name="accuracy")

learning_rate = 0.01

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [39]:
n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Test accuracy: 0.8347
1 Test accuracy: 0.8713
2 Test accuracy: 0.8832
3 Test accuracy: 0.8907
4 Test accuracy: 0.8959
5 Test accuracy: 0.8978
6 Test accuracy: 0.9015
7 Test accuracy: 0.9035
8 Test accuracy: 0.9033
9 Test accuracy: 0.9064
10 Test accuracy: 0.906
11 Test accuracy: 0.907
12 Test accuracy: 0.9076
13 Test accuracy: 0.9068
14 Test accuracy: 0.9065
15 Test accuracy: 0.9076
16 Test accuracy: 0.9068
17 Test accuracy: 0.9063
18 Test accuracy: 0.9064
19 Test accuracy: 0.9066


This could be prohibitively slow if we have many layers. Instead, use the `l1_regularizer()`, *etc.* functions. First, define the regularization parameters and your base loss function.
```python
scale = 0.001 # l1 regularization hyperparameter

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                              logits=logits)
    base_loss = tf.reduce_mean(xentropy, name="avg_xentropy")
```
Next, construct your network using the `kernel_regularizer` argument.
```python
my_dense_layer = partial(
    tf.layers.dense, activation=tf.nn.relu,
    kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

with tf.name_scope("dnn"):
    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    hidden2 = my_dense_layer(hidden1, n_hidden2, name="hidden2")
    logits = my_dense_layer(hidden2, n_outputs, activation=None,
                            name="outputs")
```
Then, add the regularization losses to your base loss to obtain your loss function.
```python
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([base_loss] + reg_losses, name="loss")
```

#### Max norm

Max-norm regularization constrains the weights $w$ of the incoming connections such that $\|w\|_2 \leq r$ for some hyperparameter $r$. This is typically implemented by clipping $w$ after each training step according to $w \leftarrow w \cdot r / \|w\|_2$. To implement max-norm "regularization" we'll wind up getting a handle on each of the weights and creating an `Operation` that will reassign a *clipped* set of weights to those tensors. We'll have the `Session` object `run` that `Operation` on each iteration after the `training_op` call.

**Pet peeve:** I would call this a *constraint* and not merely regularization since $w$ **cannot** be greater than $r$.

First, build the neural network: a plain and simple neural net for MNIST with just 2 hidden layers:

In [40]:
reset_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

Next, let's get a handle on the first hidden layer's weight and create an operation that will compute the clipped weights using the `clip_by_norm()` function. Then we create an assignment operation to assign the clipped weights to the weights variable:

In [41]:
threshold = 1.0
weights = tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1)
clip_weights = tf.assign(weights, clipped_weights)

We can do this as well for the second hidden layer:

In [42]:
weights2 = tf.get_default_graph().get_tensor_by_name("hidden2/kernel:0")
clipped_weights2 = tf.clip_by_norm(weights2, clip_norm=threshold, axes=1)
clip_weights2 = tf.assign(weights2, clipped_weights2)

Let's add an initializer and a saver:

In [43]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

And now we can train the model. It's pretty much as usual, except that right after running the `training_op`, we run the `clip_weights` and `clip_weights2` operations:

In [44]:
n_epochs = 20
batch_size = 50

In [45]:
with tf.Session() as sess:                                              # not shown in the book
    init.run()                                                          # not shown
    for epoch in range(n_epochs):                                       # not shown
        for iteration in range(mnist.train.num_examples // batch_size):  # not shown
            X_batch, y_batch = mnist.train.next_batch(batch_size)       # not shown
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            clip_weights.eval()
            clip_weights2.eval()                                        # not shown
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,       # not shown
                                            y: mnist.test.labels})      # not shown
        print(epoch, "Test accuracy:", acc_test)                        # not shown

    save_path = saver.save(sess, "./my_model_final.ckpt")               # not shown

0 Test accuracy: 0.9525
1 Test accuracy: 0.9688
2 Test accuracy: 0.9732
3 Test accuracy: 0.9755
4 Test accuracy: 0.975
5 Test accuracy: 0.9769
6 Test accuracy: 0.9795
7 Test accuracy: 0.9778
8 Test accuracy: 0.979
9 Test accuracy: 0.9792
10 Test accuracy: 0.9811
11 Test accuracy: 0.9802
12 Test accuracy: 0.9826
13 Test accuracy: 0.9816
14 Test accuracy: 0.9824
15 Test accuracy: 0.9817
16 Test accuracy: 0.9823
17 Test accuracy: 0.9827
18 Test accuracy: 0.9829
19 Test accuracy: 0.9827


The implementation above is straightforward and it works fine, but it is a bit messy. A better approach is to define a `max_norm_regularizer()` function:

In [46]:
def max_norm_regularizer(threshold, axes=1, name="max_norm",
                         collection="max_norm"):
    def max_norm(weights):
        clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
        clip_weights = tf.assign(weights, clipped, name=name)
        tf.add_to_collection(collection, clip_weights)
        return None # there is no regularization loss term
    return max_norm

Then you can call this function to get a max norm regularizer (with the threshold you want). When you create a hidden layer, you can pass this regularizer to the `kernel_regularizer` argument:

In [47]:
reset_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

In [48]:
max_norm_reg = max_norm_regularizer(threshold=1.0)

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                              kernel_regularizer=max_norm_reg, name="hidden1")
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu,
                              kernel_regularizer=max_norm_reg, name="hidden2")
    logits = tf.layers.dense(hidden2, n_outputs, name="outputs")

In [49]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()

Training is as usual, except you must run the weights clipping operations after each training operation:

In [50]:
n_epochs = 20
batch_size = 50

In [51]:
clip_all_weights = tf.get_collection("max_norm")

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            sess.run(clip_all_weights)
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images,     # not shown in the book
                                            y: mnist.test.labels})    # not shown
        print(epoch, "Test accuracy:", acc_test)                      # not shown

    save_path = saver.save(sess, "./my_model_final.ckpt")             # not shown

0 Test accuracy: 0.9534
1 Test accuracy: 0.9657
2 Test accuracy: 0.9729
3 Test accuracy: 0.9718
4 Test accuracy: 0.9752
5 Test accuracy: 0.9749
6 Test accuracy: 0.9777
7 Test accuracy: 0.9796
8 Test accuracy: 0.9791
9 Test accuracy: 0.9801
10 Test accuracy: 0.9792
11 Test accuracy: 0.9801
12 Test accuracy: 0.9811
13 Test accuracy: 0.9815
14 Test accuracy: 0.9802
15 Test accuracy: 0.982
16 Test accuracy: 0.9805
17 Test accuracy: 0.9813
18 Test accuracy: 0.9813
19 Test accuracy: 0.9812


### Faster Optimizers

* Momentum Optimization

\begin{align*}
m &\leftarrow \beta m - \eta \nabla_\theta J(\theta)
\\
\theta &\leftarrow \theta + m
\end{align*}

* Nesterov Accelerated Gradient

\begin{align*}
m &\leftarrow \beta m - \eta \nabla_\theta J(\theta + \beta m)
\\
\theta & \leftarrow \theta + m
\end{align*}

* Adagrad 
* RMSProp
* Adam
* Learning Rate Scheduling / Annealing

For a complete list, see [the documentation](https://www.tensorflow.org/api_guides/python/train).

