## Avoiding Overfitting through regularization
### $\mathcal{l}_1$ and $\mathcal{l}_2$ regularization
Implement $\mathcal{l}_1$ regularization manually. 

In [1]:
from utils import *
%load_ext autoreload
%autoreload 2

In [3]:
reset_graph()

n_inputs = 28 * 28
n_hidden1 = 300 # Just one hidden layer
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name='hidden1')
    logits = tf.layers.dense(hidden1, n_outputs, name='outputs')

Next, we get a handle on the layer weights, and we compute the total loss, which is equal to the sum of the usual cross entropy loss and the $\mathcal{l}_1$ loss (i.e. the absolute values of the weights)

In [4]:
W1 = tf.get_default_graph().get_tensor_by_name('hidden1/kernel:0')
W2 = tf.get_default_graph().get_tensor_by_name('outputs/kernel:0')

scale = 0.001 # l1 regularization hyperparameter

with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    base_loss = tf.reduce_mean(xentropy, name='avg_xentropy')
    reg_losses = tf.reduce_sum(tf.abs(W1)) + tf.reduce_sum(tf.abs(W2))
    loss = tf.add(base_loss, scale * reg_losses, name='loss')

The rest is as usual:

In [5]:
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

learning_rate = 0.01

with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [9]:
X_train, X_valid, y_train, y_valid = get_mnist()

In [None]:
n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, 'Validation accuracy:', accuracy_val)
    save_path = saver.save(sess, './my_model_final.ckpt')

Alternatively, we can pass a regularization function to the `tf.layers.dense()` function, which will use it to create operations that will compute the regularization loss, and it adds these operations to the collection of regularization losses. The beginning is the same as above:

In [11]:
reset_graph()

n_inputs = 28*28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

Next, we will use Python's `partial()` function to avoid repeating the same arguments over and over again. Note that we set the `kernel_regularizer` argument:

In [12]:
scale = 0.001

In [15]:
from functools import partial

In [16]:
my_dense_layer = partial(tf.layers.dense, activation=tf.nn.relu, kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

with tf.name_scope('dnn'):
    hidden1 = my_dense_layer(X, n_hidden1, name='hidden1')
    hidden2 = my_dense_layer(hidden1, n_hidden2, name='hidden2')
    logits = my_dense_layer(hidden2, n_outputs, activation=None, name='outputs')

Now we must add the regularization losses to the base loss:

In [17]:
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    base_loss = tf.reduce_mean(xentropy, name='avg_xentropy')
    reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    loss = tf.add_n([base_loss] + reg_losses, name='loss')

In [20]:
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32), name='accuracy')

learning_rate = 0.01

with tf.name_scope('train'):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [21]:
n_epochs = 20
batch_size = 200

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

0 Validation accuracy: 0.8274
1 Validation accuracy: 0.8766
2 Validation accuracy: 0.8952
3 Validation accuracy: 0.9016
4 Validation accuracy: 0.9084
5 Validation accuracy: 0.9096
6 Validation accuracy: 0.9124
7 Validation accuracy: 0.9154
8 Validation accuracy: 0.9178
9 Validation accuracy: 0.919
10 Validation accuracy: 0.92
11 Validation accuracy: 0.9224
12 Validation accuracy: 0.9212
13 Validation accuracy: 0.9228
14 Validation accuracy: 0.9224
15 Validation accuracy: 0.9216
16 Validation accuracy: 0.9218
17 Validation accuracy: 0.9228
18 Validation accuracy: 0.9216
19 Validation accuracy: 0.9214


## Dropout
Fairly simple algorithm: at every training step, every neuron (including the input neurons but excluding the output neurons) has a probability $p$ of being temporarily 'dropped out', meaning it will be entirely ignored during this training step, but it may be active during the next step. The hyperparameter $p$ is called the *dropout rate* and it is typically set to 50%. After training, neurons don't get dropped anymore.

There's one important technical detail. Suppose $p$=50%, in which case during testing a neuron will be connected to twice as many input neurons as it was (on average) during training. To compensate for this fact, we need to multiply each neuron's input connection weights by 0.5 after training. If we don't, each neuron will get a total input signal roughly twice as large as what the network was trained on, and it is unlikely to perform well. More generally, we need to multiply each input connection weight by the *keep probability* $(1-p)$ after training. Alternatively, we can divide each neuron's output by the keep probability during training (these alternatives are not perfectly equivalent, but they work equally well).

In [8]:
reset_graph()
n_inputs = 28 * 28  # MNIST
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10
learning_rate = 0.01

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

In [9]:
training = tf.placeholder_with_default(False, shape=(), name='training')

dropout_rate = 0.5 # == 1- keep_prob
X_drop = tf.layers.dropout(X, dropout_rate, training=training)

with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu, name='hidden1')
    hidden1_drop = tf.layers.dropout(hidden1, dropout_rate, training=training)
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu, name='hidden2')
    hidden2_drop = tf.layers.dropout(hidden2, dropout_rate, training=training)
    logits = tf.layers.dense(hidden2_drop, n_outputs, name='outputs')

In [10]:
with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    
with tf.name_scope('train'):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss)

with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [11]:
X_train, X_valid, y_train, y_valid = get_mnist()

In [12]:
n_epochs = 20
batch_size = 50
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch, training:True})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, 'Validation accuracy:', accuracy_val)
    save_path = saver.save(sess, '/tmp/my_model_final.ckpt')

0 Validation accuracy: 0.9254
1 Validation accuracy: 0.9452
2 Validation accuracy: 0.9492
3 Validation accuracy: 0.9566
4 Validation accuracy: 0.9618
5 Validation accuracy: 0.9602
6 Validation accuracy: 0.96
7 Validation accuracy: 0.9674
8 Validation accuracy: 0.969
9 Validation accuracy: 0.9712
10 Validation accuracy: 0.9696
11 Validation accuracy: 0.9676
12 Validation accuracy: 0.9714
13 Validation accuracy: 0.9714
14 Validation accuracy: 0.9722
15 Validation accuracy: 0.9696
16 Validation accuracy: 0.9726
17 Validation accuracy: 0.9726
18 Validation accuracy: 0.974
19 Validation accuracy: 0.9742


Dropout tends to slow down convergence significantly, but if tuned properly generally produces a better model.

## Max-Norm Regularization
For each neuron, it constrains the weights $\mathbf{w}$ of the incoming connections such that $||\mathbf{w}||_2 \le r$, where $r$ is the max-norm hyperparameter and $||\cdot||_2$ is the $\mathcal{l}_2$ norm.

We typically implement this constraint by computing $||\mathbf{w}||_2$ after each training step and clipping $||\mathbf{w}||$ if needed ($\mathbf{w} \leftarrow \mathbf{w}\frac{r}{||\mathbf{w}||_2}$).

Reducing $r$ increases the amount of regularization and helps reduce overfitting. Max norm regularization can also help alleviate the vanishing/exploding gradient problems (if you are not using Batch Normalization).

In [13]:
reset_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, name='hidden1')
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, name='hidden2')
    logits = tf.layers.dense(hidden2, n_outputs, name='outputs')

with tf.name_scope('loss'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name='loss')
    
with tf.name_scope('train'):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
    training_op = optimizer.minimize(loss)
    
with tf.name_scope('eval'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

Next, let's get a handle on the first hidden layer's weight and create an operation that will compute the clipped weights using the `clip_by_norm()` function. Then we create an assignment operation to assign the clipped weights to the weights variable.

In [14]:
threshold = 1.0
weights = tf.get_default_graph().get_tensor_by_name('hidden1/kernel:0')
clipped_weights = tf.clip_by_norm(weights, clip_norm=threshold, axes=1)
clip_weights = tf.assign(weights, clipped_weights)

We can do this as well for the second hidden layer:

In [15]:
weights2 = tf.get_default_graph().get_tensor_by_name('hidden2/kernel:0')
clipped_weights2 = tf.clip_by_norm(weights2, clip_norm=threshold, axes=1)
clip_weights2 = tf.assign(weights2, clipped_weights2)

In [16]:
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [17]:
show_graph(tf.get_default_graph())

Now we can train the model: same as usual except that right after running the `training_op` we run `clip_weights` and `clip_weights2` operations

In [18]:
n_epochs = 20
batch_size = 50

In [20]:
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            clip_weights.eval()
            clip_weights2.eval()
        acc_valid = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, 'Validation accuracy:', acc_valid)
    
    save_path = saver.save(sess, '/tmp/my_model_final.ckpt')

0 Validation accuracy: 0.9562
1 Validation accuracy: 0.9684
2 Validation accuracy: 0.9718
3 Validation accuracy: 0.9772
4 Validation accuracy: 0.9758
5 Validation accuracy: 0.9764
6 Validation accuracy: 0.9822
7 Validation accuracy: 0.98
8 Validation accuracy: 0.9806
9 Validation accuracy: 0.9834
10 Validation accuracy: 0.9824
11 Validation accuracy: 0.9838
12 Validation accuracy: 0.9824
13 Validation accuracy: 0.984
14 Validation accuracy: 0.983
15 Validation accuracy: 0.983
16 Validation accuracy: 0.9832
17 Validation accuracy: 0.984
18 Validation accuracy: 0.9842
19 Validation accuracy: 0.9842


The implementation above is straightforward and it works fine, but it is a bit messy. A better approach is to define a `max_norm_regularizer()` function:

In [21]:
def max_norm_regularizer(threshold, axes=1, name='max_norm', collection='max_norm'):
    def max_norm(weights):
        clipped = tf.clip_by_norm(weights, clip_norm=threshold, axes=axes)
        clip_weights = tf.assign(weights, clipped, name=name)
        tf.add_to_collection(collection, clip_weights)
        return None
    return max_norm

This function returns a parameterized `max_norm()` function that you can use like any other regularizer: when you create a hidden layer, you can pass this regularizer to the `kernel_regularizer` argument.

In [22]:
reset_graph()

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 50
n_outputs = 10

learning_rate = 0.01
momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

In [23]:
max_norm_reg = max_norm_regularizer(threshold=1.0)

with tf.name_scope('dnn'):
    hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu, kernel_regularizer=max_norm_reg, name='hidden1')
    hidden2 = tf.layers.dense(hidden1, n_hidden2, activation=tf.nn.relu, kernel_regularizer=max_norm_reg, name='hidden2')
    logits = tf.layers.dense(hidden2, n_outputs, name='outputs')

In [24]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum)
    training_op = optimizer.minimize(loss)    

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [25]:
n_epochs = 20
batch_size = 50

clip_all_weights = tf.get_collection('max_norm')

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
            sess.run(clip_all_weights)
        acc_valid = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, 'Validation accuracy:', acc_valid)
    
    save_path = saver.save(sess, '/tmp/my_model_final.ckpt')

0 Validation accuracy: 0.9558
1 Validation accuracy: 0.97
2 Validation accuracy: 0.9732
3 Validation accuracy: 0.9744
4 Validation accuracy: 0.9768
5 Validation accuracy: 0.9784
6 Validation accuracy: 0.9796
7 Validation accuracy: 0.981
8 Validation accuracy: 0.981
9 Validation accuracy: 0.9812
10 Validation accuracy: 0.982
11 Validation accuracy: 0.9812
12 Validation accuracy: 0.9812
13 Validation accuracy: 0.9822
14 Validation accuracy: 0.982
15 Validation accuracy: 0.9832
16 Validation accuracy: 0.9822
17 Validation accuracy: 0.983
18 Validation accuracy: 0.9824
19 Validation accuracy: 0.9828


In [27]:
tf.__version__

'1.12.0'