## Dropout

**Dropout** is one of the most popular regularization techniques for DNN.

The idea behind dropout is simple. At every training step, every neuron (including input neurons but excluding output neurons) has a probability $p$ of being *temporarily dropped out* which means these neurons will be ignored during a training step. However, they may become active again in the subsequent training steps.

Hyperparameter $p$ is called *dropout rate* and typically takes the value of 0.5 or 50%. After training process, neurons won't be dropped anymore.

At each training step, a unique neural network is generated. Since each neuron can be either present or absent, there'll be $2^N$ possible networks ($N$ is the number of droppable neurons).

As a result, it is nearly impossible to generate a same neural network twice. All the generated neural networks are not independent since they share many weights; however, they're all different.

Imagine with dropout, we have an ensemble of all smaller neural networks (like the Random Forest algorithm), and the result will be the average of those networks.

**Note:**
Suppose that $p = 0.5$, during testing a neuron will be connected to twice as many input neurons as it was (on average) during training. As a result, we need to multiply each neuron's input connection weights by 0.5 after training. Otherwise, each neuron will get a input signal twice as large as what the network was trained on (worse performance).

**Generally, we need to multiply each input connection weight by the KEEP PROBABILITY $(1 - p)$ after training. Alternatively, we can divide each neuron's output by the keep probability during training. These two methods are not the same; however, they both work well.**

In TensorFlow, we can use `tf.layers.dropout()` function to implement Dropout for DNN.

If the model is overfitting, we should increase the dropout rate. Conversely, we should decrease the dropout rate if the model underfits the training set.

Note that applying dropout would slow down the convergence, but it **usually** results in a much better model.

In [1]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

tf.reset_default_graph()

n_inputs = 784
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")

In [2]:
dropout_rate = 0.25 # 1 - keep_prob

# This placeholder is used to turn on/off training mode for other variables
training = tf.placeholder_with_default(False, shape=(), name="training")

# Apply dropout for input neurons
X_dropped = tf.layers.dropout(X, rate=dropout_rate, training=training)

with tf.name_scope("dnn"):
    # Use dropped X for the first hidden layer
    hidden1 = tf.layers.dense(X_dropped, n_hidden1, activation=tf.nn.relu, name="hidden1")
    
    # Use dropout for the second hidden layer
    hidden1_dropped = tf.layers.dropout(hidden1, rate=dropout_rate, training=training)
    hidden2 = tf.layers.dense(hidden1_dropped, n_hidden2, activation=tf.nn.relu, name="hidden2")
    
    # Use dropout for the output layer
    hidden2_dropped = tf.layers.dropout(hidden2, rate=dropout_rate, training=training)
    logits = tf.layers.dense(hidden2_dropped, n_outputs, name="outputs")

In [3]:
with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

In [4]:
with tf.name_scope("train"):
    initial_learning_rate = 0.1
    decay_steps = 10000
    decay_rate = 1/10
    global_step = tf.Variable(0, trainable=False, name="global_step")
    learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, decay_steps, decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)

In [5]:
with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [6]:
init = tf.global_variables_initializer()

n_epochs = 100
batch_size = 100

mnist = input_data.read_data_sets("/tmp/data/")

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={training: True, X: X_batch, y: y_batch})
        
        acc_test = accuracy.eval(feed_dict={X: mnist.test.images, y: mnist.test.labels})
        print("Epoch:", epoch, "--", "Test Accuracy:", acc_test)

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz
Epoch: 0 -- Test Accuracy: 0.9596
Epoch: 1 -- Test Accuracy: 0.9666
Epoch: 2 -- Test Accuracy: 0.9684
Epoch: 3 -- Test Accuracy: 0.9728
Epoch: 4 -- Test Accuracy: 0.9764
Epoch: 5 -- Test Accuracy: 0.9777
Epoch: 6 -- Test Accuracy: 0.9789
Epoch: 7 -- Test Accuracy: 0.9812
Epoch: 8 -- Test Accuracy: 0.9808
Epoch: 9 -- Test Accuracy: 0.9811
Epoch: 10 -- Test Accuracy: 0.9819
Epoch: 11 -- Test Accuracy: 0.9831
Epoch: 12 -- Test Accuracy: 0.9827
Epoch: 13 -- Test Accuracy: 0.9839
Epoch: 14 -- Test Accuracy: 0.9836
Epoch: 15 -- Test Accuracy: 0.985
Epoch: 16 -- Test Accuracy: 0.9843
Epoch: 17 -- Test Accuracy: 0.9836
Epoch: 18 -- Test Accuracy: 0.9843
Epoch: 19 -- Test Accuracy: 0.9838
Epoch: 20 -- Test Accuracy: 0.9844
Epoch: 21 -- Test Accuracy: 0.9845
Epoch: 22 -- Test Accuracy: 0.9845
Epoch: 23 -- Tes