In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

import matplotlib.pyplot as plt

import time
from datetime import datetime, timedelta

from pylib.draw_nn import draw_neural_net_fig

In [None]:
sess = None

def reset_vars():
    sess.run(tf.global_variables_initializer())

def reset_tf():
    global sess
    if sess:
        sess.close()
    tf.reset_default_graph()
    sess = tf.Session()

<!-- requirement: pylib/__init__.py -->
<!-- requirement: pylib/draw_nn.py -->
<!-- requirement: images/Accuracy_NoDropout.png-->
<!-- requirement: images/Accuracy_Dropout.png -->

# Deep Neural Networks

## What is deep learning?

Deep learning is a branch of machine learning that tries to emulate the biological structure and function of the brain using artificial neural networks. These networks include: 

- Multilayer Perceptron Networks
- Convolutional Neural Networks
- Recurrent Neural Networks

Additionally, these networks are hierarchical or multilayered, enabling them to model high-level abstractions in data. For this reason, deep learning is also called **hierarchical learning**. (Notice how there are two hidden layers in the figure of the multilayer perceptron network below.)

In [None]:
draw_neural_net_fig([20, 14, 12, 10])

There are benefits to using hierarchical models. In contrast to the performance of older machine learning algorithms, the performance of deep learning algorithms scales with the amount of data they are trained on -- the more data, the better the model. Consequently, deep learning algorithms typically outperform traditional ones. These models also have the ability to automatically extract features from data in a process called [feature learning](https://en.wikipedia.org/wiki/Feature_learning). This ability eliminates the need for a priori knowledge of the data to construct features, which is particularly useful when dealing with complex data such as images.  

Deep learning has some pretty neat applications. Not only can we classify images with a high degree of accuracy, but we can also use deep learning algorithms to [generate captions](https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html), [summarize](https://research.googleblog.com/2016/08/text-summarization-with-tensorflow.html) and [translate](https://research.googleblog.com/2016/09/a-neural-network-for-machine.html) text, [generate audio](https://deepmind.com/blog/wavenet-generative-model-raw-audio/), and [produce art](https://github.com/lengstrom/fast-style-transfer/). 


## Multilayer perceptron

We will start by replicating some of the code we used for our basic neural network. 

In [None]:
mnist = input_data.read_data_sets('/tmp/data', one_hot=True)

In [None]:
N_PIXELS= 28 * 28
N_CLASSES = 10
BATCH_SIZE = 100
LEARNING_RATE = 0.5

hidden_size = 64

### Initializing Weights and Biases

As a reminder, we want to initialize our weights with random values to break symmetry between neurons in a hidden layer. Additionally, we want to choose small values to avoid the **gradient vanishing problem**, where the weighted sum of the inputs (plus a bias) fall on the flat portion of the sigmoid curve. What is the proper scale of the weights?  Most of our activation functions have their best response for inputs of $\mathcal O(1)$.  If we have $m$ random inputs, each of $\mathcal O(1)$, we expect their sum to scale as $\sqrt m$.  Therefore, weights are often chosen randomly with a mean of zero and standard deviation of $1/\sqrt m$.

For very large layers, this gives rather small weights.  An alternative approach is to only provide $k < m$ non-zero weights when initializing neurons.  This scheme, known as **sparse initialization**, provides more diversity amongst the neurons at initialization.  It can, however, also produce very slow convergence as "incorrect" choices of non-zero weights have the be removed.

In the code below, we initialize our weights by sampling from a truncated normal distribution, where any weights greater than 2 standard deviations from the mean is re-picked. We also initialize the biases to zero. 

In [None]:
def initializer(shape):
    return tf.truncated_normal(shape, stddev=shape[0]**-0.5)

### Adding Hidden Layers

A single layer neural network only works well on linearly separable data. By adding one more layer, we can solve most classification problems. The exercise at the end of the basic neural network notebook was to add a layer to our model to improve the accuracy of our predictions. We will present the solution below.

In [None]:
reset_tf()

x = tf.placeholder(tf.float32, [None, N_PIXELS], name="pixels")
y_label = tf.placeholder(tf.float32, [None, N_CLASSES], name="labels")

W1 = tf.Variable(initializer([N_PIXELS, hidden_size]), name="weights")
b1 = tf.Variable(tf.zeros([hidden_size]), name="biases")

hidden = tf.nn.sigmoid(tf.matmul(x, W1) + b1)

W2 = tf.Variable(initializer([hidden_size, N_CLASSES]), name="weights2")
b2 = tf.Variable(tf.zeros([N_CLASSES]), name="biases2")

y = tf.matmul(hidden, W2) + b2

In [None]:
def merge_dicts(*args):
    d = dict()
    for a in args:
        d.update(a)
    return d

def optimize(x, y, y_label, steps_total, steps_print, train_feed={}, test_feed={}):
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=y,
                                                                  labels=y_label))
    train = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
    accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(y_label, 1)), tf.float32))

    reset_vars()
    for i in xrange(steps_total):
        batch_x, batch_y = mnist.train.next_batch(BATCH_SIZE)
        sess.run(train,
                 feed_dict=merge_dicts({x: batch_x, y_label: batch_y}, train_feed))
        if i % steps_print == 0 or i == steps_total - 1:
            feed_dict = merge_dicts({x: mnist.test.images, y_label: mnist.test.labels}, test_feed)
            print "Test: ", sess.run([loss, accuracy],
                                     feed_dict=merge_dicts({x: mnist.test.images,
                                                            y_label: mnist.test.labels},
                                                           test_feed))
            print "Train:", sess.run([loss, accuracy],
                                     feed_dict=merge_dicts({x: mnist.train.images, 
                                                            y_label: mnist.train.labels},
                                                           test_feed))

optimize(x, y, y_label, 10000, 1000)

## Layer API

Setting up all of this math is obviously going to get tedious as we increase the number of layers. To address this, TensorFlow provides a [layers API](https://www.tensorflow.org/api_docs/python/tf/layers), which lets us create individual layers with a single line.  Let's recreate this two-layer network in the new API.

The input tensors are created in the same way as before.

In [None]:
reset_tf()

x = tf.placeholder(tf.float32, [None, N_PIXELS], name="pixels")
y_label = tf.placeholder(tf.float32, [None, 10], name="labels")

We have been using **dense** layers.  That is, each neuron is connected to all of the inputs to the layer.  First we create thie hidden layer:

In [None]:
hidden = tf.layers.dense(x, hidden_size, activation=tf.nn.sigmoid, use_bias=True,
    kernel_initializer=tf.truncated_normal_initializer(stddev=N_PIXELS**-0.5))

This sets up a weight matrix (referred to as the **kernel**, for reasons to be discussed later) of `size(x)`$\times$`hidden_size`, multiplies it with $x$, adds a bias term (since `use_bias=True`), and sends the result through the sigmoid activation function.  We use the same truncated normal initializer for the weights as before.

We use a second dense layer to produce the final output.  We don't need an activation function here, as we'll feed it into the softmax function ourselves.

In [None]:
y = tf.layers.dense(hidden, 10, activation=None, use_bias=True,
    kernel_initializer=tf.truncated_normal_initializer(stddev=hidden_size**-0.5))

The training proceeds identically to before.

In [None]:
optimize(x, y, y_label, 10000, 1000)

This API makes it easy to add layers and neurons to neural network. However, in doing so, we run the risk of overfitting our model. 

## Overfitting and dropout

Consider a neural network with two hidden layers, with 400 neurons in each, that has been trained on the MNIST dataset. Now consider the figure below. It shows the model's performance on training (blue) and test (orange) data as a function of training steps. The gap between the training and test curves suggests that we are capturing too much of the variance in our training dataset such that our model doesn't generalize to new data. In other words, we have **overfit** the model. 


![No dropout](images/Accuracy_NoDropout.png)

Overfitting occurs when there are too many hidden layers or neurons, while underfitting (when the model does not capture enough variance in the data) occurs when there are too few. A general rule of thumb is that the number of neurons in a hidden layer should be between the size of the input and output layers. 

There are several [strategies](http://neuralnetworksanddeeplearning.com/chap3.html#regularization) to prevent overfitting, but we will consider **dropout**. For dropout, we assign a probability that a neuron will remain in the network for each iteration of training. A common choice for this probability is 0.5, and you can read more about dropout in [Srivastava et al (2014)](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf).  _Run the cell below to visualize how a neural network changes with dropout_. In TensorFlow, we apply the dropout regularization function right after the activation function:

`p = tf.placeholder(float32)
Y = tf.nn.relu(tf.matmul(x, weights) + biases)
Y_ = tf.nn.dropout(Y, p)`

In [None]:
draw_neural_net_fig([20, 14, 12, 10], 0.7) #last argument is the probability of keeping a neuron

We do not apply dropout on the test data. Applying a dropout of $p=0.5$ produces the set of curves below. You can see that the gap is no longer present. 

![With dropout](images/Accuracy_Dropout.png)

Dropout can be created through the layer API.  The dropout rate is a parameter.  By setting it to a placeholder, we can adjust the dropout rate later.

In [None]:
reset_tf()

x = tf.placeholder(tf.float32, [None, N_PIXELS], name="pixels")
y_label = tf.placeholder(tf.float32, [None, 10], name="labels")
dropout = tf.placeholder(tf.float32)

DROPOUT = 0.2
LAYERS = [400, 400]

In [None]:
layer = x
for size in LAYERS:
    layer = tf.layers.dense(layer, size, activation=tf.nn.relu, use_bias=True,
        kernel_initializer=tf.truncated_normal_initializer(stddev=layer.shape[1].value**-0.5))
    layer = tf.layers.dropout(layer, rate=dropout, training=True)

y = tf.layers.dense(layer, N_CLASSES, activation=None, use_bias=True,
    kernel_initializer=tf.truncated_normal_initializer(stddev=layer.shape[1].value**-0.5))

We apply dropout only during training.  We want to make predictions using the best model we have, namely by using all of the neurons we can.  The dropout layer's training flag can be used here, but we found it easier to just adjust the dropout rate to 0 in the test case.

In [None]:
BATCH_SIZE = 400
n_steps = 300  # 3000
optimize(x, y, y_label, 3000, 100, train_feed={dropout: DROPOUT}, test_feed={dropout: 0})

We can also make predictions using our model by running the cell below.

In [None]:
def predict(idx):
    image = mnist.test.images[idx]
    return sess.run(tf.argmax(y, 1), feed_dict={x: [image], dropout: 0})

idx = 0

plt.imshow(mnist.test.images[idx].reshape((28, 28)), cmap=plt.cm.gray_r)
plt.title("Predicted: %d, Actual: %d" % (predict(idx), np.argmax(mnist.test.labels[idx])));

## Exercise: Adding flexibility

Try increasing the number of **hidden layers** and **neurons**. How does increasing these values influence the model's accuracy (for both training and test data)? The time it takes to train the model? The time it takes the model's accuracy to converge? Are you overfitting you model with these values? How do you know?

## Exercise: Adding dropout

Play around with the values of **dropout**. How does changing the dropout rate change the fit of your model? 

## Exercise: Changing the learning rate

Finally, play around with the **learning rate**. In particular, try using the value for your solution to the basic neural networks exercise. What do you notice?

*Copyright &copy; 2017 The Data Incubator.  All rights reserved.*