In [2]:
# Import MNIST data set
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot = True)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The MNIST data set is split into three parts:

* 55,000 data points of training data (mnist.train)
* 10,000 data points of test data (mnist.test)
* 5,000 points of validation data (mnist.validation)

Each image is of size 28x28 pixels. Our classification labels are in one-vs-all format, so each training example is a vector of length 10 (one for each digit), with one element equal to 1 (the classification for that example) and the rest set to 0.

### Softmax Regressions
We do one-vs-all classification on the unrolled images, so our training tensor, mnist.train.images, is of size 55000x784. The training labels, mnist.train.labels, is of size 55000x10.

We also add a bias unit to each class, basically saying that some things are more likely independent of the input. The result is that the evidence (probability) for a class $i$ given an input $x$ is:

$$evidence_i = \sum_j W_{i, j}x_j + b_i$$

where $W_i$ is the weights and $b_i$ is the bias for class $i$, and $j$ is an index for summing over the pixels in our input image $x$. We then convert the evidence tallies into our predicted probabilities $y$ using the "softmax" function:

$$y = softmax(evidence) = normalize(e^x)$$

Expanded, you get:

$$softmax(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

### Implementing the Regression
We now implement the linear regression for classification.

Here, x isn't a specific value. It's a placeholder - a value that we'll input when we ask TensorFlow to run a computation. We want to be able to input any number of MNIST images, each flattened into a 784-dimensional vector. We represent this as a 2-D tensor of floating-point numbers with a shape [None, 784] (here, None means that a dimension can be of any length).

In [4]:
import tensorflow as tf
x = tf.placeholder(tf.float32, [None, 784])

We treat the weights and biases as TensorFlow variables:

In [5]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

Notice that W has a shape of [784, 10] because we want to multiply the 784-dimensional image vectors by it to produce 10-dimensional vectors of evidence for the difference classes. b has a shape of [10] so we can add it to the output.

We can now implement our model (in a single line!):

In [6]:
y = tf.nn.softmax(tf.matmul(x, W) + b)

To test the model, we calculate a **cross-entropy** error for the model, defined as:

$$H_{y'}(y) = -\sum_i y_i' log(y_i)$$

where $y$ is our predicted probability distribution and $y'$ is the true distribution (the given labels).

To implement cross-entropy, we need to first add a new placeholder to input the correct answers:

In [7]:
y_ = tf.placeholder(tf.float32, [None, 10])

Then we can implement the cross-entropy function, $-\sum y' log(y)$:

In [8]:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices = [1]))

A better way to get the cross-entropy is by using the tf.nn.softmax_cross_entropy_with_logits function which is more numerically stable than calculating it like this.

In [13]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = y_, logits = y))

Now that we know what we want our model to do, it's very easy to have TensorFlow train it to do so. Because TensorFlow knows the entire graph of your computations, it can automatically use the backpropagation algorithm to efficiently determine how your variables affect the loss you ask it to minimize.

In [24]:
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

We can now launch the model in an InteractiveSession (if not using an InteractiveSession, you should build the entire computation graph before starting a session and launching the graph. InteractiveSessions are more convenient for interactive contexts such as in this jupyter notebook):

In [25]:
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for _ in range(1000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    sess.run(train_step, feed_dict = {x: batch_xs, y_: batch_ys})

Using small batches of random data is called stochastic training. Ideally, we'd like to use all our data for every step of training because that would give us a better sense of what we should be doing, but that's expensive. So, instead, we use a different subset every time. Doing this is cheap and has much of the same benefit.

### Evaluating our model
We first figure out where our model predicts the correct label. tf.argmax is an extremely useful function which gives you the index of the highest entry in a tensor along some axis:

In [26]:
correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

Now, we get the accuracy over our test data set (not the training data set we used to train the model!):

In [27]:
print(sess.run(accuracy, feed_dict = {x: mnist.test.images, y_: mnist.test.labels}))

0.9055


This is pretty bad, so we'll try to improve it by building a multilayer convolutional network, which should get us to around 99.2% accuracy.

## Build a multilayer convolutional network
### Weight initialization
We first need to create a lot of weights and biases. Generally, weights should be initialized with a small amount of nosie for symmetry breaking and to prevent 0 gradients. Since we'll be using rectified linear unit neurons (ReLU), we'll also initialize them with a slightly positive initial bias to avoid "dead neurons".

We create two handy functions to help us with this:

In [28]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev = 0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape = shape)
    return tf.Variable(initial)

### Convolution and pooling
TensorFlow also gives us a lot of flexibility in convolution and pooling operations. In this example, we'll implement a vanilla version with a stride of one and zero padding so that the output is the same size as the input. Our pooling is plain old max poolign over 2x2 blocks.

We also abstract these into functions:

In [29]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides = [1, 1, 1, 1], padding = 'SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = 'SAME')

### First convolutional layer
Our first later will consist of convolution, followed by max pooling. The convolution will compute 32 features for each 5x5 patch. Its weight tensor will have a shape of [5, 5, 1, 32]. The first two dimensions (5 and 5) are the patch size, the next (1) is the number of input channels, and the last (32) is the number of output channels. We also have a bias vector with a component for each output channel.

In [30]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

To apply the layer, we first reshape x to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels.

In [32]:
# The shape argument (here, [-1, 28, 28, 1]) can have at most
# one component equal to the special value of -1. The size of
# that dimension is computed so that the total size of x
# remains constant. In particular, a shape of [-1] flattens
# x into a 1-D vector.
x_image = tf.reshape(x, [-1, 28, 28, 1])

We then convolve x_image with the weight tensor, add the bias, apply the ReLU function, and finally max pool. The max_pool_2x2 method will reduce the image size to 14x14.

In [35]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)  # reduces image size to 28/2 = 14

### Second convolutional layer
In order to build a deep network, we stack several layers of this type. The second layer will have 64 features for each 5x5 patch.

In [36]:
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)  # reduces image size to 14/2 = 7

### Densely connected layer
Now that the image size has been reduced to 7x7, we add a fully-connected layer with 1024 neurons to allow processing on the entire image. We reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU.

In [37]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

### Dropout
To reduce overfitting, we will apply dropout before the readout layer. We create a placeholder for the probability that a neuron's output is kept during dropout. This allows us to turn dropout on during training, and turn it off during testing.

TensorFlow's tf.nn.dropout operation automatically handles scaling neuron outputs in addition to masking them, so dropout just works without any additional scaling.

NOTE: For this small convolutional network, performance is actually nearly identical with and without dropout. Dropout is often very effective at reducing overfitting, but it is most useful when training very large neural networks.

In [38]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

### Readout layer
Finally, we add a layer, just like for the one layer softmax regression above.

In [39]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

### Train and evaluate the model
Now, we train and evaluate this model using code that is nearly identical to that for the simple one layer SoftMax network above.

The differences are that:

* We will replace the steepest gradient descent optimizer with the more sophisticated ADAM optimizer.
* We will include the additional parameter keep_prob in feed_dict to control the dropout rate.
* We will add logging to every 100th iteration in the training process.

Note that running this will take a while (it requires 20,000 training iterations).

In [None]:
cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels = y_, logits = y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
sess.run(tf.global_variables_initializer())
for i in range(20000):
    batch = mnist.train.next_batch(50)
    if i % 100 == 0:
        train_accuracy = accuracy.eval(feed_dict = {
            x: batch[0], y_: batch[1], keep_prob: 1.0
        })
        print("Step %d) Training accuracy: %g" % (i, train_accuracy))
    train_step.run(feed_dict = {x: batch[0], y_: batch[1], keep_prob: 0.5})

print("\nTest accuracy: %g" % accuracy.eval(feed_dict = {
    x: mnist.test.images, y_: mnist.test.labels, keep_prob = 1.0
}))