# Deep MNIST
Adapted from Google's [Deep MNIST for Experts](http://www.tensorflow.org/tutorials/mnist/pros/index.html)

Let's use TensorFlow on the classic MNIST dataset.
Let us first load the data using a script that Google provided. Loading the data is not difficult (it just requires us to download it from Yann LeCun's website), but it's faster this way. The script creates a directory `MNIST_data` in which it stores the files.

In [2]:
import tensorflow as tf
import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


We'll run the example in an `InteractiveSession`, as this allows us to build our computation graph and run it step-by-step.

In [6]:
sess = tf.InteractiveSession()

Exception AssertionError: AssertionError() in <bound method InteractiveSession.__del__ of <tensorflow.python.client.session.InteractiveSession object at 0x7f4e53c5b8d0>> ignored


## Placeholders
We now create nodes for the input images and our target output classes.

In [7]:
x = tf.placeholder('float', shape=[None, 784])
y_ = tf.placeholder('float', shape=[None, 10])

`x` and `y_` are placeholders as we'll only input them at the start of the computation. Both are 2d tensors. `None` indicates that the first dimension (i.e. the batch size) can be of any size. 784 is the dimensionality of a single flattened MNIST image, while 10 is the number of digits we want to predict.
Note: the `shape` argument is optional, but allows the catching of bugs stemming from inconsistent tensor shapes.
## Variables
We define the weights `W` and the biases `b` as variables. A `Variable` lives in TensorFlow's computation graph and can be used and modified by the computation. Model parameters are usually `Variable`s.

In [9]:
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

We pass the initial values for each parameter to `Variable`. In this case, we initialize both `W` and `b` with zeros. Note: We should initialize them with random values with mean 1 and unit variance.
In order to use them within a session, we have to initialize them using that session.

In [10]:
sess.run(tf.initialize_all_variables())

## Predicted class and cost function
We can now implement our first neuron with $X \cdot W + b$ and computing the softmax probabilities of each class.

In [11]:
y = tf.nn.softmax(tf.matmul(x, W) + b)

We choose cross-entropy as our cost function. Specifying `tf.reduce_sum` without an argument allows us to calculate the sum of cross-entropies across both dimensions of the vector, e.g. batch size and number of classes.

In [13]:
cross_entropy = -tf.reduce_sum(y_ * tf.log(y))

## Training
In order to train our model, we can specify the optimizer with a step length and provide it with the cost function to minimize.

In [15]:
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)

`train_step` applies the gradient descent updates to the parameters. We can thus train our model by repeatedly running `train_step`.

In [20]:
for i in range(1000):
    batch = mnist.train.next_batch(50)
    train_step.run(feed_dict={x: batch[0], y_: batch[1]})

We replace the `placeholder` tensors `x` and `y_` with the training examples using `feed_dict`. Note: You can replace any tensor in the computation graph using `feed_dict`.
## Evaluation
We can now evaluate the accuracy of our model. We use `tf.argmax` to get the index of the highest entry along some axis. `tf.equal` returns a list of booleans, which we cast to floating point numbers and then take the mean. We can now evaluate the accuracy of our model.

In [21]:
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, 'float'))
print accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels})

0.9152


# Building a CNN
Let's do something a bit more advanced. How about a CNN?
## Weight initialization
As we need to create a lot of weights and biases, we create two helper functions that will assist us. We initialize the weights with a normal distribution for symmetry breaking and the biases with a small positive value to avoid "dead" neurons as we use ReLU. A bit weird that we initialize the weights with a standard deviation of 0.1 instead of 1.0.

In [22]:
def weight_variable(shape):
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

## Convolution and pooling
Let's now also create helper functions for our convolution and pooling. We use a stride of one and zero-padding as well as max pooling over 2x2 blocks.

In [25]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

## First convolutional layer
Our first convolutional layer computes 32 features for each 5x5 patch. The weights thus have have a shape of `[5, 5, 1, 32]`, with the patch size as the first two dimensions and the number of input channels and number of output channels representing the last two respectively. We thus also have 32 bias units.

In [26]:
W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

We reshape `x` to a 4d tensor, with the second and third dimensions corresponding to image width and height, and the final dimension corresponding to the number of color channels. We can then apply the convolution with the weight tensor, add the bias, apply the ReLU function, and then max pool.

In [27]:
x_image = tf.reshape(x, [-1,28,28,1])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

## Second convolutional layer
We can now stack a second convolutional layer on top of the first one. This time, we compute 64 features for each 5x5 patch.

In [28]:
W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

## Fully connected layer
We now add a fully connected layer with 1024 neurons on top. As the image size has been reduced to 7x7, we reshape the tensor from the pooling layer into a batch of vectors, multiply by a weight matrix, add a bias, and apply a ReLU.

In [29]:
W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

## Dropout
We can now add dropout as well. We create a `placeholder` to store the dropout probability, which allows us to turn dropout on during training and off during testing.

In [30]:
keep_prob = tf.placeholder('float')
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

## Softmax layer
We now add a final softmax layer.

In [31]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

## Training and evaluation
We can now train our model, this time using the ADAM optimizer, adding `keep_prob` to `feed_dict`, and logging every 100th iteration.

In [32]:
cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
sess.run(tf.initialize_all_variables())
for i in range(20000):
    batch = mnist.train.next_batch(50)
    if i%100 == 0:
        train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})
        print "step %d, training accuracy %g"%(i, train_accuracy)
    train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print "test accuracy %g" % accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0})

step 0, training accuracy 0.14
step 100, training accuracy 0.8
step 200, training accuracy 0.9
step 300, training accuracy 0.9
step 400, training accuracy 0.96
step 500, training accuracy 0.92
step 600, training accuracy 0.94
step 700, training accuracy 0.98
step 800, training accuracy 0.96
step 900, training accuracy 1
step 1000, training accuracy 0.9
step 1100, training accuracy 0.96
step 1200, training accuracy 0.98
step 1300, training accuracy 0.98
step 1400, training accuracy 0.94
step 1500, training accuracy 0.96
step 1600, training accuracy 0.96
step 1700, training accuracy 1
step 1800, training accuracy 0.94
step 1900, training accuracy 0.98
step 2000, training accuracy 1
step 2100, training accuracy 0.98
step 2200, training accuracy 1
step 2300, training accuracy 0.94
step 2400, training accuracy 0.98
step 2500, training accuracy 0.98
step 2600, training accuracy 0.98
step 2700, training accuracy 1
step 2800, training accuracy 0.96
step 2900, training accuracy 1
step 3000, tra

Note: Takes __very__ long to train.