# 1. Build a multilayer convolutional network

Following the previous exercise on using `SoftMax network` for learning MNIST dataset, here we'll build convolutional neural network and make it learn MNIST dataset.

This exercise is [the latter half of tutorial for MNIST for experts.](https://www.tensorflow.org/get_started/mnist/pros)

## 1.0 Import MNIST dataset

First we import MNIST dataset, and create nodes for input images `x` and output classes `y_`:

In [61]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data/', one_hot=True)

import tensorflow as tf
x = tf.placeholder(tf.float32, shape = [None, 784])
y_ = tf.placeholder(tf.float32, shape = [None, 10])

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


The value **`None`** in `x` indicates the first dimension, corresponding to the **batch size**, can be of any size.

The shape argument to placeholder is optional, but it allows TensorFlow to automatically **catch bugs stemming from inconsistent tensor shapes.**



## 1.1 Initializations and defining functions

## 1.1.1 Weight initialization

To create convolutional neural network, we need to create a lot of weights and biases. And one should initalize them to break symmetry, so that one can prevent 0 gradients.

Because we're using **[ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks) as an activation function (not softmax this time),** it is good to initalize them with a **slight bias to avoid dead neurons.**

In [62]:
def weight_variable(shape):
    # Initialize with random variables distributed as truncated normal distribution with mean 0 and standard deviation 0.1
    initial = tf.truncated_normal(shape,stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    # Initialize with constant 0.1
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

## 1.1.2 Convolution and pooling

We use stride of one and zero padding so that the output is the same size as the input [which, in turn, means that the filter size (K) is also 1].

Our pooling is plain old max pooling over 2x2 blocks. To keep our code cleaner, let's also abstract those operations into functions

In [63]:
def conv2d(x, W):
    return tf.nn.conv2d(x, W, strides=[1,1,1,1], padding='SAME')

def max_pool_2x2(x):
    return tf.nn.max_pool(x, ksize=[1,2,2,1], strides=[1,2,2,1], padding="SAME")

## 1.2 Building convolutional neural networks

### 1.2.1 First convolutional layer

We can now implement first convolutional layer. **A convolutional layer consists of convolution followed by max pooling.**

The convolution will compute **32 features for each 5x5 patch**, and therefore, its **weight tensor** will have a **shape of [5,5,1,32]:** the first two dimensions are **the patch size**, the next is **the number of input channels**, and the last is **the number of output channels.**

In [64]:
W_conv1 = weight_variable([5,5,1,32])

And we need a bias vector with **a component for each output channel:**

In [65]:
b_conv1 = bias_variable([32])

To apply the layer, we first reshape input `x` to a 4d tensor with shape [-1,28,28,1]: the second and the third dimensions corresponding to image width and height (28x28), and the final dimension corresponding to the number of color chanels (1 in our case).
### QUESTION: WHAT ABOUT THE FIRST DIMENSION???

In [66]:
x_image = tf.reshape(x, [-1, 28, 28, 1])

We then convolve the input `x_image` with the weight tensor, add the bias, apply the `ReLU` function, and finally max pool.

In [67]:
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

As a result of max pooling, the final outcome will have shape 14x14.

### 1.2.2 Second convolutional layer

We stack multiple layers to build a deep network. The second layer will have 64 features for each 5x5 patch.

In [68]:
W_conv2 = weight_variable([5,5,32,64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

### 4.2.3 Densely connected layer

Now the image size has been reduced to 7x7 (because we applied `max_pool_2x2` twice to 28x28 image), **we add a fully-connected layer with 1024 neurons to allow processing on the entire image.**

In [69]:
W_fc1 = weight_variable([7*7*64, 1024])
b_fc1 = bias_variable([1024])

We first reshape the tensor from the pooling layer into a batch of vectors. Then we multiply it by a weight matrix `W_fc1`, add a bias `b_fc1`, and then apply `ReLU`.

In [70]:
h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

### <a name="dropout">1.2.4 Dropout </a>

[To reduce overfitting](https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks-Part-2/), we apply dropout before the readout layer. We use `placeholder` for the **probability that a neuron's output is kept** during dropout. This allows us to **turn dropout on during training, and turn it off during testing.**

Tensorflow's `tf.nn.dropout` op automatically handles scaling neuron outputs in addition to masking them, so dropout just works without any additional scaling (`dropout` is very powerful for large neural networks; therefore, `dropbout` doesn't do much for this small convolutional network we're using throughout this practice).

In [71]:
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

### 1.2.5 Readout layer

Lastly, we add a layer, just like for the one layer softmax regression.

In [72]:
W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

## 1.3 Train and evaluate the model

We train and evaluate the model just like the way we did for a single `SoftMax network` that we saw from Sections 1--3. The differences between `SoftMax network` is that

1. here, we will use `ADAM optimizer`, which is more sophisticated version than the steepest gradient descent optimizer (also we use learning rate 1e-4 instead of 0.5 for `SoftMax network`),
2. the argument `feed_dict` has another parameter `keep_prob` to control [dropout rate](#dropout), and
3. we add logging to every 100th iteration in the training process.

Also, we use `tf.Session` instead of `tf.InteractiveSession` to better separate the process of creating the graph (i.e., model specification) from the process of evaluating the graph (model fitting).

In [74]:
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
    #By using `with ~~ as XX`, we can automatically destroy the session once the block is exited.
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
        batch = mnist.train.next_batch(50)
        
        if i % 100 == 0:
            train_accuracy = accuracy.eval(feed_dict={x:batch[0], y_:batch[1], keep_prob:1.0})
            print('step %d, training accuracy %g' % (i, train_accuracy))
            
        train_step.run(feed_dict={x:batch[0], y_:batch[1], keep_prob: 0.5})
    
    print('test accuracy %g' % accuracy.eval(feed_dict={x:mnist.test.images, y_:mnist.test.labels, keep_prob: 1.0}))

step 0, training accuracy 0.06


KeyboardInterrupt: 

In [75]:
# Entire code in one script


def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)

def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')

W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])

x_image = tf.reshape(x, [-1, 28, 28, 1])

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv = tf.matmul(h_fc1_drop, W_fc2) + b_fc2

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv, 1), tf.argmax(y_, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(20000):
        batch = mnist.train.next_batch(50)
        if i % 100 == 0:
            train_accuracy = accuracy.eval(feed_dict={x: batch[0], y_: batch[1], keep_prob: 1.0})
            print('step %d, training accuracy %g' % (i, train_accuracy))
        train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

    print('test accuracy %g' % accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

step 0, training accuracy 0.14
step 100, training accuracy 0.84
step 200, training accuracy 0.88


KeyboardInterrupt: 