# COMP90051 Workshop 6
## Convolutional neural net (CNN) in TensorFlow
***
In the previous worksheet, we implemented a logistic regression classifier for the MNIST data set in TensorFlow, which achieved a test accuracy of 92%. 
In this worksheet, we hope to improve upon this accuracy by implementing a convolutional neural network (CNN)—a model that is more naturally suited to image data.
We'll assume familiarity with the TensorFlow fundamentals covered previously.
By the end of this worksheet you should be able to:
* build more complex computation graphs
* apply composite operators (e.g. those available under [`tf.layers`](https://www.tensorflow.org/api_docs/python/tf/layers))
* monitor computations in [TensorBoard](https://www.tensorflow.org/guide/summaries_and_tensorboard) (a web app that's included with TensorFlow)

*Note: this worksheet is draws on material from the following tutorials: [link 1](https://www.tensorflow.org/tutorials/estimators/cnn) and [link 2](https://codelabs.developers.google.com/codelabs/cloud-tensorflow-mnist/).*

Let's begin by importing the required packages.

In [1]:
%matplotlib inline
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import os

### 1. Resuming from Worksheet 5
We're going to use the same MNIST data set as in Worksheet 5, so that we can compare the accuracy of the CNN with logistic regression.

In Worksheet 5, we unrolled the 2D image arrays into feature vectors, as was required for logistic regression. However, here we leave the image arrays intact, as the CNN assumes images as input (it exploits spatial locality between the pixels). 
We again apply a rescaling transformation.

In [2]:
from tensorflow.keras.datasets import mnist
(images_train, labels_train), (images_test, labels_test) = mnist.load_data()

# Rescale
images_train = images_train.astype('float32')/255
images_test = images_test.astype('float32')/255

Below we define some constants related to the data set.

In [3]:
IM_WIDTH = images_train.shape[1]      # width of an image in pixels
IM_HEIGHT = images_train.shape[2]     # height of an image in pixels
NUM_CLASSES = 10                      # number of classes (0-9)

We again make use of the `DatasetIterator` defined in Worksheet 5, which provides an interface for drawing randomised mini-batches from the training set.
Note that we continue to use a batch size of 100 (you may consider changing this).

In [4]:
class DatasetIterator:
    """
    An iterator that returns randomized batches from a data set (with features and labels)
    """
    def __init__(self, features, labels, batch_size):
        assert(features.shape[0]==labels.shape[0])
        assert(batch_size > 0 and batch_size <= features.shape[0])
        self.features = features
        self.labels = labels
        self.num_instances = features.shape[0]
        self.batch_size = batch_size
        self.num_batches = self.num_instances//self.batch_size
        if (self.num_instances%self.batch_size!=0):
            self.num_batches += 1
        self._i = 0
        self._rand_ids = None

    def __iter__(self):
        self._i = 0
        self._rand_ids = np.random.permutation(self.num_instances)
        return self
    
    def next(self):
        self.__next__(self)
    
    def __next__(self):
        if self.num_instances - self._i >= self.batch_size:
            this_rand_ids = self._rand_ids[self._i:self._i + self.batch_size]
            self._i += self.batch_size
            return self.features[this_rand_ids], self.labels[this_rand_ids]
        elif self.num_instances - self._i > 0:
            this_rand_ids = self._rand_ids[self._i::]
            self._i = self.num_instances
            return self.features[this_rand_ids], self.labels[this_rand_ids]
        else:
            raise StopIteration()
            
batch_size = 100
train_iterator = DatasetIterator(images_train, labels_train, batch_size)

### 2. Placeholders for data input
Following Worksheet 5, we again define placeholders for inputting data (images + labels) into the graph.
This time we group the placeholders for the images and labels under a [variable scope](https://www.tensorflow.org/api_docs/python/tf/variable_scope) called `'input'`.
By using variable scopes, we can simplify the graph visualisation in TensorBoard.

In [5]:
with tf.variable_scope('input'):
    X = tf.placeholder(dtype=tf.float32, shape=[None, IM_WIDTH, IM_HEIGHT], name='images')
    Y = tf.placeholder(dtype=tf.int32, shape=[None,], name='labels')

### 3. CNN architecture
Due to hardware and time constraints, we must limit the size of our CNN, otherwise it will take too long to train.

For the convolutional layers, we follow the "convolutional pyramid" design principle—i.e. successive layers have decreasing spatial dimensions, but increasing depth. (This architecture is biologically motivated.)
The reduction in the spatial dimensions is achieved through max pooling.

After the convolutional layers, we add two densely-connected layers which combine the higher-level features to make a classification.
We also make use of dropout (a regularization method whereby random units are removed from the network) to prevent overfitting.
Note that the final layer is similar to the logistic regression model (although the input differs).

**Exercise (Advanced/Optional):** If you're interested in learning more about dropout (not examinable), you may like to read the following paper:
> Srivastava et al. "Dropout: a simple way to prevent neural networks from overfitting." JMLR 15.1 (2014): 1929-1958. [link](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)

**Architecture overview**
1. *Convolutional Layer #1* | 8 5×5 filters with a stride of 1 and a ReLU activation function.
2. *Pooling Layer #1* | Max pooling with a 2×2 filter and stride of 2 (implies pooled regions do not overlap).
3. *Convolutional Layer #2* | 16 5×5 filters with a stride of 1 and a ReLU activation function.
4. *Pooling Layer #2* | Same specs as pooling layer #1.
5. *Dense Layer #1* | 256 neurons, with dropout regularization rate of 0.4 (probability of 0.4 that any given element will be dropped during training)
6. *Dense Layer #2* | Logits Layer. 10 neurons, one for each digit target class (0–9).

In [15]:
DEPTH_C1 = 8       # depth of convolutional layer #1
DEPTH_C2 = 16      # depth of convolutional layer #2
UNITS_D1 = 256     # number of neurons in dense layer #1

Fill in the missing parts of the model (Convolutional Layer #2 and Pooling Layer #2) below.

In [7]:
with tf.variable_scope('cnn_model'):
    # Boolean placeholder which is set to True for training, and False for inference.
    # This is required to implement dropout. 
    training_mode = tf.placeholder(dtype=tf.bool, name='training_mode')
    
    # Input Layer
    input_layer = tf.reshape(X, [-1, IM_WIDTH, IM_HEIGHT, 1])

    # Convolutional Layer #1
    conv1 = tf.layers.conv2d(inputs=input_layer, filters=DEPTH_C1, kernel_size=[5, 5], 
                             padding='same', activation=tf.nn.relu, use_bias=True, 
                             name='conv_layer_1')

    # Pooling Layer #1
    pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2, 
                                    name='pool_layer_1')

    # Convolutional Layer #2 
    conv2 = tf.layers.conv2d(inputs=pool1, filters=DEPTH_C2, kernel_size=[5, 5], 
                             padding='same', activation=tf.nn.relu, use_bias=True,
                             name='conv_layer_2') # fill in

    # Pooling Layer #2
    pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2, 
                                    name='pool_layer_2') # fill in

    # Dense Layer #1
    pool2_flat = tf.reshape(pool2, shape=[-1, 7*7*DEPTH_C2], name='pool_layer_2_flat')
    dense = tf.layers.dense(inputs=pool2_flat, units=UNITS_D1, activation=tf.nn.relu, 
                            name='dense_layer_1')
    dropout = tf.layers.dropout(inputs=dense, rate=0.4, training=training_mode, name='dropout')

    # Dense Layer #2 (Logits Layer)
    logits = tf.layers.dense(inputs=dropout, units=NUM_CLASSES, use_bias=True,
                             name='dense_layer_2')
    
    # Predicted labels
    predictions = tf.argmax(logits, axis=1)

**Question:** What is the shape of the tensor output at each layer? Assume a single training instance is passed through the network. It may be helpful to review the lecture slides describing convolutional and max pooling layers.
*(Hint: the `padding='same'` option for `tf.layers.conv2d` adds a border of zeros around the input so that the width/height of the output = width/height of the input.)*

**Answer:** 
* *Output of Convolutional Layer #1: (1, 28, 28, 8).*
* *Output of Pooling Layer #1: (1, 14, 14, 8).*
* *Output of Convolutional Layer #2: (1, 14, 14, 16).*
* *Output of Pooling Layer #2: (1, 7, 7, 16). This is then flattened to (1, 784).*
* *Output of Dense Layer #1: (1, 256)*
* *Output of Dense Layer #2: (1, 10)*

### 4. Minimizing the empirical loss
To measure the discrepancy between the predicted class distribution and the true labels, we use the softmax cross entropy—the same loss we used for logistic regression in Worksheet 5.

Fill in the blank below to calculate the loss from the true labels `Y` and the output of the network `logits`.
(Hint: The built-in losses can be found under the [`tf.losses`](https://www.tensorflow.org/api_docs/python/tf/losses) namespace.)

In [8]:
with tf.variable_scope('loss'):
    loss = tf.losses.sparse_softmax_cross_entropy(labels=Y, logits=logits) # fill in

To minimize the loss, we'll use the built-in Adam optimizer (it tends to converge more rapidly than gradient descent).
Note: we've defined the `global_step` variable to keep track of how many parameter updates have been performed.

In [9]:
with tf.variable_scope('train'):
    opt = tf.train.AdamOptimizer(learning_rate=0.001)
    global_step = tf.Variable(0, name='global_step', trainable=False)
    train_op = opt.minimize(loss=loss, global_step=global_step)

### 5. Evaluation and TensorBoard summaries
As in Worksheet 5, we'll use accuracy to evaluate the CNN.
When using the built-in [`tf.metrics.accuracy`](https://www.tensorflow.org/api_docs/python/tf/metrics/accuracy) implementation, `acc_op` must be called to update the accuracy—if `acc` is called then an out-of-date value (computed from internal local variables) may be returned.

Since we're going to use TensorBoard to monitor training progress, we need to define some Summary operations (available under the [`tf.summary`](https://www.tensorflow.org/api_docs/python/tf/summary) namespace).
Below we define `loss_summary` and `acc_summary` to monitor the loss and accuracy.
Then we merge the summaries into a single Summary operation `eval_summaries` (for simplicity).

In [10]:
with tf.variable_scope('evaluation'):
    acc, acc_op = tf.metrics.accuracy(labels=Y, predictions=predictions, name='accuracy')
    loss_summary = tf.summary.scalar('loss', loss)
    acc_summary = tf.summary.scalar('accuracy', acc)
    eval_summaries = tf.summary.merge([loss_summary, acc_summary])

We'd also like to monitor some of the filters (a.k.a. kernels) in the first convolutional layer of the network. These will show up in the 'Images' tab in TensorBoard.
To do this we:
* extract the kernel tensor from `conv_layer_1`
* rescale the kernel tensor so that all values are on the unit interval
* transpose the kernel tensor so that the depth dimension is first
* define an image Summary operator

In [11]:
with tf.variable_scope('cnn_model/conv_layer_1', reuse=True):
    kernel = tf.get_variable('kernel')
    with tf.variable_scope('visualization'):
        # scale weights to [0 1]
        x_min = tf.reduce_min(kernel)
        x_max = tf.reduce_max(kernel)
        kernel_0_to_1 = (kernel - x_min) / (x_max - x_min)

        # to tf.summary.image format
        kernel_transposed = tf.transpose(kernel_0_to_1, [3, 0, 1, 2])

        # this will display 5 filters from the 8 in conv_layer_1
        filter_summary = tf.summary.image('filters', kernel_transposed, max_outputs=5)

### 6. Running TensorBoard

Before opening a session to train the CNN, you should start TensorBoard so that you can monitor progress.

To do this on the lab machine:

1. Start an Anaconda Prompt in the `workshop06` directory and run the command: `python -m tensorboard.main --logdir=mnist_log --host=localhost`
2. Navigate to http://localhost:6006/ in your web browser.
3. If successful, you should see the following web page. Later on, this will be populated with useful info.

![TensorBoard](https://screenshotscdn.firefoxusercontent.com/images/8941dbec-7dfb-4e5a-b015-225345f7615f.png)

### 7. Training

We can finally start training the CNN. 
Below we specify the log directory for TensorBoard and the number of epochs (full sweeps through the training data).
You'll soon see that training is slow on the CPU, so we're limited to a small number of epochs.

In [12]:
LOG_DIR = os.path.join(os.curdir, 'mnist_log')
NUM_EPOCHS = 5

We need to create an initializer for the global and local variables.

In [13]:
init = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())

And open a session to run operations on the graph.

In [14]:
with tf.Session() as sess:
    sess.run(init)
    # Instantiate writers for TensorBoard (for saving serialized summaries to disk)
    train_summary_writer = tf.summary.FileWriter(os.path.join(LOG_DIR, 'train'), sess.graph)
    test_summary_writer = tf.summary.FileWriter(os.path.join(LOG_DIR, 'test'), sess.graph)
    
    # Run optimizer for multiple epochs
    for epoch in range(NUM_EPOCHS):
        print("Starting epoch {}.".format(epoch))
        for X_batch, Y_batch in train_iterator:
            # Run a training step
            _, step = sess.run([train_op, global_step],
                               feed_dict={X: X_batch, Y: Y_batch, training_mode: True})
            # Every 100 batches compute the accuracy on the training set and save the filters in the first convolutional layer
            if (step % 100 == 0 and step > 0):
                train_accuracy, eval_s, filter_s = sess.run([acc_op, eval_summaries, filter_summary], 
                                  feed_dict={X: images_train, Y: labels_train, training_mode: False})
                train_summary_writer.add_summary(eval_s, global_step=step)
                train_summary_writer.add_summary(filter_s, global_step=step)
                print("\tTraining accuracy at step {}: {}.".format(step, train_accuracy))
            # Every 10 batches compute the accuracy on the test set.
            if (step % 10 == 0):
                test_accuracy, eval_s = sess.run([acc_op, eval_summaries], 
                                 feed_dict={X: images_test, Y: labels_test, training_mode: False})
                test_summary_writer.add_summary(eval_s, global_step=step)
    print("Optimization complete.")
    
    train_summary_writer.close()
    test_summary_writer.close()

Starting epoch 0.
	Training accuracy at step 100: 0.8704533576965332.
	Training accuracy at step 200: 0.9113644957542419.
	Training accuracy at step 300: 0.929695725440979.
	Training accuracy at step 400: 0.9406428337097168.
	Training accuracy at step 500: 0.9478151798248291.
	Training accuracy at step 600: 0.9531726241111755.
Starting epoch 1.
	Training accuracy at step 700: 0.9572432637214661.
	Training accuracy at step 800: 0.9604693055152893.
	Training accuracy at step 900: 0.9631503224372864.
	Training accuracy at step 1000: 0.9652717113494873.
	Training accuracy at step 1100: 0.967065691947937.
	Training accuracy at step 1200: 0.9685727953910828.
Starting epoch 2.
	Training accuracy at step 1300: 0.9699966311454773.
	Training accuracy at step 1400: 0.9712287187576294.
	Training accuracy at step 1500: 0.9722719788551331.
	Training accuracy at step 1600: 0.9732027649879456.
	Training accuracy at step 1700: 0.97404944896698.
	Training accuracy at step 1800: 0.9748313426971436.
Start

**Question:** Are 5 training epochs sufficient for this problem?

**Answer:** *For this particular run (note that there is variability due to randomisation), the optimiser does not appear to have converged after 5 epochs. 
Both the loss and accuracy curves in TensorBoard were continuing to improve (albeit at a slow rate) when the optimisation was terminated.*

### 7. Extension activities
* Count the number of scalar parameters in the CNN model. How does this compare to logistic regression (from Worksheet 5)? *\[Answer: This CNN has 206,954 parameters, whereas logistic regression had only 7,850 parameters.\]*
* Remove dropout from the architecture. What happens to the train/test curves? Does the model now overfit?
* Vary `DEPTH_C1`, `DEPTH_C2` and/or `UNITS_D1`. How do these parameters affect the goodness of fit?