# Introduction

In this tutorial, we'll be walking through the Tensorflow code behind creating a convolutional neural network. Understanding the code and concepts will require familiarity in creating neural networks with Tensorflow. If you want to review or learn about that, the notes from last week's workshop are [here](https://github.com/uclaacmai/tf-workshop-series/tree/master/week6-neural-nets).


In [1]:
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


Let's definite some standard hyperparameters for our network. 

In [25]:
n_epochs = 2000
minibatch_size = 50
lr = 1e-4
keep = 0.5

Next, we'll create the standard placeholders for our input training examples and corresponding labels. The ```None``` in the first dimension (that denotes the number of examples being fed into the network) allows us to vary the magnitude of the first dimension, which allows us to feed in different batch sizes into our network. 

In [16]:
x = tf.placeholder(tf.float32, shape = [None, 28,28,1]) #shape in CNNs is always None x height x width x color channels
y_ = tf.placeholder(tf.float32, shape = [None, 10]) #shape is always None x number of classes

# Convolutions and max-pooling

At the core of convolutional neural networks is the key idea of convolutions. A convolution is a mathematical operation on two functions that produces a third function. Convolutions have rigorous mathematical theory behind them, and if you're interested in learning about that, we recommend Wikipedia's post on convolutions and Christopher Olah's explanation of convolutions. 

Each convolution operation is followed by a non-linearity, after which a `max-pooling` operation is performed. The purpose of the max-pooling operation is to downsample the activations from the convolution step. It reduces overfitting by representing a set of activations as only the "most important" (in a sense) activation in that region. It also reduces computational cost due to a smaller dimensionality. Finally, max-pooling allows us to make assumptions regarding the presence of important features in certain layers. 


This just defines some methods to make the function calls a little nicer.

In [17]:
def weight_variable(shape):
    """Initializes weights randomly from a normal distribution
    Params: shape: list of dimensionality of the tensor to be initialized
    """
    initial = tf.truncated_normal(shape, stddev=0.1)
    return tf.Variable(initial)

def bias_variable(shape):
    """Initializes the bias term randomly from a normal distribution.
    Params: shape: list of dimensionality for the bias term.
    """
    initial = tf.constant(0.1, shape=shape)
    return tf.Variable(initial)

def conv2d(x, W):
    """Performs a convolution over a given patch x with some filter W.
    Uses a stride of length 1 and SAME padding (padded with zeros at the edges)
    Params:
    x: tensor: the image to be convolved over
    W: the kernel (tensor) with which to convolve.
    """
    return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
# strides is a length-4 list that specifies the amount to move for each dimension of our input x. 
# the dimensions correspond to the following (in order): batch_size, length of image, width of image, # of channels in image

def max_pool_2x2(x):
    """Performs a max pooling operation over a 2 x 2 region"""
    # ksize: we only want to take the maximum over 1 example and 1 channel. 
    # the middle elements are 2 x 2 because we want to take maxima over 2 x 2 regions
    
    return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME') # stride 2 right and 2 down

# Model Architecture

We will implement a (relatively) simple convolutional neural network model. It'll be composed of two convolutional layers, each of which is followed by a max-pooling operation to reduce dimensionality. 

We will then follow these two convolutional layers with two fully connected layers, similar to those in the vanilla neural network that we implemented last week. However, we're also going to add an operation known as **dropout**. Dropout is an interesting concept that allows large neural networks to prevent from overfitting on the training dataset. With dropout, we discard a predetermined proportion of a hidden layer's activations, so that they don't contribute anything to what the next layer computes. 

Dropout has many interesting consequences. First of all, it introduces a degree of per-(hidden) layer sparsity when we are training: at any given layer, a certain proportion of the inputs into that layer will be zero. Moreover, since we sample which neurons in particular that we wish to discard at a per-step granularity, we can think of this as training an ensemble of (correlated) neural networks, since the active neurons at each particular training step are zero. When we predict using our network (and don't use dropout), it's as if we're getting a prediction from an ensemble of neural networks, without having to train an ensemble in the first place (which is much more expensive). 

A few questions to sanity check your understanding of dropout: 
 1. Why should we not use dropout in the final layer? 
 2. Why do we not use dropout when testing the network accuracy on testing datasets? Why is it a bad idea to use in predictions? 
 3. Which dropout probability corresponds to the most number of ways to have the neurons in a hidden layer active/inactive? (hint: think combinatorics). 
 
Finally, we will have a output layer that does a standard matrix multiplication to generate class predictions. 

In [18]:
W_conv1 = weight_variable([5, 5, 1, 32]) # 5 x 5 kernel, across an image with 1 channel to 32 channels
b_conv1 = bias_variable([32])
h_conv1 = tf.nn.relu(conv2d(x, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

In [19]:
W_conv2 = weight_variable([5, 5, 32, 64]) # 5 x 5 kernel, across an "image" with 32 channels to 64 channels
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

Fully connected layers

In [20]:
W_fc1 = weight_variable([7 * 7 * 64, 1024]) # This shape can be determined by plugging in something random * 64, and seing the resulting error. 
b_fc1 = bias_variable([1024])
h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

In [21]:
W_fc2 = weight_variable([1024, 256])
b_fc2 = bias_variable([256])
h_fc2 = tf.nn.relu(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
h_fc2_drop = tf.nn.dropout(h_fc2, keep_prob)

In [22]:
W_fc3 = weight_variable([256, 10])
b_fc3 = bias_variable([10])
y_out = tf.matmul(h_fc2_drop, W_fc3) + b_fc3

In [23]:
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(logits = y_out, labels = y_)
train_step = tf.train.AdamOptimizer(lr).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(y_, axis = 1), tf.argmax(y_out, axis = 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.global_variables_initializer()

In [26]:
with tf.Session() as sess:
    sess.run(init)
    for i in range(n_epochs):
        batch = mnist.train.next_batch(minibatch_size)
        training_inputs = batch[0].reshape([minibatch_size,28,28,1])
        training_labels = batch[1]
        if i % 100 == 0:
            print("epoch: {}".format(i))
            train_acc = accuracy.eval(feed_dict = {x: training_inputs, y_: training_labels, keep_prob : 1.0})
            print("training accuracy: {}".format(train_acc))
        sess.run([train_step], feed_dict = {x: training_inputs, y_: training_labels, keep_prob : keep})
    test_inputs = mnist.test.images.reshape([-1,28,28,1])
    test_labels = mnist.test.labels   
    test_acc = accuracy.eval(feed_dict = {x: test_inputs, y_: test_labels, keep_prob : 1.0})
    print("test accuracy: {}".format(test_acc))

epoch: 0
training accuracy: 0.07999999821186066
epoch: 100
training accuracy: 0.6000000238418579
epoch: 200
training accuracy: 0.8600000143051147
epoch: 300
training accuracy: 0.8600000143051147
epoch: 400
training accuracy: 0.9399999976158142
epoch: 500
training accuracy: 0.9599999785423279
epoch: 600
training accuracy: 0.9800000190734863
epoch: 700
training accuracy: 0.9800000190734863
epoch: 800
training accuracy: 1.0
epoch: 900
training accuracy: 0.8999999761581421
epoch: 1000
training accuracy: 0.9599999785423279
epoch: 1100
training accuracy: 0.9599999785423279
epoch: 1200
training accuracy: 0.9399999976158142
epoch: 1300
training accuracy: 1.0
epoch: 1400
training accuracy: 0.9399999976158142
epoch: 1500
training accuracy: 0.9800000190734863
epoch: 1600
training accuracy: 0.9800000190734863
epoch: 1700
training accuracy: 0.9800000190734863
epoch: 1800
training accuracy: 0.9800000190734863
epoch: 1900
training accuracy: 0.9399999976158142
test accuracy: 0.9678999781608582


# Additional Resources

* CNN [tutorial](https://www.tensorflow.org/tutorials/deep_cnn) from the Tensorflow docs
* Stanford's [course](http://cs231n.github.io/convolutional-networks/) on CNNs
* Michael Nielson's [chapter](http://neuralnetworksanddeeplearning.com/chap6.html) on CNNs in his book
* Facebook's [video](https://www.facebook.com/Engineering/videos/10154673882797200/ ) on ML and CNNs