# Convolutional Network in TensorFlow

This will construct a Convolutional Neural Network (CNN) in TensorFlow step by step.
* A convolutional network typically has two parts: 
    * convolution part for constructing feature representations from the original images
        * convolutional layers
        * pooling layers
        * dropout layers
    * fully-connected part 
        * multiple fully-connected layers like normal multilayer neural network

<img src="images/cnn_architecture.png" alt="Drawing"/>

## TensorFlow Convolution Layer

We use [tf.nn.conv2d](https://www.tensorflow.org/api_docs/python/tf/nn/conv2d) to build up a convolution layer. Following shows the signature of this function: 

```python
tf.nn.conv2d(
    input,
    filter,
    strides,
    padding,
    use_cudnn_on_gpu=True,
    data_format='NHWC',
    dilations=[1, 1, 1, 1],
    name=None
)
```

Filters, strides and padding are three arguments that are most relevant to the convolutional computation. 

* `filter`: A 4-D tensor of shape `[filter_height, filter_width, in_channels, out_channels]`
* `strides`: A 1-D tensor of length 4. The stride of the sliding window for each dimension of input. The dimension order is determined by the value of data_format that has the default value of "NHWC, which specifies the data storage order of: `[batch, height, width, channels]`.
* `padding`: A string from: "SAME", "VALID". The type of padding algorithm to use.

With these arguments set up, the convolution layer takes an input tensor of shape `[batch, in_height, in_width, in_channels]` and outputs a tensor of shape `[batch, out_height, out_width, out_channels]`

The `out_height` and `out_width` are determined by the filters, strides and padding that we set up. The formula for computing `out_height` and `out_width` in Tensorflow are:

<b>SAME Padding</b>, the output height and width are computed as:

> <b style='color:blue'>out_height</b> = <b style='color:red'>ceil( float(in_height) / float(strides[1]) )</b>

> <b style='color:blue'>out_width</b> = <b style='color:red'>ceil( float(in_width) / float(strides[2]) )</b>

<b>VALID Padding</b>, the output height and width are computed as:

> <b style='color:blue'>out_height</b> = <b style='color:red'>ceil( float(in_height - filter_height + 1 ) / float(strides[1]))</b>

> <b style='color:blue'>out_width</b> = <b style='color:red'>ceil( float(in_width - filter_width + 1 ) / float(strides[2]))</b>

The following conv2d function setups the strides, padding and filter weight/bias (F_w and F_b) such that the output shape is (1, 2, 2, 3). Note that F_w and F_b are TensorFlow variables.

In [95]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf
import numpy as np

x = np.array([
    [0, 1, 0.5, 10],
    [2, 2.5, 1, -8],
    [4, 0, 5, 6],
    [15, 1, 2, 3]], dtype=np.float32).reshape((1, 4, 4, 1))
X = tf.constant(x)

def conv2d(input_):
    # Filter (weights and bias)
    
    # The shape of the filter weight is (height, width, input_depth, output_depth)
    # The shape of the filter bias is (output_depth)
    # Define the filter weights `F_W` and filter bias `F_b`.
    # NOTE: Remember to wrap them in `tf.Variable`, they are trainable parameters.
    F_W = tf.Variable(tf.random_normal([2, 2, 1, 3]))
    F_b = tf.Variable(tf.random_normal([3]))
    
    # Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # Set the padding, either 'VALID' or 'SAME'.
    padding = 'VALID'
    
    # https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#conv2d
    # `tf.nn.conv2d` does not include the bias computation so we have to add it ourselves after.
    return tf.nn.conv2d(input_, F_W, strides, padding) + F_b


In [96]:
out = conv2d(X)
print(out)

Tensor("add:0", shape=(1, 2, 2, 3), dtype=float32)


We want to transform the input shape (1, 4, 4, 1) to (1, 2, 2, 3). We need to use tensorflow way of calculating the output shape, as shown above.

I choose 'VALID' for the padding algorithm. I find it simpler to understand and it achieves the result I'm looking for:

Plugging in the values:

> out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2

> out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2

In order to change the depth from 1 to 3, I have to set the output depth of my filter appropriately:

> F_W = tf.Variable(tf.truncated_normal((2, 2, 1, 3))) # (height, width, input_depth, output_depth)

> F_b = tf.Variable(tf.zeros(3)) # (output_depth)

The input has a depth of 1, so I set that as the input_depth of the filter.

## TensorFlow Pooling Layers

Setting up the dimensions of the pooling window size, strides, as well as the appropriate padding. You should go over the TensorFlow documentation for <b style='color:red'>[tf.nn.max_pool()](https://www.tensorflow.org/versions/r0.11/api_docs/python/nn.html#max_pool)</b>. Strides and padding works the same as it does for a convolution.

The function signature for `tf.nn.max_pool()` shown as follow:

```python
tf.nn.max_pool(
    value,
    ksize,
    strides,
    padding,
    data_format='NHWC',
    name=None
)
```

* `value`: The input. A 4-D Tensor of the format specified by data_format.
* `ksize`: A 1-D int Tensor of 4 elements. The size of the window for each dimension of the input tensor.
* `strides`: A 1-D int Tensor of 4 elements. The stride of the sliding window for each dimension of the input tensor.
* `padding`: A string, either 'VALID' or 'SAME'. The padding algorithm. See the comment here
* `data_format`: A string. 'NHWC', 'NCHW' and 'NCHW_VECT_C' are supported.
* `name`: Optional name for the operation.

Note that Pooling layer does not shrink the size of channels of the input as the convolutional layer does. However, similar to convolutional layer, it can shrink the (2-D) size of feature maps of the input.

**Setup the strides, padding and ksize such that the output shape after pooling is (1, 2, 2, 1).**

We want to transform the input shape (1, 4, 4, 1) to (1, 2, 2, 1). We choose 'VALID' for the padding algorithm. 

> <b style='color:blue'>out_height</b> = <b style='color:red'>ceil(float(in_height - filter_height + 1) / float(strides[1]))</b>

> <b style='color:blue'>out_width</b>  = <b style='color:red'>ceil(float(in_width - filter_width + 1) / float(strides[2]))</b>

Plugging in the values:

> out_height = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2

> out_width  = ceil(float(4 - 2 + 1) / float(2)) = ceil(1.5) = 2

The depth doesn't change during a pooling operation so I don't have to worry about that.

In [None]:
"""
Set the values to `strides` and `ksize` such that
the output shape after pooling is (1, 2, 2, 1).
"""
import tensorflow as tf
import numpy as np

# `tf.nn.max_pool` requires the input be 4D (batch_size, height, width, depth)
# (1, 4, 4, 1)
x = np.array([
    [0, 1, 0.5, 10],
    [2, 2.5, 1, -8],
    [4, 0, 5, 6],
    [15, 1, 2, 3]], dtype=np.float32).reshape((1, 4, 4, 1))
X = tf.constant(x)

def maxpool(input_):
    #  Set the ksize (filter size) for each dimension (batch_size, height, width, depth)
    ksize = [1, 2, 2, 1]
    #  Set the stride for each dimension (batch_size, height, width, depth)
    strides = [1, 2, 2, 1]
    # set the padding, either 'VALID' or 'SAME'.
    padding = "VALID"

    return tf.nn.max_pool(input_, ksize, strides, padding)

## Put Together

### Dataset

We will use MNIST handwritten digits. The dataset contains 60,000 examples for training and 10,000 examples for testing. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 1. Each image was not flattened and has shape (28\*28\*1).

<img src="images/mnist_pic.png" alt="Drawing"/>

we're importing the MNIST dataset and using a convenient TensorFlow function to batch, scale, and One-Hot encode the data.

In [10]:
from tensorflow.examples.tutorials.mnist import input_data
import tensorflow as tf

# Note that we read images that were not flatten
mnist = input_data.read_data_sets("/tmp/tensorflow/mnist/input_data", one_hot=True, reshape=False)

Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz
Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz


In [84]:
# define hyperparameters
learning_rate = 0.001
num_steps = 500
epochs = 2
batch_size = 128
display_step = 10

# number of samples to calculate validation and accuracy
test_valid_size = 256

# total number classes
n_classes = 10

dropout = 0.75


### Weights and Biases

We first define the weights and biases for each layer. 

> Note that, for convolutional layers ('wc1' and 'wc2'), weights are actually filters that will be learned from training.

In [44]:
weights = {
    # dimensionof filter/weight: (height, width, input_depth, output_depth)
    # 5x5 conv or filter, 1 input (1 image), 32 outputs
    'wc1': tf.Variable(tf.random_normal([5, 5, 1, 32])),
    # 5x5 conv or filter, 32 inputs, 64 outputs
    'wc2': tf.Variable(tf.random_normal([5, 5, 32, 64])),
    # fully connected, 7*7*64 inputs, 1024 outputs
    'wd1': tf.Variable(tf.random_normal([7*7*64, 1024])),
    # 1024 inputs, 10 outputs (class prediction)
    'out': tf.Variable(tf.random_normal([1024, n_classes]))
}

biases = {
    'bc1' : tf.Variable(tf.random_normal([32])),
    'bc2' : tf.Variable(tf.random_normal([64])),
    'bd1' : tf.Variable(tf.random_normal([1024])),
    'out' : tf.Variable(tf.random_normal([n_classes]))               
}

You may wonder where does 7\*7\*64 comes from for 'wd1'. 
> As we will see later, we will use padding with value of "SAME" and strides for max pooling with height 2 and width 2. Therefore, We will get 64 inputs with shape (7, 7) when we reach the third layer (i.e., fully connected layer). In order to fit these inputs to the fully connected layer for classification, we flatten the 64 inputs with shape (7, 7) into a 7\*7\*64 vector.

### Convolutions

In [46]:
def conv2d(X, W, b, k=1):
    X = tf.nn.conv2d(X, W, strides=[1, k, k, 1], padding="SAME")
    X = tf.nn.bias_add(X, b)
    return tf.nn.relu(X)

The <b style='color: red'>tf.nn.conv2d()</b> function computes the convolution against weight (i.e., filter) W as shown above.

In TensorFlow, strides is an array of 4 elements; the first element in this array indicates the stride for batch and last element indicates stride for features. 

It's good practice to remove the batches or features you want to skip from the data set rather than use a stride to skip them. You can always set the first and last element to 1 in strides in order to use all batches and features.

The middle two elements are the strides for height and width respectively. We often mentioned stride as one number because you usually have a square stride where <b style='color: red'>height = width</b>. 

> When someone says they are using a stride of 3, they usually mean <b style='color: red'>tf.nn.conv2d(x, W, strides=[1, 3, 3, 1])</b>.

To make life easier, the code is using tf.nn.bias_add() to add the bias. Using tf.add() doesn't work when the tensors aren't the same shape.

"### Max Pooling

<img src="images/pooling_example.png" alt="Drawing"/>

The above is an example of max pooling with a 2x2 filter and stride of 2. The left square is the input and the right square is the output. For example, [[1, 1], [5, 6]] becomes 6 and [[3, 2], [1, 2]] becomes 3.

In [48]:
def maxpool2d(X, k=2):
    return tf.nn.max_pool(X, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

The <b style='color: red'>tf.nn.max_pool()</b> function does exactly what you would expect, it performs max pooling with the ksize parameter as the size of the filter.

### Convolutional NN Model

<img src="images/cnn_architecture2.png" alt="Drawing"/>

In the code below, we will be creating 3 layers alternating between convolutions and max pooling followed by a fully connected and output layer. The transformation of each layer to new dimensions are shown in the comments. For example, the first layer shapes the images from 28x28x1 to 28x28x32 in the convolution step. Then next step applies max pooling, turning each sample into 14x14x32. All the layers are applied from conv1 to output, producing 10 class predictions.

In [49]:
def conv_net(X, weights, biases, dropout):
    
    # Convolution Layer 1
    conv1 = conv2d(X, weights['wc1'], biases['bc1'])
    # Max Pooling (down-sampling)
    conv1 = maxpool2d(conv1)
    print('conv1:', conv1)

    # Convolution Layer 2
    conv2 = conv2d(conv1, weights['wc2'], biases['bc2'])
    # Max Pooling (down-sampling)
    conv2 = maxpool2d(conv2)
    print('conv2:', conv2)
    
    # Fully connected layer
    # Reshape conv2 output to fit the first fully connected layer input
    
#     conv2_shape = conv2.get_shape.as_list()
#     fc1 = tf.reshape(conv2, [conv2_shape[0], np.prod(conv2_shape[1:])])
    
    print('wd1 shape', weights['wd1'].get_shape().as_list())
    fc1 = tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
    print('fc1 after reshape conv2:', fc1)
    
    fc1 = tf.add(tf.matmul(fc1, weights['wd1']), biases['bd1'])
    fc1 = tf.nn.relu(fc1)
    fc1 = tf.nn.dropout(fc1, dropout)
    
    # Output, class prediction
    out = tf.add(tf.matmul(fc1, weights['out']), biases['out'])
    print('out:', out)
    return out
    

* One pitfall is the reshape of the output of the last max pooling layer to fit the first fully-connected layer by using [<b style='color: red'>tf.reshape</b>](https://www.tensorflow.org/api_docs/python/tf/reshape)
```python
# conv2 has shape (?, 7, 7, 64)
# weights['wd1'] has shape (3136, 1024)
tf.reshape(conv2, [-1, weights['wd1'].get_shape().as_list()[0]])
```    
* When constructing the convolutional neural network, you calculated the output shape of the last pooling layer. In this case, it is (?, 7, 7, 64). Then, you defined that the input dimension of the first fully-connected layer is $7 * 7 * 64 = 3136$
* Basically, this reshape process is to flatten each training/testing data and fit it into the fully-connected layers. In this case, it converts the conv2 with original shape (?, 7, 7, 64) to a matrix with shape (?, 3136), where 3136 is the input dimension of the first fully-connected layer and, '?' indicates the batch size of the input data. 

**special value -1**
* The special value -1 in the shape, according to the official explanation, means that:

> If one component of shape is the special value -1, the size of that dimension is computed so that the total size remains constant. In particular, a shape of [-1] flattens into 1-D. At most one component of shape can be -1.

* tf.reshape(conv2, [-1, 3316]) means that 
> If we flatten conv2 to a 2-D matrix and the second dimension of the matrix is defined as 3316, the first dimension of the matrix can be inferred by: (the total size of the matrix) / 3316

**another way**
* A probably more straightforward way is:
```python
tf.reshape(conv2, [conv2_shape[0], np.prod(conv2_shape[1:])])
```
* It means converting the conv2 with original shape (?, 7, 7, 64) to a matrix with shape (?, 7 \* 7 \* 64)

### Loss function

In [75]:
# tf Graph inputs that will be feed values when running tensorflow session
X = tf.placeholder(tf.float32, shape=[None, 28, 28, 1])
Y = tf.placeholder(tf.float32, shape=[None, n_classes])

# dropout (keep probability)
keep_prob = tf.placeholder(tf.float32)

# Model
logits = conv_net(X, weights, biases, keep_prob)
prediction = tf.nn.softmax(logits)

# loss function
loss_op = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))

# define optimizer to minimize the loss w.r.t the parameters of the network
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
# optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
train_op = optimizer.minimize(loss_op)

# Accuracy
correct_pred = tf.equal(tf.argmax(prediction, 1), tf.argmax(Y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

conv1: Tensor("MaxPool_24:0", shape=(?, 14, 14, 32), dtype=float32)
conv2: Tensor("MaxPool_25:0", shape=(?, 7, 7, 64), dtype=float32)
wd1 shape [3136, 1024]
fc1 after reshape conv2: Tensor("Reshape_44:0", shape=(?, 3136), dtype=float32)
out: Tensor("Add_23:0", shape=(?, 10), dtype=float32)


When using [tf.nn.softmax_cross_entropy_with_logits](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits), you should be carefull about following several points:

> This operation expects **unscaled logits**, since it performs a softmax on logits internally for efficiency. Do not call this op with the output of softmax, as it will produce incorrect results.

> `logits` and `labels` must have the same shape, e.g. `[batch_size, num_classes]` and the same dtype (either float16, float32, or float64).

> To avoid confusion, it is required to pass only named arguments to this function.

The most important arguments for this function are:
* `labels`: Each row labels[i] must be a valid probability distribution.
* `logits`: Unscaled log probabilities.
* `dim`: The class dimension. Defaulted to -1 which is the last dimension.

It returns a **1-D tensor of length `batch_size`** of the same type as logits with the softmax cross entropy loss.

Backpropagation in this version of softmax cross entropy will happen only into logits. To calculate a cross entropy loss that allows backpropagation into both logits and labels, see [tf.nn.softmax_cross_entropy_with_logits_v2](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits_v2).

[tf.reduce_mean](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/reduce_mean) computes the mean of elements across specified dimensions of a tensor.

> Reduces input_tensor along the dimensions given in axis. Unless keep_dims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keep_dims is true, the reduced dimensions are retained with length 1.

> If axis has no entries, all dimensions are reduced, and a tensor with a single element is returned

The most important arguments for this function:
* `input_tensor`: The tensor to reduce. Should have numeric type.
* `axis`: The dimensions to reduce. If None (the default), reduces all dimensions.
* `keep_dims`: If true, retains reduced dimensions with length 1.

It returns the reduced tensor.

As shown in the code, we are using `tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y))` to compute the loss of all the training examples in a mini-batch. 
* `tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y)` computes a 1-D tensor in which each element is the softmax cross entropy value for one training example and there are totally `batch_size` of them.
* Then, the `tf.reduce_mean` computes the mean of the `batch_size` number of softmax cross entropy values.


### Session

In [83]:
print("Using learning rate:", learning_rate)
print("Using keep pro:", dropout)
print("epochs:", epochs)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    
#     for step in range(1, num_steps+1):
    for ep in range(epochs):
        for batch in range(mnist.train.num_examples//batch_size + 1):
            batch_x, batch_y = mnist.train.next_batch(batch_size)
            sess.run(train_op, feed_dict={X:batch_x, Y:batch_y, keep_prob:dropout})
#         if step % display_step == 0 or step == 1:
            loss = sess.run(loss_op, feed_dict={X:batch_x, Y:batch_y, keep_prob:1.0})
            acc = sess.run(accuracy, feed_dict={
                X: mnist.validation.images[:test_valid_size],
                Y: mnist.validation.labels[:test_valid_size],
                keep_prob: 1.})

            if batch % 50 == 0:
                print("batch " + str(batch) + ", Minibatch Loss= " + \
                      "{:.4f}".format(loss) + ", Training Accuracy= " + \
                      "{:.3f}".format(acc))
#             if batch % 60 == 0:
#                 print("ep", ep, 'batch', batch, loss, valid_acc)
                
                
    # Calculate Test Accuracy
    test_acc = sess.run(accuracy, feed_dict={
        X: mnist.test.images[:test_valid_size],
        Y: mnist.test.labels[:test_valid_size],
        keep_prob: 1.0})
    print('Testing Accuracy: {}'.format(test_acc))

Using learning rate: 0.001
Using keep pro: 0.75
epochs: 1
batch 0, Minibatch Loss= 85614.4062, Training Accuracy= 0.160
batch 50, Minibatch Loss= 8908.8672, Training Accuracy= 0.395
batch 100, Minibatch Loss= 4839.6494, Training Accuracy= 0.629
batch 150, Minibatch Loss= 3041.9094, Training Accuracy= 0.719
batch 200, Minibatch Loss= 2652.6052, Training Accuracy= 0.723
batch 250, Minibatch Loss= 2085.2380, Training Accuracy= 0.746
batch 300, Minibatch Loss= 1951.3878, Training Accuracy= 0.777
batch 350, Minibatch Loss= 2389.5933, Training Accuracy= 0.789
batch 400, Minibatch Loss= 2486.0088, Training Accuracy= 0.785
Testing Accuracy: 0.77734375


** Additional Resources **

There are many wonderful free resources that allow you to go into more depth around Convolutional Neural Networks. In this course, our goal is to give you just enough intuition to start applying this concept on real world problems so you have enough of an exposure to explore more on your own. We strongly encourage you to explore some of these resources more to reinforce your intuition and explore different ideas.

These are the resources we recommend in particular:

* Andrej Karpathy's [CS231n Stanford course](http://cs231n.github.io/) on Convolutional Neural Networks.
* Michael Nielsen's [free book](http://neuralnetworksanddeeplearning.com/) on Deep Learning.
* Goodfellow, Bengio, and Courville's more advanced [free book](http://deeplearningbook.org/) on Deep Learning.