# Convolutional neural networks from scratch

Now let's take a look at *convolutional neural networks* (CNNs), the models people really use for classifying images. 

In [15]:
from __future__ import print_function
import mxnet as mx
from mxnet import nd, autograd
import numpy as np

ctx = mx.cpu()

## MNIST data (last one, we promise!)

In [16]:
mnist = mx.test_utils.get_mnist()
batch_size = 64
train_data = mx.io.NDArrayIter(mnist["train_data"], mnist["train_label"], batch_size, shuffle=True)
test_data = mx.io.NDArrayIter(mnist["test_data"], mnist["test_label"], batch_size, shuffle=True)

## Convolutional neural networks (CNNs)

In the [previous example](5a-mlp-scratch.ipynb), we connected the nodes our neural networks in what seems like the simplest possible way. Every node in each layer was connected to every node in the subsequent layers. 

![](https://github.com/zackchase/mxnet-the-straight-dope/blob/master/img/multilayer-perceptron.png?raw=true)

This can require a lot of parameters! If our input were a 256x256 color image (still quite small for a photograph), and our network had 1,000 nodes in the first hidden layer, then our first weight matrix would require (256x256x3)x1000 parameters. That's nearly 200 million. Moreover the hidden layer would ignore all the spatial structure in the input image even though we know the local structure represents and powerful source of prior knowledge. 

Convolutional neural networks incorporate convolutional layers. These layers associate each of their nodes with a small window, called a *receptive field*, in the previous layer. This allows us to first learn local features via transformations that are applied in the same way for the top right corner as for the bottom left. Then we collect all this local information to predict global qualities of the image (like whether or not it depicts a dog). 

![](http://cs231n.github.io/assets/cnn/depthcol.jpeg)
(Image credit: Stanford cs231n http://cs231n.github.io/assets/cnn/depthcol.jpeg)

In short, there are two new concepts you need to grep here. First, we'll be introducting *convolutional* layers. Second, we'll be interleaving them with *pooling* layers. 

##  Parameters

Each node in convolutional layer is associated associated with a 3D block (height x width x channel) in the input tensor. Moreover, the convolutional layer itself has multiple output chanels. So the layer is parameterized by a 4 dimensional weight tensor, commonly called a *convolutional kernel*. 

The output tensor is produced by sliding the kernel across the input image skiping locations according to a pre-defined *stride* (but we'll just assume that to be 1 in this tutorial). Let's initialize some such kernels from scratch.

In [17]:
W1 = nd.random_normal(shape=(20, 1, 3,3)) *.01
b1 = nd.random_normal(shape=20) * .01

W2 = nd.random_normal(shape=(50, 20, 5, 5)) *.01
b2 = nd.random_normal(shape=50) * .01

W3 = nd.random_normal(shape=(800,128)) *.01
b3 = nd.random_normal(shape=128) *.01

W4 = nd.random_normal(shape=(128,10)) *.01
b4 = nd.random_normal(shape=10) *.01

params = [W1, b1, W2, b2, W3, b3, W4, b4]

And assign space for gradients

In [18]:
for param in params:
    param.attach_grad()

## Convolving with MXNet's NDArrray

To write a convolution when using *raw MXNet*, we use the function ``nd.Convolution()``. This function takes a few important arguments: inputs (``data``), a 4D weight matrix (``weight``), a bias (``bias``), the shape of the kernel (``kernel``), and a number of filters (``num_filter``).

In [19]:
data = train_data.next().data[0].as_in_context(ctx)
conv = nd.Convolution(data=data, weight=W1, bias=b1, kernel=(3,3), num_filter=20)
print(conv.shape)

(64, 20, 26, 26)


Note the shape. The number of examples (64) remains unchanged. The number of channels (also called *filters*) has increased to 20. And because the (3,3) kernel can only be applied in 26 different heights and widths (without the kernel busting over the image border), our output is 26,26. There are some weird padding tricks we can use when we want the input and output to have the same height and width dimensions, but we won't get into that now.

## Average pooling

The other new component of this model is the pooling layer. Pooling gives us a way to downsample in the spatial dimensions. Early convnets typically used average pooling, but max pooling tends to give better results. 

In [20]:
pool = nd.Pooling(data=conv, pool_type="max", kernel=(2,2), stride=(2,2))
print(pool.shape)

(64, 20, 13, 13)


Note that the batch and channel components of the shape are unchanged but that the height and width have been downsampled from (26,26) to (13,13).

## Activation function

In [21]:
def relu(X):
    return nd.maximum(X,nd.zeros_like(X))

## Softmax output

In [22]:
def softmax(y_linear):
    exp = nd.exp(y_linear-nd.max(y_linear))
    partition =nd.sum(exp, axis=0, exclude=True).reshape((-1,1))
    return exp / partition

## Define the model

Now we're ready to define our model

In [29]:
def net(X, debug=False):
    h1_conv = nd.Convolution(data=X, weight=W1, bias=b1, kernel=(3,3), num_filter=20)
    h1_activation = relu(h1_conv)
    h1 = nd.Pooling(data=h1_activation, pool_type="avg", kernel=(2,2), stride=(2,2))
    if debug:
        print("h1 shape: %s" % (np.array(h1.shape)))
        
    h2_conv = nd.Convolution(data=h1, weight=W2, bias=b2, kernel=(5,5), num_filter=50)
    h2_activation = relu(h2_conv)
    h2 = nd.Pooling(data=h2_activation, pool_type="avg", kernel=(2,2), stride=(2,2))
    if debug:
        print("h2 shape: %s" % (np.array(h2.shape)))
    
    ########################
    #  Flattening h2 so that we can feed it into a fully-connected layer
    ########################
    h2 = nd.flatten(h2)
    if debug:
        print("Flat h2 shape: %s" % (np.array(h2.shape)))
    
    
    h3_linear = nd.dot(h2, W3) + b3
    h3 = relu(h3_linear)
    if debug:
        print("h3 shape: %s" % (np.array(h3.shape)))
        
    yhat_linear = nd.dot(h3, W4) + b4
    yhat = softmax(yhat_linear)
    if debug:
        print("yhat shape: %s" % (np.array(yhat.shape)))
    
    return yhat


## Test run

We can now print out the shapes of the activations at each layer by using the debug flag.

In [24]:
output = net(data, debug=True)

h1 shape: [64 20 13 13]
h2 shape: [64 50  4  4]
Flat h2 shape: [ 64 800]
h3 shape: [ 64 128]
yhat shape: [64 10]


## The cross-entropy loss function

In [25]:
def cross_entropy(yhat, y):
    return - nd.sum(y * nd.log(yhat), axis=0, exclude=True)

## Optimizer

In [26]:
def SGD(params, lr):    
    for param in params:
        param[:] = param - lr * param.grad

## Evaluation metric

In [27]:
metric = mx.metric.Accuracy()

def evaluate_accuracy(data_iterator, net):
    numerator = 0.
    denominator = 0.
    
    data_iterator.reset()
    for i, batch in enumerate(data_iterator):
        with autograd.record():
            data = batch.data[0].as_in_context(ctx)
            label = batch.label[0].as_in_context(ctx)
            label_one_hot = nd.one_hot(label, 10)
            output = net(data)
        
        metric.update([label], [output])
    return metric.get()[1]

## The training loop

In [28]:
epochs = 5
moving_loss = 0.

for e in range(epochs):
    train_data.reset()
    for i, batch in enumerate(train_data):
        with autograd.record():
            data = batch.data[0].as_in_context(ctx)
            label = batch.label[0].as_in_context(ctx)
            label_one_hot = nd.one_hot(label, 10)
            output = net(data)
            loss = cross_entropy(output, label_one_hot)
            loss.backward()
        SGD(params, .01)
        
        ##########################
        #  Keep a moving average of the losses
        ##########################
        if i == 0:
            moving_loss = np.mean(loss.asnumpy()[0])
        else:
            moving_loss = .99 * moving_loss + .01 * np.mean(loss.asnumpy()[0])
            
    test_accuracy = evaluate_accuracy(test_data, net)
    train_accuracy = evaluate_accuracy(train_data, net)
    print("Epoch %s. Loss: %s, Train_acc %s, Test_acc %s" % (e, moving_loss, train_accuracy, test_accuracy))    
    

Epoch 0. Loss: 0.0761822477841, Train_acc 0.976640981735, Test_acc 0.977806528662
Epoch 1. Loss: 0.0312450892538, Train_acc 0.980829052511, Test_acc 0.977598342652
Epoch 2. Loss: 0.020098715804, Train_acc 0.982876712329, Test_acc 0.981232690669
Epoch 3. Loss: 0.0174940529287, Train_acc 0.984467751142, Test_acc 0.983072160081
Epoch 4. Loss: 0.00930268655741, Train_acc 0.985850456621, Test_acc 0.984602297774


## Conclusion

Contained in this example are nearly all the important ideas you'll need to start attacking problem in computer vision. While state of the art vision systems incorporate few more Believe it or not, if you knew just the content in this tutorial 5 years ago, you could have sold a startup to a Fortune500 company for millions of dollars. Unfortunately, the world has gotten marginally more sophisticates, so we'll have to come up with some more sophisticated tutorials to follow.

For whinges or inquiries, [open an issue on  GitHub.](https://github.com/zackchase/mxnet-the-straight-dope)