# Convolution
Up until now, we've been treating an MNIST digit as a really, really big vector. This is odd because, as far as the network knows, each component/pixel is independent of the rest. Since the input is actually a 2D image, we know that this isn't the case! How, then, can we take advantage of this additional structure?

If we wanted to stick with linear layers, we could try connecting each $3\times 3$ neighborhood of pixels to a single output. In that case, if we had a $5\times 5$ image, we'd have four separate `nn.Linear` modules producing a $3\times 3$ output. That's not too bad except it assumes that the information learned about the top left neighborhood is independent from that learned about the bottom right. For instance, if we wanted to detect a 1, we'd need a vertical line detector; a repeated-linear network would have to learn this same detector for every position in the image!

Much better would be to, instead, learn a bunch of different $n\times n$ *filters* to slide (or "*convolve*") over each $n\times n$ neighborhood in the image. Now, you can have one filter that detects horizontal edges, another one that detects vertical edges, and possibly even one that detects the holes inside of 0s and 9s! The best part is that now you only have to learn $n^2 \cdot numFilters$ parameters instead of $n^2 \cdot (N-2)^2$, which speeds up computation and helps stave off overfitting.

If this doesn't make sense yet, don't fret. First, check out this diagram of the computation performed during an application of a convolutional filter.

<img src="https://i.stack.imgur.com/GvsBA.jpg">
<div class="figcaption">Figure 1: Computation performed during an application of a single conv filter.</div>

Still considering the image in Figure 1, to produce the entire *output feature map*, one must simply apply the conv filter at every valid location in the image. Note that, due to edge effects, the output feature map has size $o = N - n + 1$ (where $N$ is the input size and $n$ is the kernel/filter size).

If you grok the previous bit, this next part should be a simple extension: if you have $f$ filters, applying all of them to an input will give you $f$ output feature maps. You can then stack those up into a *volume* of size $f\times o \times o$ and feed them into the next conv layer. This is to say that, in general, the input to a single conv filter is an $f \times n \times n$ volume. In fact, you can think of an RGB image as a special case of a conv volume where each channel is one feature map (albeit one that is highly correlated with the others). 

Again, for you diagram people:

<img src="https://upload.wikimedia.org/wikipedia/commons/6/68/Conv_layer.png" style="width: 33%">
<div class="figcaption">Figure 2: Input and output conv volumes.</div>

As each feature map goes through more and more convolutional layers, the features become increasingly abstract. For example, your first layer might contain a horizontal and vertical edge detector. In the second layer, a filer could then combine the feature maps of the - and | detectors to yield a + detector! 

If diagrams aren't your style and you prefer the cold, hard code the module of the hour is [`nn.SpatialConvolution`](https://github.com/torch/nn/blob/master/doc/convolution.md#nn.SpatialConvolution):

In [1]:
require 'nn'
conv = nn.SpatialConvolution(3, 5, 3, 3) -- 1 feature map of size 28x28 -> 5 feature maps of size 26x26
print(tostring(conv))

nn.SpatialConvolution(3 -> 5, 3x3)	


In [2]:
img = torch.rand(2, 3, 28, 28) -- perhaps 2 MNIST images
conv:forward(img)
print(conv.output:size())

  2
  5
 26
 26
[torch.LongStorage of size 4]




### Taking a Page from Term Papers

One last thing before we slide on to the next part: if you want to capture details present at the edges of the image, the general consensus is that one should add a border of zeros known as *padding*. So, if we wanted to produce a $28 \times 28$ output in the previous example, we would do:

In [3]:
padW, padH = 1, 1
conv = nn.SpatialConvolution(3, 5, 3, 3, 1, 1, padW, padH) -- add a frame of width 1 before convolving
print(conv:forward(img):size())

  2
  5
 28
 28
[torch.LongStorage of size 4]



## Pooling

Pooling, in a nutshell, throws out several pixels in a neighborhood in exchange for *translation invariance*, which is just a fancy way of saying that shifting your input won't significantly change your output. 

The two most common types of pooling are max and average pooling, both of which are available in Torch as [`nn.SpatalMaxPooling`](https://github.com/torch/nn/blob/master/doc/convolution.md#nn.SpatialMaxPooling) and [`nn.SpatialAveragePooling`](https://github.com/torch/nn/blob/master/doc/convolution.md#nn.SpatialAveragePooling).

Is it that time already for more diagrams?

<img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/Max_pooling.png" style="width: 33%">
<div class="figcaption">Figure 3: Max pooling of a single feature map using a $2\times 2$ kernel and stride.</div>



In [6]:
pool = nn.SpatialMaxPooling(2, 2) -- takes the max in a 2x2 neighborhood
print(tostring(pool))             -- slides by 2 px after each application (there's no overlap)

nn.SpatialMaxPooling(2x2, 2,2)	


In [7]:
img = torch.Tensor(1, 6, 6):random(99)
print(img)
print(pool:forward(img))
print(pool:backward(img, torch.ones(1, 3, 3))) -- mask of maximal pixels in each neighborhood

(1,.,.) = 
  17  40  93  77  19  60
  23  36  59  55  43  74
  20  55  74  62  53  64
  86  63  25  63  51  36
  66  68  17  67  88  17
  80   8  41  60  51  79
[torch.DoubleTensor of size 1x6x6]

(1,.,.) = 
  40  93  74
  86  74  64
  80  67  88
[torch.DoubleTensor of size 1x3x3]

(1,.,.) = 
  0  1  1  0  0  0
  0  0  0  0  0  1
  0  0  1  0  0  1
  1  0  0  0  0  0
  0  0  0  1  1  0
  1  0  0  0  0  0
[torch.DoubleTensor of size 1x6x6]




#### Strided convolutions

Torch lets you do strided convolutions using the `dW` and `dH` parameters like so:

In [7]:
img = torch.rand(2, 3, 28, 28)
dW, dH = 2, 2
conv = nn.SpatialConvolution(3, 5, 3, 3, dW, dH) -- stride of 2 in each direction
print(tostring(conv))
print(conv:forward(img):size())

nn.SpatialConvolution(3 -> 5, 3x3, 2,2)	
  2
  5
 13
 13
[torch.LongStorage of size 4]



## To 99.7% and Beyond!

With convolutions and pooling in hand, we can now go about improving our paltry 98% on MNIST to something that's actually acceptable.

We'll use the same training framework as before; the only thing that will change is the model, which you'll place in [`mnist/models/conv.lua`](../edit/mnist/models/conv.lua).

Whereas before we stacked some combination of `nn.Linear` and `nn.ReLU`, this time around, you'll want to do something more akin to [`nn.SpatialConvolution`](https://github.com/torch/nn/blob/master/SpatialConvolution.lua), [`nn.ReLU`](https://github.com/torch/nn/blob/master/doc/transfer.md#relu), and then throw in a [`nn.SpatialMaxPooling`](https://github.com/torch/nn/blob/master/doc/convolution.md#nn.SpatialMaxPooling) every so often.



In [15]:
local model = nn.Sequential()
      model:add(nn.SpatialConvolutionMM(1, 32, 5, 5))
      model:add(nn.Tanh())
      model:add(nn.SpatialMaxPooling(3, 3, 3, 3, 1, 1))
      -- stage 2 : mean suppresion -> filter bank -> squashing -> max pooling
      model:add(nn.SpatialConvolutionMM(32, 64, 5, 5))
      model:add(nn.Tanh())
      model:add(nn.SpatialMaxPooling(2, 2, 2, 2))
      -- stage 3 : standard 2-layer MLP:
      model:add(nn.Reshape(64*2*2))
      model:add(nn.Linear(64*2*2, 200))
      model:add(nn.Tanh())
      model:add(nn.Linear(200, 10))
print(model)

nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> output]
  (1): nn.SpatialConvolutionMM(1 -> 32, 5x5)
  (2): nn.Tanh
  (3): nn.SpatialMaxPooling(3x3, 3,3, 1,1)
  (4): nn.SpatialConvolutionMM(32 -> 64, 5x5)
  (5): nn.Tanh
  (6): nn.SpatialMaxPooling(2x2, 2,2)
  (7): nn.Reshape(256)
  (8): nn.Linear(256 -> 200)
  (9): nn.Tanh
  (10): nn.Linear(200 -> 10)
}
{
  gradInput : DoubleTensor - empty
  modules : 
    {
      1 : 
        nn.SpatialConvolutionMM(1 -> 32, 5x5)
        {
          dH : 1
          dW : 1
          nInputPlane : 1
          output : DoubleTensor - empty
          kH : 5
          gradBias : DoubleTensor - size: 32
          padH : 0
          bias : DoubleTensor - size: 32
          weight : DoubleTensor - size: 32x25
          _type : torch.DoubleTensor
          gradWeight : DoubleTensor - size: 32x25
          padW : 0
          nOutputPlane : 32
          kW : 5
          gradInput : DoubleTensor - empty
        

    4 : 
        nn.SpatialConvolutionMM(32 -> 64, 5x5)
        {
          dH : 1
          dW : 1
          nInputPlane : 32
          output : DoubleTensor - empty
          kH : 5
          gradBias : DoubleTensor - size: 64
          padH : 0
          bias : DoubleTensor - size: 64
          weight : DoubleTensor - size: 64x800
          _type : torch.DoubleTensor
          gradWeight : DoubleTensor - size: 64x800
          padW : 0
          nOutputPlane : 64
          kW : 5
          gradInput : DoubleTensor - empty
        }
      5 : 
        nn.Tanh
        {
          gradInput : DoubleTensor - empty
          _type : torch.DoubleTensor
          output : DoubleTensor - empty
        }
      6 : 
        nn.SpatialMaxPooling(2x2, 2,2)
        {
          dH : 2
          dW : 2
          kW : 2
          gradInput : DoubleTensor - empty
          indices : DoubleTensor - empty
          _type : torch.DoubleTensor
          padH : 0
          ceil_mode : false
          out

        nn.Reshape(256)
        {
          _type : torch.DoubleTensor
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
          size : LongStorage - size: 1
          nelement : 256
          batchsize : LongStorage - size: 2
        }
      8 : 
        nn.Linear(256 -> 200)
        {
          gradBias : DoubleTensor - size: 200
          weight : DoubleTensor - size: 200x256
          _type : torch.DoubleTensor
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
          bias : DoubleTensor - size: 200
          gradWeight : DoubleTensor - size: 200x256
        }
      9 : 
        nn.Tanh
        {
          gradInput : DoubleTensor - empty
          _type : torch.DoubleTensor
          output : DoubleTensor - empty
        }
      10 : 
        nn.Linear(200 -> 10)
        {
          gradBias : DoubleTensor - size: 10
          weight : DoubleTensor - size: 10x200
          _type : torch.DoubleTensor
          ou

As before, verify that your network has the correct input/output sizes:

In [11]:
dofile('mnist/test/conv_io.lua') -- check the Tensor sizes!

Passed!	


Train to your heart's content! (Don't forget to twiddle hyperparameters!)

In [16]:
trainMNIST = dofile('mnist/main.lua')
trainMNIST({modelType='conv', nEpochs=5})



Epoch  1 | train loss: 1.323 | val loss: 0.525 | val acc: 89.76%	




Epoch  2 | train loss: 0.367 | val loss: 0.250 | val acc: 94.01%	




Epoch  3 | train loss: 0.220 | val loss: 0.175 | val acc: 95.64%	




Epoch  4 | train loss: 0.164 | val loss: 0.136 | val acc: 96.46%	




Epoch  5 | train loss: 0.134 | val loss: 0.115 | val acc: 96.96%	
