# Convolutional networks
The objective of this worksheet is to introduce you to a specific type of layer: the convolutional layer.
Neural networks that contain such layers are typically called 'Convolutional networks'.

- 'convnets' for short, are networks with their initial layers based on the operation of convolution
- convolutional layers have several effects
 - they improve the statistical efficiency through parameter sharing a.k.a. tied weights (a regularization effect)
 - they can be seen as filters of the input that pick out a specific feature over the entire input
 - main uses: image and signal processing, recently also for text processing
- conv layers are often coupled together with aggregating layers to reduce the number of neurons in subsequent layers -> max pooling


1. Time this code (per-epoch) on your computational device, on the zseason computer, it takes 75s/epoch.  We don't like that.
2. Can you convert this to GPU-based code? How fast can you make it run?
3. What is the lowest test error you can achieve?

In [4]:
require 'torch'
require 'nn'
require 'utils'
require 'optim'
require 'cunn'
print('ready')

ready	


In [2]:
-- snippet shamelessly ripped from https://github.com/torch/tutorials/blob/master/A_datasets/mnist.lua 
-- load the training data (downloads when first used)
tar = 'http://torch7.s3-website-us-east-1.amazonaws.com/data/mnist.t7.tgz'

if not paths.dirp('mnist.t7') then
   print('==> downloading dataset')
   os.execute('wget ' .. tar)
   os.execute('tar xvf ' .. paths.basename(tar))
end

train_file = 'mnist.t7/train_32x32.t7'
test_file = 'mnist.t7/test_32x32.t7'

----------------------------------------------------------------------
print('==> loading dataset')

-- We load the dataset from disk, it's straightforward

train_data = torch.load(train_file,'ascii')
test_data = torch.load(test_file,'ascii')

==> loading dataset	


In [5]:
-- let's preprocess the data a bit, so they are between 0 and 1
train_data.data = train_data.data:type('torch.DoubleTensor'):mul(1.0 / 256.0):cuda()
test_data.data = test_data.data:type('torch.DoubleTensor'):mul(1.0 / 256.0):cuda()

In [None]:
img = test_data.data[{519}]:clone()
itorch.image(img)
itorch.image(utils.enlarge(img,4))

In [None]:
?nn.SpatialConvolution

In [None]:
conv2 = nn.SpatialConvolution(1, 1, 2, 2)
print(conv2)
conv2.weight:zero()
conv2.weight[{1,1,1,1}] = -1
conv2.weight[{1,1,2,2}] = 1
conv2.bias[1] = 0

In [None]:
-- let's print out only the last two dimensions
print(conv2.weight[{1,1}])

In [None]:
-- what does this filter do?
itorch.image(utils.enlarge(conv2:forward(img),4))

## Max-Pooling
- reducing the amount of processed data
- enforcing translational invariance

In [None]:
?nn.SpatialMaxPooling

In [None]:
mp = nn.SpatialMaxPooling(2, 2, 2, 2)
seq = nn.Sequential()
seq:add(conv2)
seq:add(mp)

In [None]:
itorch.image(utils.enlarge(seq:forward(img), 4))

## Multiple filters at the same time - input/output planes
- planes allow us to keep the networks sequential while being able to apply multiple filters
- we **stack** filters on top of each other to generate multiple output images

In [None]:
conv2_2 = nn.SpatialConvolution(1, 2, 2, 2)
conv2_2.weight:zero()
print(conv2_2.weight:size())
conv2_2.weight[{1,1,{1,2},1}] = 1
conv2_2.weight[{1,1,{1,2},2}] = -1
conv2_2.weight[{2,1,1,{1,2}}] = 1
conv2_2.weight[{2,1,2,{1,2}}] = -1

In [None]:
img2_2 = conv2_2:forward(img)
print(img2_2:size())

In [None]:
-- iTorch already knows, how to handle stacks of images
itorch.image(utils.enlarge(img2_2,4))

In [None]:
?nn.Reshape

In [None]:
?nn.SpatialMaxPooling

## Convolutional network to process MNIST data

In [6]:
-- network structure taken from 
-- http://nbviewer.ipython.org/github/eladhoffer/Talks/blob/master/DL_class2015/Deep%20Learning%20with%20Torch.ipynb
net = nn.Sequential()
net:add(nn.SpatialConvolution(1, 8, 5, 5)) -- 1 input image channel, 64 output channels, 5x5 convolution kernel
net:add(nn.SpatialMaxPooling(2,2,2,2))      -- A max-pooling operation that looks at 2x2 windows and finds the max.
net:add(nn.ReLU())                          -- ReLU activation function
net:add(nn.SpatialConvolution(8, 16, 3, 3))
net:add(nn.SpatialMaxPooling(2,2,2,2))
net:add(nn.ReLU())
net:add(nn.SpatialConvolution(16, 32, 3, 3))
net:add(nn.View(32*4*4):setNumInputDims(3)) -- reshapes from a 3D tensor of 32x4x4 into 1D tensor of 32*4*4
net:add(nn.Linear(32*4*4, 128))             -- fully connected layer (matrix multiplication between input and weights)
net:add(nn.ReLU())
net:add(nn.Dropout(0.5))                    --Dropout layer with p=0.5
net:add(nn.Linear(128,10))                  -- 10 is the number of outputs of the network (in this case, 10 digits)

-- loss is cross-entropy (== logsoftmax + classnllcriterion)
loss = nn.CrossEntropyCriterion():cuda()
print(net)

-- move entire network to the GPU
net = net:cuda()

parameters, gradParameters = net:getParameters()

nn.Sequential {
  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> output]
  (1): nn.SpatialConvolution(1 -> 8, 5x5)
  (2): nn.SpatialMaxPooling(2,2,2,2)
  (3): nn.ReLU
  (4): nn.SpatialConvolution(8 -> 16, 3x3)
  (5): nn.SpatialMaxPooling(2,2,2,2)
  (6): nn.ReLU
  (7): nn.SpatialConvolution(16 -> 32, 3x3)
  (8): nn.View
  (9): nn.Linear(512 -> 128)
  (10): nn.ReLU
  (11): nn.Dropout(0.500000)
  (12): nn.Linear(128 -> 10)
}
{
  gradInput : DoubleTensor - empty
  modules : 
    {
  

    1 : 
        nn.SpatialConvolution(1 -> 8, 5x5)
        {
          dH : 1
          dW : 1
          nOutputPlane : 8
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
          gradBias : DoubleTensor - size: 8
          padH : 0
          weight : DoubleTensor - size: 8x1x5x5
          bias : DoubleTensor - size: 8
          gradWeight : DoubleTensor - size: 8x1x5x5
          padW : 0
          nInputPlane : 1
          kW : 5
          kH : 5
        }
      2 : 
        nn.SpatialMaxPooling(2,2,2,2)
        {
    

      dH : 2
          dW : 2
          padH : 0
          gradInput : DoubleTensor - empty
          indices : DoubleTensor - empty
          kH : 2
          ceil_mode : false
          output : DoubleTensor - empty
          padW : 0
          kW : 2
        }
      3 : 
        nn.ReLU
        {
          inplace : false
          threshold : 0
          val : 0
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
        }
      4 : 
        nn.SpatialConvolution(8 -> 16, 3x3)
        {
          dH : 1


          dW : 1
          nOutputPlane : 16
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
          gradBias : DoubleTensor - size: 16
          padH : 0
          weight : DoubleTensor - size: 16x8x3x3
          bias : DoubleTensor - size: 16
          gradWeight : DoubleTensor - size: 16x8x3x3
          padW : 0
          nInputPlane : 8
          kW : 3
          kH : 3
        }
      5 : 
        nn.SpatialMaxPooling(2,2,2,2)
        {
          dH : 2
          dW : 2


          padH : 0
          gradInput : DoubleTensor - empty
          indices : DoubleTensor - empty
          kH : 2
          ceil_mode : false
          output : DoubleTensor - empty
          padW : 0
          kW : 2
        }
      6 : 
        nn.ReLU
        {
          inplace : false
          threshold : 0
          val : 0
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
        }
      7 : 
        nn.SpatialConvolution(16 -> 32, 3x3)
        {
          dH : 1
          dW : 1
          

nOutputPlane : 32
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
          gradBias : DoubleTensor - size: 32
          padH : 0
          weight : DoubleTensor - size: 32x16x3x3
          bias : DoubleTensor - size: 32
          gradWeight : DoubleTensor - size: 32x16x3x3
          padW : 0
          nInputPlane : 16
          kW : 3
          kH : 3
        }
      8 : 
        nn.View
        {
          numInputDims : 3
          size : LongStorage - size: 1
          numElements : 512
        }
      9 : 
     

   nn.Linear(512 -> 128)
        {
          gradBias : DoubleTensor - size: 128
          weight : DoubleTensor - size: 128x512
          bias : DoubleTensor - size: 128
          gradInput : DoubleTensor - empty
          gradWeight : DoubleTensor - size: 128x512
          output : DoubleTensor - empty
        }
      10 : 
        nn.ReLU
        {
          inplace : false
          threshold : 0
          val : 0
          output : DoubleTensor - empty
          gradInput : DoubleTensor - empty
        }
      11 : 
        nn.Dropout(0.500000)
        {
          v2 : true
          noise : DoubleTensor - empty
          train : true
      

    p : 0.5
          gradInput : DoubleTensor - empty
          output : DoubleTensor - empty
        }
      12 : 
        nn.Linear(128 -> 10)
        {
          gradBias : DoubleTensor - size: 10
          weight : DoubleTensor - size: 10x128
          bias : DoubleTensor - size: 10
          gradInput : DoubleTensor - empty
          gradWeight : DoubleTensor - size: 10x128
          output : DoubleTensor - empty
        }
    }
  output : DoubleTensor - empty
}


In [7]:
-- random initialization according to Xavier
local tmp = math.sqrt(1. / 25)
net:get(1).weight:uniform(-tmp, tmp)
net:get(1).bias:zero()
tmp = math.sqrt(1. / 9)
net:get(4).weight:uniform(-tmp, tmp)
net:get(4).bias:zero()
tmp = math.sqrt(1. / 9)
net:get(7).weight:uniform(-tmp, tmp)
net:get(7).bias:zero()

tmp = math.sqrt(1. / net:get(9).bias:size(1))
net:get(9).weight:uniform(-tmp, tmp)
net:get(9).bias:zero()
tmp = math.sqrt(1. / net:get(12).bias:size(1))
net:get(12).weight:uniform(-tmp, tmp)
net:get(12).bias:zero()

-- initialize state for the optimizer
opt_state = {}
print('Random init complete')

Random init complete	


In [8]:
-- optimization parameters
nepochs = 1
ntrain = 60000
--ntrain = 10000
ntest = 10000
batch_size = 100
etest = 1

-- SGD
--opt_param = {
--    learningRate = 0.01,
--    momentum = 0.9,
--    learningRateDecay = 1e-3
--}

-- AdaDelta
execute_optimizer = optim.adadelta
opt_params = {
    rho = 0.9,
    eps = 1e-6
}

-- train the network
for e=1,nepochs do
    
    local timer = torch.Timer() 
    local confmat = optim.ConfusionMatrix(10, {'0','1','2','3','4','5','6','7','8','9'})
    local train_err = 0
    for b=1,ntrain/batch_size do
        
        function batch_eval(x)
            
            local err = 0
            
            if x ~= parameters then
                parameters:copy(x)
            end
            
            -- DO NOT FORGET TO ZERO THE GRADIENT STORAGE
            -- BEFORE EACH MINIBATCH
            gradParameters:zero()
        
            for i=1,batch_size do
                -- select current training example
                local ndx = (b-1) * batch_size + i
                local x = train_data.data[ndx]
                local t = train_data.labels[ndx]
            
                -- run it through the network & accumulate error
                local y = net:forward(x)
                err = err + loss:forward(y, t)
                
                confmat:add(y, t)

                -- run it backward from the loss criterion
                local dt_dy = loss:backward(y, t) 
                net:backward(x, dt_dy)
            end
            
            train_err = train_err + err
            
            return err, gradParameters
        end
        
        execute_optimizer(batch_eval, parameters, opt_params, opt_state)
    end
    
    print('******************* EPOCH ' .. e .. ' ************************')

    print('TRAINING [error = ' .. train_err .. ']')
    print(confmat:__tostring__())
    
    -- compute testing error every etest epochs
    if e % etest == 0 then
        local terr = 0
        local p = 0
        confmat:zero()
        for i=1,ntest do
            local x = test_data.data[i]
            local t = test_data.labels[i]
            local y = net:forward(x)
            confmat:add(y, t)
            terr = terr + loss:forward(y, t)
        end
        print('TESTING [error = ' .. terr .. ']')
        print(confmat:__tostring__())
    end
        
    print('Epoch took ' .. timer:time().real .. ' seconds.')
end

******************* EPOCH 1 ************************	
TRAINING [error = 15617.553837895]	


ConfusionMatrix:
[[    5686       7      24      13      21      31      67      16      36      22]   95.999% 	[class: 0]
 [       3    6499      64      16      18      13      15      35      67      12]   96.396% 	[class: 1]
 [      37      64    5447      84      57      11      60      77     100      21]   91.423% 	[class: 2]
 [      30      36     103    5554       9     148      21      70     106      54]   90.589% 	[class: 3]
 [      29      38      29       7    5337      14      93      28      45     222]   91.356% 	[class: 4]
 [      52      24       6     169      24    4869      89      18     122      48]   89.817% 	[class: 5]
 [      77      29      28      10      66      74    5566       8      48      12]   94.052% 	[class: 6]
 [      34      51      64      58      52      17      16    5795      32     146]   92.498% 	[class: 7]
 [      30      77      91      90      44     107      58      33    5217     104]   89.164% 	[class: 8]
 [      41      43      23   

TESTING [error = 1094.9916245341]	
ConfusionMatrix:
[[     958       0       3       1       0       3       7       1       2       5]   97.755% 	[class: 0]
 [       1    1119       2       2       0       2       0       5       3       1]   98.590% 	[class: 1]
 [       4       4     990      11       0       0       2       9      12       0]   95.930% 	[class: 2]
 [       1       0       4     960       0      34       0       4       4       3]   95.050% 	[class: 3]
 [       2       1       0       0     939       1       7       2       4      26]   95.621% 	[class: 4]
 [       2       1       0       3       0     880       3       0       1       2]   98.655% 	[class: 5]
 [       6       2       0       0       1      23     922       0       4       0]   96.242% 	[class: 6]
 [       2       1       8       4       3       1       0     994       4      11]   96.693% 	[class: 7]
 [       4       1       4       2       4      11       2       4     935       7]   95.996% 	[clas

In [None]:
-- Let us pass the image through the first convolutional layer and look at the results
imgf = net:get(1):forward(img)
print('Source input')
itorch.image(img)
print('After convolution')
itorch.image(image.scale(image.toDisplayTensor({input=imgf}),400))
print('Convolutional filters')
scaled_weights = image.scale(image.toDisplayTensor({input=net:get(1).weight,padding=2}),300)
itorch.image(scaled_weights)