# First brush with the MNIST digits problem
The objective of this notebook is to introduce the MNIST digits classification challenge and implement a first solution that we will keep improving.

The dataset was obtained by US Post office from zip codes.  Task: recognize the number from it's image. This is a 10-class problem with a training set of 50k 32x32 images and test set of 10k 32x32 images.  Build a neural network that can correctly identify the images and achieve an error rate of less than 3% [on this 10-class problem!]

In [None]:
require 'nn'
require 'optim'
require 'paths'
print('ready')

-- set the number of threads [set to number of cores]
torch.setnumthreads(2)

In [None]:
-- snippet shamelessly ripped from https://github.com/torch/tutorials/blob/master/A_datasets/mnist.lua 
-- load the training data (downloads when first used)
tar = 'http://torch7.s3-website-us-east-1.amazonaws.com/data/mnist.t7.tgz'

if not paths.dirp('mnist.t7') then
   print('==> downloading dataset')
   os.execute('wget ' .. tar)
   os.execute('tar xvf ' .. paths.basename(tar))
end

train_file = 'mnist.t7/train_32x32.t7'
test_file = 'mnist.t7/test_32x32.t7'

----------------------------------------------------------------------
print('==> loading dataset')

-- We load the dataset from disk, it's straightforward

train_data = torch.load(train_file,'ascii')
test_data = torch.load(test_file,'ascii')

In [None]:
-- print some training data
print(train_data)

In [None]:
-- example image
itorch.image(train_data.data[{500,1}])

In [None]:
-- let's preprocess the data a bit, so they are between 0 and 1
train_data.data = train_data.data:reshape(60000, 32*32):type('torch.DoubleTensor'):mul(1.0 / 256.0)
test_data.data = test_data.data:reshape(10000, 32*32):type('torch.DoubleTensor'):mul(1.0 / 256.0)

-- remove mean
--train_data.data = train_data.data - torch.repeatTensor(torch.mean(train_data.data,2),1,1024)
--test_data.data = test_data.data - torch.repeatTensor(torch.mean(test_data.data,2),1,1024)

In [None]:
-- build our NN model
net = nn.Sequential()
net:add(nn.Linear(32*32, 250))
net:add(nn.SoftSign())
net:add(nn.Linear(250,10))
loss = nn.CrossEntropyCriterion()

print(net)

## Initialization code
- here, the network is initialized to small random values
- good initialization is critical, e.g. recently [Sutskever, 2013](http://www.cs.toronto.edu/~fritz/absps/momentum.pdf)
- we use initialization due to [Glorot & Bengio, 2010](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf)
- when you modify the network, add code to randomly initialize also the new nn.Linear layers!!

In [None]:
-- Glorot & Bengio per-layer initialization of network (we will discuss this topic more tomorrow)
local tmp = math.sqrt(1. / 1024)
net:get(1).weight:uniform(-tmp, tmp)
net:get(1).bias:zero() -- per Bengio & Bergstra 2012
tmp = math.sqrt(1. / 250)
net:get(3).weight:uniform(-tmp, tmp)
net:get(3).bias:zero()

--parameters:uniform(-0.01, 0.01)
print('Random initialization')

## Training code
- networks are typically trained in mini-batches, which are a middle ground between fully online and full batch learning
- the size of the mini-batch and the learning rate must be adjusted to 'good values', what are those? Unfortunately those are problem specific.
- Here's a starting point: learning rate 0.1 and batch size of 128
- training proceeds in epochs, one epoch is a full pass through all the training data
- we split the epoch into mini-batches of size 128, which the network processes simultaneously
- when you change the parameters to see if the network learns well or not, don't forget to rerun the random initialization

NOTE: On this [blog post](http://blog.bigml.com/2013/07/01/you-dont-need-coursera-to-get-started-with-machine-learning/), there is a 1-click solution by bigml.com who explain that to get good results (their test error: 90.6% on a subset of 4000 training data and 1000 testing data), you don't need Coursera.  Maybe true but we're still going to at least match that score in this first example.

In [None]:
-- params
nepochs = 40
lambda = 0.01
lambda_decay = 1e-3
batch_size = 100
etest = 1
print_diagnostics = false

-- dataset sizes
ntrain = train_data.labels:size(1)
--ntrain = 10000 -- small subset of the dataset to tune the learning rate
--ntrain = 4000
ntest = test_data.labels:size(1)
--ntest = 1000

-- For each epochs
for e=1,nepochs do

    local timer = torch.Timer() 
    local err = 0
    local misses = 0
    local confmat = optim.ConfusionMatrix(10, {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'})
    for b=1,ntrain/batch_size do

        -- DO NOT FORGET TO ZERO THE GRADIENT STORAGE
        -- BEFORE EACH MINIBATCH
        net:zeroGradParameters()

        for i=1,batch_size do
            -- select current training example
            local ndx = (b-1) * batch_size + i
            local x = train_data.data[ndx]
            local t = train_data.labels[ndx]

            -- run it through the network & accumulate error
            local y = net:forward(x)
            err = err + loss:forward(y, t)
            
            confmat:add(y,t)
            
            local _, digit = torch.max(y,1)
            if digit[1] ~= t then
                misses = misses + 1
            end

            -- run it backward from the loss criterion
            local dt_dy = loss:backward(y, t) 
            net:backward(x, dt_dy)
        end

        -- simple network update (gradient descent with learning rate = lambda)
        net:updateParameters(lambda)
        
    end

    print('Training error after EPOCH ' .. e .. ' = ' .. err .. ' misses = ' .. misses .. ' learning_rate = ' .. lambda)
    
    --FIXME: uncomment if you want to see how the network is learning
    --print(confmat:__tostring__())

    -- compute testing error every etest epochs
    if e % etest == 0 then
        local terr = 0
        local p = 0
        confmat:zero()
        for i=1,ntest do
            local x = test_data.data[i]
            local t = test_data.labels[i]
            local y = net:forward(x)
            local logprob, digit = torch.max(y,1)
            if digit[1] ~= t then 
                --print('test[' .. i .. '] tgt ' .. (t-1) .. ' prediction ' .. (digit-1))
                p = p + 1
            end
            confmat:add(y,t)
            terr = terr + loss:forward(y, t)
        end
        print('Testing error after epoch ' .. e .. ' = ' .. terr .. ' in counts ' .. p .. '/' .. ntest)
        if e == nepochs then
            print('**** FINAL CONFUSION MATRIX ****')
            print(confmat:__tostring__())
        end
    end

    -- print diagnostic info about the training process
    if print_diagnostics then
        print('       DIAGNOSTICS       ')
        for _, l in ipairs({1,3}) do
            local w = net:get(l).weight
            local nw = w:nElement()
            local w_max =  torch.max(torch.abs(w))
            local wg = net:get(l).gradWeight
            local wg_max =  torch.max(torch.abs(wg))
            local b = net:get(l).bias
            local nb = b:nElement()
            local b_max = torch.max(torch.abs(b))
            print('layer ' .. l)
            print('  weight norms max/2/1 [' .. w_max .. ', ' .. torch.norm(w,2)/nw .. ', ' .. torch.norm(w,1)/nw .. ']')
            print('  w/grad norms max/2/1 [' .. wg_max .. ', ' .. torch.norm(wg,2)/nw .. ', ' .. torch.norm(wg,1)/nw .. ']')
            print('  bias norms max/2/1 [' .. b_max .. ', ' .. torch.norm(b,2)/nb .. ', ' .. torch.norm(b,1)/nb .. ']')
        end
    end

    print('Epoch took ' .. timer:time().real .. ' seconds.')
    
    -- apply decay before next epoch
    lambda = lambda * (1 - lambda_decay)

end

In [None]:
-- let's plot the weights
gnuplot.imagesc(net:get(1).weight, 'color')

### Suggested experiments
Summarize and record your observations for each of these problems.
- change the learning rate, observe the magnitude of the gradient updates
- change the number of neurons, observe behavior
- switch on the diagnostic code and observe the gradient norms in detail
- add another layer to the network (e.g. a 32 neuron layer after layer 1, Tanh again) - don't forget to add initialization !
- try to change the activation functions to nn.Sigmoid(), is training easier?
- instead of using the same learning rate for all layers, try out custom learning rates! See the TODO part in the code. Where would you increase them?

In [None]:
-- this always computes the error over the entire test set
local terr = 0
local p = 0
local ntest = test_data.labels:size(1)
errlist = {}
for i=1,ntest do
    local x = test_data.data[i]
    local t = test_data.labels[i]
    local y = net:forward(x)
    local _, digit = torch.max(y,1)
    digit = digit[1]
    if digit ~= t then 
        errlist[#errlist+1]={i, (t-1), (digit-1)}
        p = p + 1
    end
    terr = terr + loss:forward(y, t)
end
--print(errlist)
print('Testing error = ' .. terr .. ' in counts ' .. p .. '/' .. ntest)

In [None]:
-- This can be used to examine the networks performance on any chosen test digit.
ndx = 126
itorch.image(test_data.data[ndx]:reshape(32,32))
y = net:forward(test_data.data[ndx])
print(torch.cat(torch.range(0,9),y,2))
logprob, digit = torch.max(y,1)
digit = digit[1] - 1
target = test_data.labels[ndx]-1
print('Prediction ' .. digit .. ' real label ' .. target)

In [None]:
-- test on an arbitrary image
img = image.load('handwritten.png')
img = img[1]
itorch.image(img)

In [None]:
y = nn.SoftMax():forward(net:forward(img:reshape(1024)))
print(torch.cat(torch.range(0,9),y,2))