# The simplest neural network
The objective of this notebook is to familiarize you with looking at a neural network from **three other perspectives** besides looking at it as a computational or flow graph (structure).  The example solved here is intentionally trivial as well as the 'network' we use to solve it.

The **first** is as a mathematical function with two arguments: an input $x$ and parameters $\theta$.  A neural network is simply a mapping $$y = f(x, \theta)$$

The **second** is as a geometrical object.  A (classification) network with fixed parameters generates a decision surface in sample space.

The **third** is looking at the network from an error perspective.  Given a fixed 'training set' $$\{(x_i,y_i)|i=1,..,n\},$$ for each possible vector of parameters $\theta$ the network exhibits an error on this training set.  The training problem is then to look for parameters such that the error is acceptably low.  Using a loss function $l(\cdot,\cdot)$, which quantifies the error, we can formalize this error surface as $$e(\theta) = \sum_{i=1}^n l(y_i, f(x_i,\theta))$$

In [2]:
-- welcome to the Ostrava workshop on NN and Deep Learning
Plot = require 'itorch.Plot'
require 'utils'
require 'gnuplotx'

In [3]:
-- construct input data
torch.manualSeed(12345) -- crypto-safe EKOJ initialization using a NONCE

## The Reds vs. Blue problem

In [4]:
-- make our dataset
N = 100
stdev = 1.5
Reds, Blues = unpack(utils.make_data(N, stdev))

-- make a closure that will plot our data so we don't need to redo this all the time
function plot_data(title, xlabel, ylabel)
    p = Plot()
    p:circle(Reds[{{},1}], Reds[{{},2}], 'red', 'hi'):circle(Blues[{{},1}], Blues[{{},2}], 'blue', 'bye'):draw()
    p:xaxis(xlabel):yaxis(ylabel)
    p:title(title):redraw()
    return p
end

-- plot our dataset
plot_data('Reds vs. Blues [stdev=' .. stdev .. ']', 'feature 1 [-]', 'feature 2 [-]')

## A neural network with two weights and a bias

In [5]:
-- This rule computes the result of a perceptron classifier
-- y = sign(w1 * x1 + w2 * x2 + b)
-- where ps = [w1, w2, b] are the weights and the bias, respectively
-- the sign function maps negative values to -1, all others to 1
-- xs = [x1, x2] are the inputs
function perceptron_hard(xs, ps)
  local h = xs[1] * ps[1] + xs[2] * ps[2] + ps[3]
  if h >= 0 then return 1. else return -1. end
end

utils.print_rule_output({10, 1}, {1, -1, 0}, perceptron_hard)
utils.print_rule_output({1, 10}, {1, -1, 0}, perceptron_hard)

Perceptron with weights [1, -1] and bias 0 maps [10, 1] to 1	
Perceptron with weights [1, -1] and bias 0 maps [1, 10] to -1	


In [6]:
-- what happens if we keep the second input fixed but keep increasing the first input
-- over a range?

x1s = torch.range(1, 100) * 0.1
y3s = utils.map(function (x) return perceptron_hard({x, 3}, {1, -1, 0}) end, x1s)
y5s = utils.map(function (x) return perceptron_hard({x, 5}, {1, -1, 0}) end, x1s)
y7s = utils.map(function (x) return perceptron_hard({x, 7}, {1, -1, 0}) end, x1s)

p = Plot():line(x1s, y3s, 'red', '3'):title('Output for fixed x2 and changing x1'):xaxis('x1'):yaxis('output'):draw()
p:line(x1s, y5s, 'green', '5')
p:line(x1s, y7s, 'blue', '7'):legend(true):redraw()

# Geometrical representation of a network
- Where does the perceptron switch from -1 to 1 and vice-versa?
- We can look at the perceptron from the perspective of decision hyperplane
- That is w1*x1 + w2*x2 + b = 0
- This is the equation of a line that we can plot (**this is the separator associated with the equation**)

In [7]:
p = Plot()
p:circle(Reds[{{},1}], Reds[{{},2}], 'red', 'reds'):circle(Blues[{{},1}], Blues[{{},2}], 'blue', 'blues'):draw()
p:xaxis('x1 [-]'):yaxis('x2 [-]')
p:title('Example decision surfaces for different parameters'):redraw()
pss = { {1, -1, -5}}
for i=1,#pss do
    xx, yy = utils.perceptron_separator(pss[i])
    p:line(xx, yy, '#000000')
end
p:redraw()

In [8]:
-- let's try to classify reds with a 1 and blues with a -1
-- which of the above curves looks best?

-- first we rearrange our information into data and labels
train_data = torch.cat(Reds, Blues, 1)
train_labels = torch.cat(torch.ones(N/2), - torch.ones(N/2), 1)

-- look at some example results
for i=40,60 do
    local train_i, label_i = train_data[i], train_labels[i]
    --FIXME fill in example here
    local res_i = perceptron_hard(train_i, {-1, 1, 5})
    print('Data ' .. utils.vec2str(train_i) .. ' map to ' .. res_i .. ' target is ' .. label_i .. ' difference is ' .. (label_i - res_i))
end

Data [4.52375, 11.5243] map to 1 target is 1 difference is 0	
Data [3.30055, 9.88281] map to 1 target is 1 difference is 0	
Data [2.87942, 12.1371] map to 1 target is 1 difference is 0	
Data [4.29597, 7.51912] map to 1 target is 1 difference is 0	
Data [1.60533, 10.5039] map to 1 target is 1 difference is 0	
Data [3.60234, 8.14657] map to 1 target is 1 difference is 0	
Data [5.28854, 10.4837] map to 1 target is 1 difference is 0	
Data [3.94889, 9.41947] map to 1 target is 1 difference is 0	
Data [7.04433, 7.76757] map to 1 target is 1 difference is 0	
Data [4.2667, 11.4489] map to 1 target is 1 difference is 0	
Data [5.29223, 12.9837] map to 1 target is 1 difference is 0	
Data [9.67164, 5.48929] map to 1 target is -1 difference is -2	


Data [9.93426, 8.17141] map to 1 target is -1 difference is -2	
Data [11.4127, 5.08333] map to -1 target is -1 difference is 0	
Data [12.4855, 7.63803] map to 1 target is -1 difference is -2	
Data [10.1739, 5.63366] map to 1 target is -1 difference is -2	
Data [8.53568, 2.70081] map to -1 target is -1 difference is 0	
Data [11.1372, 3.37552] map to -1 target is -1 difference is 0	
Data [9.4488, 5.04929] map to 1 target is -1 difference is -2	
Data [9.86975, 7.15336] map to 1 target is -1 difference is -2	
Data [9.89294, 3.72821] map to -1 target is -1 difference is 0	


## Loss functions

In [11]:
-- that's nice but we need to measure our error quantitatively
-- to do this, we need a 'loss function'
-- the loss function below is the first we will encounter, the mean square error loss
-- note: sometimes the 0.5 factor is omitted from the definition, this has no effect on the optimal value
mse_loss = function (y, t) return 0.5 * (y-t)^2 end

-- this function accepts a data tensor, associated targets, the decision rule and the loss function
-- and computes our 'loss' on the entire set
function dataset_error(data, targets, rule, loss)
  tloss = 0.
  for i=1,data:size(1) do
    -- simply add all losses per data rule
    tloss = tloss + loss(rule(data[i]), targets[i])
  end
  return tloss / data:size(1)
end

In [12]:
-- let's test this for our perceptron with some parameters
-- change these to see if you can find a smaller loss
-- FIXME: fill in the rule here
my_perceptron = function (xs) return perceptron_hard(xs, {-1,1,5}) end
print('MSE of my_perceptron is ' .. dataset_error(train_data, train_labels, my_perceptron, mse_loss))

MSE of my_perceptron is 0.62	


## The 'error surface' representation of a network
Given a fixed training set (pairs x,y), if we change the parameters of the network, our error changes as well.
For our super-simple network, we can plot this and examine the errors for different weights (we keep the bias zero).  Can you pick out the best parameters?

In [18]:
-- in principle, we can test the perceptron for many possible parameters and paint the result!

-- this plots the results for 21 x 21 weight combinations
hard_training_errors = torch.zeros(21, 21)
w1_train_grid = torch.zeros(21, 21)
w2_train_grid = torch.zeros(21, 21)
for i=-10,10 do
  for j=-10,10 do
    -- this is what is plotted
    local f_hard = function (xs) return perceptron_hard(xs, {i/10, j/10, 0}, 1) end
    local err_hard = dataset_error(train_data, train_labels, f_hard, mse_loss)
    hard_training_errors[{i+11,j+11}] = err_hard

    w1_train_grid[{i+11,j+11}] = i/10
    w2_train_grid[{i+11,j+11}] = j/10
    -- this is a colormap plot, where green = 0 and red = 256
    --p:circle({i/10}, {j/10}, string.format('#%02x%02x00',err_hard,255-err_hard))
  end
end

--p:redraw()
print(hard_training_errors:size())

-- Plot the error surface
gnuplotx.figure(1)
gnuplotx.xlabel('w1')
gnuplotx.ylabel('w2')
gnuplotx.zlabel('error')
gnuplotx.title('Error surface for 2-weight hard perceptron [bias=0]')
gnuplotx.splot(w1_train_grid,w2_train_grid,hard_training_errors)


 21
 21
[torch.LongStorage of size 2]



In [19]:
-- From the error surface, what's the best perceptron then?
p = Plot()
p:circle(Reds[{{},1}], Reds[{{},2}], 'red', 'reds'):circle(Blues[{{},1}], Blues[{{},2}], 'blue', 'blues'):draw()
p:xaxis('x1 [-]'):yaxis('x2 [-]')
p:title('Decision surface for my best rule')

-- FIXME: fill best perceptron parameters into next line
xx, yy = utils.perceptron_separator({-0.5,0.5,0})
p:line(xx, yy, '#000000'):redraw()

## Training a neural network
That was easy, but real problems are never this easy, first of all with more than 3 dimensions the visualization is almost impossible.  With 10000 dimensions this is not the way to go :)

We need an automatic learning process.  All automatic learning processes rely on the the algorithm looking around in the immediate neighborhood of the parameter space around the current parameters nad 'looking' for a good direction to go.

With our decision rule, we however have a problem ...

In [20]:
-- Optimization algorithms rely on local information around the current point in parameters space.
-- Here, you are given a working point, try to change the parameters by e.g. 0.01, what happens to the error?
-- Based on these tiny changes, can you decide which way to go?
my_perceptron = function (xs) return perceptron_hard(xs, {-1.00, 1.30, 0}) end
print('MSE of my_perceptron is ' .. dataset_error(train_data, train_labels, my_perceptron, mse_loss))

MSE of my_perceptron is 0.08	


In [9]:
-- Let's have a look at this for a fixed parameter w2 = 1 and varying w1.
-- For a range of w1, we compute the error.  This is a 'slice' of the error surface we discussed.
-- If you happen to start at w1=-1.15 (and w2 = 1), where would you move if you could only 'see' up to distance 0.01?
-- What is your conclusion?
w1s = torch.range(-10, 10) * 0.03 - 1
mses = torch.Tensor(w1s:size())
for i=1,w1s:size(1) do
    local rule = function (xs) return perceptron_hard(xs, {w1s[i],1,0}) end
    mses[i] = dataset_error(train_data, train_labels, rule, mse_loss)
end

p = Plot():line(w1s, mses):title('MSE of hard perceptron vs. w1'):xaxis('w1'):yaxis('MSE'):draw()

[string "-- Let's have a look at this for a fixed para..."]:9: attempt to call global 'dataset_error' (a nil value)
stack traceback:
	[string "-- Let's have a look at this for a fixed para..."]:9: in main chunk
	[C]: in function 'xpcall'
	/usr/local/share/torch/share/lua/5.1/itorch/main.lua:179: in function </usr/local/share/torch/share/lua/5.1/itorch/main.lua:143>
	/usr/local/share/torch/share/lua/5.1/lzmq/poller.lua:75: in function 'poll'
	/usr/local/share/torch/share/lua/5.1/lzmq/impl/loop.lua:307: in function 'poll'
	/usr/local/share/torch/share/lua/5.1/lzmq/impl/loop.lua:325: in function 'sleep_ex'
	/usr/local/share/torch/share/lua/5.1/lzmq/impl/loop.lua:370: in function 'start'
	/usr/local/share/torch/share/lua/5.1/itorch/main.lua:350: in main chunk
	[C]: in function 'require'
	[string "arg={'/home_lustre/dd-15-28-13/.local/share/j..."]:1: in main chunk: 

## The 'soft' perceptron

In [22]:
-- the soft perceptron rule
function perceptron_soft(xs, ps, a)
  local h = xs[1] * ps[1] + xs[2] * ps[2] + ps[3]
  return math.tanh(h * a)
end

In [23]:
-- compute the error surface for the soft perceptron
soft_training_errors = torch.zeros(21,21)
-- this plots the results for 21 x 21 weight combinations
for i=-10,10 do
  for j=-10,10 do
    -- this is a colormap plot, where green = 0 and red = 256
    local f_soft = function (xs) return perceptron_soft(xs, {i/10, j/10, 0}, 1) end
    local err_soft = dataset_error(train_data, train_labels, f_soft, mse_loss)

    soft_training_errors[{i+11,j+11}] = err_soft
  end
end

In [24]:
-- Plot the error surface for the soft perceptron
gnuplotx.figure(2)
gnuplotx.xlabel('w1')
gnuplotx.ylabel('w2')
gnuplotx.zlabel('error')
gnuplotx.title('Error surface for 2-weight soft perceptron [bias=0]')
gnuplotx.splot(w1_train_grid,w2_train_grid,soft_training_errors)

In [25]:
x1s = torch.range(1, 100) * 0.1
y3s = utils.map(function (x) return perceptron_soft({x, 3}, {1, -1, 0}, 1) end, x1s)
y5s = utils.map(function (x) return perceptron_soft({x, 5}, {1, -1, 0}, 1) end, x1s)
y7s = utils.map(function (x) return perceptron_soft({x, 7}, {1, -1, 0}, 1) end, x1s)

p = Plot():line(x1s, y3s, 'red', 'x2=3'):title('Output of soft perceptron for [x1,5]'):xaxis('x1 [-]'):yaxis('output'):legend(true):draw()
p:line(x1s, y5s, 'green', 'x2=5')
p:line(x1s, y7s, 'blue', 'x2=7'):redraw()

In [26]:
x1s = torch.range(1, 100) * 0.1
y3s = utils.map(function (x) return perceptron_soft({x, 5}, {1, -1, 0}, 0.1) end, x1s)
y5s = utils.map(function (x) return perceptron_soft({x, 5}, {1, -1, 0}, 1) end, x1s)
y7s = utils.map(function (x) return perceptron_soft({x, 5}, {1, -1, 0}, 10) end, x1s)

p = Plot():line(x1s, y3s, 'red', 'a=0.1'):title('Output of soft perceptron for [x1,5] with changing a'):xaxis('x1 [-]'):yaxis('output'):legend(true):draw()
p:line(x1s, y5s, 'green', 'a=1'):line(x1s, y7s, 'blue', 'a=10'):redraw()

In [27]:
-- remember the soft perceptron? let's try again with that rule
-- if you change the parameters by a tiny bit, what happens?
-- try to change the parameters by e.g. 0.001
my_perceptron = function (xs) return perceptron_soft(xs, {-1.7, 1.1, 0}, 1) end
print('MSE of my_perceptron is ' .. dataset_error(train_data, train_labels, my_perceptron, mse_loss))

MSE of my_perceptron is 0.18014532898922	


In [28]:
-- Let's compute the error for a number of params for the SOFT perceptron
-- This error surface 'slice' differs substantially from the (corresponding) slice above for the HARD perceptron.
w1s = torch.range(-10, 10) * 0.03 - 1
mses = torch.Tensor(w1s:size())
for i=1,w1s:size(1) do
    local rule = function (xs) return perceptron_soft(xs, {w1s[i],1,0}, 1) end
    mses[i] = dataset_error(train_data, train_labels, rule, mse_loss)
end

p = Plot():line(w1s, mses):title('MSE of soft perceptron vs. w1'):xaxis('w1'):yaxis('MSE'):draw()

## Training process
Let's start building an automatic procedure to train this network.  What we need is

1. a starting point
2. a procedure the step from any point to a (hopefully better) point
3. a rule that decides when to stop this process

In the following, we use the following rules

1. I pick some starting points that demonstrate some principle/problem for you :)
2. We use what is called online gradient descent (I will derive the update rule)
3. We stop after a fixed number of applications of rule (2)

**Notes**
- notice how  information flows alternatively in two directions during training

In [29]:
-- we can try to descend along this error surface from a random initialization
-- and always head in the direction of largest decrease of error (gradient descent)

-- training parameters (try other learning rates - 0.01, 0.001, ...)
lambda = 0.1

-- number of times the algorithm should go through the all of the data
epochs = 50

-- Some interesting 'random' initializations
--ps = torch.Tensor{-1, 0.5, 0}
--ps = torch.Tensor{0.01, 0, 0}
--ps = torch.Tensor{0.1, 0.15, 0}
ps = torch.Tensor{0.25, -0.3, 0}
--ps = torch.Tensor{-0.02, -0.01, 0} -- another initialization, try this!
--ps = torch.Tensor{-0.8, -0.8, 0} -- same separator (plot it above!) but works differently? how? why?
--ps = torch.Tensor{torch.normal(), torch.normal(), 0}

-- parameters after each epoch will be stored here
pss = {{ps[1],ps[2],ps[3]}}

print('Initial parameters ' .. utils.vec2str(ps))

-- store the error before training (at index 1)
errs = torch.Tensor(epochs+1)
local rule = function (xs) return perceptron_soft(xs, ps, 1) end
errs[1] = dataset_error(train_data, train_labels, rule, mse_loss)

-- go through all of the data 'epochs' times
for e=1,epochs do
    -- we accumulate the training error here
    local total_err = 0
    
    -- gradient storage (we accumulate all gradients across the entire dataset)
    local dw1 = 0
    local dw2 = 0
    local db = 0
    
    -- iterate over all training examples
    for i=1,train_labels:size(1) do
        
        -- get the i-th sample (input and output/label/target)
        local xs, target = train_data[i], train_labels[i]
        
        -- compute forward pass (the soft perceptron rule written out manually)
        local a = xs[1] * ps[1] + xs[2] * ps[2] + ps[3]
        local y = math.tanh(a)
        
        -- accumulate total error
        total_err = total_err + (target - y)^2
                
        -- compute the backward pass ()
        dw1 = dw1 + (y - target) * (1 - y^2) * xs[1]
        dw2 = dw2 + (y - target) * (1 - y^2) * xs[2]
        
        -- we don't train the bias in these examples but if you want to, try it
        --db= db + (y - target) * (1 - y^2)
    end

    local dps = torch.Tensor{dw1, dw2, db} / train_labels:size(1)
    print('Gradient for epoch ' .. e .. ' is ' .. utils.vec2str(dps) .. ' total_err = ' .. total_err)
        
    -- the gradient update rule
    ps = ps - dps * lambda
    
    -- we manually copy the current parameter table (fastest and easiest)
    pss[e+1] = {ps[1], ps[2], ps[3]}
    
    -- compute the error 
    local rule = function (xs) return perceptron_soft(xs, ps, 1) end
    errs[e+1] = dataset_error(train_data, train_labels, rule, mse_loss)
end

-- note, at zero we have the initial error
Plot():line(torch.range(0,epochs), errs, 'red', 'MSE'):circle(torch.range(0,epochs), errs, 'red', 'MSE'):xaxis('epoch'):yaxis('MSE'):title('MSE vs epoch'):draw()
print('Final parameters ' .. utils.vec2str(ps))

final_rule = function (xs) return perceptron_hard(xs, ps) end
print('Error of hard perceptron (with same parameters) is ' .. dataset_error(train_data, train_labels, final_rule, mse_loss))

Initial parameters [0.25, -0.3, 0]	


Gradient for epoch 1 is [2.74659, 0.646956, 0] total_err = 321.63443272704	


Gradient for epoch 2 is [0.00495307, -0.0277399, 0] total_err = 199.68870458433	


Gradient for epoch 3 is [0.00460753, -0.029128, 0] total_err = 199.67245862762	


Gradient for epoch 4 is [0.00425989, -0.0306561, 0] total_err = 199.65463948534	


Gradient for epoch 5 is [0.00390681, -0.0323474, 0] total_err = 199.63498139031	


Gradient for epoch 6 is [0.00354452, -0.0342307, 0] total_err = 199.61315931181	


Gradient for epoch 7 is [0.00316867, -0.0363412, 0] total_err = 199.58877096446	


Gradient for epoch 8 is [0.00277404, -0.0387236, 0] total_err = 199.56131198302	


Gradient for epoch 9 is [0.00235428, -0.0414346, 0] total_err = 199.53014102384	
Gradient for epoch 10 is [0.00190141, -0.0445475, 0] total_err = 199.4944296865	


Gradient for epoch 11 is [0.0014052, -0.0481589, 0] total_err = 199.45308898136	


Gradient for epoch 12 is [0.000852141, -0.0523983, 0] total_err = 199.40465851875	


Gradient for epoch 13 is [0.000223932, -0.0574437, 0] total_err = 199.34713449984	


Gradient for epoch 14 is [-0.000504937, -0.063546, 0] total_err = 199.27769343354	


Gradient for epoch 15 is [-0.00137125, -0.0710706, 0] total_err = 199.19223032556	


Gradient for epoch 16 is [-0.00243007, -0.0805692, 0] total_err = 199.08454950768	


Gradient for epoch 17 is [-0.0037677, -0.0929139, 0] total_err = 198.94486424337	


Gradient for epoch 18 is [-0.00552698, -0.109564, 0] total_err = 198.75681486643	
Gradient for epoch 19 is [-0.00796143, -0.13314, 0] total_err = 198.49100409599	


Gradient for epoch 20 is [-0.0115635, -0.168826, 0] total_err = 198.08931256054	


Gradient for epoch 21 is [-0.0174144, -0.228288, 0] total_err = 197.42061581704	


Gradient for epoch 22 is [-0.0283368, -0.343328, 0] total_err = 196.12553804755	


Gradient for epoch 23 is [-0.0537945, -0.631299, 0] total_err = 192.85375973551	


Gradient for epoch 24 is [-0.12663, -1.8148, 0] total_err = 178.44231466427	
Gradient for epoch 25 is [4.83218, 1.16311, 0] total_err = 92.545005728558	


Gradient for epoch 26 is [-0.0950398, -0.327762, 0] total_err = 196.54730937875	


Gradient for epoch 27 is [-0.181859, -0.62167, 0] total_err = 193.27174294909	


Gradient for epoch 28 is [-0.548628, -1.71821, 0] total_err = 178.23774906974	


Gradient for epoch 29 is [-1.66808, -2.97112, 0] total_err = 54.07755631298	


Gradient for epoch 30 is [4.05529, 2.13487, 0] total_err = 80.112197890965	
Gradient for epoch 31 is [-1.33146, -2.60604, 0] total_err = 79.022222178042	


Gradient for epoch 32 is [0.107526, 0.0608539, 0] total_err = 1.9382542920109	


Gradient for epoch 33 is [0.0272595, -0.0112046, 0] total_err = 1.7651581589315	


Gradient for epoch 34 is [0.0193887, -0.0181755, 0] total_err = 1.7491559186219	


Gradient for epoch 35 is [0.0181356, -0.0189857, 0] total_err = 1.7351264898375	
Gradient for epoch 36 is [0.0177989, -0.0189466, 0] total_err = 1.7214079955406	


Gradient for epoch 37 is [0.0176008, -0.0187858, 0] total_err = 1.7079583589133	


Gradient for epoch 38 is [0.0174273, -0.0186099, 0] total_err = 1.6947681689442	


Gradient for epoch 39 is [0.0172609, -0.0184348, 0] total_err = 1.6818291285483	


Gradient for epoch 40 is [0.0170986, -0.0182631, 0] total_err = 1.6691333260426	
Gradient for epoch 41 is [0.01694, -0.018095, 0] total_err = 1.656673197008	


Gradient for epoch 42 is [0.0167847, -0.0179306, 0] total_err = 1.64444150444	


Gradient for epoch 43 is [0.0166328, -0.0177696, 0] total_err = 1.6324313206788	


Gradient for epoch 44 is [0.016484, -0.017612, 0] total_err = 1.6206360105626	
Gradient for epoch 45 is [0.0163384, -0.0174576, 0] total_err = 1.6090492156907	


Gradient for epoch 46 is [0.0161956, -0.0173064, 0] total_err = 1.5976648397111	


Gradient for epoch 47 is [0.0160558, -0.0171582, 0] total_err = 1.5864770345547	


Gradient for epoch 48 is [0.0159187, -0.017013, 0] total_err = 1.5754801875431	
Gradient for epoch 49 is [0.0157844, -0.0168706, 0] total_err = 1.5646689093075	


Gradient for epoch 50 is [0.0156527, -0.016731, 0] total_err = 1.5540380224572	


Final parameters [-0.550593, 0.576486, 0]	


Error of hard perceptron (with same parameters) is 0	


In [30]:
-- print the last parameters for some epoch
print(pss[#pss])

{
  1 : -0.55059279673384
  2 : 0.57648584057477
  3 : 0
}



In [31]:
p = plot_data('Decision surface after selected epochs', 'feature 1 [-]', 'feature 2 [-]')
sel_epochs = {1, 5, 10, 20, 30, 40, 50}
for i=1,#sel_epochs do
    local e = sel_epochs[i]
    if e <= epochs then
        xx,yy = utils.perceptron_separator(pss[e])
        -- another colormap hack - greener epochs are later
        p:line(xx, yy, string.format('#40%02x40', e * 256 / epochs), 'ep. ' .. e)
    end
end
p:redraw()

### Let's visualize this training
In the following plot, we will show the path the algorithm took during training.
Rerun these blocks for different starting points, change the learning rate parameters and observe changes.
Summarize your findings.

In [32]:
-- store the parameter vector in a variable
sgd_path = torch.cat(torch.Tensor{pss}[{1,{},{1,2}}],errs,2)

In [33]:
-- We now plot the optimizer paths that we stored
gnuplotx.figure(3)
gnuplotx.xlabel('w1')
gnuplotx.ylabel('w2')
gnuplotx.zlabel('error')
gnuplotx.raw('set xrange [-1:1]')
gnuplotx.raw('set yrange [-1:1]')
gnuplotx.raw('set zrange [0:200]')
gnuplotx.displaypaths({w1_train_grid,w2_train_grid,soft_training_errors},{'sgd_1',sgd_path})