-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need help for backward training #91
Comments
@russellfei can you provide a small snippet of your model, along with the input tensor sizes. Your network's math does not seem to work out, maybe you are providing a gradOutput that is too big |
Thx~ @soumith ----------------------------------------------------------------------
function train()
-- epoch tracker
epoch = epoch or 1
-- local vars
local time = sys.clock()
local batchSize = opt.batchSize
-- create augmented dataset
if opt.augment == 'false' then
----> added by r.f.
local totalSize = trainData:size()
-- shuffle at each epoch
trsize = 1680
local shuffle = torch.randperm(trsize):type('torch.LongTensor')
-- BDHW mode
local inputs = torch.Tensor(totalSize,nBands,height,width)
local targets = torch.Tensor(totalSize):zero()
-- shuffle input
inputs = trainData.data:index(1,shuffle)
targets = trainData.labels:index(1,shuffle)
--print('targets of train data')
--print(targets)
-- do one epoch
print('==> doing epoch on training data:')
print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')
for t = 1,totalSize,opt.batchSize do
-- disp progress
xlua.progress(t, totalSize)
-- create mini batch
-------------------------------------------------------
-- the key for use CUDA lies in the support in torch lib
-- not in table, as a result, this code will surely fail.
-- need to add flag 'bmode' for handling with cudaconvnet api
-- TBD
------------------------------------------------------------
local input = inputs[t]
local target = targets[t]
-- evaluate function for complete mini batch
---> get all output at first ---------------
--> error: input is not a floatTensor ???
-- essential data format
if opt.type == 'double' then input = input:double() end
if opt.type == 'cuda' then input = input:cuda() end
-- optimize on current mini-batch
------------------------------------------------------------
-- optim function
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
--print('--> data preparation')
local batchSize = opt.batchSize
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
--print('---> forward propagation')
local outputs = model:forward(input)
outputs = outputs:float()
---> transfer to floatTensor to calculate
-- calculate gradient matrix
local df_do = torch.Tensor(outputs:size())
--print('---> gradients accumulation')
for i = 1,batchSize do
-- estimate f
local err = criterion:forward(outputs, target)
f = f + err
--print('add err 1')
-- estimate df/dW
-- split to calculate df_do
df_do = criterion:backward(outputs,target)
--print('---> backprop')
-- do backwards together
if opt.type == 'cuda' then
model:backward( input,df_do:cuda() )
else
model:backward( input,df_do )
end
-- update confusion
confusion:add(outputs, target)
end
-- normalize gradients and f(X)
gradParameters:div( batchSize )
f = f/batchSize
-- check for convergence at 1st epoch
-- if error doesn't decrease to less than half
-- that model might be diverged.
--print('err: ' .. (f))
-- return f and df/dX
return f,gradParameters
end
--print '------>start to optim'
if optimMethod == optim.asgd then
_,_,average = optimMethod(feval, parameters, optimState)
else
optimMethod(feval, parameters, optimState)
end
end
else
-- augmented inputs and targets
-- store entire augment dataset needs 155G RAM
-- do immediate augment as alternatives
if opt.augment == 'true' then
local bangIdx = 2640
trsize = 1680
local totalSize = bangIdx * trsize
local shuffle = torch.randperm(trsize):type('torch.LongTensor')
-- BDHW mode
local in_inputs = torch.Tensor(trsize,nBands,height,width)
local in_targets = torch.Tensor(trsize):zero()
-- shuffle input
in_inputs = trainData.data:index(1,shuffle)
in_targets = trainData.labels:index(1,shuffle)
-- autmented one image
local inputs = torch.Tensor(bangIdx,nBands,height,width)
local targets = torch.Tensor(bangIdx):zero()
-- do one epoch
print('==> doing epoch on training data:')
print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']')
for t = 1,totalSize,opt.batchSize do
-- disp progress
xlua.progress(t, totalSize)
-- augment first image
if (t-1) % bangIdx == 0 then
-- originImageIndex: j
local j = torch.ceil(t/bangIdx)
inputs,targets = dataBang(in_inputs[j],in_targets[j])
end
-- create mini batch
--print('==> map index')
-- related idx for inputs
p_idx = t % bangIdx
--print('p idx = '..p_idx..', t = '..t)
local input = inputs[p_idx]
local target = targets[p_idx]
-- essential data format
if opt.type == 'double' then input = input:double() end
if opt.type == 'cuda' then input = input:cuda() end
------------------------------------------------------------
-- optim function
-- create closure to evaluate f(X) and df/dX
local feval = function(x)
--print('--> data preparation')
local batchSize = opt.batchSize
-- get new parameters
if x ~= parameters then
parameters:copy(x)
end
-- reset gradients
gradParameters:zero()
-- f is the average of all criterions
local f = 0
--print('---> forward propagation')
local outputs = model:forward(input)
outputs = outputs:float()
---> transfer to floatTensor to calculate
-- calculate gradient matrix
local df_do = torch.Tensor(outputs:size())
--print('---> gradients accumulation')
for i = 1,batchSize do
-- estimate f
local err = criterion:forward(outputs, target)
f = f + err
--print('add err 1')
-- estimate df/dW
-- split to calculate df_do
df_do = criterion:backward(outputs,target)
--print('---> backprop')
-- do backwards together
if opt.type == 'cuda' then
model:backward( input,df_do:cuda() )
else
model:backward( input,df_do )
end
-- update confusion
confusion:add(outputs, target)
end
-- normalize gradients and f(X)
gradParameters:div( batchSize )
f = f/batchSize
-- check for convergence at 1st epoch
-- if error doesn't decrease to less than half
-- that model might be diverged.
--print('err: ' .. (f))
-- return f and df/dX
return f,gradParameters
end
-- optimize on current mini-batch
--print ('==> start to optim')
if optimMethod == optim.asgd then
_,_,average = optimMethod(feval, parameters, optimState)
else
optimMethod(feval, parameters, optimState)
end
end
else
print 'error at data augment flag value'
end
end
--------end of local optim funciton--------------------------------------
-- time taken
time = sys.clock() - time
time = time / trainData:size()
print("\n==> time to learn 1 sample = " .. (time*1000) .. 'ms')
-- print confusion matrix
print(confusion)
sys.sleep(1)
-- update logger/plot
trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100}
if opt.plot then
trainLogger:style{['% mean class accuracy (train set)'] = '-'}
trainLogger:plot()
end
-- save/log current net
local filename = paths.concat(opt.save, 'model.net')
os.execute('mkdir -p ' .. sys.dirname(filename))
print('==> saving model to '..filename)
torch.save(filename, model)
-- next epoch
confusion:zero()
epoch = epoch + 1
end In the snippet above, there're two identical The According to source code of |
ok so if one feval is working fine and the other fails. your dataBang function is not giving the correct sized inputs. |
Morning~ @soumith model = nn.Sequential()
if opt.model == 'convnet' then
-- input dimensions
if opt.augment == 'true' then
nBands = 3
width = 112
height = 112
--TODO: specify augmented cnn arch
hidConv = {96,128,256,384,512,768,210}
filtsize = {5,5,3,3,3,3}
poolsize = {2,0,3,0,4,0}
-- stage 1 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
-- stage 2 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
model:add(nn.ReLU())
--model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
-- stage 3: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))
-- stage 4: filter bank --> nonlinear --> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
model:add(nn.ReLU())
--model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))
-- stage 5: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))
-- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
model:add(nn.ReLU())
--model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))
-- stage 6 : standard 2-layer neural network
model:add(nn.Reshape(hidConv[6]))
model:add(nn.Linear(hidConv[6], hidConv[7]))
model:add(nn.Tanh())
model:add(nn.Linear(hidConv[7], noutputs))
else
if opt.augment == 'false' then
nBands = 3
width = 256
height = 256
-- hidden units, filter sizes (for ConvNet only):
hidConv = {128,256,384,512,768,768,210}
filtsize = {5,7,5,5,3,3}
poolsize = {2,2,2,2,2,3}
-- stage 1 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(nBands, hidConv[1], filtsize[1], filtsize[1]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[1],2,poolsize[1],poolsize[1],poolsize[1],poolsize[1]))
-- stage 2 : filter bank -> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[1], hidConv[2], filtsize[2], filtsize[2]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[2],2,poolsize[2],poolsize[2],poolsize[2],poolsize[2]))
-- stage 3: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[2], hidConv[3], filtsize[3], filtsize[3]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[3],poolsize[3],poolsize[3],poolsize[3],poolsize[3]))
-- stage 4: filter bank --> nonlinear --> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[3], hidConv[4], filtsize[4], filtsize[4]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[4],poolsize[4],poolsize[4],poolsize[4],poolsize[4]))
-- stage 5: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[4], hidConv[5], filtsize[5], filtsize[5]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[5],poolsize[5],poolsize[5],poolsize[5],poolsize[5]))
-- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6]))
model:add(nn.ReLU())
model:add(nn.SpatialLPPooling(hidConv[6],poolsize[6],poolsize[6],poolsize[6],poolsize[6]))
-- stage 6 : standard 2-layer neural network
model:add(nn.Reshape(hidConv[6]))
model:add(nn.Linear(hidConv[6], hidConv[7]))
model:add(nn.Tanh())
model:add(nn.Linear(hidConv[7], noutputs))
end
end
end The forward process is identical, because only one image pass and back at one time. I'll check it again. |
the network looks fine, however, i am saying check that your dataBang function always gives out 112x112 cases. When doing random crops, you might hit a corner case somewhere. |
@soumith and the input size for that first image is input............................. 1/4435200 ..................................] ETA: 0ms | Step: 0ms
3
112
112
[torch.LongStorage of size 3]
df_do
21
[torch.LongStorage of size 1] using this snippet print()
print('input')
print(#input)
--print('---> forward propagation')
local outputs = model:forward(input)
outputs = outputs:float()
---> transfer to floatTensor to calculate
-- calculate gradient matrix
local df_do = torch.Tensor(outputs:size())
print('df_do')
print(#df_do) Maybe But I also checked the |
df_do should be equal to noutputs afaik. |
Also, try replacing the LPPooling with MaxPooling and see if that works. just to be sure something funky is not going on with LPPooling |
Thanks @soumith |
Genius!!! @soumith Notes: |
what was the solution? |
Maybe there's something wrong when I call 'SpatialLPPooling` |
I changed ==> defining some tools
nn.Sequential {
[input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> output]
(1): nn.SpatialConvolutionMM
(2): nn.ReLU
(3): nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.Square
(2): nn.SpatialSubSampling
(3): nn.Sqrt
}
(4): nn.SpatialConvolutionMM
(5): nn.ReLU
(6): nn.SpatialConvolutionMM
(7): nn.ReLU
(8): nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.Square
(2): nn.SpatialSubSampling
(3): nn.Sqrt
}
(9): nn.SpatialConvolutionMM
(10): nn.ReLU
(11): nn.SpatialConvolutionMM
(12): nn.ReLU
(13): nn.Sequential {
[input -> (1) -> (2) -> (3) -> output]
(1): nn.Square
(2): nn.SpatialSubSampling
(3): nn.Sqrt
}
(14): nn.SpatialConvolutionMM
(15): nn.ReLU
(16): nn.Reshape
(17): nn.Linear
(18): nn.Tanh
(19): nn.Linear
(20): nn.LogSoftMax
}
==> configuring optimizer
==> training!
==> doing epoch on training data:
==> online epoch # 1 [batchSize = 1]
/usr/local/bin/luajit: /usr/local/share/lua/5.1/nn/Sequential.lua:37: size mismatchA: 0ms | Step: 0ms
stack traceback:
[C]: in function 'updateOutput'
/usr/local/share/lua/5.1/nn/Sequential.lua:37: in function 'forward'
ucmcnn_aug_LP.lua:939: in function 'opfunc'
/usr/local/share/lua/5.1/optim/sgd.lua:40: in function 'optimMethod'
ucmcnn_aug_LP.lua:979: in function 'train'
ucmcnn_aug_LP.lua:1166: in main chunk
[C]: in function 'dofile'
/usr/local/lib/luarocks/rocks/trepl/scm-1/bin/th:109: in main chunk
[C]: at 0x00404480 Well, there is a block of code in if pnorm == 2 then
self:add(nn.Square())
else
self:add(nn.Power(pnorm))
end
self:add(nn.SpatialSubSampling(nInputPlane, kW, kH, dW, dH))
if pnorm == 2 then
self:add(nn.Sqrt())
else
self:add(nn.Power(1/pnorm))
end
self:get(2).bias:zero()
self:get(2).weight:fill(1) I think there's some rule to follow when In short, here's something we have to penetrate in. Too tired to continue, see you~ |
The bug has been caught, a really little bug! -- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[6], hidConv[6], filtsize[6], filtsize[6])) should be -- stage 6: filter bank --> nonlinear -> L2 pooling
model:add(nn.SpatialConvolutionMM(hidConv[5], hidConv[6], filtsize[6], filtsize[6])) However, during these past hours, I've noted another weird thing and I'll issue a new bug for this. Thanks~ @soumith |
According to the contribution regulations of torch, please delete this issue because it is a personal help request which should be posted on mailing list (the google group, which I often have no access to), thanks @soumith |
Hi, all~
Currently, I'm plug
nn
modules throughSequential
container.My NN script is adapted from @soumith /galaxyzoo for CUDA usage
Everything works fine, however, this error message is quite confusing,
I've checked
Sequential.lua
(while found thatmodel:backward(input,df_do:cud()
is related toModule.la
andPower.lua
later).There are two identical code snippets in my scripts and one of them works fine, another don't.
Can anyone help me figure out this ?
BTW, some functions in
Module.lua
just do nothing about input parameters, are those parameterscleared when
zeroParameters()
is called?Thanks~
The text was updated successfully, but these errors were encountered: