Only CUDA tensors are supported for cudnn.BatchNormalization! #219

byronwwang · 2016-07-13T09:12:33Z

The previous layer of SpatialBatchNormalization is nn.Abs which can not be covert to cudnn.Abs(). Does this cause the problem?

fmassa · 2016-07-13T14:21:14Z

You can have nn modules running on the GPU, and you can mix nn and cudnn modules, but they all should be on gpu in this case.
You need to convert your network to :cuda if you want to use cudnn.

byronwwang · 2016-07-14T01:45:02Z

@fmassa Yes, I did this. When I use cudnn.convert(model, cudnn, function(module) return torch.type(module):find('SpatialBatchNormalization') end), it works.

fmassa · 2016-07-14T02:00:27Z

Could you write a small working example that illustrates the issue?

byronwwang · 2016-07-14T02:36:42Z

Maybe nn.Identity()() caused the problem.

require 'nngraph'
require 'cutorch'
require 'cunn'
require 'cudnn'

local input = nn.Identity()()

local features = nn.Sequential()
features:add(nn.SpatialConvolution(1,8,5,5,1,1,2,2)) 
features:add(nn.Abs())
features:add(nn.SpatialBatchNormalization(8,nil,nil,false))
features:add(nn.Tanh())
features:add(nn.SpatialAveragePooling(5,5,2,2,2,2)) 
features:add(nn.SpatialConvolution(8,1,5,5,1,1,2,2)) 
features:add(nn.Tanh())
features:add(nn.SpatialAveragePooling(5,5,2,2,2,2))

local classifier = nn.Sequential()
classifier:add(nn.View(64*1*1))
classifier:add(nn.Linear(64, 2))
classifier:add(nn.LogSoftMax())

local model = nn.gModule({input},{classifier(features(input))})

local x = torch.rand(3,1,32, 32):type('torch.CudaTensor')

model:cuda()
cudnn.convert(model, cudnn)
--cudnn.convert(model, cudnn, function(module) return torch.type(module):find('SpatialBatchNormalization') end)
local y = model:forward(x)
print(y)

iN1k1 · 2016-07-19T07:39:41Z

Hi all,
I don't know exactly if my issue is related to the one posted by @byronwwang, however I'm experiencing something weird after the last update to cuDNN v.5..
If I include a SpatialBatchNormalization layer in any netwok I'll get a nan output during the inference phase. During training, everything goes smoothly as usual.
I noticed that the nan output is due to the running_mean and running_var variables contained in such a layer. All the entries in such vectors are nan themselves..

The same network was working before I updated to cuDNN v5..

soumith · 2016-07-27T06:00:39Z

@iN1k1 that is quite strange. During training, did you have a batch size of 1?

iN1k1 · 2016-07-27T06:50:24Z

Nope.. but I discovered that is something related to threads. I started my coding on the top of your imagenet-multiGPU.torch example. If I run the procedure with a single thread, than everything works. But, if I increase the number of threads, then I'm not only getting nans but when that does not happen (seems to be quite random..) I obtain exactly the same accuracy no matter how many training epochs have been performed..

Jongchan · 2016-11-28T12:02:28Z

@byronwwang
Not sure if you can understand Chinese, the reason is analyzed here.

The problem seems to come from converting nn type batch norm to cudnn type batch norm. Initialized as an nn batch norm layer, it has no initialized bias and weight, hence, the assertion fails while checking the bias/weight's type after converted to cudnn type layer..

Jongchan mentioned this issue Nov 28, 2016

We need to set affine=true for nn.SpatialBatchNormalization layer dgyoo/dl-practice#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only CUDA tensors are supported for cudnn.BatchNormalization! #219

Only CUDA tensors are supported for cudnn.BatchNormalization! #219

byronwwang commented Jul 13, 2016

fmassa commented Jul 13, 2016

byronwwang commented Jul 14, 2016

fmassa commented Jul 14, 2016

byronwwang commented Jul 14, 2016

iN1k1 commented Jul 19, 2016

soumith commented Jul 27, 2016

iN1k1 commented Jul 27, 2016

Jongchan commented Nov 28, 2016 •

edited

Only CUDA tensors are supported for cudnn.BatchNormalization! #219

Only CUDA tensors are supported for cudnn.BatchNormalization! #219

Comments

byronwwang commented Jul 13, 2016

fmassa commented Jul 13, 2016

byronwwang commented Jul 14, 2016

fmassa commented Jul 14, 2016

byronwwang commented Jul 14, 2016

Maybe nn.Identity()() caused the problem.

iN1k1 commented Jul 19, 2016

soumith commented Jul 27, 2016

iN1k1 commented Jul 27, 2016

Jongchan commented Nov 28, 2016 • edited

Jongchan commented Nov 28, 2016 •

edited