Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only CUDA tensors are supported for cudnn.BatchNormalization! #219

Open
byronwwang opened this issue Jul 13, 2016 · 8 comments
Open

Only CUDA tensors are supported for cudnn.BatchNormalization! #219

byronwwang opened this issue Jul 13, 2016 · 8 comments

Comments

@byronwwang
Copy link

The previous layer of SpatialBatchNormalization is nn.Abs which can not be covert to cudnn.Abs(). Does this cause the problem?

@fmassa
Copy link
Collaborator

fmassa commented Jul 13, 2016

You can have nn modules running on the GPU, and you can mix nn and cudnn modules, but they all should be on gpu in this case.
You need to convert your network to :cuda if you want to use cudnn.

@byronwwang
Copy link
Author

@fmassa Yes, I did this. When I use cudnn.convert(model, cudnn, function(module) return torch.type(module):find('SpatialBatchNormalization') end), it works.

@fmassa
Copy link
Collaborator

fmassa commented Jul 14, 2016

Could you write a small working example that illustrates the issue?

@byronwwang
Copy link
Author

Maybe nn.Identity()() caused the problem.

require 'nngraph'
require 'cutorch'
require 'cunn'
require 'cudnn'

local input = nn.Identity()()

local features = nn.Sequential()
features:add(nn.SpatialConvolution(1,8,5,5,1,1,2,2)) 
features:add(nn.Abs())
features:add(nn.SpatialBatchNormalization(8,nil,nil,false))
features:add(nn.Tanh())
features:add(nn.SpatialAveragePooling(5,5,2,2,2,2)) 
features:add(nn.SpatialConvolution(8,1,5,5,1,1,2,2)) 
features:add(nn.Tanh())
features:add(nn.SpatialAveragePooling(5,5,2,2,2,2))

local classifier = nn.Sequential()
classifier:add(nn.View(64*1*1))
classifier:add(nn.Linear(64, 2))
classifier:add(nn.LogSoftMax())

local model = nn.gModule({input},{classifier(features(input))})

local x = torch.rand(3,1,32, 32):type('torch.CudaTensor')

model:cuda()
cudnn.convert(model, cudnn)
--cudnn.convert(model, cudnn, function(module) return torch.type(module):find('SpatialBatchNormalization') end)
local y = model:forward(x)
print(y)

@iN1k1
Copy link

iN1k1 commented Jul 19, 2016

Hi all,
I don't know exactly if my issue is related to the one posted by @byronwwang, however I'm experiencing something weird after the last update to cuDNN v.5..
If I include a SpatialBatchNormalization layer in any netwok I'll get a nan output during the inference phase. During training, everything goes smoothly as usual.
I noticed that the nan output is due to the running_mean and running_var variables contained in such a layer. All the entries in such vectors are nan themselves..

The same network was working before I updated to cuDNN v5..

@soumith
Copy link
Owner

soumith commented Jul 27, 2016

@iN1k1 that is quite strange. During training, did you have a batch size of 1?

@iN1k1
Copy link

iN1k1 commented Jul 27, 2016

Nope.. but I discovered that is something related to threads. I started my coding on the top of your imagenet-multiGPU.torch example. If I run the procedure with a single thread, than everything works. But, if I increase the number of threads, then I'm not only getting nans but when that does not happen (seems to be quite random..) I obtain exactly the same accuracy no matter how many training epochs have been performed..

@Jongchan
Copy link

Jongchan commented Nov 28, 2016

@byronwwang
Not sure if you can understand Chinese, the reason is analyzed here.

The problem seems to come from converting nn type batch norm to cudnn type batch norm. Initialized as an nn batch norm layer, it has no initialized bias and weight, hence, the assertion fails while checking the bias/weight's type after converted to cudnn type layer..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants