Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OS X R2 Tanh and SoftMax tests fail #9

Closed
szagoruyko opened this issue Dec 19, 2014 · 15 comments
Closed

OS X R2 Tanh and SoftMax tests fail #9

szagoruyko opened this issue Dec 19, 2014 · 15 comments

Comments

@szagoruyko
Copy link
Collaborator

Have just tested in Ubuntu, all tests pass. But in OS X no:

____*__*______  ==> Done Completed 50 asserts in 14 tests with 3 errors
--------------------------------------------------------------------------------
Tanh_single
error on state (forward)
 LT(<) violation   val=nan, condition=0.0001
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:329: in function <test/test.lua:303>

--------------------------------------------------------------------------------
Tanh_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:332: in function <test/test.lua:303>

--------------------------------------------------------------------------------
SoftMax_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:467: in function <test/test.lua:437>

--------------------------------------------------------------------------------

weird

@soumith
Copy link
Owner

soumith commented Dec 19, 2014

aaaah, it is very likely they have a bug on OSX version. Cant think of any other explanation, as it works cleanly on Linux. you can report bugs to them via the nvidia developer tool.

@szagoruyko
Copy link
Collaborator Author

Yes, on Linux only MaxPooling fails sometimes, as they mention in docs. On OS X actually all ReLU, Tanh, Sigmoid and SofMax fail a lot. Will report a bug.

@szagoruyko
Copy link
Collaborator Author

Just caught the same problem in Linux!

--------------------------------------------------------------------------------
ReLU_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:350: in function <test/test.lua:321>

--------------------------------------------------------------------------------

@soumith
Copy link
Owner

soumith commented Dec 24, 2014

ok, i am going to run the unit tests a few thousand times and see how that goes.
Also, are you making sure to use Cuda 6.5 on Linux?

@soumith
Copy link
Owner

soumith commented Dec 24, 2014

can you give me other details about your linux, for a possible reproduction

@szagoruyko
Copy link
Collaborator Author

yes, cuda-6.5, 4 Titan Blacks, 340.29 driver, torch, cutorch, nn and cunn updated to the last version, and also got another machine (mostly equal) on which it fails too, like here Sigmoid:

--------------------------------------------------------------------------------
Sigmoid_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    ...ocks/torch-distro/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:484: in function <test/test.lua:455>

--------------------------------------------------------------------------------

@soumith
Copy link
Owner

soumith commented Dec 24, 2014

thanks, having a look

@szagoruyko
Copy link
Collaborator Author

it's Ubuntu 14.04 btw, and it fails on all 4 cards in one test, not just one.

@soumith
Copy link
Owner

soumith commented Dec 24, 2014

ok that's an interesting detail.

@soumith
Copy link
Owner

soumith commented Dec 24, 2014

ok over several hundred runs, reproduced this once on my Tesla K40. Now trying to print out the specific input shape etc, and reproduce this consistently.

@soumith
Copy link
Owner

soumith commented Dec 25, 2014

i'm not able to reproduce this even over thousands of runs. i got the nan once, but cant seem to get it again.
can you run this https://github.com/soumith/cudnn.torch/blob/burepro/test/test.lua#L415
And then send me the files badTanh.t7 badSoftmax.t7, badSigmoid.t7 etc.

@szagoruyko
Copy link
Collaborator Author

hm, no luck after 100, will leave it to run overnight

@soumith
Copy link
Owner

soumith commented Dec 25, 2014

ok i reproduced the nans. it is very likely that cudnn guys are using the fast approximations, so in very extreme precision cases, generating nans. we went down that path in the past and reverted, we have a long history of these things. I will report this to them.

Fast approximations docs:
http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html

@philvdm
Copy link

philvdm commented Dec 26, 2014

I believe it is because you use Beta=0 and it is not handled properly in cuDNN R2 RC for the activation functions.
When beta=0, we are supposed to write directly (without reading the input) because 0 x Nan = Nan
It will be fixed in R2 Final release

@soumith
Copy link
Owner

soumith commented Dec 26, 2014

thanks @philvdm will wait for the final release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants