OS X R2 Tanh and SoftMax tests fail #9

szagoruyko · 2014-12-19T20:18:07Z

Have just tested in Ubuntu, all tests pass. But in OS X no:

____*__*______  ==> Done Completed 50 asserts in 14 tests with 3 errors
--------------------------------------------------------------------------------
Tanh_single
error on state (forward)
 LT(<) violation   val=nan, condition=0.0001
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:329: in function <test/test.lua:303>

--------------------------------------------------------------------------------
Tanh_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:332: in function <test/test.lua:303>

--------------------------------------------------------------------------------
SoftMax_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:467: in function <test/test.lua:437>

--------------------------------------------------------------------------------

weird

The text was updated successfully, but these errors were encountered:

soumith · 2014-12-19T20:33:38Z

aaaah, it is very likely they have a bug on OSX version. Cant think of any other explanation, as it works cleanly on Linux. you can report bugs to them via the nvidia developer tool.

szagoruyko · 2014-12-19T20:49:43Z

Yes, on Linux only MaxPooling fails sometimes, as they mention in docs. On OS X actually all ReLU, Tanh, Sigmoid and SofMax fail a lot. Will report a bug.

szagoruyko · 2014-12-24T15:37:23Z

Just caught the same problem in Linux!

--------------------------------------------------------------------------------
ReLU_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:350: in function <test/test.lua:321>

--------------------------------------------------------------------------------

soumith · 2014-12-24T16:47:57Z

ok, i am going to run the unit tests a few thousand times and see how that goes.
Also, are you making sure to use Cuda 6.5 on Linux?

soumith · 2014-12-24T16:48:32Z

can you give me other details about your linux, for a possible reproduction

szagoruyko · 2014-12-24T17:09:06Z

yes, cuda-6.5, 4 Titan Blacks, 340.29 driver, torch, cutorch, nn and cunn updated to the last version, and also got another machine (mostly equal) on which it fails too, like here Sigmoid:

--------------------------------------------------------------------------------
Sigmoid_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    ...ocks/torch-distro/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:484: in function <test/test.lua:455>

--------------------------------------------------------------------------------

soumith · 2014-12-24T17:12:12Z

thanks, having a look

szagoruyko · 2014-12-24T17:13:06Z

it's Ubuntu 14.04 btw, and it fails on all 4 cards in one test, not just one.

soumith · 2014-12-24T17:13:32Z

ok that's an interesting detail.

soumith · 2014-12-24T17:50:41Z

ok over several hundred runs, reproduced this once on my Tesla K40. Now trying to print out the specific input shape etc, and reproduce this consistently.

soumith · 2014-12-25T00:15:31Z

i'm not able to reproduce this even over thousands of runs. i got the nan once, but cant seem to get it again.
can you run this https://github.com/soumith/cudnn.torch/blob/burepro/test/test.lua#L415
And then send me the files badTanh.t7 badSoftmax.t7, badSigmoid.t7 etc.

szagoruyko · 2014-12-25T00:39:50Z

hm, no luck after 100, will leave it to run overnight

soumith · 2014-12-25T03:40:33Z

ok i reproduced the nans. it is very likely that cudnn guys are using the fast approximations, so in very extreme precision cases, generating nans. we went down that path in the past and reverted, we have a long history of these things. I will report this to them.

Fast approximations docs:
http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html

philvdm · 2014-12-26T21:11:25Z

I believe it is because you use Beta=0 and it is not handled properly in cuDNN R2 RC for the activation functions.
When beta=0, we are supposed to write directly (without reading the input) because 0 x Nan = Nan
It will be fixed in R2 Final release

soumith · 2014-12-26T21:20:05Z

thanks @philvdm will wait for the final release.

soumith closed this as completed Dec 29, 2014

tatarsky mentioned this issue Jul 23, 2015

Torch failed with CUDA test in Hal cBio/cbio-cluster#291

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OS X R2 Tanh and SoftMax tests fail #9

OS X R2 Tanh and SoftMax tests fail #9

szagoruyko commented Dec 19, 2014

soumith commented Dec 19, 2014

szagoruyko commented Dec 19, 2014

szagoruyko commented Dec 24, 2014

soumith commented Dec 24, 2014

soumith commented Dec 24, 2014

szagoruyko commented Dec 24, 2014

soumith commented Dec 24, 2014

szagoruyko commented Dec 24, 2014

soumith commented Dec 24, 2014

soumith commented Dec 24, 2014

soumith commented Dec 25, 2014

szagoruyko commented Dec 25, 2014

soumith commented Dec 25, 2014

philvdm commented Dec 26, 2014

soumith commented Dec 26, 2014

OS X R2 Tanh and SoftMax tests fail #9

OS X R2 Tanh and SoftMax tests fail #9

Comments

szagoruyko commented Dec 19, 2014

soumith commented Dec 19, 2014

szagoruyko commented Dec 19, 2014

szagoruyko commented Dec 24, 2014

soumith commented Dec 24, 2014

soumith commented Dec 24, 2014

szagoruyko commented Dec 24, 2014

soumith commented Dec 24, 2014

szagoruyko commented Dec 24, 2014

soumith commented Dec 24, 2014

soumith commented Dec 24, 2014

soumith commented Dec 25, 2014

szagoruyko commented Dec 25, 2014

soumith commented Dec 25, 2014

philvdm commented Dec 26, 2014

soumith commented Dec 26, 2014