nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 #50

sneiman · 2014-12-02T03:28:21Z

Getting unstable results with nn.testcuda(). Sometimes passes, sometimes fails, sometime segfaults. Ran tests due to cpu->gpu results discrepency for identical scripts and data.

Running Macbook Pro Retina 10,1 (mid 2012).
Yosemite 10.10.1, CUDA 6.5 - latest drivers and libs as of 12/1/14.
Re-installed today - as part of ongoing effort to solve cpu/gpu discrepancies - described at end.
Latest Torch7 install - using '2 line' scripts from Torch.ch. Used Clang 6.0 as CUDA 6.5 is incompatible with gcc49. Am not clear how scripts deal with libstdc++ issues.
Ran dependencies script as normal admin user.
Ran luajit-torch script using sudo -s.
This fails to build a loadable cunn properly. Local build fix did not work, due to cmake 3.0.2 changes in rpath handling. Edited FindCUDA.cmake as recommended - produced loadable libcunn.so.

Attached terminal sessions shows a common failure mode. Repeated testing shows passing, passing with significant delays, and failures ranging from failing a single test, to segfault, to out of memory.

Some background: have been struggling for 2-3 weeks trying to get cpu and gpu results to match. Have reinstalled all Torch components as well as CUDA numerous times. Did experiments with setting manualSeed(). Found that each platform produced repeatable results, but none of them matched. This is cpu and gpu on OSX, Ubuntu 14.04, and CENTOS 6.6. Timing differences with and without gpu are also inconsistent. Feels to me that this could be some kind of an install issue - but after having built the environment from scratch numerous times, am in the dark as to what it might be.

soumith · 2014-12-02T03:30:20Z

sigh! I dont have OSX Yosemite, and I dont have an OSX powered CUDA machine.
If anyone can give me ssh access to OSX powered CUDA with Yosemite (or if you know how I can get a contbuild to run on this combo), much appreciated. I'll setup continuous builds for these.

sneiman · 2014-12-04T01:13:37Z

soumith - i've been exploring related issues in an effort to provide some support. sorry I am not conversant enough in the libs to help more. but I did discover some other failures that may be related -
torch.test(), and nn.test() both work fine UNTIL default tensor type is set to torch.FloatTensor. Then many failures. additionally nn.test fails more tests and segfaults regularly if cunn is used with default type as FloatTensor.

Unfortunately I did not see any change in nn.testcuda() when leaving default type unchanged. They could be unrelated - but it all smells connected.

s

szagoruyko · 2014-12-04T01:56:07Z

fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1
so I'm able to reproduce, @sneiman are you running an old macbook with 512gb 650m?
these errors happen sometimes to me when I run tests, not every time.
@soumith what about a virtual machine?

sneiman · 2014-12-04T02:30:25Z

I’m running mid-2012 Macbook Pro Retina with 16gB ram, Nvidia GeForce GT650M with 1gB vRAM, and 4 core 2.7 gHx i7.
Yosemite (my bad) 10.10.1.

Does anyone have a better experience on a newer MBP? For the record, in my experience this particular mbp vintage has a lot of little problems. Drivers that don’t run, strange usb behavior – for example, cannot tribe gaze tracker.

I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default and with nn.testcuda(). Don’t take that to the bank – I was rushing to a meeting and did not keep good notes.

All of this is to have a workflow that makes it easy for me to go from OS X -> xubuntu on laptop, and run the same code for long training and parameter searches on the big GPU box sitting in my office.

I have not had problems with cutorch.

Got cunn to build – local build with cmake 3.0.2 and the edit you advised. 3.0.2 is latest on brew.

However, am still having gpu training issues. Using code that is identical to cpu it fails to train at all. OS X and xubuntu. It seems very fragile as well. Still not convinced there isn’t something more deeply wrong – like wrong stdlib in some library that the torch chain is dependent on.

I did finally get OS X and ubuntu behavior to be the same. Not the best behavior – but the same is good.

Still banging away at it,

S

This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it.

From: Sergey Zagoruyko <notifications@github.com mailto:notifications@github.com>
Reply-To: torch/cunn <reply@reply.github.com mailto:reply@reply.github.com>
Date: Wednesday, December 3, 2014 at 5:56 PM
To: torch/cunn <cunn@noreply.github.com mailto:cunn@noreply.github.com>
Cc: Seth Neiman <seth@lizardms.com mailto:seth@lizardms.com>
Subject: Re: [cunn] nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 (#50)

fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1
so I'm able to reproduce, @sneimanhttps://github.com/sneiman are you running an old macbook with 512gb 650m?
these errors happen sometimes to me when I run tests, not every time.
@soumithhttps://github.com/soumith what about a virtual machine?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-65525958.

sneiman · 2014-12-04T02:31:40Z

Btw – I have not found that there is a straight setup with vm that gives useful gpu access for cuda. If there is Id love to know about it as xubuntu and OS X were not designed to dual boot.

S

This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it.

From: Sergey Zagoruyko <notifications@github.com mailto:notifications@github.com>
Reply-To: torch/cunn <reply@reply.github.com mailto:reply@reply.github.com>
Date: Wednesday, December 3, 2014 at 5:56 PM
To: torch/cunn <cunn@noreply.github.com mailto:cunn@noreply.github.com>
Cc: Seth Neiman <seth@lizardms.com mailto:seth@lizardms.com>
Subject: Re: [cunn] nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 (#50)

fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1
so I'm able to reproduce, @sneimanhttps://github.com/sneiman are you running an old macbook with 512gb 650m?
these errors happen sometimes to me when I run tests, not every time.
@soumithhttps://github.com/soumith what about a virtual machine?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-65525958.

soumith · 2014-12-04T02:33:56Z

I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default

If you use FloatTensor as default, the jacobian tests will fail (as completely expected). This is because we define the perturbation amount for calculating finite difference based derivatives to be 1e-6
https://github.com/torch/nn/blob/master/Jacobian.lua#L70

soumith · 2014-12-04T02:34:35Z

Lots of modules on CPU use these jacobian tests to check for correctness.

sneiman · 2014-12-04T02:39:17Z

Got it. Thx.
This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it.

From: Soumith Chintala <notifications@github.com mailto:notifications@github.com>
Reply-To: torch/cunn <reply@reply.github.com mailto:reply@reply.github.com>
Date: Wednesday, December 3, 2014 at 6:33 PM
To: torch/cunn <cunn@noreply.github.com mailto:cunn@noreply.github.com>
Cc: Seth Neiman <seth@lizardms.com mailto:seth@lizardms.com>
Subject: Re: [cunn] nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 (#50)

I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default

If you use FloatTensor as default, the jacobian tests will fail (as completely expected). This is because we define the perturbation amount for calculating finite difference based derivatives to be 1e-6
https://github.com/torch/nn/blob/master/Jacobian.lua#L70

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-65528837.

szagoruyko · 2015-01-18T14:21:06Z

Seems to me that torch/trepl#13 fixed the error in the screenshot with concat operator. I've run testcuda several times, the only error I get is out of memory, so probably it was causing calling trepl which was giving this no concat operator error. Can be closed I think.

soumith · 2015-01-18T16:27:09Z

Thanks a lot sergey

soumith closed this as completed Jan 18, 2015

This was referenced Jul 23, 2015

Test failed with th -lcunn -e "nn.testcuda()" #117

Open

Torch failed with CUDA test in Hal cBio/cbio-cluster#291

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 #50

nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 #50

sneiman commented Dec 2, 2014

soumith commented Dec 2, 2014

sneiman commented Dec 4, 2014

szagoruyko commented Dec 4, 2014

sneiman commented Dec 4, 2014

sneiman commented Dec 4, 2014

soumith commented Dec 4, 2014

soumith commented Dec 4, 2014

sneiman commented Dec 4, 2014

szagoruyko commented Jan 18, 2015

soumith commented Jan 18, 2015

nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 #50

nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 #50

Comments

sneiman commented Dec 2, 2014

soumith commented Dec 2, 2014

sneiman commented Dec 4, 2014

szagoruyko commented Dec 4, 2014

sneiman commented Dec 4, 2014

sneiman commented Dec 4, 2014

soumith commented Dec 4, 2014

soumith commented Dec 4, 2014

sneiman commented Dec 4, 2014

szagoruyko commented Jan 18, 2015

soumith commented Jan 18, 2015