Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 #50

Closed
sneiman opened this issue Dec 2, 2014 · 10 comments
Closed

Comments

@sneiman
Copy link

sneiman commented Dec 2, 2014

Getting unstable results with nn.testcuda(). Sometimes passes, sometimes fails, sometime segfaults. Ran tests due to cpu->gpu results discrepency for identical scripts and data.

Running Macbook Pro Retina 10,1 (mid 2012).
Yosemite 10.10.1, CUDA 6.5 - latest drivers and libs as of 12/1/14.
Re-installed today - as part of ongoing effort to solve cpu/gpu discrepancies - described at end.
Latest Torch7 install - using '2 line' scripts from Torch.ch. Used Clang 6.0 as CUDA 6.5 is incompatible with gcc49. Am not clear how scripts deal with libstdc++ issues.
Ran dependencies script as normal admin user.
Ran luajit-torch script using sudo -s.
This fails to build a loadable cunn properly. Local build fix did not work, due to cmake 3.0.2 changes in rpath handling. Edited FindCUDA.cmake as recommended - produced loadable libcunn.so.

Attached terminal sessions shows a common failure mode. Repeated testing shows passing, passing with significant delays, and failures ranging from failing a single test, to segfault, to out of memory.

testcuda fails

Some background: have been struggling for 2-3 weeks trying to get cpu and gpu results to match. Have reinstalled all Torch components as well as CUDA numerous times. Did experiments with setting manualSeed(). Found that each platform produced repeatable results, but none of them matched. This is cpu and gpu on OSX, Ubuntu 14.04, and CENTOS 6.6. Timing differences with and without gpu are also inconsistent. Feels to me that this could be some kind of an install issue - but after having built the environment from scratch numerous times, am in the dark as to what it might be.

@soumith
Copy link
Member

soumith commented Dec 2, 2014

sigh! I dont have OSX Yosemite, and I dont have an OSX powered CUDA machine.
If anyone can give me ssh access to OSX powered CUDA with Yosemite (or if you know how I can get a contbuild to run on this combo), much appreciated. I'll setup continuous builds for these.

@sneiman
Copy link
Author

sneiman commented Dec 4, 2014

soumith - i've been exploring related issues in an effort to provide some support. sorry I am not conversant enough in the libs to help more. but I did discover some other failures that may be related -
torch.test(), and nn.test() both work fine UNTIL default tensor type is set to torch.FloatTensor. Then many failures. additionally nn.test fails more tests and segfaults regularly if cunn is used with default type as FloatTensor.

Unfortunately I did not see any change in nn.testcuda() when leaving default type unchanged. They could be unrelated - but it all smells connected.

s

@szagoruyko
Copy link
Member

fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1
so I'm able to reproduce, @sneiman are you running an old macbook with 512gb 650m?
these errors happen sometimes to me when I run tests, not every time.
@soumith what about a virtual machine?

@sneiman
Copy link
Author

sneiman commented Dec 4, 2014

I’m running mid-2012 Macbook Pro Retina with 16gB ram, Nvidia GeForce GT650M with 1gB vRAM, and 4 core 2.7 gHx i7.
Yosemite (my bad) 10.10.1.

Does anyone have a better experience on a newer MBP? For the record, in my experience this particular mbp vintage has a lot of little problems. Drivers that don’t run, strange usb behavior – for example, cannot tribe gaze tracker.

I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default and with nn.testcuda(). Don’t take that to the bank – I was rushing to a meeting and did not keep good notes.

All of this is to have a workflow that makes it easy for me to go from OS X -> xubuntu on laptop, and run the same code for long training and parameter searches on the big GPU box sitting in my office.

I have not had problems with cutorch.

Got cunn to build – local build with cmake 3.0.2 and the edit you advised. 3.0.2 is latest on brew.

However, am still having gpu training issues. Using code that is identical to cpu it fails to train at all. OS X and xubuntu. It seems very fragile as well. Still not convinced there isn’t something more deeply wrong – like wrong stdlib in some library that the torch chain is dependent on.

I did finally get OS X and ubuntu behavior to be the same. Not the best behavior – but the same is good.

Still banging away at it,

S

This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it.

From: Sergey Zagoruyko <notifications@github.commailto:notifications@github.com>
Reply-To: torch/cunn <reply@reply.github.commailto:reply@reply.github.com>
Date: Wednesday, December 3, 2014 at 5:56 PM
To: torch/cunn <cunn@noreply.github.commailto:cunn@noreply.github.com>
Cc: Seth Neiman <seth@lizardms.commailto:seth@lizardms.com>
Subject: Re: [cunn] nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 (#50)

fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1
so I'm able to reproduce, @sneimanhttps://github.com/sneiman are you running an old macbook with 512gb 650m?
these errors happen sometimes to me when I run tests, not every time.
@soumithhttps://github.com/soumith what about a virtual machine?


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-65525958.

@sneiman
Copy link
Author

sneiman commented Dec 4, 2014

Btw – I have not found that there is a straight setup with vm that gives useful gpu access for cuda. If there is Id love to know about it as xubuntu and OS X were not designed to dual boot.

S

This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it.

From: Sergey Zagoruyko <notifications@github.commailto:notifications@github.com>
Reply-To: torch/cunn <reply@reply.github.commailto:reply@reply.github.com>
Date: Wednesday, December 3, 2014 at 5:56 PM
To: torch/cunn <cunn@noreply.github.commailto:cunn@noreply.github.com>
Cc: Seth Neiman <seth@lizardms.commailto:seth@lizardms.com>
Subject: Re: [cunn] nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 (#50)

fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1
so I'm able to reproduce, @sneimanhttps://github.com/sneiman are you running an old macbook with 512gb 650m?
these errors happen sometimes to me when I run tests, not every time.
@soumithhttps://github.com/soumith what about a virtual machine?


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-65525958.

@soumith
Copy link
Member

soumith commented Dec 4, 2014

I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default

If you use FloatTensor as default, the jacobian tests will fail (as completely expected). This is because we define the perturbation amount for calculating finite difference based derivatives to be 1e-6
https://github.com/torch/nn/blob/master/Jacobian.lua#L70

@soumith
Copy link
Member

soumith commented Dec 4, 2014

Lots of modules on CPU use these jacobian tests to check for correctness.

@sneiman
Copy link
Author

sneiman commented Dec 4, 2014

Got it. Thx.
This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it.

From: Soumith Chintala <notifications@github.commailto:notifications@github.com>
Reply-To: torch/cunn <reply@reply.github.commailto:reply@reply.github.com>
Date: Wednesday, December 3, 2014 at 6:33 PM
To: torch/cunn <cunn@noreply.github.commailto:cunn@noreply.github.com>
Cc: Seth Neiman <seth@lizardms.commailto:seth@lizardms.com>
Subject: Re: [cunn] nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 (#50)

I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default

If you use FloatTensor as default, the jacobian tests will fail (as completely expected). This is because we define the perturbation amount for calculating finite difference based derivatives to be 1e-6
https://github.com/torch/nn/blob/master/Jacobian.lua#L70


Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-65528837.

@szagoruyko
Copy link
Member

Seems to me that torch/trepl#13 fixed the error in the screenshot with concat operator. I've run testcuda several times, the only error I get is out of memory, so probably it was causing calling trepl which was giving this no concat operator error. Can be closed I think.

@soumith soumith closed this as completed Jan 18, 2015
@soumith
Copy link
Member

soumith commented Jan 18, 2015

Thanks a lot sergey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants