-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nn.testcuda() produces unstable results on Yosemite 10.10.1 with CUDA 6.5 #50
Comments
sigh! I dont have OSX Yosemite, and I dont have an OSX powered CUDA machine. |
soumith - i've been exploring related issues in an effort to provide some support. sorry I am not conversant enough in the libs to help more. but I did discover some other failures that may be related - Unfortunately I did not see any change in nn.testcuda() when leaving default type unchanged. They could be unrelated - but it all smells connected. s |
fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1 |
I’m running mid-2012 Macbook Pro Retina with 16gB ram, Nvidia GeForce GT650M with 1gB vRAM, and 4 core 2.7 gHx i7. Does anyone have a better experience on a newer MBP? For the record, in my experience this particular mbp vintage has a lot of little problems. Drivers that don’t run, strange usb behavior – for example, cannot tribe gaze tracker. I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default and with nn.testcuda(). Don’t take that to the bank – I was rushing to a meeting and did not keep good notes. All of this is to have a workflow that makes it easy for me to go from OS X -> xubuntu on laptop, and run the same code for long training and parameter searches on the big GPU box sitting in my office. I have not had problems with cutorch. Got cunn to build – local build with cmake 3.0.2 and the edit you advised. 3.0.2 is latest on brew. However, am still having gpu training issues. Using code that is identical to cpu it fails to train at all. OS X and xubuntu. It seems very fragile as well. Still not convinced there isn’t something more deeply wrong – like wrong stdlib in some library that the torch chain is dependent on. I did finally get OS X and ubuntu behavior to be the same. Not the best behavior – but the same is good. Still banging away at it, S This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it. From: Sergey Zagoruyko <notifications@github.commailto:notifications@github.com> fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1 — |
Btw – I have not found that there is a straight setup with vm that gives useful gpu access for cuda. If there is Id love to know about it as xubuntu and OS X were not designed to dual boot. S This entire message is confidential. If it isn't intended for you, you may not use it – so please throw it away and forget about it. From: Sergey Zagoruyko <notifications@github.commailto:notifications@github.com> fixed my cutorch finally! there is a lot of weird stuff going on with these malformed libraries and install_name_tool, I was only able to install it with cmake 3.1 — |
If you use FloatTensor as default, the jacobian tests will fail (as completely expected). This is because we define the perturbation amount for calculating finite difference based derivatives to be 1e-6 |
Lots of modules on CPU use these jacobian tests to check for correctness. |
Got it. Thx. From: Soumith Chintala <notifications@github.commailto:notifications@github.com> I am also dual booting xubuntu 14.04. It seems to have similar problems with the torch.test() and nn.test() with FloatTensor as default If you use FloatTensor as default, the jacobian tests will fail (as completely expected). This is because we define the perturbation amount for calculating finite difference based derivatives to be 1e-6 — |
Seems to me that torch/trepl#13 fixed the error in the screenshot with concat operator. I've run testcuda several times, the only error I get is out of memory, so probably it was causing calling trepl which was giving this no concat operator error. Can be closed I think. |
Thanks a lot sergey |
Getting unstable results with nn.testcuda(). Sometimes passes, sometimes fails, sometime segfaults. Ran tests due to cpu->gpu results discrepency for identical scripts and data.
Running Macbook Pro Retina 10,1 (mid 2012).
Yosemite 10.10.1, CUDA 6.5 - latest drivers and libs as of 12/1/14.
Re-installed today - as part of ongoing effort to solve cpu/gpu discrepancies - described at end.
Latest Torch7 install - using '2 line' scripts from Torch.ch. Used Clang 6.0 as CUDA 6.5 is incompatible with gcc49. Am not clear how scripts deal with libstdc++ issues.
Ran dependencies script as normal admin user.
Ran luajit-torch script using sudo -s.
This fails to build a loadable cunn properly. Local build fix did not work, due to cmake 3.0.2 changes in rpath handling. Edited FindCUDA.cmake as recommended - produced loadable libcunn.so.
Attached terminal sessions shows a common failure mode. Repeated testing shows passing, passing with significant delays, and failures ranging from failing a single test, to segfault, to out of memory.
Some background: have been struggling for 2-3 weeks trying to get cpu and gpu results to match. Have reinstalled all Torch components as well as CUDA numerous times. Did experiments with setting manualSeed(). Found that each platform produced repeatable results, but none of them matched. This is cpu and gpu on OSX, Ubuntu 14.04, and CENTOS 6.6. Timing differences with and without gpu are also inconsistent. Feels to me that this could be some kind of an install issue - but after having built the environment from scratch numerous times, am in the dark as to what it might be.
The text was updated successfully, but these errors were encountered: