New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on a GTX 1080 does not work, produces random labels #3507
Comments
Judging from my own model, it seems that the labels returned are simply always Zero. Is there a simple way to confirm this with the mnist model? ps.: I have run the mnist_with_summaries.py code, and here everything seems fine. What is the difference between those two? |
@akors would you let me know if CUDA 8 helps? |
Yes, it seems to have fixed the issue. Thank you for your support, and sorry that I didn't find the two threads myself. |
I'm having the exact same issues with the GTX 1080 and CUDA 7.5, but I am unable to successfully compile Tensorflow with CUDA 8.0. Is anybody able to share their compile Tensorflow whl file that works with CUDA 8.0? |
@ashleyjsands I have one compiled from the master branch (I think it's labeled0.10rc0 or something), for Ubuntu 16.04.1. what is your system? If it's incompatible to my whql, try these instructions for compilation: #2053 (comment) |
@ashleyjsands You have to rename (or symlink) it to |
Thanks @akors. Yes, I'm using Ubuntuy 16.04, but I am using Python 2. I will give Python 3 a try and along with your whl file. Thanks for sending it through. I tried your instructions for compilation previously, but I got stuck in the part when it came to editting the CROSSTOOL file as I apt-get install-ed gcc-4.9 on my machine and I couldn't figure out the correct paths to set for the cxx_builtin_include_directory values. |
Glad I could help.
First, for CUDA 8, you need GCC 5.3.1, which is not available in the Repos. You have to compile that manually. Second, here's my patch to the CROSSTOOL file: diff --git a/third_party/gpus/crosstool/CROSSTOOL b/third_party/gpus/crosstool/CROSSTOOL
index 8db81a9..d026738 100644
--- a/third_party/gpus/crosstool/CROSSTOOL
+++ b/third_party/gpus/crosstool/CROSSTOOL
@@ -50,6 +50,7 @@ toolchain {
# Use "-std=c++11" for nvcc. For consistency, force both the host compiler
# and the device compiler to use "-std=c++11".
cxx_flag: "-std=c++11"
+ cxx_flag: "-D_FORCE_INLINES"
linker_flag: "-lstdc++"
linker_flag: "-B/usr/bin/"
@@ -57,8 +58,10 @@ toolchain {
# used by gcc. That works because bazel currently doesn't track files at
# absolute locations and has no remote execution, yet. However, this will need
# to be fixed, maybe with auto-detection?
- cxx_builtin_include_directory: "/usr/lib/gcc/"
- cxx_builtin_include_directory: "/usr/local/include"
+ cxx_builtin_include_directory: "/opt/gcc-5.3/lib/gcc/"
+ cxx_builtin_include_directory: "/opt/gcc-5.3/local/include"
+ cxx_builtin_include_directory: "/opt/gcc-5.3/include"
+ cxx_builtin_include_directory: "/usr/local/cuda-8.0/include/"
cxx_builtin_include_directory: "/usr/include"
tool_path { name: "gcov" path: "/usr/bin/gcov" }
|
Hi!
I have come across a very very strange issue. Namely, training on an NVIDIA GTX 1080 does not work at all, and judging from the error rate, the predicted labels are completely random.
I have 2 almost identical systems (see below), and while on the System with the GTX 960 the training runs perfectly fine, on the system with the GTX 1080, the training simply doesn't work.
To test this, I ran
the following code:
python -m tensorflow.models.image.mnist.convolutional
On the System 1 (GTX 960), i get to an error rate below 4% within 2-3 batches, and at the end, it's below 1%:
On the GTX 1080 system, the performance simply never improves! Error rate is steady at around 90%.
I tested this with TensorFlow 0.9, from the release PIP package for Python 3.5 with GPU support.
I also tested this with TensorFlow master from a week ago (fc91629), compiled to a PIP package on one machine, installed on both machines.
Here are the full system Specs, but the difference between them is only the GPU (1080 vs 960) and the Driver (367.35 vs 361.42)
System 1
System 2:
My LD_LIBRARY_PATH on both machines is:
:/usr/local/cuda/lib64:/usr/local/cuda-7.5/extras/CUPTI/lib64/
My Cuda version is 7.5, cudnn is 4.0.7 on BOTH machines.
Output of
ls -l /usr/local/cuda/lib64
on the Machine with the GTX 960 and GTX 1080https://gist.github.com/akors/30f5fbe3994e3ac40a4adbb6f76eb756#file-cudalibs-gtx960-txt
https://gist.github.com/akors/30f5fbe3994e3ac40a4adbb6f76eb756#file-cudalibs-gtx1080-txt
Does anyone know what could cause this and how to fix this?
The text was updated successfully, but these errors were encountered: