Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on a GTX 1080 does not work, produces random labels #3507

Closed
akors opened this issue Jul 26, 2016 · 9 comments
Closed

Training on a GTX 1080 does not work, produces random labels #3507

akors opened this issue Jul 26, 2016 · 9 comments
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@akors
Copy link

akors commented Jul 26, 2016

Hi!
I have come across a very very strange issue. Namely, training on an NVIDIA GTX 1080 does not work at all, and judging from the error rate, the predicted labels are completely random.

I have 2 almost identical systems (see below), and while on the System with the GTX 960 the training runs perfectly fine, on the system with the GTX 1080, the training simply doesn't work.

To test this, I ran
the following code:

python -m tensorflow.models.image.mnist.convolutional

On the System 1 (GTX 960), i get to an error rate below 4% within 2-3 batches, and at the end, it's below 1%:

Step 0 (epoch 0.00), 7.6 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 13.0 ms
Minibatch loss: 3.296, learning rate: 0.010000
Minibatch error: 4.7%
Validation error: 7.3%
Step 200 (epoch 0.23), 13.1 ms
Minibatch loss: 3.459, learning rate: 0.010000
Minibatch error: 12.5%
Validation error: 3.9%
...
Step 8500 (epoch 9.89), 13.0 ms
Minibatch loss: 1.604, learning rate: 0.006302
Minibatch error: 1.6%
Validation error: 0.9%
Test error: 0.8%

On the GTX 1080 system, the performance simply never improves! Error rate is steady at around 90%.

Step 8400 (epoch 9.77), 5.5 ms
Minibatch loss: 3.881, learning rate: 0.006302
Minibatch error: 85.9%
Validation error: 88.7%
Step 8500 (epoch 9.89), 5.5 ms
Minibatch loss: 3.877, learning rate: 0.006302
Minibatch error: 87.5%
Validation error: 88.7%
Test error: 89.7%

I tested this with TensorFlow 0.9, from the release PIP package for Python 3.5 with GPU support.
I also tested this with TensorFlow master from a week ago (fc91629), compiled to a PIP package on one machine, installed on both machines.

Here are the full system Specs, but the difference between them is only the GPU (1080 vs 960) and the Driver (367.35 vs 361.42)

System 1

  • OS: Ubuntu 16.04.1 LTS
  • Kernel: 4.4.0-31-generic
  • NVIDIA Driver Version: 361.42
  • CPU: Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz
  • RAM: 12 GB Ram
  • GPU: NVIDIA GTX 960, 4GB VRAM (edition: MSI GTX 960 Gaming 4G)

System 2:

  • OS: Ubuntu 16.04.1 LTS
  • Kernel: 4.4.0-31-generic
  • NVIDIA Driver Version: 367.35
  • CPU: Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz
  • RAM: 12 GB Ram
  • GPU: NVIDIA GTX 1080, 8GB VRAM

My LD_LIBRARY_PATH on both machines is:
:/usr/local/cuda/lib64:/usr/local/cuda-7.5/extras/CUPTI/lib64/

My Cuda version is 7.5, cudnn is 4.0.7 on BOTH machines.

Output of ls -l /usr/local/cuda/lib64 on the Machine with the GTX 960 and GTX 1080
https://gist.github.com/akors/30f5fbe3994e3ac40a4adbb6f76eb756#file-cudalibs-gtx960-txt
https://gist.github.com/akors/30f5fbe3994e3ac40a4adbb6f76eb756#file-cudalibs-gtx1080-txt

Does anyone know what could cause this and how to fix this?

@akors
Copy link
Author

akors commented Jul 26, 2016

Judging from my own model, it seems that the labels returned are simply always Zero.

Is there a simple way to confirm this with the mnist model?

ps.: I have run the mnist_with_summaries.py code, and here everything seems fine. What is the difference between those two?

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Jul 26, 2016

People had similar problems with GTX 1080 (ie, #3068), which were fixed by building TensorFlow with CUDA 8.0 instead of default 7.5 - #3052

@michaelisard michaelisard added the stat:awaiting response Status - Awaiting response from author label Jul 26, 2016
@michaelisard
Copy link

@akors would you let me know if CUDA 8 helps?

@akors
Copy link
Author

akors commented Jul 26, 2016

Yes, it seems to have fixed the issue. Thank you for your support, and sorry that I didn't find the two threads myself.

@ashleyjsands
Copy link

I'm having the exact same issues with the GTX 1080 and CUDA 7.5, but I am unable to successfully compile Tensorflow with CUDA 8.0. Is anybody able to share their compile Tensorflow whl file that works with CUDA 8.0?

@akors
Copy link
Author

akors commented Aug 7, 2016

@ashleyjsands I have one compiled from the master branch (I think it's labeled0.10rc0 or something), for Ubuntu 16.04.1. what is your system?

If it's incompatible to my whql, try these instructions for compilation: #2053 (comment)

@akors
Copy link
Author

akors commented Aug 8, 2016

@ashleyjsands
Here's my version compiled for CUDA 8.0: https://dl.dropboxusercontent.com/u/1414175/tensorflow_gpu_cuda8-0.9.0-4fb28d6-py3-none-any.whl

You have to rename (or symlink) it to tensorflow-0.9.0-py3-none-any.whl to install.

@ashleyjsands
Copy link

Thanks @akors. Yes, I'm using Ubuntuy 16.04, but I am using Python 2. I will give Python 3 a try and along with your whl file. Thanks for sending it through.

I tried your instructions for compilation previously, but I got stuck in the part when it came to editting the CROSSTOOL file as I apt-get install-ed gcc-4.9 on my machine and I couldn't figure out the correct paths to set for the cxx_builtin_include_directory values.

@akors
Copy link
Author

akors commented Aug 8, 2016

Glad I could help.

I tried your instructions for compilation previously, but I got stuck in the part when it came to editting the CROSSTOOL file as I apt-get install-ed gcc-4.9 on my machine and I couldn't figure out the correct paths to set for the cxx_builtin_include_directory values.

First, for CUDA 8, you need GCC 5.3.1, which is not available in the Repos. You have to compile that manually. Second, here's my patch to the CROSSTOOL file:

diff --git a/third_party/gpus/crosstool/CROSSTOOL b/third_party/gpus/crosstool/CROSSTOOL
index 8db81a9..d026738 100644
--- a/third_party/gpus/crosstool/CROSSTOOL
+++ b/third_party/gpus/crosstool/CROSSTOOL
@@ -50,6 +50,7 @@ toolchain {
   # Use "-std=c++11" for nvcc. For consistency, force both the host compiler
   # and the device compiler to use "-std=c++11".
   cxx_flag: "-std=c++11"
+  cxx_flag: "-D_FORCE_INLINES"
   linker_flag: "-lstdc++"
   linker_flag: "-B/usr/bin/"

@@ -57,8 +58,10 @@ toolchain {
   # used by gcc. That works because bazel currently doesn't track files at
   # absolute locations and has no remote execution, yet. However, this will need
   # to be fixed, maybe with auto-detection?
-  cxx_builtin_include_directory: "/usr/lib/gcc/"
-  cxx_builtin_include_directory: "/usr/local/include"
+  cxx_builtin_include_directory: "/opt/gcc-5.3/lib/gcc/"
+  cxx_builtin_include_directory: "/opt/gcc-5.3/local/include"
+  cxx_builtin_include_directory: "/opt/gcc-5.3/include"
+  cxx_builtin_include_directory: "/usr/local/cuda-8.0/include/"
   cxx_builtin_include_directory: "/usr/include"
   tool_path { name: "gcov" path: "/usr/bin/gcov" }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

4 participants