Training on a GTX 1080 does not work, produces random labels #3507

akors · 2016-07-26T15:01:10Z

Hi!
I have come across a very very strange issue. Namely, training on an NVIDIA GTX 1080 does not work at all, and judging from the error rate, the predicted labels are completely random.

I have 2 almost identical systems (see below), and while on the System with the GTX 960 the training runs perfectly fine, on the system with the GTX 1080, the training simply doesn't work.

To test this, I ran
the following code:

python -m tensorflow.models.image.mnist.convolutional

On the System 1 (GTX 960), i get to an error rate below 4% within 2-3 batches, and at the end, it's below 1%:

Step 0 (epoch 0.00), 7.6 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 13.0 ms
Minibatch loss: 3.296, learning rate: 0.010000
Minibatch error: 4.7%
Validation error: 7.3%
Step 200 (epoch 0.23), 13.1 ms
Minibatch loss: 3.459, learning rate: 0.010000
Minibatch error: 12.5%
Validation error: 3.9%
...
Step 8500 (epoch 9.89), 13.0 ms
Minibatch loss: 1.604, learning rate: 0.006302
Minibatch error: 1.6%
Validation error: 0.9%
Test error: 0.8%

On the GTX 1080 system, the performance simply never improves! Error rate is steady at around 90%.

Step 8400 (epoch 9.77), 5.5 ms
Minibatch loss: 3.881, learning rate: 0.006302
Minibatch error: 85.9%
Validation error: 88.7%
Step 8500 (epoch 9.89), 5.5 ms
Minibatch loss: 3.877, learning rate: 0.006302
Minibatch error: 87.5%
Validation error: 88.7%
Test error: 89.7%

I tested this with TensorFlow 0.9, from the release PIP package for Python 3.5 with GPU support.
I also tested this with TensorFlow master from a week ago (fc91629), compiled to a PIP package on one machine, installed on both machines.

Here are the full system Specs, but the difference between them is only the GPU (1080 vs 960) and the Driver (367.35 vs 361.42)

System 1

OS: Ubuntu 16.04.1 LTS
Kernel: 4.4.0-31-generic
NVIDIA Driver Version: 361.42
CPU: Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz
RAM: 12 GB Ram
GPU: NVIDIA GTX 960, 4GB VRAM (edition: MSI GTX 960 Gaming 4G)

System 2:

OS: Ubuntu 16.04.1 LTS
Kernel: 4.4.0-31-generic
NVIDIA Driver Version: 367.35
CPU: Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz
RAM: 12 GB Ram
GPU: NVIDIA GTX 1080, 8GB VRAM

My LD_LIBRARY_PATH on both machines is:
:/usr/local/cuda/lib64:/usr/local/cuda-7.5/extras/CUPTI/lib64/

My Cuda version is 7.5, cudnn is 4.0.7 on BOTH machines.

Output of ls -l /usr/local/cuda/lib64 on the Machine with the GTX 960 and GTX 1080
https://gist.github.com/akors/30f5fbe3994e3ac40a4adbb6f76eb756#file-cudalibs-gtx960-txt
https://gist.github.com/akors/30f5fbe3994e3ac40a4adbb6f76eb756#file-cudalibs-gtx1080-txt

Does anyone know what could cause this and how to fix this?

The text was updated successfully, but these errors were encountered:

akors · 2016-07-26T16:22:49Z

Judging from my own model, it seems that the labels returned are simply always Zero.

Is there a simple way to confirm this with the mnist model?

ps.: I have run the mnist_with_summaries.py code, and here everything seems fine. What is the difference between those two?

yaroslavvb · 2016-07-26T17:02:40Z

People had similar problems with GTX 1080 (ie, #3068), which were fixed by building TensorFlow with CUDA 8.0 instead of default 7.5 - #3052

michaelisard · 2016-07-26T17:27:40Z

@akors would you let me know if CUDA 8 helps?

akors · 2016-07-26T18:52:22Z

Yes, it seems to have fixed the issue. Thank you for your support, and sorry that I didn't find the two threads myself.

ashleyjsands · 2016-08-07T04:18:14Z

I'm having the exact same issues with the GTX 1080 and CUDA 7.5, but I am unable to successfully compile Tensorflow with CUDA 8.0. Is anybody able to share their compile Tensorflow whl file that works with CUDA 8.0?

akors · 2016-08-07T08:33:30Z

@ashleyjsands I have one compiled from the master branch (I think it's labeled0.10rc0 or something), for Ubuntu 16.04.1. what is your system?

If it's incompatible to my whql, try these instructions for compilation: #2053 (comment)

akors · 2016-08-08T08:53:22Z

@ashleyjsands
Here's my version compiled for CUDA 8.0: https://dl.dropboxusercontent.com/u/1414175/tensorflow_gpu_cuda8-0.9.0-4fb28d6-py3-none-any.whl

You have to rename (or symlink) it to tensorflow-0.9.0-py3-none-any.whl to install.

ashleyjsands · 2016-08-08T10:12:44Z

Thanks @akors. Yes, I'm using Ubuntuy 16.04, but I am using Python 2. I will give Python 3 a try and along with your whl file. Thanks for sending it through.

I tried your instructions for compilation previously, but I got stuck in the part when it came to editting the CROSSTOOL file as I apt-get install-ed gcc-4.9 on my machine and I couldn't figure out the correct paths to set for the cxx_builtin_include_directory values.

akors · 2016-08-08T13:01:48Z

Glad I could help.

I tried your instructions for compilation previously, but I got stuck in the part when it came to editting the CROSSTOOL file as I apt-get install-ed gcc-4.9 on my machine and I couldn't figure out the correct paths to set for the cxx_builtin_include_directory values.

First, for CUDA 8, you need GCC 5.3.1, which is not available in the Repos. You have to compile that manually. Second, here's my patch to the CROSSTOOL file:

diff --git a/third_party/gpus/crosstool/CROSSTOOL b/third_party/gpus/crosstool/CROSSTOOL
index 8db81a9..d026738 100644
--- a/third_party/gpus/crosstool/CROSSTOOL
+++ b/third_party/gpus/crosstool/CROSSTOOL
@@ -50,6 +50,7 @@ toolchain {
   # Use "-std=c++11" for nvcc. For consistency, force both the host compiler
   # and the device compiler to use "-std=c++11".
   cxx_flag: "-std=c++11"
+  cxx_flag: "-D_FORCE_INLINES"
   linker_flag: "-lstdc++"
   linker_flag: "-B/usr/bin/"

@@ -57,8 +58,10 @@ toolchain {
   # used by gcc. That works because bazel currently doesn't track files at
   # absolute locations and has no remote execution, yet. However, this will need
   # to be fixed, maybe with auto-detection?
-  cxx_builtin_include_directory: "/usr/lib/gcc/"
-  cxx_builtin_include_directory: "/usr/local/include"
+  cxx_builtin_include_directory: "/opt/gcc-5.3/lib/gcc/"
+  cxx_builtin_include_directory: "/opt/gcc-5.3/local/include"
+  cxx_builtin_include_directory: "/opt/gcc-5.3/include"
+  cxx_builtin_include_directory: "/usr/local/cuda-8.0/include/"
   cxx_builtin_include_directory: "/usr/include"
   tool_path { name: "gcov" path: "/usr/bin/gcov" }

michaelisard added the stat:awaiting response Status - Awaiting response from author label Jul 26, 2016

akors closed this as completed Jul 26, 2016

michaelisard mentioned this issue Jul 29, 2016

Inception retraining / transfer learning fails when running with GPU #3560

Closed

TylerBalsam mentioned this issue Jul 30, 2016

Loss not decreasing, GTX 1070, cuDNN 5.0 for RC 8.0, Cuda 8.0 RC #3582

Closed

akors mentioned this issue Aug 23, 2016

[bug] tensorflow/models/image/mnist/convolutional.py doesn't converge with CUDA7.5+cuDNN4 #3925

Closed

yichunk mentioned this issue Mar 1, 2017

Low accuracy of own trained model ronghanghu/text_objseg#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on a GTX 1080 does not work, produces random labels #3507

Training on a GTX 1080 does not work, produces random labels #3507

akors commented Jul 26, 2016 •

edited

akors commented Jul 26, 2016

yaroslavvb commented Jul 26, 2016 •

edited

michaelisard commented Jul 26, 2016

akors commented Jul 26, 2016

ashleyjsands commented Aug 7, 2016

akors commented Aug 7, 2016

akors commented Aug 8, 2016

ashleyjsands commented Aug 8, 2016

akors commented Aug 8, 2016

Training on a GTX 1080 does not work, produces random labels #3507

Training on a GTX 1080 does not work, produces random labels #3507

Comments

akors commented Jul 26, 2016 • edited

akors commented Jul 26, 2016

yaroslavvb commented Jul 26, 2016 • edited

michaelisard commented Jul 26, 2016

akors commented Jul 26, 2016

ashleyjsands commented Aug 7, 2016

akors commented Aug 7, 2016

akors commented Aug 8, 2016

ashleyjsands commented Aug 8, 2016

akors commented Aug 8, 2016

akors commented Jul 26, 2016 •

edited

yaroslavvb commented Jul 26, 2016 •

edited