Inception retraining / transfer learning fails when running with GPU #3560

Closed
theclifbar opened this Issue Jul 29, 2016 · 6 comments

Projects

None yet

4 participants

@theclifbar
theclifbar commented Jul 29, 2016 edited

Thanks so much for releasing TensorFlow. We're experimenting with the image retraining example as described here: https://www.tensorflow.org/versions/r0.9/how_tos/image_retraining/index.html

Everything in TensorFlow has worked perfectly for us, including the test and GPU setup validation samples. However, when running the Inception retraining code and using the GPU, TensorFlow errors. Since the error occurs in check_numerics_op.cc, I thought it might be worth reporting.

For example, CPU-only bottleneck generation for 600 classes on a recent 36-core machine takes nearly a month, so working multi-GPU support for bottleneck creation would be really great. It would help us learn faster and hopefully contribute to the project sooner.

Abbreviated output (full output is attached):
TensorFlow_Retraining_Error.txt

python tensorflow/examples/image_retraining/retrain.py --image_dir ~/flower_photos
Mon Jul 25 00:54:39 PDT 2016
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 1080
...
Creating bottleneck at /tmp/bottleneck/dandelion/3365850019_8158a161a8_n.jpg.txt
E tensorflow/core/kernels/check_numerics_op.cc:157] abnormal_detected_host @0x1000e000300 = {1, 0} activation input is not finite.
Traceback (most recent call last):
File "tensorflow/examples/image_retraining/retrain.py", line 824, in
tf.app.run()

Environment info

Operating System: Ubuntu 14.04
GPU: GTX 1080 (two cards)

Installed version of CUDA and cuDNN:
/usr/local/cuda/lib/libcudadevrt.a
/usr/local/cuda/lib/libcudart.so -> libcudart.so.7.5
/usr/local/cuda/lib/libcudart.so.7.5 -> libcudart.so.7.5.18
/usr/local/cuda/lib/libcudart.so.7.5.18
/usr/local/cuda/lib/libcudart_static.a

  1. The commit hash (git rev-parse HEAD)
    v0.9.0 25023df
  2. The output of bazel version
    Build label: 0.2.3
    Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
    Build time: Tue May 17 14:21:13 2016 (1463494873)
    Build timestamp: 1463494873
    Build timestamp as int: 1463494873

Steps to reproduce

  1. Build tensorflow from source with GPU support; all demos and tests working properly
  2. bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir ~/flower_photos runs properly but only uses CPU when generating bottleneck files
  3. python tensorflow/examples/image_retraining/retrain.py --image_dir ~/flower_photos utilizes the GPU, but crashes during bottleneck file generation due to "irregular" response from GPU. If the /tmp/bottlenecks directory has not yet been generated, image retraining using the gpu will fail almost immediately after processing a few files.

What have you tried?

  1. If cached bottleneck files already exist in /tmp/bottlenecks, then the retraining on GPU works properly. However, building those bottleneck files in the first place depends on the CPU to generate those bottlenecks, which takes forever (nearly a month for a dual socket E5-2699 v3 36-core machine for around 600 classes)
@michaelisard
Member

Thank you for the detailed bug report! We have heard of other people having problems with GTX 1080 #3507 which have been fixed by switching to CUDA 8.0. Is it easy for you to try either using a different GPU or building with 8.0? Unfortunately 8.0 isn't fully supported by TensorFlow yet so that isn't an ideal solution, but it has unblocked other people and will be supported before long.

@theclifbar

Thanks for reading it, Michaelisard! We are very excited by the possibilities opened up by TensorFlow. I can give CUDA 8.0RC a try.

We could try other GPUs, though these GTX 1080s were bought specifically for TensorFlow. On a somewhat related question, is the 8GB of GDDRAM going to be a major limiting factor in your opinion? Our training sets typically run around 60GB in size (actual total image file size; the bottlenecks are much smaller). The new Titan X Pascal cards have 12GB; I wonder if we'll be handicapped in the long run by the 8GB in terms of batch sizes and other problems.

@JohnAllen
Contributor

CUDA 8.0 is working for me using a GTX 1080 and Ubuntu 16.04. Can confirm creating bottlenecks takes forever, a few seconds an image or so for me.

@michaelisard
Member

@theclifbar I was only suggesting you try another GPU as a debugging aid to help pinpoint the problem: TensorFlow should work on GTX 1080 but others have had trouble on that specific card with earlier versions of CUDA and that may be the issue here. Please report back if CUDA8 doesn't fix it.

I can't comment on the bottleneck question I'm afraid since I am not familiar with the details of the model implementation: @shlens does this sound expected?

@theclifbar

thanks @michaelisard, upgrading to CUDA 8.0 Release Candidate and cuDNN 5 and building from source again with GPUs enabled has resolved this issue!

@JohnAllen, thanks for your information, as well. We've found that running the retrainer when built for GPUs takes about 0.1 seconds per image bottleneck; previously, on a 36-core machine, it would take around 3 seconds per bottleneck with all cores maxed out. If your setup is still taking 3 seconds per image bottleneck, I'm happy to help debug your setup with you as it really should be 20-30x faster on the GTX 1080. You might simply need to rebuild the trainer with CUDA enabled.

The retrainer / transfer learning example does not seem to support multiple GPUs, unfortunately, even with the --num_gpus flag. I can open up another issue for that if that's not by design nor a known issue. Thanks again, you have literally saved us weeks of processing time.

@theclifbar theclifbar closed this Aug 1, 2016
@Adithya2015
Adithya2015 commented Feb 23, 2017 edited

Hello everyone. I'm having some issue regarding running the bottlenecks on GPU. Currently, it takes about 1s per bottleneck and when I check the GPU utilizaiton with nvidia-smi it appears that it is fluctuating between 0-20%.

Also, I noticed that this gets printed to my screen everytime sess.run() is called
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Maybe this is the cause of overhead? I also tried to check the device placement logs and here are the first few lines from the log file

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
Creating bottleneck at ./newmodel/bottleneck/baby_in_crib/1_136.jpg.txt
DecodeJpeg_1: (DecodeJpeg): /job:localhost/replica:0/task:0/cpu:0
Shape: (Shape): /job:localhost/replica:0/task:0/cpu:0
DecodeJpeg: (DecodeJpeg): /job:localhost/replica:0/task:0/cpu:0
Cast: (Cast): /job:localhost/replica:0/task:0/gpu:0
ExpandDims: (ExpandDims): /job:localhost/replica:0/task:0/gpu:0
ResizeBilinear: (ResizeBilinear): /job:localhost/replica:0/task:0/gpu:0
Sub: (Sub): /job:localhost/replica:0/task:0/gpu:0
Mul: (Mul): /job:localhost/replica:0/task:0/gpu:0
conv/Conv2D: (Conv2D): /job:localhost/replica:0/task:0/gpu:0
conv/batchnorm: (BatchNormWithGlobalNormalization): /job:localhost/replica:0/task:0/gpu:0
conv/CheckNumerics: (CheckNumerics): /job:localhost/replica:0/task:0/gpu:0
conv/control_dependency: (Identity): /job:localhost/replica:0/task:0/gpu:0
conv: (Relu): /job:localhost/replica:0/task:0/gpu:0
conv_1/Conv2D: (Conv2D): /job:localhost/replica:0/task:0/gpu:0
conv_1/batchnorm: (BatchNormWithGlobalNormalization): /job:localhost/replica:0/task:0/gpu:0
conv_1/CheckNumerics: (CheckNumerics): /job:localhost/replica:0/task:0/gpu:0
conv_1/control_dependency: (Identity): /job:localhost/replica:0/task:0/gpu:0
conv_1: (Relu): /job:localhost/replica:0/task:0/gpu:0
conv_2/Conv2D: (Conv2D): /job:localhost/replica:0/task:0/gpu:0
conv_2/batchnorm: (BatchNormWithGlobalNormalization): /job:localhost/replica:0/task:0/gpu:0
...

As I understand, the operations in the first few layers are being run in the cpu and the rest on gpu. IS the data transfer between cpu and gpu the cause for the slow execution? @theclifbar did you have any such issues while running your model?
This is my first post here and please let me know if I have to open another discussion for this. Any help is much appreciated. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment