Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
GitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Inception retraining / transfer learning fails when running with GPU #3560
Thanks so much for releasing TensorFlow. We're experimenting with the image retraining example as described here: https://www.tensorflow.org/versions/r0.9/how_tos/image_retraining/index.html
Everything in TensorFlow has worked perfectly for us, including the test and GPU setup validation samples. However, when running the Inception retraining code and using the GPU, TensorFlow errors. Since the error occurs in
For example, CPU-only bottleneck generation for 600 classes on a recent 36-core machine takes nearly a month, so working multi-GPU support for bottleneck creation would be really great. It would help us learn faster and hopefully contribute to the project sooner.
Abbreviated output (full output is attached):
python tensorflow/examples/image_retraining/retrain.py --image_dir ~/flower_photos
Operating System: Ubuntu 14.04
Installed version of CUDA and cuDNN:
Steps to reproduce
What have you tried?
Thank you for the detailed bug report! We have heard of other people having problems with GTX 1080 #3507 which have been fixed by switching to CUDA 8.0. Is it easy for you to try either using a different GPU or building with 8.0? Unfortunately 8.0 isn't fully supported by TensorFlow yet so that isn't an ideal solution, but it has unblocked other people and will be supported before long.
Thanks for reading it, Michaelisard! We are very excited by the possibilities opened up by TensorFlow. I can give CUDA 8.0RC a try.
We could try other GPUs, though these GTX 1080s were bought specifically for TensorFlow. On a somewhat related question, is the 8GB of GDDRAM going to be a major limiting factor in your opinion? Our training sets typically run around 60GB in size (actual total image file size; the bottlenecks are much smaller). The new Titan X Pascal cards have 12GB; I wonder if we'll be handicapped in the long run by the 8GB in terms of batch sizes and other problems.
@theclifbar I was only suggesting you try another GPU as a debugging aid to help pinpoint the problem: TensorFlow should work on GTX 1080 but others have had trouble on that specific card with earlier versions of CUDA and that may be the issue here. Please report back if CUDA8 doesn't fix it.
I can't comment on the bottleneck question I'm afraid since I am not familiar with the details of the model implementation: @shlens does this sound expected?
thanks @michaelisard, upgrading to CUDA 8.0 Release Candidate and cuDNN 5 and building from source again with GPUs enabled has resolved this issue!
@JohnAllen, thanks for your information, as well. We've found that running the retrainer when built for GPUs takes about 0.1 seconds per image bottleneck; previously, on a 36-core machine, it would take around 3 seconds per bottleneck with all cores maxed out. If your setup is still taking 3 seconds per image bottleneck, I'm happy to help debug your setup with you as it really should be 20-30x faster on the GTX 1080. You might simply need to rebuild the trainer with CUDA enabled.
The retrainer / transfer learning example does not seem to support multiple GPUs, unfortunately, even with the --num_gpus flag. I can open up another issue for that if that's not by design nor a known issue. Thanks again, you have literally saved us weeks of processing time.
Hello everyone. I'm having some issue regarding running the bottlenecks on GPU. Currently, it takes about 1s per bottleneck and when I check the GPU utilizaiton with nvidia-smi it appears that it is fluctuating between 0-20%.
Also, I noticed that this gets printed to my screen everytime sess.run() is called
As I understand, the operations in the first few layers are being run in the cpu and the rest on gpu. IS the data transfer between cpu and gpu the cause for the slow execution? @theclifbar did you have any such issues while running your model?