New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF 1.14.0 training crashes with unimplemented Conv2D errors (works fine in TF 1.13.2) #32691
Comments
Hi @mschonwe |
@Leslie-Fang the device placement should be putting these ops on GPU (afaik). The issue only crops up in new versions of TF code, in older versions the GPU utilization is appropriately high. |
I've tracked down the the function where it goes off the rails (when running TF 1.14.0):
|
I found an issue that seems related: /issues/26411
Training runs without error. Seems to me it would be preferable to place the conv2d op on GPU (except for this issue). |
@mschonwe, |
This is based on the tensorflow/tensor2tensor project, In the initial post I describe how to reproduce. The function that causes the trouble is the conv2d in tensor2tensor/layers/common_audio.py add_delta_deltas(). The (likely) issue is the optimization pass causing a conv2d op, which should be placed on CPU, to be rewritten to use a version of tf.nn.conv2d() that is only available on GPU. Since I have a work-around (above) for the bug, I am ok closing this issue. |
@mschonwe, thank you for reporting this issue. I am confused why placing on the CPU would help at all. The error is that the CPU only supports NHWC and that NCHW is being used. Yet, in the Do you understand why explicitly placing the op on the CPU would help? If not, I'll CC some other people to try to figure this out. |
@reedwm I don't understand why it works, only that it does. I agree that the error message seems to indicate that the offending op is placed on the CPU even without the explicit device placement. My guess was that (w/o the explicit device placement) there are multiple optimization passes, the first choses the op for the GPU, and a later optimization places that op on CPU, which then failed. /issues/26411 relates to a different problem, but seems related. |
/CC @ezhulenev, do you know if this is an issue with the layout optimizer? @mschonwe, also try consider trying TF 1.15-rc1 ( |
I did a quick try with TF1.15-rc1, but got an error: |
Based on my understanding that is what happens:
When Conv2D explicitly placed on CPU, it prevents layout optimizer from swapping data format. @mschonwe what if you explicitly put this Conv2D on GPU? |
Hacked past the session_creation_timeout_secs issue by hardcoding a return in /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/run_config.py
Still fails in TF1.15-rc2 when no explicit device placement is given. HOWEVER, it does WORK now when GPU is specified ( |
Still fails in TF1.15-rc3 when no explicit device placement is given. |
Faced the same issue. The program gets stuck when using GPU but works on CPU. No error message for me, it just remains stuck due to the tf.nn.conv2d code. Explicit device placement by @mschonwe helped.
This seems to be it; tf.nn.conv2d or tf.nn.conv1d inside tf.dataset map doesn't work with GPU enabled for me. |
Hi There, We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help. This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information. |
Environment
Issue
Change in Tensorflow has broken tensor2tensor librispeech training.
Running librispeech training crashes with Unimplemented Conv2D errors.
Expected behavior
This works fine in earlier versions of Tensorflow (e.g. 1.13.2).
Code to reproduce the issue
Via Nvidia Docker Hub
run tensorflow/tensorflow:1.14.0-gpu
pip install tensorflow-hub && pip install tensor2tensor
apt-get update && apt-get install sox
t2t-trainer --problem=librispeech_clean_small --model=transformer --output_dir=/models/JUNK --data_dir=/data/ --save_checkpoints_secs=1800 --schedule=train --hparams_set=transformer_librispeech
(note: sox and --generate are only needed once, to prep the dataset)
Other info / logs
Related to closed issue #32017.
The text was updated successfully, but these errors were encountered: