Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF 1.14.0 training crashes with unimplemented Conv2D errors (works fine in TF 1.13.2) #32691

Closed
mschonwe opened this issue Sep 20, 2019 · 16 comments
Assignees
Labels
comp:apis Highlevel API related issues TF 1.14 for issues seen with TF 1.14 type:bug Bug

Comments

@mschonwe
Copy link

Environment

  • Ubuntu 16.04:
  • Docker based tensorflow/tensorflow:1.14.0-gpu
  • tensor2tensor==1.14.0 (pip installed in container)
  • Python 2.7
  • CUDA/cuDNN version: 10/7 (defaults from docker image)
  • GPUs (tested on many from 1080 to RTX Titan)

Issue
Change in Tensorflow has broken tensor2tensor librispeech training.

Running librispeech training crashes with Unimplemented Conv2D errors.

  (0) Unimplemented:  The Conv2D op currently only supports the NHWC tensor format on the CPU. The op was given the format: NCHW
         [[{{node Conv2D}}]]
  (1) Unimplemented:  The Conv2D op currently only supports the NHWC tensor format on the CPU. The op was given the format: NCHW
         [[{{node Conv2D}}]]
         [[Shape_3/_8]]

Expected behavior
This works fine in earlier versions of Tensorflow (e.g. 1.13.2).

Code to reproduce the issue
Via Nvidia Docker Hub
run tensorflow/tensorflow:1.14.0-gpu
pip install tensorflow-hub && pip install tensor2tensor
apt-get update && apt-get install sox
t2t-trainer --problem=librispeech_clean_small --model=transformer --output_dir=/models/JUNK --data_dir=/data/ --save_checkpoints_secs=1800 --schedule=train --hparams_set=transformer_librispeech
(note: sox and --generate are only needed once, to prep the dataset)

Other info / logs
Related to closed issue #32017.

@Leslie-Fang
Copy link
Contributor

Hi @mschonwe
It seems you are installing a GPU version.
Why it throws the error message on CPU? Did you run the training on CPU?

@mschonwe
Copy link
Author

@Leslie-Fang the device placement should be putting these ops on GPU (afaik). The issue only crops up in new versions of TF code, in older versions the GPU utilization is appropriately high.

@mschonwe
Copy link
Author

I've tracked down the the function where it goes off the rails (when running TF 1.14.0):

def add_delta_deltas(filterbanks, name=None):
  """Compute time first and second-order derivative channels.
  Args:
    filterbanks: float32 tensor with shape [batch_size, len, num_bins, 1]
    name: scope name
  Returns:
    float32 tensor with shape [batch_size, len, num_bins, 3]
  """
  delta_filter = np.array([2, 1, 0, -1, -2])
  delta_delta_filter = scipy.signal.convolve(delta_filter, delta_filter, "full")

  delta_filter_stack = np.array(
      [[0] * 4 + [1] + [0] * 4, [0] * 2 + list(delta_filter) + [0] * 2,
       list(delta_delta_filter)],
      dtype=np.float32).T[:, None, None, :]

  delta_filter_stack /= np.sqrt(
      np.sum(delta_filter_stack**2, axis=0, keepdims=True))

  filterbanks = tf.nn.conv2d(
      filterbanks, delta_filter_stack, [1, 1, 1, 1], "SAME", data_format="NHWC",
      name=name)
  return filterbanks

@mschonwe
Copy link
Author

mschonwe commented Sep 21, 2019

I found an issue that seems related: /issues/26411
I changed add_delta_deltas to hard code placement on CPU:

  with tf.device('/cpu:0'):
    filterbanks = tf.nn.conv2d(
      filterbanks, delta_filter_stack, [1, 1, 1, 1], "SAME", data_format="NHWC",
      name=name)

Training runs without error. Seems to me it would be preferable to place the conv2d op on GPU (except for this issue).

@gadagashwini-zz gadagashwini-zz self-assigned this Sep 23, 2019
@gadagashwini-zz gadagashwini-zz added TF 1.14 for issues seen with TF 1.14 comp:apis Highlevel API related issues labels Sep 23, 2019
@gadagashwini-zz
Copy link
Contributor

@mschonwe,
Please provide the minimal standalone code to reproduce the reported issue. Thanks!

@gadagashwini-zz gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Sep 23, 2019
@mschonwe
Copy link
Author

This is based on the tensorflow/tensor2tensor project, In the initial post I describe how to reproduce. The function that causes the trouble is the conv2d in tensor2tensor/layers/common_audio.py add_delta_deltas().

The (likely) issue is the optimization pass causing a conv2d op, which should be placed on CPU, to be rewritten to use a version of tf.nn.conv2d() that is only available on GPU.

Since I have a work-around (above) for the bug, I am ok closing this issue.

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Sep 25, 2019
@gadagashwini-zz gadagashwini-zz added the type:support Support issues label Sep 25, 2019
@ymodak ymodak added type:bug Bug and removed type:support Support issues labels Sep 27, 2019
@rachellim rachellim assigned reedwm and unassigned rachellim Sep 30, 2019
@reedwm
Copy link
Member

reedwm commented Sep 30, 2019

@mschonwe, thank you for reporting this issue. I am confused why placing on the CPU would help at all. The error is that the CPU only supports NHWC and that NCHW is being used. Yet, in the add_delta_deltas function, it explicitly uses NHWC, so the error should not occur whether the op is on the CPU or GPU. And, the error message indicates the op already is placed on the CPU, so explicitly placing on the CPU should not do anything.

Do you understand why explicitly placing the op on the CPU would help? If not, I'll CC some other people to try to figure this out.

@mschonwe
Copy link
Author

mschonwe commented Sep 30, 2019

@reedwm I don't understand why it works, only that it does.

I agree that the error message seems to indicate that the offending op is placed on the CPU even without the explicit device placement.

My guess was that (w/o the explicit device placement) there are multiple optimization passes, the first choses the op for the GPU, and a later optimization places that op on CPU, which then failed.

/issues/26411 relates to a different problem, but seems related.

@reedwm
Copy link
Member

reedwm commented Sep 30, 2019

/CC @ezhulenev, do you know if this is an issue with the layout optimizer?

@mschonwe, also try consider trying TF 1.15-rc1 (pip install tensorflow-gpu==1.15rc1)

@reedwm reedwm added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 30, 2019
@mschonwe
Copy link
Author

I did a quick try with TF1.15-rc1, but got an error:
AttributeError: 'RunConfig' object has no attribute '_session_creation_timeout_secs'
I tried hard coding it to be be session_creation_timeout_secs=7200 but that didn't work (because the subfunctions didn't know about that parameter). I'll look at this again when I have time.

@ezhulenev
Copy link
Member

Based on my understanding that is what happens:

  1. Conv2D without explicit device string placed on GPU
  2. Layout optimizer changes data format NHWC->NCHW becasue it's optimal for the GPU
  3. Then something puts Conv2D back on CPU <<<---- after a discussion we have no idea what it could be

When Conv2D explicitly placed on CPU, it prevents layout optimizer from swapping data format.

@mschonwe what if you explicitly put this Conv2D on GPU?

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 1, 2019
@mschonwe
Copy link
Author

mschonwe commented Oct 2, 2019

Hacked past the session_creation_timeout_secs issue by hardcoding a return in /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/run_config.py

@property
  def session_creation_timeout_secs(self):
     return 7200 #return self._session_creation_timeout_secs

Still fails in TF1.15-rc2 when no explicit device placement is given.

HOWEVER, it does WORK now when GPU is specified (with tf.device('/gpu:0'):).

@ymodak ymodak removed their assignment Oct 4, 2019
@mschonwe
Copy link
Author

mschonwe commented Oct 9, 2019

Still fails in TF1.15-rc3 when no explicit device placement is given.

@rusiaaman
Copy link

Faced the same issue. The program gets stuck when using GPU but works on CPU. No error message for me, it just remains stuck due to the tf.nn.conv2d code.

Explicit device placement by @mschonwe helped.

/issues/26411 relates to a different problem, but seems related.

This seems to be it; tf.nn.conv2d or tf.nn.conv1d inside tf.dataset map doesn't work with GPU enabled for me.

@tensorflowbutler
Copy link
Member

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues TF 1.14 for issues seen with TF 1.14 type:bug Bug
Projects
None yet
Development

No branches or pull requests

9 participants