TF 1.14.0 training crashes with unimplemented Conv2D errors (works fine in TF 1.13.2) #32691

mschonwe · 2019-09-20T16:06:43Z

Environment

Ubuntu 16.04:
Docker based tensorflow/tensorflow:1.14.0-gpu
tensor2tensor==1.14.0 (pip installed in container)
Python 2.7
CUDA/cuDNN version: 10/7 (defaults from docker image)
GPUs (tested on many from 1080 to RTX Titan)

Issue
Change in Tensorflow has broken tensor2tensor librispeech training.

Running librispeech training crashes with Unimplemented Conv2D errors.

  (0) Unimplemented:  The Conv2D op currently only supports the NHWC tensor format on the CPU. The op was given the format: NCHW
         [[{{node Conv2D}}]]
  (1) Unimplemented:  The Conv2D op currently only supports the NHWC tensor format on the CPU. The op was given the format: NCHW
         [[{{node Conv2D}}]]
         [[Shape_3/_8]]

Expected behavior
This works fine in earlier versions of Tensorflow (e.g. 1.13.2).

Code to reproduce the issue
Via Nvidia Docker Hub
run tensorflow/tensorflow:1.14.0-gpu
pip install tensorflow-hub && pip install tensor2tensor
apt-get update && apt-get install sox
t2t-trainer --problem=librispeech_clean_small --model=transformer --output_dir=/models/JUNK --data_dir=/data/ --save_checkpoints_secs=1800 --schedule=train --hparams_set=transformer_librispeech
(note: sox and --generate are only needed once, to prep the dataset)

Other info / logs
Related to closed issue #32017.

The text was updated successfully, but these errors were encountered:

Leslie-Fang · 2019-09-20T16:19:29Z

Hi @mschonwe
It seems you are installing a GPU version.
Why it throws the error message on CPU? Did you run the training on CPU?

mschonwe · 2019-09-20T16:56:05Z

@Leslie-Fang the device placement should be putting these ops on GPU (afaik). The issue only crops up in new versions of TF code, in older versions the GPU utilization is appropriately high.

mschonwe · 2019-09-20T18:24:43Z

I've tracked down the the function where it goes off the rails (when running TF 1.14.0):

def add_delta_deltas(filterbanks, name=None):
  """Compute time first and second-order derivative channels.
  Args:
    filterbanks: float32 tensor with shape [batch_size, len, num_bins, 1]
    name: scope name
  Returns:
    float32 tensor with shape [batch_size, len, num_bins, 3]
  """
  delta_filter = np.array([2, 1, 0, -1, -2])
  delta_delta_filter = scipy.signal.convolve(delta_filter, delta_filter, "full")

  delta_filter_stack = np.array(
      [[0] * 4 + [1] + [0] * 4, [0] * 2 + list(delta_filter) + [0] * 2,
       list(delta_delta_filter)],
      dtype=np.float32).T[:, None, None, :]

  delta_filter_stack /= np.sqrt(
      np.sum(delta_filter_stack**2, axis=0, keepdims=True))

  filterbanks = tf.nn.conv2d(
      filterbanks, delta_filter_stack, [1, 1, 1, 1], "SAME", data_format="NHWC",
      name=name)
  return filterbanks

mschonwe · 2019-09-21T20:24:45Z

I found an issue that seems related: /issues/26411
I changed add_delta_deltas to hard code placement on CPU:

  with tf.device('/cpu:0'):
    filterbanks = tf.nn.conv2d(
      filterbanks, delta_filter_stack, [1, 1, 1, 1], "SAME", data_format="NHWC",
      name=name)

Training runs without error. Seems to me it would be preferable to place the conv2d op on GPU (except for this issue).

gadagashwini-zz · 2019-09-23T06:00:54Z

@mschonwe,
Please provide the minimal standalone code to reproduce the reported issue. Thanks!

mschonwe · 2019-09-24T14:47:13Z

This is based on the tensorflow/tensor2tensor project, In the initial post I describe how to reproduce. The function that causes the trouble is the conv2d in tensor2tensor/layers/common_audio.py add_delta_deltas().

The (likely) issue is the optimization pass causing a conv2d op, which should be placed on CPU, to be rewritten to use a version of tf.nn.conv2d() that is only available on GPU.

Since I have a work-around (above) for the bug, I am ok closing this issue.

reedwm · 2019-09-30T17:30:14Z

@mschonwe, thank you for reporting this issue. I am confused why placing on the CPU would help at all. The error is that the CPU only supports NHWC and that NCHW is being used. Yet, in the add_delta_deltas function, it explicitly uses NHWC, so the error should not occur whether the op is on the CPU or GPU. And, the error message indicates the op already is placed on the CPU, so explicitly placing on the CPU should not do anything.

Do you understand why explicitly placing the op on the CPU would help? If not, I'll CC some other people to try to figure this out.

mschonwe · 2019-09-30T19:15:57Z

@reedwm I don't understand why it works, only that it does.

I agree that the error message seems to indicate that the offending op is placed on the CPU even without the explicit device placement.

My guess was that (w/o the explicit device placement) there are multiple optimization passes, the first choses the op for the GPU, and a later optimization places that op on CPU, which then failed.

/issues/26411 relates to a different problem, but seems related.

reedwm · 2019-09-30T19:36:03Z

/CC @ezhulenev, do you know if this is an issue with the layout optimizer?

@mschonwe, also try consider trying TF 1.15-rc1 (pip install tensorflow-gpu==1.15rc1)

mschonwe · 2019-09-30T20:09:56Z

I did a quick try with TF1.15-rc1, but got an error:
AttributeError: 'RunConfig' object has no attribute '_session_creation_timeout_secs'
I tried hard coding it to be be session_creation_timeout_secs=7200 but that didn't work (because the subfunctions didn't know about that parameter). I'll look at this again when I have time.

ezhulenev · 2019-09-30T21:50:31Z

Based on my understanding that is what happens:

Conv2D without explicit device string placed on GPU
Layout optimizer changes data format NHWC->NCHW becasue it's optimal for the GPU
Then something puts Conv2D back on CPU <<<---- after a discussion we have no idea what it could be

When Conv2D explicitly placed on CPU, it prevents layout optimizer from swapping data format.

@mschonwe what if you explicitly put this Conv2D on GPU?

mschonwe · 2019-10-02T21:23:50Z

Hacked past the session_creation_timeout_secs issue by hardcoding a return in /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/run_config.py

@property
  def session_creation_timeout_secs(self):
     return 7200 #return self._session_creation_timeout_secs

Still fails in TF1.15-rc2 when no explicit device placement is given.

HOWEVER, it does WORK now when GPU is specified (with tf.device('/gpu:0'):).

mschonwe · 2019-10-09T19:51:31Z

Still fails in TF1.15-rc3 when no explicit device placement is given.

rusiaaman · 2020-01-03T11:56:43Z

Faced the same issue. The program gets stuck when using GPU but works on CPU. No error message for me, it just remains stuck due to the tf.nn.conv2d code.

Explicit device placement by @mschonwe helped.

/issues/26411 relates to a different problem, but seems related.

This seems to be it; tf.nn.conv2d or tf.nn.conv1d inside tf.dataset map doesn't work with GPU enabled for me.

tensorflowbutler · 2021-02-01T14:08:08Z

Hi There,

We are checking to see if you still need help on this, as you are using an older version of tensorflow which is officially considered end of life . We recommend that you upgrade to the latest 2.x version and let us know if the issue still persists in newer versions. Please open a new issue for any help you need against 2.x, and we will get you the right help.

This issue will be closed automatically 7 days from now. If you still need help with this issue, please provide us with more information.

google-ml-butler · 2021-02-09T05:58:57Z

Are you satisfied with the resolution of your issue?
Yes
No

mschonwe mentioned this issue Sep 21, 2019

Stuck after printing 'Successfully opened dynamic library libcublas.so.10.0' tensorflow/tensor2tensor#1643

Open

gadagashwini-zz self-assigned this Sep 23, 2019

gadagashwini-zz added TF 1.14 for issues seen with TF 1.14 comp:apis Highlevel API related issues labels Sep 23, 2019

gadagashwini-zz added the stat:awaiting response Status - Awaiting response from author label Sep 23, 2019

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Sep 25, 2019

gadagashwini-zz added the type:support Support issues label Sep 25, 2019

gadagashwini-zz assigned ymodak and unassigned gadagashwini-zz Sep 25, 2019

ymodak assigned rachellim Sep 27, 2019

ymodak added type:bug Bug and removed type:support Support issues labels Sep 27, 2019

rachellim assigned reedwm and unassigned rachellim Sep 30, 2019

reedwm added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Sep 30, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Oct 1, 2019

ymodak removed their assignment Oct 4, 2019

tensorflowbutler closed this as completed Feb 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF 1.14.0 training crashes with unimplemented Conv2D errors (works fine in TF 1.13.2) #32691

TF 1.14.0 training crashes with unimplemented Conv2D errors (works fine in TF 1.13.2) #32691

mschonwe commented Sep 20, 2019

Leslie-Fang commented Sep 20, 2019

mschonwe commented Sep 20, 2019

mschonwe commented Sep 20, 2019

mschonwe commented Sep 21, 2019 •

edited

gadagashwini-zz commented Sep 23, 2019

mschonwe commented Sep 24, 2019

reedwm commented Sep 30, 2019

mschonwe commented Sep 30, 2019 •

edited

reedwm commented Sep 30, 2019

mschonwe commented Sep 30, 2019

ezhulenev commented Sep 30, 2019

mschonwe commented Oct 2, 2019

mschonwe commented Oct 9, 2019

rusiaaman commented Jan 3, 2020

tensorflowbutler commented Feb 1, 2021

google-ml-butler bot commented Feb 9, 2021

TF 1.14.0 training crashes with unimplemented Conv2D errors (works fine in TF 1.13.2) #32691

TF 1.14.0 training crashes with unimplemented Conv2D errors (works fine in TF 1.13.2) #32691

Comments

mschonwe commented Sep 20, 2019

Leslie-Fang commented Sep 20, 2019

mschonwe commented Sep 20, 2019

mschonwe commented Sep 20, 2019

mschonwe commented Sep 21, 2019 • edited

gadagashwini-zz commented Sep 23, 2019

mschonwe commented Sep 24, 2019

reedwm commented Sep 30, 2019

mschonwe commented Sep 30, 2019 • edited

reedwm commented Sep 30, 2019

mschonwe commented Sep 30, 2019

ezhulenev commented Sep 30, 2019

mschonwe commented Oct 2, 2019

mschonwe commented Oct 9, 2019

rusiaaman commented Jan 3, 2020

tensorflowbutler commented Feb 1, 2021

google-ml-butler bot commented Feb 9, 2021

mschonwe commented Sep 21, 2019 •

edited

mschonwe commented Sep 30, 2019 •

edited