BUG: symbolic layer triggers device creation #25946

ppwwyyxx · 2019-02-20T18:23:05Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):linux ubuntu 16.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:n/a
TensorFlow installed from (source or binary):binary
TensorFlow version (use command below):b'v1.13.0-rc2-0-gc865ec5621' 1.13.0-rc2
Python version:3.7
Bazel version (if compiling from source):n/a
GCC/Compiler version (if compiling from source):n/a
CUDA/cuDNN version:10.0 / 7.4.2
GPU model and memory:gtx960M

Describe the current behavior
The following code:

import tensorflow as tf
a = tf.placeholder(tf.float32, [100, 100, 100, 100])
b = tf.layers.Conv2DTranspose(3, 3, data_format='channels_first')
output = b.apply(a)

prints:

2019-02-20 10:20:05.505595: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-20 10:20:05.578782: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-20 10:20:05.579477: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55fd579f65d0 executing computations on platform CUDA. Devices:
2019-02-20 10:20:05.579513: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 960M, Compute Capability 5.0
2019-02-20 10:20:05.606095: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz                                
2019-02-20 10:20:05.606746: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55fd57b39b00 executing computations on platform Host. Devices:
2019-02-20 10:20:05.606785: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>               
2019-02-20 10:20:05.607093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:                              
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.0975
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.92GiB
2019-02-20 10:20:05.607118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0                                
2019-02-20 10:20:05.608205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-20 10:20:05.608229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0                                                        
2019-02-20 10:20:05.608240: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N                                                       
2019-02-20 10:20:05.608504: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1742 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)

It can be seen that it initializes the GPU devices. However this should not happen in symbolic functions.

Initializing the GPU devices has many side effects.
It can lead to different types of failures, such as #8136 (comment). The largest side effect is that: any GPU-related flags given to a tf.Session created after device initialization will not take effect.
It will also make it much harder to use horovod because horovod requires initializing the GPU in specific ways (with visible_device_list). If a graph with Conv2DTranspose was created before creating the session (which is the standard way of using TF 1.0), horovod will fail to initialize the session. (cc @alsrgv ).

This bug exists for Conv2DTranspose, but not for Conv2D.
This bug exists in 1.13.0rc0. It does not exist in 1.12.0

The text was updated successfully, but these errors were encountered:

ppwwyyxx · 2019-02-20T18:42:51Z

This bug comes from keras/backend.py, where conv2d_transpose listed available devices to check data_format.

In fact, the entire keras/backend.py file heavily relies on looking at the available devices.

alsrgv · 2019-02-21T00:55:13Z

I'm guessing we'll have to stick with https://github.com/horovod/horovod/blob/master/examples/keras_imagenet_resnet50.py#L59 in a preamble for any Keras API usage.

ppwwyyxx · 2019-03-13T18:28:17Z

3 weeks with no response?

facaiy · 2019-03-15T08:19:42Z

@qlzh727 Are you a good person to look at this?

qlzh727 · 2019-03-15T14:58:23Z

I am quite occupied right now with some RNN work, but I will reroute this to the correct owner.

robieta · 2019-03-15T17:13:40Z

It's not obvious to me how one would get around this given that checking devices triggers initialization code if the device is not already initialized. NHWC vs. NCHW device compatibility issues are one of the more common difficulties encountered, hence why we check for it. Ultimately, I think @alsrgv's solution is probably correct: if you need to set specific process level config it will have to be done at the very start.

That said, if you can think of a better solution feel free to suggest it or open a PR.

ppwwyyxx · 2019-03-15T18:08:08Z

Device initialization is not the only issue here.
A summary of the cause:
Certain Keras layers call the following function:

def _has_nchw_support():
  explicitly_on_cpu = _is_current_explicit_device('CPU')
  gpus_available = bool(_get_available_gpus())
  return not explicitly_on_cpu and gpus_available

in keras/backend.py. When the function returns False but the layer is called with NCHW format, the layer will apply some format conversions, such as transpose.

There are at least three issues with this approach:

The function _has_nchw_support is clearly wrong.
Many of the involved ops supports NCHW on CPUs with MKL build, and on TPUs.

Consequences: These Keras layers do not behave properly (transpose may be added) on CPUs with MKL build or on TPUs.
Graph construction should be conceptually independent of execution.
-- This IMHO is the core beauty of a graph computation framework.

By looking at available devices for graph construction, it is making an implicit assumption that the graph will be executed on the same device, which is often not a valid assumption.

Consequences: These Keras layers do not behave properly if the graph is not executed on the same device. Examples include:
(1) Creating a graph for deployment (on different machines)
(2) Architecture search (where some worker generates graphs and other workers run it)
(3) Distributed graph with heterogeneous workers, where the whole graph can be constructed on one single worker.

The automatic format conversion, if needed, should be done on the execution level instead.
Looking at GPU devices has side effects. This is an unfortunate fact.

Consequences: After constructing the graph with these Keras layers, users cannot create sessions with custom configs, and as a result cannot use Horovod, set memory fraction, and many others.

Workaround: Create session before graph. But this would break the define-and-run standard paradigm of TF 1.0. Most code using TF is not written like this.

My recommendations:

The first issue obviously needs to be addressed.
For backward compability with previous versions, adds a switch so that these layers do not look at devices when called from tf.layers, but can look at devices when called from tf.keras.layers.
I personally prefer to see the code crash (rather than secretly transpose many times) when there are no appropriate kernels registered on the devices.
In the long run it's best to not look at devices at all and transform the graph in execution.

ppwwyyxx · 2019-03-31T06:21:07Z

The implementation of

def _has_nchw_support():
  explicitly_on_cpu = _is_current_explicit_device('CPU')
  gpus_available = bool(_get_available_gpus())
  return not explicitly_on_cpu and gpus_available

appears to have more bugs than what I pointed out above: it does not handle DeviceSpec correctly. This makes valid code to crash, reported in #27259 and #23197.

These issues do not exist in TF 1.12 when the implementation of Conv2DTranspose is not backed by Keras.

rmothukuru · 2021-03-17T08:58:48Z

@ppwwyyxx,
Sorry for the delayed response. When we execute the code,

import tensorflow as tf
a = tf.placeholder(tf.float32, [100, 100, 100, 100])
b = tf.layers.Conv2DTranspose(3, 3, data_format='channels_first')
output = b.apply(a)

using the latest version of Tensorflow with slight modifications with respect to Compatibility, we see that GPUs are no more initialized.

Please find the Gist of the working code. Thanks!

google-ml-butler · 2021-03-24T09:37:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler · 2021-03-31T10:23:01Z

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2021-03-31T10:23:04Z

Are you satisfied with the resolution of your issue?
Yes
No

facaiy assigned fchollet Feb 22, 2019

facaiy added type:bug Bug comp:keras Keras related issues TF 1.13 Issues related to TF 1.13 labels Feb 22, 2019

facaiy added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 15, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 16, 2019

qlzh727 self-assigned this Mar 21, 2019

ppwwyyxx mentioned this issue Mar 31, 2019

r1.13 multi-gpu towering fails due to DeviceSpec parsing during model creation #27259

Closed

rchao self-assigned this Aug 19, 2019

rmothukuru added stat:awaiting tensorflower Status - Awaiting response from tensorflower stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Mar 12, 2021

rmothukuru self-assigned this Mar 17, 2021

google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 24, 2021

google-ml-butler bot closed this as completed Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: symbolic layer triggers device creation #25946

BUG: symbolic layer triggers device creation #25946

ppwwyyxx commented Feb 20, 2019 •

edited

ppwwyyxx commented Feb 20, 2019 •

edited

alsrgv commented Feb 21, 2019

ppwwyyxx commented Mar 13, 2019

facaiy commented Mar 15, 2019

qlzh727 commented Mar 15, 2019

robieta commented Mar 15, 2019

ppwwyyxx commented Mar 15, 2019

ppwwyyxx commented Mar 31, 2019 •

edited

rmothukuru commented Mar 17, 2021

google-ml-butler bot commented Mar 24, 2021

google-ml-butler bot commented Mar 31, 2021

google-ml-butler bot commented Mar 31, 2021

BUG: symbolic layer triggers device creation #25946

BUG: symbolic layer triggers device creation #25946

Comments

ppwwyyxx commented Feb 20, 2019 • edited

ppwwyyxx commented Feb 20, 2019 • edited

alsrgv commented Feb 21, 2019

ppwwyyxx commented Mar 13, 2019

facaiy commented Mar 15, 2019

qlzh727 commented Mar 15, 2019

robieta commented Mar 15, 2019

ppwwyyxx commented Mar 15, 2019

ppwwyyxx commented Mar 31, 2019 • edited

rmothukuru commented Mar 17, 2021

google-ml-butler bot commented Mar 24, 2021

google-ml-butler bot commented Mar 31, 2021

google-ml-butler bot commented Mar 31, 2021

ppwwyyxx commented Feb 20, 2019 •

edited

ppwwyyxx commented Feb 20, 2019 •

edited

ppwwyyxx commented Mar 31, 2019 •

edited