tf.config.set_soft_device_placement() seems to have no effect #29342

ageron · 2019-06-03T01:51:19Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
MacOSX 10.13.6
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
N/A
TensorFlow installed from (source or binary):
binary
TensorFlow version (use command below):
VERSION='2.0.0-dev20190527'
GIT_VERSION='v1.12.1-2821-gc5b8e15064'
Python version:
3.5
Bazel version (if compiling from source):
N/A
GCC/Compiler version (if compiling from source):
N/A
CUDA/cuDNN version:
CUDA 10.0 (it's just a Colab GPU instance)
GPU model and memory:
Tesla P4 15079MiB

Describe the current behavior
The tf.config.set_soft_device_placement() function seems to have no effect when I create an integer variable and try to place it on a GPU, I still get an exception.

Describe the expected behavior
I expect soft placement to fallback to using the CPU. No error.

Code to reproduce the issue

import tensorflow as tf
tf.config.set_soft_device_placement(True)
with tf.device("/gpu:0"):
    f = tf.Variable(42)

Other info / logs
The code above causes the following exception:

---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-3-1babaf613bc3> in <module>()
      2 tf.config.set_soft_device_placement(True)
      3 with tf.device("/gpu:0"):
----> 4     f = tf.Variable(42)

10 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    260       return cls._variable_v1_call(*args, **kwargs)
    261     elif cls is Variable:
--> 262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
    264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in _variable_v2_call(cls, initial_value, trainable, validate_shape, caching_device, name, variable_def, dtype, import_scope, constraint, synchronization, aggregation, shape)
    254         synchronization=synchronization,
    255         aggregation=aggregation,
--> 256         shape=shape)
    257 
    258   def __call__(cls, *args, **kwargs):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in <lambda>(**kws)
    235                         shape=None):
    236     """Call on Variable class. Useful to force the signature."""
--> 237     previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
    238     for _, getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
    239       previous_getter = _make_getter(getter, previous_getter)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator_v2(next_creator, **kwargs)
   2549       synchronization=synchronization,
   2550       aggregation=aggregation,
-> 2551       shape=shape)
   2552 
   2553 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
--> 264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    265 
    266 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
    462           synchronization=synchronization,
    463           aggregation=aggregation,
--> 464           shape=shape)
    465 
    466   def __repr__(self):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, shape)
    616               shared_name=shared_name,
    617               name=name,
--> 618               graph_mode=self._in_graph_mode)
    619         # pylint: disable=protected-access
    620         if (self._in_graph_mode and initial_value is not None and

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in eager_safe_variable_handle(initial_value, shape, shared_name, name, graph_mode)
    223   dtype = initial_value.dtype.base_dtype
    224   return variable_handle_from_shape_and_dtype(
--> 225       shape, dtype, shared_name, name, graph_mode, initial_value)
    226 
    227 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in variable_handle_from_shape_and_dtype(shape, dtype, shared_name, name, graph_mode, extra_handle_data)
    139                                                    shared_name=shared_name,
    140                                                    name=name,
--> 141                                                    container=container)
    142   if extra_handle_data is None:
    143     extra_handle_data = handle

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_resource_variable_ops.py in var_handle_op(dtype, shape, container, shared_name, name)
   1416       else:
   1417         message = e.message
-> 1418       _six.raise_from(_core._status_to_exception(e.code, message), None)
   1419   # Add nodes to the TensorFlow graph.
   1420   dtype = _execute.make_type(dtype, "dtype")

/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

NotFoundError: No registered 'VarHandleOp' OpKernel for GPU devices compatible with node {{node VarHandleOp}}
	 (OpKernel was found, but attributes didn't match) Requested Attributes: container="", dtype=DT_INT32, shape=[], shared_name="cd2c89b7-88b7-44c8-ad83-06c2a9158347"
	.  Registered:  device='GPU'; dtype in [DT_VARIANT]
  device='GPU'; dtype in [DT_INT64]
  device='GPU'; dtype in [DT_COMPLEX128]
  device='GPU'; dtype in [DT_COMPLEX64]
  device='GPU'; dtype in [DT_BOOL]
  device='GPU'; dtype in [DT_DOUBLE]
  device='GPU'; dtype in [DT_FLOAT]
  device='GPU'; dtype in [DT_HALF]
  device='CPU'
  device='XLA_CPU'
  device='XLA_GPU'
 [Op:VarHandleOp] name: Variable/

The text was updated successfully, but these errors were encountered:

ageron · 2019-06-03T02:18:24Z

Also, if I activate soft placement and I try to place an operation on a GPU device that does not exist, I still get an exception. I expected TF to fallback to using the CPU:

import tensorflow as tf
tf.config.set_soft_device_placement(True)
with tf.device("/gpu:1"):
    f = tf.Variable(42.0)

Raises this exception:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-3a03536f5e27> in <module>
      2 tf.config.set_soft_device_placement(True)
      3 with tf.device("/gpu:1"):
----> 4         f = tf.Variable(42.5)
      5

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    259       return cls._variable_v1_call(*args, **kwargs)
    260     elif cls is Variable:
--> 261       return cls._variable_v2_call(*args, **kwargs)
    262     else:
    263       return super(VariableMetaclass, cls).__call__(*args, **kwargs)

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in _variable_v2_call(cls, initial_value, trainable, validate_shape, caching_device, name, variable_def, dtype, import_scope, constraint, synchronization, aggregation, shape)
    253         synchronization=synchronization,
    254         aggregation=aggregation,
--> 255         shape=shape)
    256
    257   def __call__(cls, *args, **kwargs):

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in <lambda>(**kws)
    234                         shape=None):
    235     """Call on Variable class. Useful to force the signature."""
--> 236     previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
    237     for _, getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
    238       previous_getter = _make_getter(getter, previous_getter)

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator_v2(next_creator, **kwargs)
   2542       synchronization=synchronization,
   2543       aggregation=aggregation,
-> 2544       shape=shape)
   2545
   2546

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    261       return cls._variable_v2_call(*args, **kwargs)
    262     else:
--> 263       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    264
    265

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
    462           synchronization=synchronization,
    463           aggregation=aggregation,
--> 464           shape=shape)
    465
    466   def __repr__(self):

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, shape)
    607             initial_value = ops.convert_to_tensor(
    608                 initial_value() if init_from_fn else initial_value,
--> 609                 name="initial_value", dtype=dtype)
    610           # Don't use `shape or initial_value.shape` since TensorShape has
    611           # overridden `__bool__`.

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype, dtype_hint)
   1095   preferred_dtype = deprecation.deprecated_argument_lookup(
   1096       "dtype_hint", dtype_hint, "preferred_dtype", preferred_dtype)
-> 1097   return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
   1098
   1099

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2(value, dtype, dtype_hint, name)
   1153       name=name,
   1154       preferred_dtype=dtype_hint,
-> 1155       as_ref=False)
   1156
   1157

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors, accept_composite_tensors)
   1232
   1233     if ret is None:
-> 1234       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1235
   1236     if ret is NotImplemented:

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    303                                          as_ref=False):
    304   _ = as_ref
--> 305   return constant(v, dtype=dtype, name=name)
    306
    307

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
    244   """
    245   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 246                         allow_broadcast=True)
    247
    248

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    252   ctx = context.context()
    253   if ctx.executing_eagerly():
--> 254     t = convert_to_eager_tensor(value, ctx, dtype)
    255     if shape is None:
    256       return t

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
    109       return ops.EagerTensor(
    110           value, handle, device, dtype, tensor)
--> 111     t = ops.EagerTensor(value, handle, device, dtype)
    112     scalar_cache[cache_key] = t
    113     return t

RuntimeError: Error copying tensor to device: /job:localhost/replica:0/task:0/device:GPU:1. /job:localhost/replica:0/task:0/device:GPU:1 unknown device.

lukasfolle · 2019-06-03T07:12:43Z

Hey, have you had a look at this example from the GPU guide at tensorflow.org

tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

To me it looks like TF is choosing which device is then suitable to execute this op:

If enabled, an op will be placed on CPU if any of the following are true
1. there's no GPU implementation for the OP
2. no GPU devices are known or registered
3. need to co-locate with reftype input(s) which are from CPU

from TF config

ageron · 2019-06-03T10:32:31Z

Hi @lufol,

Thanks for your answer. Yes I saw this doc, that's actually why I filed this bug. I don't see what soft placement would change in this example.

I expect soft placement to change something when the user requests a specific device but there is no op for that device, or the device does not exist. This would be useful if you want to write a program and deploy it on machines that may or may not have GPUs, for example. So on a machine without any GPU, I expect the following code to work without any error, and just fallback to placing the variable on the CPU:

tf.config.set_soft_device_placement(True)
with tf.device("/gpu:0"):
    x = tf.Variable(1.0)

Perhaps I'm misunderstanding what set_soft_device_placement() is designed for?

ageron · 2019-06-03T10:34:38Z

Okay, I got it: the semantics of soft placement have changed since TF 1.

In TF 1, the following code works fine on a machine without any GPU (I just tried it):

import tensorflow as tf

with tf.device("/gpu:42"): # there is no such device
    i = tf.Variable(123) # plus integers are not allowed on GPUs

# but let's make TF super soft and tolerant:
config = tf.ConfigProto()
config.allow_soft_placement = True
with tf.Session(config=config) as sess:
    sess.run(i.initializer)
    print(sess.run(i))

Prints 123, no problem. :)

That's why I was surprised that it didn't work in TF 2.0. I'm actually not sure when you would ever need to call set_soft_device_placement(False) in TF 2.0, what's the use case?

lukasfolle · 2019-06-03T12:13:56Z

From what I understand, you either use

tf.config.set_soft_device_placement(True) to let tf automatically decide, which device to use.
or you use tf.config.set_soft_device_placement(False) and decide for each tensor where to place it like this with tf.device('/CPU:0'): ...

So I guess in your case the first option would be best.

ageron · 2019-06-04T04:24:30Z

Hi @lufol,

I just ran some tests, and it really seems like tf.config.set_soft_device_placement(False) makes no difference at all. Whether it's True or False, the default behavior is applied.

I'm quite puzzled.

achandraa · 2019-06-04T12:03:48Z

Have tried to reproduce on Colab with TF 2.0.0-dev20190527 with set_soft_device_placement as True as well as False and was able to get same result in both the scenarios as mentioned in the issue.

lukasfolle · 2019-06-04T19:15:06Z

Yes @ageron , you are right.
But I guess at least the doc does not explicitly says to execute the tensors on CPU if tf.config.set_soft_device_placement(False) according to the doc:

If enabled, an op will be placed on CPU if any of the following are true
1. there's no GPU implementation for the OP
2. no GPU devices are known or registered
3. need to co-locate with reftype input(s) which are from CPU

Maybe one could add something to the doc explaining the behaviour if tf.config.set_soft_device_placement() is set to False?

Behaviour is reproducable here.

aaroey · 2019-06-04T20:32:25Z

@jaingaurav any comments?

jaingaurav · 2019-06-13T20:18:59Z

Sorry for the delay. There have been multiple discussions about this internally. It seems that soft placement is respected by tf.function but not by the tf.device placement. We'd like to resolve this, however given it is a behavior change we're unlikely to be able to change the default in 1.0.

ageron · 2019-06-13T23:47:35Z

Thanks @jaingaurav. I'm just thinking about TF 2.x, I understand that it must keep the same behavior in TF 1.x.

tensorflow-bot · 2019-07-03T23:37:49Z

Are you satisfied with the resolution of your issue?
Yes
No

ageron closed this as completed Jun 3, 2019

ageron reopened this Jun 3, 2019

achandraa self-assigned this Jun 4, 2019

achandraa added TF 2.0 Issues relating to TensorFlow 2.0 comp:gpu GPU related issues type:bug Bug labels Jun 4, 2019

achandraa assigned jvishnuvardhan and unassigned achandraa Jun 4, 2019

jvishnuvardhan assigned aaroey and unassigned jvishnuvardhan Jun 4, 2019

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 4, 2019

aaroey assigned jaingaurav Jun 4, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 5, 2019

jaingaurav assigned qqfish Jun 13, 2019

tensorflow-copybara closed this as completed in 7360531 Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tf.config.set_soft_device_placement() seems to have no effect #29342

tf.config.set_soft_device_placement() seems to have no effect #29342

ageron commented Jun 3, 2019 •

edited

ageron commented Jun 3, 2019

lukasfolle commented Jun 3, 2019 •

edited

ageron commented Jun 3, 2019

ageron commented Jun 3, 2019 •

edited

lukasfolle commented Jun 3, 2019

ageron commented Jun 4, 2019

achandraa commented Jun 4, 2019

lukasfolle commented Jun 4, 2019 •

edited

aaroey commented Jun 4, 2019

jaingaurav commented Jun 13, 2019

ageron commented Jun 13, 2019 •

edited

tensorflow-bot bot commented Jul 3, 2019

tf.config.set_soft_device_placement() seems to have no effect #29342

tf.config.set_soft_device_placement() seems to have no effect #29342

Comments

ageron commented Jun 3, 2019 • edited

ageron commented Jun 3, 2019

lukasfolle commented Jun 3, 2019 • edited

ageron commented Jun 3, 2019

ageron commented Jun 3, 2019 • edited

lukasfolle commented Jun 3, 2019

ageron commented Jun 4, 2019

achandraa commented Jun 4, 2019

lukasfolle commented Jun 4, 2019 • edited

aaroey commented Jun 4, 2019

jaingaurav commented Jun 13, 2019

ageron commented Jun 13, 2019 • edited

tensorflow-bot bot commented Jul 3, 2019

ageron commented Jun 3, 2019 •

edited

lukasfolle commented Jun 3, 2019 •

edited

ageron commented Jun 3, 2019 •

edited

lukasfolle commented Jun 4, 2019 •

edited

ageron commented Jun 13, 2019 •

edited