Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.config.set_soft_device_placement() seems to have no effect #29342

Closed
ageron opened this issue Jun 3, 2019 · 12 comments
Closed

tf.config.set_soft_device_placement() seems to have no effect #29342

ageron opened this issue Jun 3, 2019 · 12 comments
Assignees
Labels
comp:gpu GPU related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@ageron
Copy link
Contributor

ageron commented Jun 3, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    MacOSX 10.13.6
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    N/A
  • TensorFlow installed from (source or binary):
    binary
  • TensorFlow version (use command below):
    VERSION='2.0.0-dev20190527'
    GIT_VERSION='v1.12.1-2821-gc5b8e15064'
  • Python version:
    3.5
  • Bazel version (if compiling from source):
    N/A
  • GCC/Compiler version (if compiling from source):
    N/A
  • CUDA/cuDNN version:
    CUDA 10.0 (it's just a Colab GPU instance)
  • GPU model and memory:
    Tesla P4 15079MiB

Describe the current behavior
The tf.config.set_soft_device_placement() function seems to have no effect when I create an integer variable and try to place it on a GPU, I still get an exception.

Describe the expected behavior
I expect soft placement to fallback to using the CPU. No error.

Code to reproduce the issue

import tensorflow as tf
tf.config.set_soft_device_placement(True)
with tf.device("/gpu:0"):
    f = tf.Variable(42)

Other info / logs
The code above causes the following exception:

---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-3-1babaf613bc3> in <module>()
      2 tf.config.set_soft_device_placement(True)
      3 with tf.device("/gpu:0"):
----> 4     f = tf.Variable(42)

10 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    260       return cls._variable_v1_call(*args, **kwargs)
    261     elif cls is Variable:
--> 262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
    264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in _variable_v2_call(cls, initial_value, trainable, validate_shape, caching_device, name, variable_def, dtype, import_scope, constraint, synchronization, aggregation, shape)
    254         synchronization=synchronization,
    255         aggregation=aggregation,
--> 256         shape=shape)
    257 
    258   def __call__(cls, *args, **kwargs):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in <lambda>(**kws)
    235                         shape=None):
    236     """Call on Variable class. Useful to force the signature."""
--> 237     previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
    238     for _, getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
    239       previous_getter = _make_getter(getter, previous_getter)

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator_v2(next_creator, **kwargs)
   2549       synchronization=synchronization,
   2550       aggregation=aggregation,
-> 2551       shape=shape)
   2552 
   2553 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    262       return cls._variable_v2_call(*args, **kwargs)
    263     else:
--> 264       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    265 
    266 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
    462           synchronization=synchronization,
    463           aggregation=aggregation,
--> 464           shape=shape)
    465 
    466   def __repr__(self):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, shape)
    616               shared_name=shared_name,
    617               name=name,
--> 618               graph_mode=self._in_graph_mode)
    619         # pylint: disable=protected-access
    620         if (self._in_graph_mode and initial_value is not None and

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in eager_safe_variable_handle(initial_value, shape, shared_name, name, graph_mode)
    223   dtype = initial_value.dtype.base_dtype
    224   return variable_handle_from_shape_and_dtype(
--> 225       shape, dtype, shared_name, name, graph_mode, initial_value)
    226 
    227 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py in variable_handle_from_shape_and_dtype(shape, dtype, shared_name, name, graph_mode, extra_handle_data)
    139                                                    shared_name=shared_name,
    140                                                    name=name,
--> 141                                                    container=container)
    142   if extra_handle_data is None:
    143     extra_handle_data = handle

/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_resource_variable_ops.py in var_handle_op(dtype, shape, container, shared_name, name)
   1416       else:
   1417         message = e.message
-> 1418       _six.raise_from(_core._status_to_exception(e.code, message), None)
   1419   # Add nodes to the TensorFlow graph.
   1420   dtype = _execute.make_type(dtype, "dtype")

/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

NotFoundError: No registered 'VarHandleOp' OpKernel for GPU devices compatible with node {{node VarHandleOp}}
	 (OpKernel was found, but attributes didn't match) Requested Attributes: container="", dtype=DT_INT32, shape=[], shared_name="cd2c89b7-88b7-44c8-ad83-06c2a9158347"
	.  Registered:  device='GPU'; dtype in [DT_VARIANT]
  device='GPU'; dtype in [DT_INT64]
  device='GPU'; dtype in [DT_COMPLEX128]
  device='GPU'; dtype in [DT_COMPLEX64]
  device='GPU'; dtype in [DT_BOOL]
  device='GPU'; dtype in [DT_DOUBLE]
  device='GPU'; dtype in [DT_FLOAT]
  device='GPU'; dtype in [DT_HALF]
  device='CPU'
  device='XLA_CPU'
  device='XLA_GPU'
 [Op:VarHandleOp] name: Variable/
@ageron
Copy link
Contributor Author

ageron commented Jun 3, 2019

Also, if I activate soft placement and I try to place an operation on a GPU device that does not exist, I still get an exception. I expected TF to fallback to using the CPU:

import tensorflow as tf
tf.config.set_soft_device_placement(True)
with tf.device("/gpu:1"):
    f = tf.Variable(42.0)

Raises this exception:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-3a03536f5e27> in <module>
      2 tf.config.set_soft_device_placement(True)
      3 with tf.device("/gpu:1"):
----> 4         f = tf.Variable(42.5)
      5

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    259       return cls._variable_v1_call(*args, **kwargs)
    260     elif cls is Variable:
--> 261       return cls._variable_v2_call(*args, **kwargs)
    262     else:
    263       return super(VariableMetaclass, cls).__call__(*args, **kwargs)

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in _variable_v2_call(cls, initial_value, trainable, validate_shape, caching_device, name, variable_def, dtype, import_scope, constraint, synchronization, aggregation, shape)
    253         synchronization=synchronization,
    254         aggregation=aggregation,
--> 255         shape=shape)
    256
    257   def __call__(cls, *args, **kwargs):

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in <lambda>(**kws)
    234                         shape=None):
    235     """Call on Variable class. Useful to force the signature."""
--> 236     previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
    237     for _, getter in ops.get_default_graph()._variable_creator_stack:  # pylint: disable=protected-access
    238       previous_getter = _make_getter(getter, previous_getter)

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py in default_variable_creator_v2(next_creator, **kwargs)
   2542       synchronization=synchronization,
   2543       aggregation=aggregation,
-> 2544       shape=shape)
   2545
   2546

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py in __call__(cls, *args, **kwargs)
    261       return cls._variable_v2_call(*args, **kwargs)
    262     else:
--> 263       return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    264
    265

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in __init__(self, initial_value, trainable, collections, validate_shape, caching_device, name, dtype, variable_def, import_scope, constraint, distribute_strategy, synchronization, aggregation, shape)
    462           synchronization=synchronization,
    463           aggregation=aggregation,
--> 464           shape=shape)
    465
    466   def __repr__(self):

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py in _init_from_args(self, initial_value, trainable, collections, caching_device, name, dtype, constraint, synchronization, aggregation, shape)
    607             initial_value = ops.convert_to_tensor(
    608                 initial_value() if init_from_fn else initial_value,
--> 609                 name="initial_value", dtype=dtype)
    610           # Don't use `shape or initial_value.shape` since TensorShape has
    611           # overridden `__bool__`.

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor(value, dtype, name, preferred_dtype, dtype_hint)
   1095   preferred_dtype = deprecation.deprecated_argument_lookup(
   1096       "dtype_hint", dtype_hint, "preferred_dtype", preferred_dtype)
-> 1097   return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
   1098
   1099

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in convert_to_tensor_v2(value, dtype, dtype_hint, name)
   1153       name=name,
   1154       preferred_dtype=dtype_hint,
-> 1155       as_ref=False)
   1156
   1157

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/ops.py in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype, ctx, accept_symbolic_tensors, accept_composite_tensors)
   1232
   1233     if ret is None:
-> 1234       ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
   1235
   1236     if ret is NotImplemented:

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    303                                          as_ref=False):
    304   _ = as_ref
--> 305   return constant(v, dtype=dtype, name=name)
    306
    307

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in constant(value, dtype, shape, name)
    244   """
    245   return _constant_impl(value, dtype, shape, name, verify_shape=False,
--> 246                         allow_broadcast=True)
    247
    248

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in _constant_impl(value, dtype, shape, name, verify_shape, allow_broadcast)
    252   ctx = context.context()
    253   if ctx.executing_eagerly():
--> 254     t = convert_to_eager_tensor(value, ctx, dtype)
    255     if shape is None:
    256       return t

~/miniconda3/envs/tf2/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
    109       return ops.EagerTensor(
    110           value, handle, device, dtype, tensor)
--> 111     t = ops.EagerTensor(value, handle, device, dtype)
    112     scalar_cache[cache_key] = t
    113     return t

RuntimeError: Error copying tensor to device: /job:localhost/replica:0/task:0/device:GPU:1. /job:localhost/replica:0/task:0/device:GPU:1 unknown device.

@lukasfolle
Copy link
Contributor

lukasfolle commented Jun 3, 2019

Hey, have you had a look at this example from the GPU guide at tensorflow.org

tf.debugging.set_log_device_placement(True)
# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)
print(c)

To me it looks like TF is choosing which device is then suitable to execute this op:

If enabled, an op will be placed on CPU if any of the following are true
1. there's no GPU implementation for the OP
2. no GPU devices are known or registered
3. need to co-locate with reftype input(s) which are from CPU

from TF config

@ageron
Copy link
Contributor Author

ageron commented Jun 3, 2019

Hi @lufol,

Thanks for your answer. Yes I saw this doc, that's actually why I filed this bug. I don't see what soft placement would change in this example.

I expect soft placement to change something when the user requests a specific device but there is no op for that device, or the device does not exist. This would be useful if you want to write a program and deploy it on machines that may or may not have GPUs, for example. So on a machine without any GPU, I expect the following code to work without any error, and just fallback to placing the variable on the CPU:

tf.config.set_soft_device_placement(True)
with tf.device("/gpu:0"):
    x = tf.Variable(1.0)

Perhaps I'm misunderstanding what set_soft_device_placement() is designed for?

@ageron
Copy link
Contributor Author

ageron commented Jun 3, 2019

Okay, I got it: the semantics of soft placement have changed since TF 1.

In TF 1, the following code works fine on a machine without any GPU (I just tried it):

import tensorflow as tf

with tf.device("/gpu:42"): # there is no such device
    i = tf.Variable(123) # plus integers are not allowed on GPUs

# but let's make TF super soft and tolerant:
config = tf.ConfigProto()
config.allow_soft_placement = True
with tf.Session(config=config) as sess:
    sess.run(i.initializer)
    print(sess.run(i))

Prints 123, no problem. :)

That's why I was surprised that it didn't work in TF 2.0. I'm actually not sure when you would ever need to call set_soft_device_placement(False) in TF 2.0, what's the use case?

@ageron ageron closed this as completed Jun 3, 2019
@ageron ageron reopened this Jun 3, 2019
@lukasfolle
Copy link
Contributor

From what I understand, you either use

  • tf.config.set_soft_device_placement(True) to let tf automatically decide, which device to use.

  • or you use tf.config.set_soft_device_placement(False) and decide for each tensor where to place it like this with tf.device('/CPU:0'): ...

So I guess in your case the first option would be best.

@ageron
Copy link
Contributor Author

ageron commented Jun 4, 2019

Hi @lufol,

I just ran some tests, and it really seems like tf.config.set_soft_device_placement(False) makes no difference at all. Whether it's True or False, the default behavior is applied.

I'm quite puzzled.

@achandraa achandraa self-assigned this Jun 4, 2019
@achandraa achandraa added TF 2.0 Issues relating to TensorFlow 2.0 comp:gpu GPU related issues type:bug Bug labels Jun 4, 2019
@achandraa
Copy link

Have tried to reproduce on Colab with TF 2.0.0-dev20190527 with set_soft_device_placement as True as well as False and was able to get same result in both the scenarios as mentioned in the issue.

@achandraa achandraa assigned jvishnuvardhan and unassigned achandraa Jun 4, 2019
@lukasfolle
Copy link
Contributor

lukasfolle commented Jun 4, 2019

Yes @ageron , you are right.
But I guess at least the doc does not explicitly says to execute the tensors on CPU if tf.config.set_soft_device_placement(False) according to the doc:

If enabled, an op will be placed on CPU if any of the following are true
1. there's no GPU implementation for the OP
2. no GPU devices are known or registered
3. need to co-locate with reftype input(s) which are from CPU

Maybe one could add something to the doc explaining the behaviour if tf.config.set_soft_device_placement() is set to False?

Behaviour is reproducable here.

@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 4, 2019
@aaroey
Copy link
Member

aaroey commented Jun 4, 2019

@jaingaurav any comments?

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 5, 2019
@jaingaurav
Copy link
Contributor

Sorry for the delay. There have been multiple discussions about this internally. It seems that soft placement is respected by tf.function but not by the tf.device placement. We'd like to resolve this, however given it is a behavior change we're unlikely to be able to change the default in 1.0.

@ageron
Copy link
Contributor Author

ageron commented Jun 13, 2019

Thanks @jaingaurav. I'm just thinking about TF 2.x, I understand that it must keep the same behavior in TF 1.x.

@tensorflow-bot
Copy link

tensorflow-bot bot commented Jul 3, 2019

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:gpu GPU related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests

8 participants