GPU Device Selector in TensorFlow 2.0 #26460

alsrgv · 2019-03-07T22:04:48Z

Please make sure that this is a feature request. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:feature_template

System information

TensorFlow version (you are using): 2.0
Are you willing to contribute it (Yes/No): Happy to help as much as I can!

Describe the feature and the current behavior/state.
TensorFlow 1.x support specifying GPU devices to use:

# Horovod: pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

There's no comparable API in TensorFlow 2.0. The closest option is to use the CUDA_VISIBLE_DEVICES environment variable. Unfortunately, CUDA_VISIBLE_DEVICES prevents processes from doing cudaMemcpy from/to devices not owned by the process. There's a significant performance degradation when NCCL is used with P2P communication disabled.

The ask is to add an API to TensorFlow 2.0 to enable device selection.

Will this change the current api? How?
Yes, will introduce an API to select GPU devices to use.

Who will benefit with this feature?
Users of Horovod.

Any Other info.
cc @azaks2 @alextp @jaingaurav @guptapriya

The text was updated successfully, but these errors were encountered:

jaingaurav · 2019-03-07T22:29:14Z

Duplicate of #25446

guptapriya · 2019-03-07T22:35:59Z

@jaingaurav is a replacement for setting visible_device_list on your radar? I ask because I think Alex said there are some technical difficulties in implementing it.

jaingaurav · 2019-03-07T23:01:29Z

@guptapriya: It is on my radar, but I need to still sync up with @alextp on the potential issues.

jaingaurav · 2019-04-13T02:55:32Z

A number of new API were added in tf.config namespace to support this use case. Please let me know if there is anything we missed regarding this specific issue.

alsrgv · 2019-04-19T07:28:26Z

Thanks for the update, @jaingaurav!

I have tried new functionality via tensorflow/tensorflow:nightly-gpu-py3 and have a couple of questions.

First, the API requires one to do tf.config.experimental.list_physical_devices('GPU'), filter that list and provide remnants to tf.config.experimental.set_visible_devices(physical_devices[1:], 'GPU').

During the list operation, TensorFlow creates a GPU context on every GPU, including ones that we're not planning to use. You can see how this is wasteful if we will run 8 TensorFlow processes on 8-GPU server, each taking up ~120MB of GPU memory, totaling almost 1GB of wasted GPU memory.

Could you add a way to set visible devices w/o binding GPU contexts?

Second, I noticed that our legacy usage of config.gpu_options.visible_device_list = '0' and tf.enable_eager_execution(config=config) has stopped working. Is this intentional for 1.14?

root@fc725ca05627:/# python
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> config = tf.ConfigProto()
>>> config.gpu_options.visible_device_list = '0'
>>> tf.enable_eager_execution(config=config)
>>> tf.constant(1)
2019-04-19 07:12:06.657522: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcuda.so.1
2019-04-19 07:12:13.742849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-19 07:12:14.238725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1589] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:04.0
totalMemory: 14.73GiB freeMemory: 14.60GiB
2019-04-19 07:12:14.461207: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-19 07:12:14.463142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1589] Found device 1 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:05.0
totalMemory: 14.73GiB freeMemory: 14.60GiB
2019-04-19 07:12:14.880827: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-19 07:12:14.881876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1589] Found device 2 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:06.0
totalMemory: 14.73GiB freeMemory: 14.60GiB
2019-04-19 07:12:15.076517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-19 07:12:15.077554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1589] Found device 3 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:07.0
totalMemory: 14.73GiB freeMemory: 14.60GiB
2019-04-19 07:12:15.085892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1712] Adding visible gpu devices: 0, 1, 2, 3
2019-04-19 07:12:15.089494: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-19 07:12:15.114457: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x68e92e0 executing computations on platform CUDA. Devices:
2019-04-19 07:12:15.114490: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-04-19 07:12:15.114504: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2019-04-19 07:12:15.114511: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2019-04-19 07:12:15.114517: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2019-04-19 07:12:15.117618: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz
2019-04-19 07:12:15.121237: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6a80e90 executing computations on platform Host. Devices:
2019-04-19 07:12:15.121270: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-04-19 07:12:15.121415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1712] Adding visible gpu devices: 0, 1, 2, 3
2019-04-19 07:12:15.121678: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcudart.so.10.0
2019-04-19 07:12:15.127364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-19 07:12:15.127398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126]      0 1 2 3
2019-04-19 07:12:15.127407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1139] 0:   N Y N N
2019-04-19 07:12:15.127414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1139] 1:   Y N N N
2019-04-19 07:12:15.127420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1139] 2:   N N N Y
2019-04-19 07:12:15.127426: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1139] 3:   N N Y N
2019-04-19 07:12:15.128428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1260] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14202 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5)
2019-04-19 07:12:15.128847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1260] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 14202 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5)
2019-04-19 07:12:15.129209: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1260] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 14202 MB memory) -> physical GPU (device: 2, name: Tesla T4, pci bus id: 0000:00:06.0, compute capability: 7.5)
2019-04-19 07:12:15.129622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1260] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 14202 MB memory) -> physical GPU (device: 3, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5)
<tf.Tensor: id=0, shape=(), dtype=int32, numpy=1>
>>> tf.config.experimental.get_visible_devices()
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU')]
>>>

jaingaurav · 2019-04-19T08:10:03Z

@alsrgv: Are you sure that you the listing operation is causing the GPU memory to be used? The listing API was supposed to be a lightweight operation that would not involve any memory allocation. Did you experience this with the 1.0 or 2.0 nightly?

Regarding the bug you mentioned with tf.enable_eager_execution(config=config). This is a bug in the implementation. I'll look into it tomorrow and get a fix into the 1.14 branch.

alsrgv · 2019-04-19T08:29:20Z

@jaingaurav, thanks for the quick response. I did experience it in tensorflow/tensorflow:nightly-gpu-py3, which reports itself as 1.14.1-dev20190417. Do you expect 2.0-nightly behavior to be different? Is there a docker image with 2.0-gpu-nightly?

The way I verify GPU memory usage is via nvidia-smi. After running tf.config.experimental.list_physical_devices('GPU') in one of the terminals, I see memory usage on all GPUs in another:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6287      C   python                                       119MiB |
|    1      6287      C   python                                       119MiB |
|    2      6287      C   python                                       119MiB |
|    3      6287      C   python                                       119MiB |
+-----------------------------------------------------------------------------+

jaingaurav · 2019-04-19T20:39:48Z

Thanks for the details. I have a fix for the regression that I am getting through code review.

Looking into the GPU memory allocation issue now. I was able to reproduce it locally.

jaingaurav · 2019-04-22T17:27:36Z

@alsrgv: The fix for tf.enable_eager_execution(config=config) has been merged. Could you let me know if it works for you, and I'll have it cherry-picked into 1.14.

I am working on the GPU memory allocation issue. However, the fix requires quite a bit of code re-structuring. It seems we end up allocating memory as a function of querying CUDA capabilities.

alsrgv · 2019-04-23T04:09:54Z

@jaingaurav, thanks for the fix! I see the following error: ValueError: Invalid visible device index: 0 in CPU environments. Historically, using visible_device_list = '0' on CPU machine was a no-op, but now it's crashing. Is it possible to avoid the crash in this scenario?

Looking forward to the memory usage fix. Memory usage could be caused by the creation of CUDA context. If that's the case, driver API should allow querying device capabilities w/o CUDA context (and memory usage).

jaingaurav · 2019-04-24T01:46:17Z

@alsrgv: Thanks for the previous behavior. I'll ensure that I maintain compatibility. I've got the memory issue almost fixed. One last change needed.

In terms of releases, I will ensure the regression is cherry-picked into 1.14. However, for the memory issue, can it wait till 1.15 & 2.0 or would you like it for 1.14 as well? Just depends on what you'd like to support.

alsrgv · 2019-04-24T02:16:48Z

@jaingaurav, looking forward to the fixes!

It would be great if memory issue fix can be picked in 1.14 as well. 1.13 did have memory issue with XLA (it was binding memory on all devices as well), and it was causing out of memory issues with cuDNN. So there are no release w/o memory issues since 1.12.

jaingaurav · 2019-04-24T02:51:52Z

@alsrgv: The XLA issues has been fixed with a427c13 correct? If so, it'll be in the 1.14 branch.

alsrgv · 2019-04-24T05:22:57Z

@jaingaurav, yes, that's it - hence the request to pick the fix for this memory issue into 1.14 branch as well.

jaingaurav · 2019-04-26T04:19:52Z

@alsrgv: All known issues should be fixed in tonight's nightly. Once you confirm the behavior, I will speak to the release team about trying to get the memory fixes into 1.14. The changes weren't too bad, but they do incur some risk to get into the release.

alsrgv · 2019-04-26T06:27:47Z

@jaingaurav, thanks for the fixes!

I just tried https://files.pythonhosted.org/packages/e1/c6/6cde177c97e975d3c0aa36a7df87b353e8cfa26660735f6668d314106d81/tf_nightly_gpu-1.14.1.dev20190426-cp37-cp37m-manylinux1_x86_64.whl.

The memory leak with tf.config.experimental.list_physical_devices('GPU') is fixed 👍

I'm still getting an error with visible_device_list in absence of GPUs:

(env) root@153afad1fa58:/# CUDA_VISIBLE_DEVICES='' python
Python 3.7.3 (default, Mar 26 2019, 00:55:50)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> config = tf.ConfigProto()
>>> config.gpu_options.visible_device_list = '0'
>>> tf.enable_eager_execution(config=config)
>>> tf.constant(1)
2019-04-26 06:20:39.548228: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libcuda.so.1
2019-04-26 06:20:44.481123: E tensorflow/stream_executor/cuda/cuda_driver.cc:320] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-04-26 06:20:44.481316: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:166] retrieving CUDA diagnostic information for host: 153afad1fa58
2019-04-26 06:20:44.481341: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:173] hostname: 153afad1fa58
2019-04-26 06:20:44.481492: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:197] libcuda reported version is: 410.72.0
2019-04-26 06:20:44.481544: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:201] kernel reported version is: 410.72.0
2019-04-26 06:20:44.481555: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version seems to match DSO: 410.72.0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/env/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 180, in constant_v1
    allow_broadcast=False)
  File "/env/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 254, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/env/lib/python3.7/site-packages/tensorflow/python/framework/constant_op.py", line 98, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/env/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 405, in ensure_initialized
    config_str = self.config.SerializeToString()
  File "/env/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 672, in config
    self._initialize_physical_devices()
  File "/env/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 917, in _initialize_physical_devices
    self._import_config()
  File "/env/lib/python3.7/site-packages/tensorflow/python/eager/context.py", line 971, in _import_config
    raise ValueError("Invalid visible device index: %s" % index)
ValueError: Invalid visible device index: 0
>>>

I tried this on CPU build with the same outcome.

jaingaurav · 2019-04-26T06:32:31Z

Thanks @alsrgv. From the looks of it that nightly build might not have that latest change yet. We can re-verify in the next build.

alsrgv · 2019-04-26T07:53:34Z

@jaingaurav, I just built the master from source and can confirm it works, thanks!

ppwwyyxx · 2019-04-26T20:38:00Z

@jaingaurav thanks a lot for this improvement! I can verify that it also fixes another old issue at #8136 (comment).
This code:

import tensorflow as tf
print(tf.config.experimental.list_physical_devices('GPU'))
cfg = tf.ConfigProto()
cfg.gpu_options.visible_device_list = '1'
sess = tf.Session(config=cfg)   # do not fail

do not fail when running on a 2-gpu machine. But it would fail if using tf.test.is_gpu_available() instead of list_physical_devices.

Any chance we can have this great improvement into tf.test.is_gpu_available()?

jaingaurav · 2019-04-26T20:47:55Z

@ppwwyyxx: This is exactly why the new API was created. Unfortunately any changes to tf.test.is_gpu_available() would possibly break backwards compatibility. Hence, I'd prefer if you used the new APIs. Also, unless you are writing test cases, I'd probably avoid using that API.

alsrgv · 2019-04-29T19:03:19Z

@jaingaurav, any news whether these fixes can be picked into r1.14? We'd really like a release since 1.12.x that has correct GPU memory binding behavior.

cc @martinwicke

alextp · 2019-04-29T19:21:51Z

Cherry-picking these bug fixes makes sense I think if the release is still not too advanced.

…

On Mon, Apr 29, 2019 at 12:11 PM Alex Sergeev ***@***.***> wrote: @jaingaurav <https://github.com/jaingaurav>, any news whether these fixes can be picked into r1.14? We'd really like a release since 1.12.x that has correct GPU memory binding behavior. cc @martinwicke <https://github.com/martinwicke> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26460 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABHRNOWJIJREFDXPMTX53PS5B43ANCNFSM4G4Q2REA> .

-- - Alex

jaingaurav · 2019-04-29T20:30:49Z

@alsrgv: We had a chat about it this morning. Given the status of the 1.14, we're going to aim to cherry-pick the memory fixes. We'd greatly appreciate any testing that you can do to help ensure that we don't incur any regressions and that we have everything you need in 1.14. Thank you for all that you have done so far!

alsrgv · 2019-04-29T21:28:32Z

@jaingaurav, perfect, thanks! I will test 1.14 RCs as they come out.

llan-ml · 2019-06-07T10:02:22Z

@jaingaurav Hi, I try to use device selector with tf2, but there are still some problems:

In [1]: import tensorflow as tf

In [2]: tf.__version__
Out[2]: '2.0.0-dev20190606'

In [3]: gpus = tf.config.experimental.list_physical_devices("GPU")

In [4]: gpus
Out[4]:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'),
 PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

In [5]: tf.config.experimental.set_visible_devices(gpus[0], 'GPU')

In [6]: tf.config.experimental.set_memory_growth(gpus[0], True)

In [7]: tf.constant(1)
...
ntext.py in _compute_gpu_options(self)
    851       memory_growths = set(self._memory_growth_map.values())
    852       if len(memory_growths) > 1:
--> 853         raise ValueError("Memory growth cannot differ between GPU devices")
    854       allow_growth = memory_growths.pop()
    855     else:

ValueError: Memory growth cannot differ between GPU devices

jaingaurav · 2019-06-11T20:33:17Z

@llan-ml: Please see the updated guide at https://www.tensorflow.org/beta/guide/using_gpu#limiting_gpu_memory_growth. Currently we require the memory growth option to be uniform across all GPUs. This may change in the future if someone were to implement the changes.

llan-ml · 2019-06-12T02:38:05Z

@jaingaurav What I mean is that after I have selected a GPU by calling tf.config.experimental.set_visible_devices(gpus[0], 'GPU'), it still raises ValueError: Memory growth cannot differ between GPU devices.

For now, I still have to select a specific GPU by setting CUDA_VISIBLE_DEVICES in a multi-gpu machine.

This was discovered in Issue #26460. PiperOrigin-RevId: 253082055

jaingaurav · 2019-06-13T19:35:55Z

@llan-ml: Thanks, that is indeed a valid bug. The fix has been pushed now and I'll ensure it makes it into the upcoming 1.14 release.

pidajay · 2019-07-14T02:19:35Z

@jaingaurav I am seeing a lot of memory issues when using depthwise conv2d native. Seems like it does not respect the session config visible devices list and grabs all the GPUs. You mentioned this above - "I am working on the GPU memory allocation issue. However, the fix requires quite a bit of code re-structuring. It seems we end up allocating memory as a function of querying CUDA capabilities." I am wondering if these fixes made their way to TF 1.14. Can you point me to any PRs with these fixes? Thanks!

jaingaurav · 2019-07-14T02:25:18Z

@pidajay: How are you querying the GPUs, could you share a code snippet? Note this issue was primarily focused on querying GPUs with eager execution. If you are using sessions and experiencing issues there might be something else going on.

Here is a tutorial on how to use the new APIs:
https://www.tensorflow.org/beta/guide/using_gpu#limiting_gpu_memory_growth

pidajay · 2019-07-14T02:51:32Z

@jaingaurav Appreciate the response. I am using an estimator (TPU estimator but running on GPU). Single GPU is fine but problem shows up when distributing across multiple GPUs (I use horovod. But horovod does not seem to be the issue here). I create the visible device list as follows at the beginning of the program

sess_config=tf.ConfigProto()
sess_config.gpu_options.allow_growth = True
sess_config.gpu_options.visible_device_list = str(hvd.local_rank())

But I notice that the estimator violates the visible device list above and allocates all the GPUs. And this happens only when using depthwise conv 2D.
I guess this thread is related to eager mode and probably not the right thread. If you can't think of anything on top of your head, I will try to create a new issue with a tiny example reproducing the problem. Thanks!

llan-ml · 2019-07-27T19:17:40Z

Hi @jaingaurav It seems that disabling all GPUs does not work properly. I ran the following code

In [1]: import tensorflow as tf

In [2]: tf.__version__
Out[2]: '2.0.0-dev20190725'

In [3]: physical_devices = tf.config.experimental.list_physical_devices('GPU')
2019-07-28 03:12:44.161668: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-07-28 03:12:44.243318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:3b:00.0
2019-07-28 03:12:44.244091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties:
name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
pciBusID: 0000:af:00.0
2019-07-28 03:12:44.244418: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2019-07-28 03:12:44.246004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-07-28 03:12:44.247384: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2019-07-28 03:12:44.247752: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2019-07-28 03:12:44.249599: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2019-07-28 03:12:44.251015: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2019-07-28 03:12:44.255340: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-07-28 03:12:44.258154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1

In [4]: tf.config.experimental.set_visible_devices([], 'GPU')

In [5]: visible_devices = tf.config.experimental.get_visible_devices()

In [6]: visible_devices
Out[6]: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

In [7]: x = tf.Variable(1.0)
2019-07-28 03:13:21.109555: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-07-28 03:13:22.043938: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e7a5774b90 executing computations on platform CUDA. Devices:
2019-07-28 03:13:22.044006: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-07-28 03:13:22.044029: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla V100-PCIE-16GB, Compute Capability 7.0
2019-07-28 03:13:22.067710: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2019-07-28 03:13:22.074734: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e7a994ac40 executing computations on platform Host. Devices:
2019-07-28 03:13:22.074809: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-28 03:13:22.074977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-28 03:13:22.075003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]

In [8]: x.device
Out[8]: '/job:localhost/replica:0/task:0/device:CPU:0'

In [9]: !nvidia-smi
Sun Jul 28 03:13:56 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   30C    P0    41W / 250W |    418MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   35C    P0    42W / 250W |    418MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     52456      C   ...2.0/envs/tf-nightly-2.0-0725/bin/python   407MiB |
|    1     52456      C   ...2.0/envs/tf-nightly-2.0-0725/bin/python   407MiB |
+-----------------------------------------------------------------------------+

Although the variable x is placed on CPU, the process still occupies some gpu memory. The expected behavior is that running nvidia-smi does not show any process info.

lucasjinreal · 2019-10-12T05:42:54Z

What if users want specific 2 GPUs outof 4:

tf.config.experimental.set_visible_devices(gpus[0], 'GPU')

How to enable 2 of them.

ppwwyyxx · 2019-10-12T06:01:10Z

As the documentation said, you can give it a list of devices.

jaingaurav self-assigned this Mar 7, 2019

jaingaurav marked this as a duplicate of #25446 Mar 7, 2019

jvishnuvardhan added comp:gpu GPU related issues type:feature Feature requests labels Mar 7, 2019

alsrgv mentioned this issue Mar 12, 2019

Will Horovod work with TensorFlow 2.0 Alpha? horovod/horovod#907

Closed

tensorflow-copybara closed this as completed in 6e559b9 Apr 13, 2019

tensorflow-copybara pushed a commit that referenced this issue Jun 13, 2019

Only check memory growth for visible GPUs

2b23067

This was discovered in Issue #26460. PiperOrigin-RevId: 253082055

jaingaurav mentioned this issue Jun 13, 2019

Only check memory growth for visible GPUs #29759

Merged

llan-ml mentioned this issue Oct 9, 2019

[2.0] Disable usage of GPU using the new config APIs #33168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Device Selector in TensorFlow 2.0 #26460

GPU Device Selector in TensorFlow 2.0 #26460

alsrgv commented Mar 7, 2019

jaingaurav commented Mar 7, 2019

guptapriya commented Mar 7, 2019

jaingaurav commented Mar 7, 2019

jaingaurav commented Apr 13, 2019

alsrgv commented Apr 19, 2019

jaingaurav commented Apr 19, 2019

alsrgv commented Apr 19, 2019

jaingaurav commented Apr 19, 2019

jaingaurav commented Apr 22, 2019

alsrgv commented Apr 23, 2019

jaingaurav commented Apr 24, 2019

alsrgv commented Apr 24, 2019

jaingaurav commented Apr 24, 2019

alsrgv commented Apr 24, 2019

jaingaurav commented Apr 26, 2019

alsrgv commented Apr 26, 2019 •

edited

jaingaurav commented Apr 26, 2019

alsrgv commented Apr 26, 2019

ppwwyyxx commented Apr 26, 2019

jaingaurav commented Apr 26, 2019

alsrgv commented Apr 29, 2019

alextp commented Apr 29, 2019 via email

jaingaurav commented Apr 29, 2019

alsrgv commented Apr 29, 2019

llan-ml commented Jun 7, 2019

jaingaurav commented Jun 11, 2019

llan-ml commented Jun 12, 2019

jaingaurav commented Jun 13, 2019

pidajay commented Jul 14, 2019

jaingaurav commented Jul 14, 2019

pidajay commented Jul 14, 2019

llan-ml commented Jul 27, 2019 •

edited

lucasjinreal commented Oct 12, 2019

ppwwyyxx commented Oct 12, 2019

GPU Device Selector in TensorFlow 2.0 #26460

GPU Device Selector in TensorFlow 2.0 #26460

Comments

alsrgv commented Mar 7, 2019

jaingaurav commented Mar 7, 2019

guptapriya commented Mar 7, 2019

jaingaurav commented Mar 7, 2019

jaingaurav commented Apr 13, 2019

alsrgv commented Apr 19, 2019

jaingaurav commented Apr 19, 2019

alsrgv commented Apr 19, 2019

jaingaurav commented Apr 19, 2019

jaingaurav commented Apr 22, 2019

alsrgv commented Apr 23, 2019

jaingaurav commented Apr 24, 2019

alsrgv commented Apr 24, 2019

jaingaurav commented Apr 24, 2019

alsrgv commented Apr 24, 2019

jaingaurav commented Apr 26, 2019

alsrgv commented Apr 26, 2019 • edited

jaingaurav commented Apr 26, 2019

alsrgv commented Apr 26, 2019

ppwwyyxx commented Apr 26, 2019

jaingaurav commented Apr 26, 2019

alsrgv commented Apr 29, 2019

alextp commented Apr 29, 2019 via email

jaingaurav commented Apr 29, 2019

alsrgv commented Apr 29, 2019

llan-ml commented Jun 7, 2019

jaingaurav commented Jun 11, 2019

llan-ml commented Jun 12, 2019

jaingaurav commented Jun 13, 2019

pidajay commented Jul 14, 2019

jaingaurav commented Jul 14, 2019

pidajay commented Jul 14, 2019

llan-ml commented Jul 27, 2019 • edited

lucasjinreal commented Oct 12, 2019

ppwwyyxx commented Oct 12, 2019

alsrgv commented Apr 26, 2019 •

edited

llan-ml commented Jul 27, 2019 •

edited