GPU remapping using visible_device_list is broken #19083

rundembear · 2018-05-04T11:50:45Z

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
The form below must be filled out.
It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.8.0-0-g93bc2e2072 1.8.0
I have also tried this on 16.0 and 17.0, it crashes both of them.
13.0 and 15.0 are fine.
Python version: 3.6.3
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: Both 8.0, and 9.1 (with 9.0 libraries)
GPU model and memory: GeForce GTX 1080 Ti. with 11178 MiB
Exact command to reproduce:

import tensorflow as tf
G =tf.Graph()
sess1 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='0')))
sess2 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='1')))

Running the second tf.Session command crashes with the following error:

F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

The GPU remapping using visible_device_list is broken. This works fine in Tensorflow 1.3 and 1.5, but is completely broken (crashes the program) in 1.6, 1.7 and 1.8.
As far as I can tell from reading tensorflow/include/tensorflow/core/common_runtime/gpu/gpu_id.h
this mechanism is supposed to still work the same way it used to.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

import tensorflow as tf
G =tf.Graph()
sess1 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='0')))
sess2 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='1')))

F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

asimshankar · 2018-05-11T19:33:23Z

See #18861 (comment)

…icting visible_devices_list. See tensorflow#19083 See tensorflow#18861 More generally, this change avoids assertion failures (that will bring the whole process down) on a few code-paths that can be triggerred by user input. PiperOrigin-RevId: 196572013

rundembear · 2018-05-16T15:03:58Z

@aaroey Just in case, I am posting here since the other ticket is closed (it wasn't mine, so I don't think I can reopen it). I just added another commentwitw a follow-up question to #18861

tensorflowbutler · 2018-06-02T07:12:15Z

Nagging Assignee @aaroey: It has been 16 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

aaroey · 2018-06-04T14:24:08Z

I believe the problem is solved in #18861, so I'm closing this. Please re-open if there are any other questions.

tensorflowbutler assigned skye May 4, 2018

rundembear mentioned this issue May 5, 2018

Running two Models in two GPUs for prediction in c++ #18861

Closed

asimshankar assigned aaroey and unassigned skye May 11, 2018

aaroey closed this as completed Jun 4, 2018

bioothod mentioned this issue Jun 29, 2018

BindToDevice() binds graph to specified (gpu) device which forces all its operations to be prcessed on that device. #20412

Merged

sunzhe09 mentioned this issue Dec 20, 2018

How to compress our own models using multi-GPUs Tencent/PocketFlow#129

Closed

ttdd11 mentioned this issue Mar 25, 2019

Unable to assign GPU using c_api #27114

Closed

olk mentioned this issue Dec 1, 2019

MirroredStrategy compared to OneDeviceStrategy slower and much weaker learning #33809

Closed

gowthamkpr mentioned this issue Dec 20, 2019

Error when setting up virtual devices on system that has multiple physical GPUs #35083

Closed

cmsppl mentioned this issue Jun 24, 2020

Setting GPU device SciSharp/TensorFlow.NET#557

Open

This was referenced May 31, 2021

cppflow::get_context(), get_global_context() have issue when use specific GPU. serizba/cppflow#126

Closed

Avoid default global context initialization before session options are provided. serizba/cppflow#127

Closed

jailuthra mentioned this issue Aug 6, 2021

CUDA device mapping TensorFlow error when loading DNN model livepeer/go-livepeer#1980

Open

Pekary mentioned this issue Aug 31, 2021

Is there any way to use different gpu in multiple sessions? tensorflow/java#380

Closed

SuryanarayanaY mentioned this issue Dec 29, 2022

TensorFlow device is mapped to multiple devices when using tf.estimator models and setting visible device using TF v1 api with TF2.11 #58952

Closed

luukasnik mentioned this issue Aug 2, 2023

[BUG] restarting from checkpoint doesn't work with multiple GPUs deepmodeling/deepmd-kit#2712

Closed

obriensystems mentioned this issue Oct 1, 2023

tensorflow on OSX Mac M1 pro/max silicon 32 cores and windows 11 13900k with dual RTX-A4500/A4000 workstation and dual GTX-4090 consumer ObrienlabsDev/blog#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU remapping using visible_device_list is broken #19083

GPU remapping using visible_device_list is broken #19083

rundembear commented May 4, 2018

asimshankar commented May 11, 2018

rundembear commented May 16, 2018

tensorflowbutler commented Jun 2, 2018

aaroey commented Jun 4, 2018

GPU remapping using visible_device_list is broken #19083

GPU remapping using visible_device_list is broken #19083

Comments

rundembear commented May 4, 2018

System information

Describe the problem

Source code / logs

asimshankar commented May 11, 2018

rundembear commented May 16, 2018

tensorflowbutler commented Jun 2, 2018

aaroey commented Jun 4, 2018