Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU remapping using visible_device_list is broken #19083

Closed
rundembear opened this issue May 4, 2018 · 4 comments
Closed

GPU remapping using visible_device_list is broken #19083

rundembear opened this issue May 4, 2018 · 4 comments
Assignees

Comments

@rundembear
Copy link

Please go to Stack Overflow for help and support:

https://stackoverflow.com/questions/tagged/tensorflow

If you open a GitHub issue, here is our policy:

  1. It must be a bug, a feature request, or a significant problem with documentation (for small docs fixes please send a PR instead).
  2. The form below must be filled out.
  3. It shouldn't be a TensorBoard issue. Those go here.

Here's why we have that policy: TensorFlow developers respond to issues. We want to focus on work that benefits the whole community, e.g., fixing bugs and adding features. Support only helps individuals. GitHub also notifies thousands of people when issues are filed. We want them to see you communicating an interesting problem, rather than being redirected to Stack Overflow.


System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): YES

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04

  • TensorFlow installed from (source or binary): binary

  • TensorFlow version (use command below): v1.8.0-0-g93bc2e2072 1.8.0
    I have also tried this on 16.0 and 17.0, it crashes both of them.
    13.0 and 15.0 are fine.

  • Python version: 3.6.3

  • Bazel version (if compiling from source):

  • GCC/Compiler version (if compiling from source):

  • CUDA/cuDNN version: Both 8.0, and 9.1 (with 9.0 libraries)

  • GPU model and memory: GeForce GTX 1080 Ti. with 11178 MiB

  • Exact command to reproduce:

import tensorflow as tf
G =tf.Graph()
sess1 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='0')))
sess2 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='1')))

Running the second tf.Session command crashes with the following error:

F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

Describe the problem

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

The GPU remapping using visible_device_list is broken. This works fine in Tensorflow 1.3 and 1.5, but is completely broken (crashes the program) in 1.6, 1.7 and 1.8.
As far as I can tell from reading tensorflow/include/tensorflow/core/common_runtime/gpu/gpu_id.h
this mechanism is supposed to still work the same way it used to.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

import tensorflow as tf
G =tf.Graph()
sess1 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='0')))
sess2 = tf.Session(graph=G, config=tf.ConfigProto(log_device_placement=False,gpu_options=tf.GPUOptions(allow_growth=True,visible_device_list='1')))

F tensorflow/core/common_runtime/gpu/gpu_id_manager.cc:45] Check failed: cuda_gpu_id.value() == result.first->second (1 vs. 0)Mapping the same TfGpuId to a different CUDA GPU id. TfGpuId: 0 Existing mapped CUDA GPU id: 0 CUDA GPU id being tried to map to: 1

@asimshankar
Copy link
Contributor

See #18861 (comment)

yifeif pushed a commit to yifeif/tensorflow that referenced this issue May 15, 2018
…icting

visible_devices_list.

See tensorflow#19083
See tensorflow#18861

More generally, this change avoids assertion failures (that will bring the
whole process down) on a few code-paths that can be triggerred by user input.

PiperOrigin-RevId: 196572013
@rundembear
Copy link
Author

@aaroey Just in case, I am posting here since the other ticket is closed (it wasn't mine, so I don't think I can reopen it). I just added another commentwitw a follow-up question to #18861

@tensorflowbutler
Copy link
Member

Nagging Assignee @aaroey: It has been 16 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@aaroey
Copy link
Member

aaroey commented Jun 4, 2018

I believe the problem is solved in #18861, so I'm closing this. Please re-open if there are any other questions.

@aaroey aaroey closed this as completed Jun 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants