New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple GPU Memory Being Allocated for single device script #5066

Closed
acrosson opened this Issue Oct 19, 2016 · 10 comments

Comments

Projects
None yet
6 participants
@acrosson

acrosson commented Oct 19, 2016

I am unable to run a TF script on a single GPU. Both of my GTX 1080's memory are being fully absorbed by Tensorflow when the model is initialized, but only one of the GPU is being used for computations (based on what I'm seeing in nvidia-smi).

Because both GPUs memory are fully occupied, I cannot run two models at once.

screen shot 2016-10-18 at 11 11 38 pm

### What related GitHub issues or StackOverflow threads have you found by searching the web for your problem?

http://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
https://groups.google.com/a/tensorflow.org/forum/#!topic/discuss/jw4FtKOivZE

Environment info

Operating System:
Ubuntu 16.04

Installed version of CUDA and cuDNN:
Cuda Toolkit 8.0, cuDNN 5.1.5

I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcudnn.so.5.1.5 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:116] successfully opened CUDA library libcurand.so.8.0 locally
0.11.0rc0
Build label: 0.3.2
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Oct 7 17:25:10 2016 (1475861110)
Build timestamp: 1475861110
Build timestamp as int: 1475861110

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

I'm using the example from here: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py

What other attempted solutions have you tried?

CUDA_VISIBLE_DEVICES

and

config = tf.ConfigProto(
device_count = {'GPU': 1}
)

sess = tf.Session(config=config)

and

with tf.device('/gpu:0'):
...

Logs or other output that would be helpful

(If logs are large, please upload as attachment or provide link).

@asimshankar

This comment has been minimized.

Member

asimshankar commented Oct 19, 2016

Thanks for the report @alexandercrosson . Just to make sure I understood correctly, you're saying that even if you set CUDA_VISIBLE_DEVICES=0 or set the device_count in ConfigProto your program still uses memory of both GPUs?

Do you have a pointer to the code that you used? (The one you linked to above: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py - seems to be without any ConfigProto or tf.device).

@acrosson

This comment has been minimized.

acrosson commented Oct 19, 2016

@asimshankar . When I use device_count = {'GPU': 1} it does only allow me to use the first GPU, but this isn't helpful because I can never use the second GPU.

For instance if i use device_count = {'GPU': 1} and set tf.device('/gpu:1') i'll get the error message:

tensorflow.python.framework.errors.InvalidArgumentError: Cannot assign a device to node 'Adam/epsilon': Could not satisfy explicit device specification '/device:GPU:1' because no devices matching that specification are registered in this process; available devices: /job:localhost/replica:0/task:0/cpu:0, /job:localhost/replica:0/task:0/gpu:0
         [[Node: Adam/epsilon = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [] values: 1e-08>, _device="/device:GPU:1"]()]]

Here's a gist : https://gist.github.com/alexandercrosson/8dc970578e1d1d4e7b00b1dba63a45b4

@asimshankar

This comment has been minimized.

Member

asimshankar commented Oct 19, 2016

What about CUDA_VISIBLE_DEVICES? That should control which process sees what GPUs. So you'd run one process with CUDA_VISIBLE_DEVICES=0 and the other with CUDA_VISIBLE_DEVICES=1.

device_count is meant to limit the number of devices used - so it won't allocate memory on both devices. When device_count is 1, the device name will still be just /gpu:0. You might find https://www.tensorflow.org/versions/r0.11/how_tos/using_gpu/index.html#allowing-gpu-memory-growth - which talks about limiting memory use interesting as well.

But long story short, does CUDA_VISIBLE_DEVICES not do the trick?

@acrosson

This comment has been minimized.

acrosson commented Oct 19, 2016

@asimshankar CUDA_VISIBLE_DEVICES works! I tested it by setting CUDA_VISIBLE_DEVICES="0" in one tmux instance and CUDA_VISIBLE_DEVICES="1" in another. Both instances are able to train.

The memory being allocated to both by default - is that the intended functionality of TF?

@asimshankar

This comment has been minimized.

Member

asimshankar commented Oct 19, 2016

Great to hear it worked.
Yes, taking over all devices that are accessible to the process is intended. Some more detail in this comment: #3644 (comment)

Since your issue has been resolved, I'm going to close this. If you run into trouble, feel free to create a new issue. Thanks

@anewlearner

This comment has been minimized.

anewlearner commented Oct 22, 2016

Hi,@alexandercrosson
Thanks for your reply. You mentioned that export CUDA_VISIBLE_DEVICES="0"can be used. But I still don't know how to use it. I was using a tensorflow backend keras, and want to run different unrelated scripts at the same time.
Where should I add export CUDA_VISIBLE_DEVICES="0"?

I also tried with tf.device('/gpu:0'). For example,

with tf.device('/gpu:0'):
         model.fit(trainX, trainY)

But it did't change anything.

@asimshankar

This comment has been minimized.

Member

asimshankar commented Oct 22, 2016

@anewlearner : CUDA_VISIBLE_DEVICES is an environment variable that you would say set in your shell before invoking the python program.

@yaroslavvb

This comment has been minimized.

Contributor

yaroslavvb commented Oct 25, 2016

You run export CUDA_VISIBLE_DEVICES="0" in the environment before running
your tensorflow script.

On Sat, Oct 22, 2016 at 7:27 AM, Carol notifications@github.com wrote:

Hi,@alexandercrosson https://github.com/alexandercrosson
Thanks for your reply. You mentioned that export CUDA_VISIBLE_DEVICES="0"can
be used. But I still don't know how to use it. I was using a tensorflow
backend keras, and want to run different unrelated scripts at the same time.
Where should I add export CUDA_VISIBLE_DEVICES="0"?

I also tried with tf.device('/gpu:0'). For example,

with tf.device('/gpu:0'):
model.fit(trainX, trainY)

But it did't change anything.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#5066 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABaHN2KCeeBm9zWh4cCExFm4O_6Ex_dks5q2h1kgaJpZM4KanKw
.

@zylix666

This comment has been minimized.

zylix666 commented Mar 5, 2018

Is this because of using SLI bridge?

@mebrar

This comment has been minimized.

mebrar commented Aug 30, 2018

@zylix666 I think it is definitely something about SLI bridge. I have 2 GTX 1080Ti in SLI mode and even though I set only one device with CUDA_VISIBLE_DEVICES environment variable, it still uses both of the GPU memory but executes on single GPU. Do you have any suggestions about running on SLI mode or should I just disable SLI completely?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment