New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple tasks in the same server may cause OOM in distributed mode with GPUs #3644

Closed
tobegit3hub opened this Issue Aug 4, 2016 · 2 comments

Comments

Projects
None yet
2 participants
@tobegit3hub
Contributor

tobegit3hub commented Aug 4, 2016

Following the official tutorial of distributed TensorFlow, we find that the ps and works will use the first GPU by default which may cause OOM in distributed mode.

If we don't set CUDA_VISIBLE_DEVICES, all the ps tasks may see all the GPUs and use the first one by default. And when I start all the ps and worker processes, OOM occurs and ps uses most of GPU's memory even though no job running.

Is that possible to optimize the algorithm of scheduler for more intelligence, such as using CPU for ps task and place operations in different GPUs instead of the first one?

One of the solution is specifying CUDA_VISIBLE_DEVICES for each task. Or we can use with tf.device() which may be only use for model parallel in distributed mode.

Environment info

Operating System:

Ubuntu 14.04

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):

# ls -l /usr/local/cuda/lib64/libcud*
-rw-r--r-- 1 root root   322936 Aug 15  2015 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root       16 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root       19 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root   383336 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so.7.5.18
-rw-r--r-- 1 root root   720192 Aug 15  2015 /usr/local/cuda/lib64/libcudart_static.a
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4.0.7
-rw-r--r-- 1 root root 62025862 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn_static.a

If installed from binary pip package, provide:

0.9.0
@poxvoculi

This comment has been minimized.

Member

poxvoculi commented Aug 4, 2016

Your observation is correct. It is a deliberate design choice that a TF process will, unless instructed otherwise, use visible GPUs in order, and attempt to use all of the memory on each device. This approach can work well when processes sharing a server are in a virtual environment that makes visible only the GPUs that the process is permitted to use. As you note, setting a different value for CUDA_VISIBLE_DEVICES in each process is a similar solution.

Unfortunately, there's no single best solution for distribution, and TF does not attempt to provide one. Packing multiple ps shards and worker shards onto a single server is a convenient way of distributing, particularly if one has a large virtualized server farm, but if you really care about maximizing performance it may be necessary to customize your model to a particular server architecture, and run one process per server, where each process is aware of all the GPUs on that server and potentially takes advantage of local communication between them. This kind of approach would involve the with tf.device() construct you noted.

It's probably feasible to do a somewhat more effective job of scheduling Ops onto devices than TF does at the moment, but doing so is actually a pretty hard problem. Although we've looked into this, we've continued to find that a little bit of attention from the programmer when setting up the model and execution plan is usually sufficient for a good solution.

@poxvoculi poxvoculi closed this Aug 4, 2016

@tobegit3hub

This comment has been minimized.

Contributor

tobegit3hub commented Aug 5, 2016

Thanks for confirming and detailed explaination 😃

TensorFlow is flexible enough with CUDA_VISIBLE_DEVICES and with tf.device() to archive any architecture. We are also looking forward to the optimization of scheduling. Maybe adding the rule of using CPU for ps and placing operations randomly in all GPUs for worker could help, especially for beginners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment