Skip to content

Multiple tasks in the same server may cause OOM in distributed mode with GPUs #3644

@tobegit3hub

Description

@tobegit3hub

Following the official tutorial of distributed TensorFlow, we find that the ps and works will use the first GPU by default which may cause OOM in distributed mode.

If we don't set CUDA_VISIBLE_DEVICES, all the ps tasks may see all the GPUs and use the first one by default. And when I start all the ps and worker processes, OOM occurs and ps uses most of GPU's memory even though no job running.

Is that possible to optimize the algorithm of scheduler for more intelligence, such as using CPU for ps task and place operations in different GPUs instead of the first one?

One of the solution is specifying CUDA_VISIBLE_DEVICES for each task. Or we can use with tf.device() which may be only use for model parallel in distributed mode.

Environment info

Operating System:

Ubuntu 14.04

Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):

# ls -l /usr/local/cuda/lib64/libcud*
-rw-r--r-- 1 root root   322936 Aug 15  2015 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root       16 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root       19 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root   383336 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so.7.5.18
-rw-r--r-- 1 root root   720192 Aug 15  2015 /usr/local/cuda/lib64/libcudart_static.a
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4.0.7
-rw-r--r-- 1 root root 62025862 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn_static.a

If installed from binary pip package, provide:

0.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions