-
Notifications
You must be signed in to change notification settings - Fork 75.2k
Description
Following the official tutorial of distributed TensorFlow, we find that the ps and works will use the first GPU by default which may cause OOM in distributed mode.
If we don't set CUDA_VISIBLE_DEVICES, all the ps tasks may see all the GPUs and use the first one by default. And when I start all the ps and worker processes, OOM occurs and ps uses most of GPU's memory even though no job running.
Is that possible to optimize the algorithm of scheduler for more intelligence, such as using CPU for ps task and place operations in different GPUs instead of the first one?
One of the solution is specifying CUDA_VISIBLE_DEVICES for each task. Or we can use with tf.device() which may be only use for model parallel in distributed mode.
Environment info
Operating System:
Ubuntu 14.04
Installed version of CUDA and cuDNN:
(please attach the output of ls -l /path/to/cuda/lib/libcud*):
# ls -l /usr/local/cuda/lib64/libcud*
-rw-r--r-- 1 root root 322936 Aug 15 2015 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root 16 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root 19 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root 383336 Aug 15 2015 /usr/local/cuda/lib64/libcudart.so.7.5.18
-rw-r--r-- 1 root root 720192 Aug 15 2015 /usr/local/cuda/lib64/libcudart_static.a
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4.0.7
-rw-r--r-- 1 root root 62025862 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn_static.a
If installed from binary pip package, provide:
0.9.0