Multiple tasks in the same server may cause OOM in distributed mode with GPUs

Following the [official tutorial](https://www.tensorflow.org/versions/master/how_tos/distributed/index.html) of distributed TensorFlow, we find that the ps and works will use the first GPU by default which may cause OOM in distributed mode.

If we don't set `CUDA_VISIBLE_DEVICES`, all the ps tasks may see all the GPUs and use the first one by default. And when I start all the ps and worker processes, OOM occurs and ps uses most of GPU's memory even though no job running.

Is that possible to optimize the algorithm of scheduler for more intelligence, such as using CPU for ps task and place operations in different GPUs instead of the first one?

One of the solution is specifying `CUDA_VISIBLE_DEVICES` for each task. Or we can use `with tf.device()` which may be only use for model parallel in distributed mode.
### Environment info

Operating System: 

```
Ubuntu 14.04
```

Installed version of CUDA and cuDNN: 
(please attach the output of `ls -l /path/to/cuda/lib/libcud*`):

```
# ls -l /usr/local/cuda/lib64/libcud*
-rw-r--r-- 1 root root   322936 Aug 15  2015 /usr/local/cuda/lib64/libcudadevrt.a
lrwxrwxrwx 1 root root       16 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root       19 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root   383336 Aug 15  2015 /usr/local/cuda/lib64/libcudart.so.7.5.18
-rw-r--r-- 1 root root   720192 Aug 15  2015 /usr/local/cuda/lib64/libcudart_static.a
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4
-rwxr-xr-x 1 root root 61453024 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn.so.4.0.7
-rw-r--r-- 1 root root 62025862 Jun 30 04:17 /usr/local/cuda/lib64/libcudnn_static.a
```

If installed from binary pip package, provide:

```
0.9.0
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple tasks in the same server may cause OOM in distributed mode with GPUs #3644

Environment info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiple tasks in the same server may cause OOM in distributed mode with GPUs #3644

Description

Environment info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions