Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Multiple tasks in the same server may cause OOM in distributed mode with GPUs #3644
Following the official tutorial of distributed TensorFlow, we find that the ps and works will use the first GPU by default which may cause OOM in distributed mode.
If we don't set
Is that possible to optimize the algorithm of scheduler for more intelligence, such as using CPU for ps task and place operations in different GPUs instead of the first one?
One of the solution is specifying
Installed version of CUDA and cuDNN:
If installed from binary pip package, provide:
Your observation is correct. It is a deliberate design choice that a TF process will, unless instructed otherwise, use visible GPUs in order, and attempt to use all of the memory on each device. This approach can work well when processes sharing a server are in a virtual environment that makes visible only the GPUs that the process is permitted to use. As you note, setting a different value for
Unfortunately, there's no single best solution for distribution, and TF does not attempt to provide one. Packing multiple ps shards and worker shards onto a single server is a convenient way of distributing, particularly if one has a large virtualized server farm, but if you really care about maximizing performance it may be necessary to customize your model to a particular server architecture, and run one process per server, where each process is aware of all the GPUs on that server and potentially takes advantage of local communication between them. This kind of approach would involve the
It's probably feasible to do a somewhat more effective job of scheduling Ops onto devices than TF does at the moment, but doing so is actually a pretty hard problem. Although we've looked into this, we've continued to find that a little bit of attention from the programmer when setting up the model and execution plan is usually sufficient for a good solution.
Thanks for confirming and detailed explaination
TensorFlow is flexible enough with