Support for distributed TensorFlow: draft #296

albertz · 2020-06-02T09:48:48Z

See the overview of distributed TensorFlow in general (independent of RETURNN) for some background.

This issue here is about the specific implementation in RETURNN.

This is somewhat orthogonal to the current Horovod implementation (documentation). We should allow for both distributed TF and Horovod, and maybe even mix them.

This is also related to the new dataset pipeline (#292).

We collect experience for all possible variations of hardware, cluster environment, software (Horovod, distributed TF, new/old dataset pipeline), and strategy here in the wiki.

Rough draft:

New file TFDistributed.py with all related functionality.
We would not use the predefined TF strategies (tf.distribute.Strategy) (by default) (although we might want to allow for that as well).
Our standard/default strategy would be similar as before (Horovod horovod_reduce_type = "param", or our Theano implementation):
Every replica has its own copy of variables (no parameter server), they would do independent (async) training (and between-graph replication), and at certain points (e.g. N training steps), they would synchronize by averaging the parameters.
Our initial implementation should work together with SGE parallel environment (qsub -pe) which has basic MPI communication across hosts (although we might not need the MPI communication at all, as distributed TF has its own communication).
As multiple processes can run on a single SGE node, we need some convention or way to figure out which port to use, which also avoids race conditions as much as possible. E.g. maybe 1000 + int(os.environ["CUDA_VISIBLE_DEVICES"]) * 100 or sth like that. We should make sure this would not collide with other services.
If we have MPI, we might use mpi4py and can communicate that way (e.g. to get rank (MPI.COMM_WORLD.rank) / size (MPI.COMM_WORLD.size) or communicate other information).
Otherwise, SGE PE will probably provide some other more direct way to get this information as well.
Edit: Done via MPI now. See doc in code.
The chief worker (MPI rank 0) might spawn a dedicated subprocess as a dataset loader producer worker for the new dataset pipeline (New dataset pipeline: draft #292). (Or can SGE PE spawn dedicated CPU-only processes alongside with the GPU processes? I don't think so...)

The text was updated successfully, but these errors were encountered:

albertz · 2020-06-02T15:25:45Z

(The assignment is just such that you keep track of this, as this might be relevant for you. Of course feel free to participate more actively as well! E.g. by having certain constraints, or wishes. Maybe you can comment how the current draft is not compatible with what you need. Or what should be added. Or even what should be the priority.)

#296

albertz · 2020-06-10T09:39:42Z

Some update: Some initial implementation is there now. However this mostly just covers the cluster setup, and starting the TF servers. This is mostly also intended for between-graph replication. But further than that, there is not much implemented yet. So this is a good base for further work, but as-is this is not a complete solution for distributed training yet.

The main goal in any case: Efficient/fast training on various different hardware (single-node multi-GPU (consumer GPUs...), multi-node multi-GPU, TPU) and environments (SGE cluster with common hardware, or AWS, or GCE). It does not matter too much in the end whether we use the Horovod ops, or distributed TF functions. We should just see what works best. The dataset pipeline might be related to it.

rwth-i6#296

albertz mentioned this issue Jun 2, 2020

New dataset pipeline: draft #292

Open

albertz assigned albertz, JackTemaki, curufinwe, pavelgolik and patrick-wilken Jun 2, 2020

albertz added a commit that referenced this issue Jun 8, 2020

distributed TF, start implementation, work-in-progress

5e9c354

#296

Spotlight0xff pushed a commit to Spotlight0xff/returnn that referenced this issue Sep 5, 2020

distributed TF, start implementation, work-in-progress

f28f11a

rwth-i6#296

albertz unassigned pavelgolik Oct 14, 2022

albertz added the good first issue Should be a good starting point to get familiar with RETURNN, hopefully not too hard. label Oct 14, 2022

This was referenced Oct 14, 2022

Replicate async multi-GPU Horovod training with pure TF #1145

Open

TPU support #1162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for distributed TensorFlow: draft #296

Support for distributed TensorFlow: draft #296

albertz commented Jun 2, 2020 •

edited

albertz commented Jun 2, 2020

albertz commented Jun 10, 2020

Support for distributed TensorFlow: draft #296

Support for distributed TensorFlow: draft #296

Comments

albertz commented Jun 2, 2020 • edited

albertz commented Jun 2, 2020

albertz commented Jun 10, 2020

albertz commented Jun 2, 2020 •

edited