Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for distributed TensorFlow: draft #296

Open
albertz opened this issue Jun 2, 2020 · 2 comments
Open

Support for distributed TensorFlow: draft #296

albertz opened this issue Jun 2, 2020 · 2 comments
Assignees
Labels
good first issue Should be a good starting point to get familiar with RETURNN, hopefully not too hard.

Comments

@albertz
Copy link
Member

albertz commented Jun 2, 2020

See the overview of distributed TensorFlow in general (independent of RETURNN) for some background.

This issue here is about the specific implementation in RETURNN.

This is somewhat orthogonal to the current Horovod implementation (documentation). We should allow for both distributed TF and Horovod, and maybe even mix them.

This is also related to the new dataset pipeline (#292).

We collect experience for all possible variations of hardware, cluster environment, software (Horovod, distributed TF, new/old dataset pipeline), and strategy here in the wiki.


Rough draft:

  • New file TFDistributed.py with all related functionality.
  • We would not use the predefined TF strategies (tf.distribute.Strategy) (by default) (although we might want to allow for that as well).
  • Our standard/default strategy would be similar as before (Horovod horovod_reduce_type = "param", or our Theano implementation):
    Every replica has its own copy of variables (no parameter server), they would do independent (async) training (and between-graph replication), and at certain points (e.g. N training steps), they would synchronize by averaging the parameters.
  • Our initial implementation should work together with SGE parallel environment (qsub -pe) which has basic MPI communication across hosts (although we might not need the MPI communication at all, as distributed TF has its own communication).
  • As multiple processes can run on a single SGE node, we need some convention or way to figure out which port to use, which also avoids race conditions as much as possible. E.g. maybe 1000 + int(os.environ["CUDA_VISIBLE_DEVICES"]) * 100 or sth like that. We should make sure this would not collide with other services.
    If we have MPI, we might use mpi4py and can communicate that way (e.g. to get rank (MPI.COMM_WORLD.rank) / size (MPI.COMM_WORLD.size) or communicate other information).
    Otherwise, SGE PE will probably provide some other more direct way to get this information as well.
    Edit: Done via MPI now. See doc in code.
  • The chief worker (MPI rank 0) might spawn a dedicated subprocess as a dataset loader producer worker for the new dataset pipeline (New dataset pipeline: draft #292). (Or can SGE PE spawn dedicated CPU-only processes alongside with the GPU processes? I don't think so...)
@albertz
Copy link
Member Author

albertz commented Jun 2, 2020

(The assignment is just such that you keep track of this, as this might be relevant for you. Of course feel free to participate more actively as well! E.g. by having certain constraints, or wishes. Maybe you can comment how the current draft is not compatible with what you need. Or what should be added. Or even what should be the priority.)

@albertz
Copy link
Member Author

albertz commented Jun 10, 2020

Some update: Some initial implementation is there now. However this mostly just covers the cluster setup, and starting the TF servers. This is mostly also intended for between-graph replication. But further than that, there is not much implemented yet. So this is a good base for further work, but as-is this is not a complete solution for distributed training yet.

The main goal in any case: Efficient/fast training on various different hardware (single-node multi-GPU (consumer GPUs...), multi-node multi-GPU, TPU) and environments (SGE cluster with common hardware, or AWS, or GCE). It does not matter too much in the end whether we use the Horovod ops, or distributed TF functions. We should just see what works best. The dataset pipeline might be related to it.

Spotlight0xff pushed a commit to Spotlight0xff/returnn that referenced this issue Sep 5, 2020
@albertz albertz added the good first issue Should be a good starting point to get familiar with RETURNN, hopefully not too hard. label Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Should be a good starting point to get familiar with RETURNN, hopefully not too hard.
Projects
None yet
Development

No branches or pull requests

5 participants