implemented torchelastic.distributed.launch for oss #65

kiukchung · 2020-03-20T08:36:25Z

Summary:
Implements an elastic launcher similar in usage as torch.distributed.launch with added functionalities:

automagic RANK, LOCAL_RANK, WORLD_SIZE assignment.
retries of failed workers as a group.
support for membership changes between min and max sizes.

Completely compatible with existing scripts that are compliant with torch.distributed.launch.

Differential Revision: D20554522

Summary: Implements an elastic launcher similar in usage as `torch.distributed.launch` with added functionalities: 1. automagic RANK, LOCAL_RANK, WORLD_SIZE assignment. 2. retries of failed workers as a group. 3. support for membership changes between `min` and `max` sizes. Completely compatible with existing scripts that are compliant with `torch.distributed.launch`. Differential Revision: D20554522 fbshipit-source-id: 3ced8b5cc5ffca03413aa8d93b84f1840cb172b0

facebook-github-bot · 2020-03-20T08:36:49Z

This pull request was exported from Phabricator. Differential Revision: D20554522

test/distributed/launch_test.py

torchelastic/distributed/launch.py

facebook-github-bot · 2020-03-20T20:17:05Z

This pull request has been merged in be92012.

mrshenli · 2020-03-23T01:28:51Z

torchelastic/distributed/launch.py

+all existing workers are stopped, a new `Worker Group` is formed and all
+workers are started with a new `RANK` and `WORLD_SIZE`.
+
+2. Node arrival (scale-up) - the new node is admitted to the job,


If the number of existing works have already surpassed the minimum nnodes but smaller than maximum nnodes, will node arrival cause existing workers to stop? Which components decides node admission?

mrshenli · 2020-03-23T01:30:13Z

torchelastic/distributed/launch.py

+    pre_initialize()
+    load_checkpoint(checkpoint_path)
+    initialize()
+    start_train()


would I be correct if I assume this is just a recommendation and it launch script only requires the user program to be either an executable or a Python script?

yes this is just a recommendation, but with the caveat that with no checkpoints your program will start from the beginning when there are faults or scaling events.

mrshenli · 2020-03-23T01:33:35Z

torchelastic/distributed/launch.py

+        "'python -m'.",
+    )
+    parser.add_argument(
+        "--no_python",


if it is an executable, how do we resume from the previous checkpoint? Is it application developers' responsibility to make the executable recoverable? E.g., it should automatically save checkpoints, and when it is launched, it should always first look for previous checkpoints?

yes it is up to the application to save, load, and keep track of checkpoints. In the future we will provide a mechanism for the application to call torch.save(elastic-agent://path/checkpoint) which will save the checkpoint with the agent, when the worker is restarted, it can retrieve the checkpoint locally from the agent without having to pull from a persistent store.

Summary: Pull Request resolved: pytorch#65 Implements an elastic launcher similar in usage as `torch.distributed.launch` with added functionalities: 1. automagic RANK, LOCAL_RANK, WORLD_SIZE assignment. 2. retries of failed workers as a group. 3. support for membership changes between `min` and `max` sizes. Completely compatible with existing scripts that are compliant with `torch.distributed.launch`. Reviewed By: drdarshan Differential Revision: D20554522 fbshipit-source-id: ea63ebe98fa9c2fd4dcecb46c0cfcc0afc65ffae

facebook-github-bot added the fb-exported label Mar 20, 2020

kiukchung mentioned this pull request Mar 20, 2020

Parallelize membership discovery and train step, in order to improve the elastic performance. #60

Closed

mrshenli reviewed Mar 20, 2020

View reviewed changes

test/distributed/launch_test.py Show resolved Hide resolved

torchelastic/distributed/launch.py Show resolved Hide resolved

torchelastic/distributed/launch.py Show resolved Hide resolved

facebook-github-bot closed this in be92012 Mar 20, 2020

facebook-github-bot added the Merged label Mar 20, 2020

drdarshan mentioned this pull request Mar 21, 2020

Request for Feedback: PyTorch Elastic Trainer v0.2 #66

Closed

mrshenli reviewed Mar 23, 2020

View reviewed changes

kiukchung deleted the export-D20554522 branch September 1, 2020 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implemented torchelastic.distributed.launch for oss #65

implemented torchelastic.distributed.launch for oss #65

kiukchung commented Mar 20, 2020

facebook-github-bot commented Mar 20, 2020

facebook-github-bot commented Mar 20, 2020

mrshenli Mar 23, 2020

mrshenli Mar 23, 2020

kiukchung Mar 23, 2020

mrshenli Mar 23, 2020

kiukchung Mar 23, 2020

implemented torchelastic.distributed.launch for oss #65

implemented torchelastic.distributed.launch for oss #65

Conversation

kiukchung commented Mar 20, 2020

facebook-github-bot commented Mar 20, 2020

facebook-github-bot commented Mar 20, 2020

mrshenli Mar 23, 2020

Choose a reason for hiding this comment

mrshenli Mar 23, 2020

Choose a reason for hiding this comment

kiukchung Mar 23, 2020

Choose a reason for hiding this comment

mrshenli Mar 23, 2020

Choose a reason for hiding this comment

kiukchung Mar 23, 2020

Choose a reason for hiding this comment