-
Notifications
You must be signed in to change notification settings - Fork 97
implemented torchelastic.distributed.launch for oss #65
Conversation
Summary: Implements an elastic launcher similar in usage as `torch.distributed.launch` with added functionalities: 1. automagic RANK, LOCAL_RANK, WORLD_SIZE assignment. 2. retries of failed workers as a group. 3. support for membership changes between `min` and `max` sizes. Completely compatible with existing scripts that are compliant with `torch.distributed.launch`. Differential Revision: D20554522 fbshipit-source-id: 3ced8b5cc5ffca03413aa8d93b84f1840cb172b0
This pull request was exported from Phabricator. Differential Revision: D20554522 |
This pull request has been merged in be92012. |
all existing workers are stopped, a new `Worker Group` is formed and all | ||
workers are started with a new `RANK` and `WORLD_SIZE`. | ||
|
||
2. Node arrival (scale-up) - the new node is admitted to the job, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of existing works have already surpassed the minimum nnodes
but smaller than maximum nnodes
, will node arrival cause existing workers to stop? Which components decides node admission?
pre_initialize() | ||
load_checkpoint(checkpoint_path) | ||
initialize() | ||
start_train() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would I be correct if I assume this is just a recommendation and it launch script only requires the user program to be either an executable or a Python script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is just a recommendation, but with the caveat that with no checkpoints your program will start from the beginning when there are faults or scaling events.
"'python -m'.", | ||
) | ||
parser.add_argument( | ||
"--no_python", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is an executable, how do we resume from the previous checkpoint? Is it application developers' responsibility to make the executable recoverable? E.g., it should automatically save checkpoints, and when it is launched, it should always first look for previous checkpoints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is up to the application to save, load, and keep track of checkpoints. In the future we will provide a mechanism for the application to call torch.save(elastic-agent://path/checkpoint)
which will save the checkpoint with the agent, when the worker is restarted, it can retrieve the checkpoint locally from the agent without having to pull from a persistent store.
Summary: Pull Request resolved: pytorch#65 Implements an elastic launcher similar in usage as `torch.distributed.launch` with added functionalities: 1. automagic RANK, LOCAL_RANK, WORLD_SIZE assignment. 2. retries of failed workers as a group. 3. support for membership changes between `min` and `max` sizes. Completely compatible with existing scripts that are compliant with `torch.distributed.launch`. Reviewed By: drdarshan Differential Revision: D20554522 fbshipit-source-id: ea63ebe98fa9c2fd4dcecb46c0cfcc0afc65ffae
Summary:
Implements an elastic launcher similar in usage as
torch.distributed.launch
with added functionalities:min
andmax
sizes.Completely compatible with existing scripts that are compliant with
torch.distributed.launch
.Differential Revision: D20554522