Skip to content
This repository has been archived by the owner on Jan 6, 2023. It is now read-only.

implemented torchelastic.distributed.launch for oss #65

Closed
wants to merge 1 commit into from

Conversation

kiukchung
Copy link
Contributor

Summary:
Implements an elastic launcher similar in usage as torch.distributed.launch with added functionalities:

  1. automagic RANK, LOCAL_RANK, WORLD_SIZE assignment.
  2. retries of failed workers as a group.
  3. support for membership changes between min and max sizes.

Completely compatible with existing scripts that are compliant with torch.distributed.launch.

Differential Revision: D20554522

Summary:
Implements an elastic launcher similar in usage as `torch.distributed.launch` with added functionalities:
1. automagic RANK, LOCAL_RANK, WORLD_SIZE assignment.
2. retries of failed workers as a group.
3. support for membership changes between `min` and `max` sizes.

Completely compatible with existing scripts that are compliant with `torch.distributed.launch`.

Differential Revision: D20554522

fbshipit-source-id: 3ced8b5cc5ffca03413aa8d93b84f1840cb172b0
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D20554522

test/distributed/launch_test.py Show resolved Hide resolved
torchelastic/distributed/launch.py Show resolved Hide resolved
torchelastic/distributed/launch.py Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in be92012.

all existing workers are stopped, a new `Worker Group` is formed and all
workers are started with a new `RANK` and `WORLD_SIZE`.

2. Node arrival (scale-up) - the new node is admitted to the job,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of existing works have already surpassed the minimum nnodes but smaller than maximum nnodes, will node arrival cause existing workers to stop? Which components decides node admission?

Comment on lines +141 to +144
pre_initialize()
load_checkpoint(checkpoint_path)
initialize()
start_train()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would I be correct if I assume this is just a recommendation and it launch script only requires the user program to be either an executable or a Python script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is just a recommendation, but with the caveat that with no checkpoints your program will start from the beginning when there are faults or scaling events.

"'python -m'.",
)
parser.add_argument(
"--no_python",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is an executable, how do we resume from the previous checkpoint? Is it application developers' responsibility to make the executable recoverable? E.g., it should automatically save checkpoints, and when it is launched, it should always first look for previous checkpoints?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is up to the application to save, load, and keep track of checkpoints. In the future we will provide a mechanism for the application to call torch.save(elastic-agent://path/checkpoint) which will save the checkpoint with the agent, when the worker is restarted, it can retrieve the checkpoint locally from the agent without having to pull from a persistent store.

@kiukchung kiukchung deleted the export-D20554522 branch September 1, 2020 04:36
fotstrt pushed a commit to eth-easl/elastic that referenced this pull request Feb 17, 2022
Summary:
Pull Request resolved: pytorch#65

Implements an elastic launcher similar in usage as `torch.distributed.launch` with added functionalities:
1. automagic RANK, LOCAL_RANK, WORLD_SIZE assignment.
2. retries of failed workers as a group.
3. support for membership changes between `min` and `max` sizes.

Completely compatible with existing scripts that are compliant with `torch.distributed.launch`.

Reviewed By: drdarshan

Differential Revision: D20554522

fbshipit-source-id: ea63ebe98fa9c2fd4dcecb46c0cfcc0afc65ffae
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants