imagenet example - add logic to broadcast most recent checkpoint from max_rank by kiukchung · Pull Request #93 · pytorch/elastic

kiukchung · 2020-04-10T00:52:22Z

Summary:
Rationale for adding checkpoint broadcasting:

In our example we don't have access to globally visible storage
Each local rank =0 writes the checkpoint
when a container/node dies, the replacement container has no checkpoints (since it was lost with the node)
new nodes starts from scratch vs surviving nodes are ahead
the logic is to find the checkpoint with the max epoch and broadcast that

Rationale for removing nnode==1 assertion for launcher with --with_etcd option.

you can run two agents on the same node to simulate a multi-node run
you can first start agent#1 by giving --with_etcd option
you can start agent#2 by copy pasting the rdzv info (from the logs) and passing the --rdzv_id, --rdzv_backend, --rdzv_endpoint from the first launch.

Differential Revision: D20956704

… max_rank Summary: Rationale for adding checkpoint broadcasting: - In our example we don't have access to globally visible storage - Each local rank =0 writes the checkpoint - when a container/node dies, the replacement container has no checkpoints (since it was lost with the node) - new nodes starts from scratch vs surviving nodes are ahead - the logic is to find the checkpoint with the max epoch and broadcast that Rationale for removing nnode==1 assertion for launcher with --with_etcd option. - you can run two agents on the same node to simulate a multi-node run - you can first start agent#1 by giving `--with_etcd` option - you can start agent#2 by copy pasting the rdzv info (from the logs) and passing the `--rdzv_id, --rdzv_backend, --rdzv_endpoint` from the first launch. Differential Revision: D20956704 fbshipit-source-id: 468d858db7fa9dfde7285270b2f0674faa52bcee

facebook-github-bot · 2020-04-10T00:52:42Z

This pull request was exported from Phabricator. Differential Revision: D20956704

facebook-github-bot · 2020-04-10T02:23:11Z

This pull request has been merged in 1465dfc.

… max_rank (pytorch#93) Summary: Pull Request resolved: pytorch#93 Rationale for adding checkpoint broadcasting: - In our example we don't have access to globally visible storage - Each local rank =0 writes the checkpoint - when a container/node dies, the replacement container has no checkpoints (since it was lost with the node) - new nodes starts from scratch vs surviving nodes are ahead - the logic is to find the checkpoint with the max epoch and broadcast that Rationale for removing nnode==1 assertion for launcher with --with_etcd option. - you can run two agents on the same node to simulate a multi-node run - you can first start agent#1 by giving `--with_etcd` option - you can start agent#2 by copy pasting the rdzv info (from the logs) and passing the `--rdzv_id, --rdzv_backend, --rdzv_endpoint` from the first launch. Reviewed By: tierex, drdarshan Differential Revision: D20956704 fbshipit-source-id: 3170e1bcbedf1a7522f3aeee23f0fc67cd038253

… max_rank (#93) Summary: Pull Request resolved: pytorch/elastic#93 Rationale for adding checkpoint broadcasting: - In our example we don't have access to globally visible storage - Each local rank =0 writes the checkpoint - when a container/node dies, the replacement container has no checkpoints (since it was lost with the node) - new nodes starts from scratch vs surviving nodes are ahead - the logic is to find the checkpoint with the max epoch and broadcast that Rationale for removing nnode==1 assertion for launcher with --with_etcd option. - you can run two agents on the same node to simulate a multi-node run - you can first start agent#1 by giving `--with_etcd` option - you can start agent#2 by copy pasting the rdzv info (from the logs) and passing the `--rdzv_id, --rdzv_backend, --rdzv_endpoint` from the first launch. Reviewed By: tierex, drdarshan Differential Revision: D20956704 fbshipit-source-id: 3170e1bcbedf1a7522f3aeee23f0fc67cd038253

facebook-github-bot added the fb-exported label Apr 10, 2020

facebook-github-bot closed this in 1465dfc Apr 10, 2020

facebook-github-bot added the Merged label Apr 10, 2020

kiukchung deleted the export-D20956704 branch September 1, 2020 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imagenet example - add logic to broadcast most recent checkpoint from max_rank#93

imagenet example - add logic to broadcast most recent checkpoint from max_rank#93
kiukchung wants to merge 1 commit into
pytorch:masterfrom
kiukchung:export-D20956704

kiukchung commented Apr 10, 2020

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kiukchung commented Apr 10, 2020

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

facebook-github-bot commented Apr 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants