Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Conversation

kiukchung
Copy link
Contributor

Summary:

  1. Adds ml_image buck macro
  2. Adds --run_path option to torch.distributed.run
  3. Adds tsm/driver/fb/test/patched/foo (for unittesting)
  4. Changes to distributed_sum to use ml_image (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Differential Revision: D28421033

Summary:
1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Differential Revision: D28421033

fbshipit-source-id: ef1e4e53a337114276d35af66b92b95dd40b11f5
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D28421033

kiukchung pushed a commit to kiukchung/pytorch that referenced this pull request May 14, 2021
…ytorch#58252)

Summary:
Pull Request resolved: pytorch#58252

Pull Request resolved: pytorch/elastic#149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Differential Revision: D28421033

fbshipit-source-id: 9c59c09157e500ed250a9a75d74c01a5b9759eb6
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 9550c08.

facebook-github-bot pushed a commit to pytorch/pytorch that referenced this pull request May 15, 2021
…58252)

Summary:
Pull Request resolved: #58252

Pull Request resolved: pytorch/elastic#149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex

Differential Revision: D28421033

fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
…ytorch#58252)

Summary:
Pull Request resolved: pytorch#58252

Pull Request resolved: pytorch/elastic#149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Test Plan:
## Local Testing
```
# build the two fbpkgs (base and main)
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum.base
buck run //pytorch/elastic/examples/distributed_sum/fb:torchx.examples.dist_sum

# fetch the fbpkgs
cd ~/tmp

fbpkg fetch --symlink-tags  -o -d . jetter:prod
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum.base
fbpkg fetch --symlink-tags  -o -d . torchx.examples.dist_sum

jetter/LAST/jetter apply-and-run \
  torchx.examples.dist_sum.base/LAST/torchrun \
  torchx.examples.dist_sum/LAST \
  -- \
  --as_function \
  --rdzv_id foobar \
  --nnodes 1 \
  --nproc_per_node 2 \
  --max_restarts 0 \
  --role worker \
  --no_python \
~/torchx.examples.dist_sum/LAST/pytorch/elastic/examples/distributed_sum/fb/main.py
```

## Mast Testing
```
buck-out/gen/pytorch/elastic/torchelastic/tsm/fb/cli/tsm.par run_ddp \
  --scheduler mast
  --base_fbpkg torchx.examples.dist_sum.base:78f01b5 \
  --fbpkg torchx.examples.dist_sum:f38ab46 \
  --run_cfg hpcClusterUuid=MastNaoTestCluster,hpcIdentity=pytorch_r2p,hpcJobOncall=pytorch_r2p \
  --nnodes 2 \
  --resource T1 \
  --nproc_per_node 4 \
  --name kiuk_jetter_test \
 pytorch/elastic/examples/distributed_sum/fb/main.py
```
Runs successfully: https://www.internalfb.com/mast/job/tsm_kiuk-kiuk_jetter_test_34c9f0fa?

Reviewed By: tierex

Differential Revision: D28421033

fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
fotstrt pushed a commit to eth-easl/elastic that referenced this pull request Feb 17, 2022
Summary:
Pull Request resolved: pytorch/pytorch#58252

Pull Request resolved: pytorch#149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Reviewed By: tierex

Differential Revision: D28421033

fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
andrewjimpson9551 added a commit to andrewjimpson9551/Elastic that referenced this pull request Jul 29, 2025
Summary:
Pull Request resolved: pytorch/pytorch#58252

Pull Request resolved: pytorch/elastic#149

1. Adds `ml_image` buck macro
2. Adds `--run_path` option to `torch.distributed.run`
3. Adds `tsm/driver/fb/test/patched/foo` (for unittesting)
4. Changes to `distributed_sum` to use `ml_image` (see Test plan for how this was tested in local and mast)

NOTE: need to enable jetter for flow and local schedulers (will do this on a separate diff since this diff is already really big)

Reviewed By: tierex

Differential Revision: D28421033

fbshipit-source-id: 96edcecf639143e31ec6c86ec713a2e2d7790f3d
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants