[FSDP] Allow different `optim_input` orders across ranks #78599

awgu · 2022-06-01T01:28:53Z

Stack from ghstack:

[FSDP] Allow different optim_input orders across ranks #78599 [FSDP] Allow different optim_input orders across ranks
[FSDP][Docs] Fix typo in full_optim_state_dict() #78784 [FSDP][Docs] Fix typo in full_optim_state_dict()

This enables the first argument to the optimizer constructor, optim_input, to have different orders across ranks, e.g. if the parameters in one parameter group are permuted. This requires modification to full_optim_state_dict(), shard_full_optim_state_dict(), and scatter_full_optim_state_dict().

The high-level algorithmic change is that the state dicts are kept as being keyed by unflattened parameter name until after sharding/unsharding and flattening/unflattening and are rekeyed to be by parameter ID according to each rank's own optim_input only at the end.

Because this PR adds non-parameter-specific collectives to full_optim_state_dict(), it adds a group=None argument to the method to have a process group to default to when running those common collectives.

[ghstack-poisoned]

facebook-github-bot · 2022-06-01T01:28:59Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/78599
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours
↩️ [fb-only] Re-run with SSH instructions

❌ 1 New Failures

As of commit edef523 (more details on the Dr. CI page):

Expand to see more

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

pull / linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-06-03T00:31:28.4361196Z The PR is introduc...m to confirm whether this change is wanted or not.

2022-06-03T00:31:28.4348951Z processing existing schema:  text(__torch__.torch.classes.profiling.SourceRef _0) -> str _0
2022-06-03T00:31:28.4349925Z processing existing schema:  count(__torch__.torch.classes.profiling.InstructionStats _0) -> int _0
2022-06-03T00:31:28.4351072Z processing existing schema:  duration_ns(__torch__.torch.classes.profiling.InstructionStats _0) -> int _0
2022-06-03T00:31:28.4352467Z processing existing schema:  source(__torch__.torch.classes.profiling.SourceStats _0) -> __torch__.torch.classes.profiling.SourceRef _0
2022-06-03T00:31:28.4354316Z processing existing schema:  line_map(__torch__.torch.classes.profiling.SourceStats _0) -> Dict(int, __torch__.torch.classes.profiling.InstructionStats) _0
2022-06-03T00:31:28.4355251Z processing existing schema:  __init__(__torch__.torch.classes.profiling._ScriptProfile _0) -> NoneType _0
2022-06-03T00:31:28.4356591Z processing existing schema:  enable(__torch__.torch.classes.profiling._ScriptProfile _0) -> NoneType _0
2022-06-03T00:31:28.4357508Z processing existing schema:  disable(__torch__.torch.classes.profiling._ScriptProfile _0) -> NoneType _0
2022-06-03T00:31:28.4359248Z processing existing schema:  _dump_stats(__torch__.torch.classes.profiling._ScriptProfile _0) -> __torch__.torch.classes.profiling.SourceStats[] _0
2022-06-03T00:31:28.4360379Z processing existing schema:  __init__(__torch__.torch.classes.dist_rpc.WorkerInfo _0, str _1, int _2) -> NoneType _0
2022-06-03T00:31:28.4361196Z The PR is introducing backward incompatible changes to the operator library. Please contact PyTorch team to confirm whether this change is wanted or not. 
2022-06-03T00:31:28.4361568Z 
2022-06-03T00:31:28.4361642Z Broken ops: [
2022-06-03T00:31:28.4362147Z 	aten::_linalg_svd(Tensor A, bool full_matrices=False, bool compute_uv=True, *, str? driver=None) -> (Tensor U, Tensor S, Tensor Vh)
2022-06-03T00:31:28.4362729Z 	aten::_linalg_svd.U(Tensor A, bool full_matrices=False, bool compute_uv=True, *, str? driver=None, Tensor(a!) U, Tensor(b!) S, Tensor(c!) Vh) -> (Tensor(a!) U, Tensor(b!) S, Tensor(c!) Vh)
2022-06-03T00:31:28.4363215Z 	aten::linalg_svd(Tensor A, bool full_matrices=True, *, str? driver=None) -> (Tensor U, Tensor S, Tensor Vh)
2022-06-03T00:31:28.4363774Z 	aten::linalg_svd.U(Tensor A, bool full_matrices=True, *, str? driver=None, Tensor(a!) U, Tensor(b!) S, Tensor(c!) Vh) -> (Tensor(a!) U, Tensor(b!) S, Tensor(c!) Vh)
2022-06-03T00:31:28.4364178Z 	aten::linalg_svdvals(Tensor A, *, str? driver=None) -> Tensor
2022-06-03T00:31:28.4364543Z 	aten::linalg_svdvals.out(Tensor A, *, str? driver=None, Tensor(a!) out) -> Tensor(a!)
2022-06-03T00:31:28.4364763Z ]
2022-06-03T00:31:28.5508838Z + cleanup

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Differential Revision: [D36798482](https://our.internmc.facebook.com/intern/diff/D36798482) [ghstack-poisoned]

ghstack-source-id: 761c531 Pull Request resolved: #78599

test/distributed/fsdp/test_fsdp_optim_state.py

torch/distributed/fsdp/_optim_utils.py

test/distributed/fsdp/test_fsdp_optim_state.py

torch/distributed/fsdp/_optim_utils.py

torch/distributed/fsdp/fully_sharded_data_parallel.py

test/distributed/fsdp/test_fsdp_optim_state.py

This enables the first argument to the optimizer constructor, `optim_input`, to have different orders across ranks, e.g. if the parameters in one parameter group are permuted. This requires modification to `full_optim_state_dict()`, `shard_full_optim_state_dict()`, and `scatter_full_optim_state_dict()`. The high-level algorithmic change is that the state dicts are kept as being keyed by unflattened parameter name until after sharding/unsharding and flattening/unflattening and are rekeyed to be by parameter ID according to each rank's own `optim_input` only at the end. Differential Revision: [D36798482](https://our.internmc.facebook.com/intern/diff/D36798482) [ghstack-poisoned]

rohan-varma

LGTM! Let's test to ensure that it is fixed for the usecase we're looking at. Thanks so much for the quick fix!

test/distributed/fsdp/test_fsdp_optim_state.py

torch/distributed/fsdp/_optim_utils.py

torch/distributed/fsdp/fully_sharded_data_parallel.py

rohan-varma · 2022-06-01T15:25:11Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+            raise RuntimeError(
+                "FSDP currently requires each rank to have at least the "
+                "optimizer states needed by rank 0's optimizer but some ranks "
+                "are missing some of those states"


Will it also be useful to log the missing keys?

I can add that at the cost of an all_gather_object().

Update: The error now looks like:

RuntimeError: FSDP currently requires each rank to have at least the optimizer states needed by rank 0's optimizer but some ranks are missing some of those states Rank 1 is missing states for the parameters: [('block2.2.weight', 'block2.2.bias_module0.bias', 'block2.2.bias_module1.bias')] Rank 2 is missing states for the parameters: [('block2.2.weight', 'block2.2.bias_module0.bias', 'block2.2.bias_module1.bias')] Rank 3 is missing states for the parameters: [('block2.2.weight', 'block2.2.bias_module0.bias', 'block2.2.bias_module1.bias')]

This enables the first argument to the optimizer constructor, `optim_input`, to have different orders across ranks, e.g. if the parameters in one parameter group are permuted. This requires modification to `full_optim_state_dict()`, `shard_full_optim_state_dict()`, and `scatter_full_optim_state_dict()`. The high-level algorithmic change is that the state dicts are kept as being keyed by unflattened parameter name until after sharding/unsharding and flattening/unflattening and are rekeyed to be by parameter ID according to each rank's own `optim_input` only at the end. Differential Revision: [D36798482](https://our.internmc.facebook.com/intern/diff/D36798482) [ghstack-poisoned]

ghstack-source-id: 191ec79 Pull Request resolved: #78599

This enables the first argument to the optimizer constructor, `optim_input`, to have different orders across ranks, e.g. if the parameters in one parameter group are permuted. This requires modification to `full_optim_state_dict()`, `shard_full_optim_state_dict()`, and `scatter_full_optim_state_dict()`. The high-level algorithmic change is that the state dicts are kept as being keyed by unflattened parameter name until after sharding/unsharding and flattening/unflattening and are rekeyed to be by parameter ID according to each rank's own `optim_input` only at the end. Differential Revision: [D36798482](https://our.internmc.facebook.com/intern/diff/D36798482) [ghstack-poisoned]

ghstack-source-id: bdfdc48 Pull Request resolved: #78599

awgu · 2022-06-03T11:44:58Z

Backward compatibility error due to broken ops seems unrelated.

awgu · 2022-06-03T11:46:06Z

@pytorchbot merge

github-actions · 2022-06-03T11:47:59Z

Hey @awgu.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Pull Request resolved: #78599 Approved by: https://github.com/rohan-varma Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/4615738a3d5aee256a9f1929c846dcae7d20a041 Reviewed By: rohan-varma Differential Revision: D36798482 fbshipit-source-id: de05b37db8ed41b6cf11a9fc526a0e03a98f570d

zhaojuanmao

cc @fegin as sharded optimizer states are built on top of it

[FSDP] Allow diff optim_input across ranks

8a2cfaf

[ghstack-poisoned]

awgu requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang and mingzhe09088 as code owners June 1, 2022 01:28

facebook-github-bot added the cla signed label Jun 1, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 1, 2022

Update on "[FSDP] Allow diff optim_input across ranks"

6f0a6f2

Differential Revision: [D36798482](https://our.internmc.facebook.com/intern/diff/D36798482) [ghstack-poisoned]

awgu pushed a commit that referenced this pull request Jun 1, 2022

[FSDP] Allow diff optim_input across ranks

f248760

ghstack-source-id: 761c531 Pull Request resolved: #78599

awgu changed the title ~~[FSDP] Allow diff optim_input across ranks~~ [FSDP] Allow different optim_input orders across ranks Jun 1, 2022

rohan-varma reviewed Jun 1, 2022

View reviewed changes

test/distributed/fsdp/test_fsdp_optim_state.py Show resolved Hide resolved

awgu commented Jun 1, 2022

View reviewed changes

rohan-varma self-requested a review June 1, 2022 15:10

rohan-varma reviewed Jun 1, 2022

View reviewed changes

test/distributed/fsdp/test_fsdp_optim_state.py Outdated Show resolved Hide resolved

rohan-varma approved these changes Jun 1, 2022

View reviewed changes

awgu pushed a commit that referenced this pull request Jun 1, 2022

[FSDP] Allow different optim_input orders across ranks

366ed69

ghstack-source-id: 191ec79 Pull Request resolved: #78599

awgu mentioned this pull request Jun 3, 2022

[FSDP][Docs] Fix typo in full_optim_state_dict() #78784

Closed

awgu pushed a commit that referenced this pull request Jun 3, 2022

[FSDP] Allow different optim_input orders across ranks

4dbb785

ghstack-source-id: bdfdc48 Pull Request resolved: #78599

pytorchmergebot added the Merged label Jun 3, 2022

pytorchmergebot closed this in 4615738 Jun 3, 2022

awgu added release notes: distributed (fsdp) release notes category topic: improvements topic category labels Jun 3, 2022

zhaojuanmao reviewed Jun 5, 2022

View reviewed changes

facebook-github-bot deleted the gh/awgu/51/head branch June 6, 2022 14:17

[FSDP] Allow different optim_input orders across ranks #78599

[FSDP] Allow different optim_input orders across ranks #78599

Uh oh!

Conversation

awgu commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 1 New Failures

🕵️ 1 new failure recognized by patterns

pull / linux-focal-py3.7-gcc7 / test (backwards_compat, 1, 1, linux.2xlarge) (1/1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rohan-varma Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

awgu Jun 1, 2022

Choose a reason for hiding this comment

Uh oh!

awgu commented Jun 3, 2022

Uh oh!

awgu commented Jun 3, 2022

Uh oh!

github-actions bot commented Jun 3, 2022

Uh oh!

zhaojuanmao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[FSDP] Allow different `optim_input` orders across ranks #78599

[FSDP] Allow different `optim_input` orders across ranks #78599

awgu commented Jun 1, 2022 •

edited

Loading

facebook-github-bot commented Jun 1, 2022 •

edited

Loading