[FSDP2] Factored out `MLPStack` to de-dup code #126070

awgu · 2024-05-13T14:36:44Z

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

[ghstack-poisoned]

pytorch-bot · 2024-05-13T14:36:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126070

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7e02f46 with merge base afda668 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 22426f4c807175d3b95faacef419fe635268eb30 Pull Request resolved: #126070

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

ghstack-source-id: 2407ba423f3202f733711861259ff2f6c30ca408 Pull Request resolved: #126070

wanchaol · 2024-05-14T17:33:33Z

test/distributed/_composable/fsdp/test_fully_shard_training.py

-            device_mesh=tp_mesh,
-            # Leave the layer norm as implicitly replicated
-            parallelize_plan={
-                # Pass `use_local_output=False` to keep as DTensor to preserve


Maybe we should also add SequenceParallel(sequence_dim=0) to the parallelize_plan, this is to ensure the model params are all 1D DTensors, as @wz337 was hitting something then enabling for_each optimizer by default, we would need params to be 1D DTensor on tp_mesh_dim, so that the for_each ops could receive correct 2D sharding inputs

I think originally I had it like this to test the implicitly replicated norm weight (as opposed to having it explicitly a DTensor). If I am following correctly, we can migrate to SequenceParallel(sequence_dim=0) only if needed later.

I think implicit replication won't work for 2D cases, as it doesn't know how to implicit replicate a 1D DTensor to 2D (currently only works for torch.Tensor to DTensor). I will turn off foreach for the 2D test and submit a follow up PR to update this.

awgu · 2024-05-14T18:11:28Z

@pytorchbot merge

pytorchmergebot · 2024-05-14T18:13:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This simplifies the test a bit. **Context** Option 1: Ref model is data parallel. Each rank's ref model receives local batch. We manually all-reduce gradients and divide them by world size to match DDP/FSDP semantics. Option 2: Ref model is not data parallel. Each rank's ref model receives the same global batch. We manually divide the ref model's gradients by world size to match DDP/FSDP semantics. (Note that all ranks have the same ref model and same global batch.) All of our other unit tests are written following Option 1, which is simpler and a more direct comparison to what our claimed semantics are. This PR switches the gradient accumulation test from being written as following Option 2 to as following Option 1. Pull Request resolved: #126161 Approved by: https://github.com/wanchaol ghstack dependencies: #126067, #126070

**Context** For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients). - FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`. - FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`. For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2). - FSDP2 offers (1) without any intervention like mentioned above. - FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above. - FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`. **Overview** For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like: ``` for microbatch_idx, microbatch in enumerate(microbatches): is_last_microbatch = microbatch_idx == len(microbatches) - 1 model.set_requires_all_reduce(is_last_microbatch) # Run forward/backward ``` This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only. **Developer Notes** We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output. Pull Request resolved: #126166 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: #126067, #126070, #126161

…126166) **Context** For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients). - FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`. - FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`. For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2). - FSDP2 offers (1) without any intervention like mentioned above. - FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above. - FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`. **Overview** For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like: ``` for microbatch_idx, microbatch in enumerate(microbatches): is_last_microbatch = microbatch_idx == len(microbatches) - 1 model.set_requires_all_reduce(is_last_microbatch) # Run forward/backward ``` This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only. **Developer Notes** We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output. Pull Request resolved: pytorch#126166 Approved by: https://github.com/weifengpy, https://github.com/wanchaol ghstack dependencies: pytorch#126067, pytorch#126070, pytorch#126161

[FSDP2] Factored out MLPStack to de-dup code

1469193

[ghstack-poisoned]

awgu mentioned this pull request May 13, 2024

[FSDP2] Used CommDebugMode in grad acc test #126067

Closed

pytorch-bot bot added ci-td-distributed oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels May 13, 2024

awgu added a commit that referenced this pull request May 13, 2024

[FSDP2] Factored out MLPStack to de-dup code

c6498d1

ghstack-source-id: 22426f4c807175d3b95faacef419fe635268eb30 Pull Request resolved: #126070

Update on "[FSDP2] Factored out MLPStack to de-dup code"

7e02f46

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

awgu added a commit that referenced this pull request May 13, 2024

[FSDP2] Factored out MLPStack to de-dup code

b8cc1e0

ghstack-source-id: 2407ba423f3202f733711861259ff2f6c30ca408 Pull Request resolved: #126070

awgu mentioned this pull request May 13, 2024

[FSDP2] Showed 2D MLP with colwise + colwise sharding #126073

Draft

awgu added release notes: distributed (fsdp2) release notes category and removed release notes: distributed (fsdp) release notes category labels May 13, 2024

awgu marked this pull request as ready for review May 13, 2024 17:48

awgu requested review from wanchaol and weifengpy May 13, 2024 17:48

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2024

This was referenced May 14, 2024

[FSDP2] Changed grad acc test to use data parallel ref model #126161

Closed

[FSDP2] Supported set_all_reduce_gradients=False for HSDP #126166

Closed

wanchaol approved these changes May 14, 2024

View reviewed changes

pytorchmergebot added the merging label May 14, 2024

pytorchmergebot closed this in 4ded666 May 14, 2024

pytorchmergebot added Merged and removed merging labels May 14, 2024

github-actions bot deleted the gh/awgu/582/head branch June 14, 2024 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP2] Factored out `MLPStack` to de-dup code #126070

[FSDP2] Factored out `MLPStack` to de-dup code #126070

awgu commented May 13, 2024 •

edited

pytorch-bot bot commented May 13, 2024 •

edited

wanchaol May 14, 2024

awgu May 14, 2024

wz337 May 14, 2024

awgu commented May 14, 2024

pytorchmergebot commented May 14, 2024

[FSDP2] Factored out MLPStack to de-dup code #126070

[FSDP2] Factored out MLPStack to de-dup code #126070

Conversation

awgu commented May 13, 2024 • edited

pytorch-bot bot commented May 13, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126070

✅ No Failures

wanchaol May 14, 2024

Choose a reason for hiding this comment

awgu May 14, 2024

Choose a reason for hiding this comment

wz337 May 14, 2024

Choose a reason for hiding this comment

awgu commented May 14, 2024

pytorchmergebot commented May 14, 2024

Merge started

[FSDP2] Factored out `MLPStack` to de-dup code #126070

[FSDP2] Factored out `MLPStack` to de-dup code #126070

awgu commented May 13, 2024 •

edited

pytorch-bot bot commented May 13, 2024 •

edited