[FSDP][Perf] Do not call `pad` in no-padding case #88769

awgu · 2022-11-09T21:51:50Z

Stack from ghstack:

[Dynamo][FSDP] Migrate to ModuleWrapPolicy #88453 [Dynamo][FSDP] Migrate to ModuleWrapPolicy
[FSDP] Introduce ModuleWrapPolicy for simplicity #88450 [FSDP] Introduce ModuleWrapPolicy for simplicity
[FSDP][Perf] Do not call pad in no-padding case #88769 [FSDP][Perf] Do not call pad in no-padding case

Calling F.pad() issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if use_orig_params=True because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For use_orig_params=False, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.

[ghstack-poisoned]

pytorch-bot · 2022-11-09T21:51:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88769

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1761e6a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: da8c8378455e723a42b6d4b23145b1536331eee9 Pull Request resolved: #88769

- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case. - This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that. [ghstack-poisoned]

ghstack-source-id: 4a1cca1ef167964173892bc16cf54a5716c0191e Pull Request resolved: #88769

awgu · 2022-11-10T14:43:06Z

@pytorchbot merge

pytorchmergebot · 2022-11-10T14:49:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case. - This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that. [ghstack-poisoned]

pytorchmergebot · 2022-11-10T15:15:06Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

ghstack-source-id: da8c8378455e723a42b6d4b23145b1536331eee9 Pull Request resolved: pytorch#88769

- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case. - This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that. [ghstack-poisoned]

ghstack-source-id: 0f8c479efccb74b0f1f0e43e940bc85de774e0a5 Pull Request resolved: pytorch#88769

awgu · 2022-11-10T18:17:17Z

@pytorchbot merge

pytorchmergebot · 2022-11-10T18:18:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

awgu · 2022-11-15T14:25:19Z

F.pad(tensor, [0, 0]) actually copies the tensor.

>>> import torch
>>> t = torch.randn((3, 3))
>>> t.data_ptr()
93936708252800
>>> torch.nn.functional.pad(t, [0, 0]).data_ptr()
93936640304064

This allocation happened previously in the post-backward stream, which induced cross-stream memory fragmentation. (Only the sharded gradient needs to be allocated in the post-backward stream, not the unsharded gradient.)

For T5-11B on 2 nodes and batch size 6, eliminating the unnecessary F.pad for the no-padding case decreases peak reserved memory by ~300 MB and leads to 0 CUDA malloc retries instead of 3.

- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case. - This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that. Pull Request resolved: pytorch#88769 Approved by: https://github.com/zhaojuanmao

[FSDP][Perf] Do not call pad in no-padding case

4a84b29

[ghstack-poisoned]

awgu requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, H-Huang and kwen2501 as code owners November 9, 2022 21:51

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 9, 2022

awgu added a commit that referenced this pull request Nov 9, 2022

[FSDP][Perf] Do not call pad in no-padding case

9248bb3

ghstack-source-id: da8c8378455e723a42b6d4b23145b1536331eee9 Pull Request resolved: #88769

awgu added the topic: not user facing topic category label Nov 9, 2022

zhaojuanmao approved these changes Nov 10, 2022

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 10, 2022

awgu added a commit that referenced this pull request Nov 10, 2022

[FSDP][Perf] Do not call pad in no-padding case

2297601

ghstack-source-id: 4a1cca1ef167964173892bc16cf54a5716c0191e Pull Request resolved: #88769

awgu mentioned this pull request Nov 10, 2022

[FSDP][Easy] Remove old non-recursive wrapping code #88811

Closed

This was referenced Nov 10, 2022

[FSDP] Introduce ModuleWrapPolicy for simplicity #88450

Closed

[Dynamo][FSDP] Migrate to ModuleWrapPolicy #88453

Closed

awgu added a commit to awgu/pytorch that referenced this pull request Nov 10, 2022

[FSDP][Perf] Do not call pad in no-padding case

b07146c

ghstack-source-id: da8c8378455e723a42b6d4b23145b1536331eee9 Pull Request resolved: pytorch#88769

awgu added a commit to awgu/pytorch that referenced this pull request Nov 10, 2022

[FSDP][Perf] Do not call pad in no-padding case

9ae7e72

ghstack-source-id: 0f8c479efccb74b0f1f0e43e940bc85de774e0a5 Pull Request resolved: pytorch#88769

pytorchmergebot added the Merged label Nov 10, 2022

pytorchmergebot closed this in 6bf2776 Nov 10, 2022

facebook-github-bot deleted the gh/awgu/196/head branch June 8, 2023 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP][Perf] Do not call `pad` in no-padding case #88769

[FSDP][Perf] Do not call `pad` in no-padding case #88769

awgu commented Nov 9, 2022 •

edited

pytorch-bot bot commented Nov 9, 2022 •

edited

awgu commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

awgu commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

awgu commented Nov 15, 2022

[FSDP][Perf] Do not call pad in no-padding case #88769

[FSDP][Perf] Do not call pad in no-padding case #88769

Conversation

awgu commented Nov 9, 2022 • edited

pytorch-bot bot commented Nov 9, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88769

✅ No Failures

awgu commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

Merge started

pytorchmergebot commented Nov 10, 2022

Merge failed

awgu commented Nov 10, 2022

pytorchmergebot commented Nov 10, 2022

Merge started

awgu commented Nov 15, 2022

[FSDP][Perf] Do not call `pad` in no-padding case #88769

[FSDP][Perf] Do not call `pad` in no-padding case #88769

awgu commented Nov 9, 2022 •

edited

pytorch-bot bot commented Nov 9, 2022 •

edited