Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSDP][Perf] Do not call pad in no-padding case #88769

Closed
wants to merge 4 commits into from

Conversation

awgu
Copy link
Contributor

@awgu awgu commented Nov 9, 2022

Stack from ghstack:

  • Calling F.pad() issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
  • This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if use_orig_params=True because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For use_orig_params=False, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 9, 2022

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/88769

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1761e6a:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu added a commit that referenced this pull request Nov 9, 2022
ghstack-source-id: da8c8378455e723a42b6d4b23145b1536331eee9
Pull Request resolved: #88769
@awgu awgu added the topic: not user facing topic category label Nov 9, 2022
@awgu awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 10, 2022
- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
- This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.

[ghstack-poisoned]
awgu added a commit that referenced this pull request Nov 10, 2022
ghstack-source-id: 4a1cca1ef167964173892bc16cf54a5716c0191e
Pull Request resolved: #88769
@awgu
Copy link
Contributor Author

awgu commented Nov 10, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
- This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

awgu added a commit to awgu/pytorch that referenced this pull request Nov 10, 2022
ghstack-source-id: da8c8378455e723a42b6d4b23145b1536331eee9
Pull Request resolved: pytorch#88769
- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
- This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.

[ghstack-poisoned]
awgu added a commit to awgu/pytorch that referenced this pull request Nov 10, 2022
ghstack-source-id: 0f8c479efccb74b0f1f0e43e940bc85de774e0a5
Pull Request resolved: pytorch#88769
@awgu
Copy link
Contributor Author

awgu commented Nov 10, 2022

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@awgu
Copy link
Contributor Author

awgu commented Nov 15, 2022

F.pad(tensor, [0, 0]) actually copies the tensor.

>>> import torch
>>> t = torch.randn((3, 3))
>>> t.data_ptr()
93936708252800
>>> torch.nn.functional.pad(t, [0, 0]).data_ptr()
93936640304064

This allocation happened previously in the post-backward stream, which induced cross-stream memory fragmentation. (Only the sharded gradient needs to be allocated in the post-backward stream, not the unsharded gradient.)

For T5-11B on 2 nodes and batch size 6, eliminating the unnecessary F.pad for the no-padding case decreases peak reserved memory by ~300 MB and leads to 0 CUDA malloc retries instead of 3.

kulinseth pushed a commit to kulinseth/pytorch that referenced this pull request Dec 10, 2022
- Calling `F.pad()` issues a pad kernel from the CPU even if there is no padding needed, which can incur some non-negligible overhead. This PR removes that unnecessary call for the no-padding case.
- This PR also does not zero the newly-allocated sharded gradient tensor before the reduce-scatter if `use_orig_params=True` because there is no need. The reduce-scatter will fill the tensor anyway, and we do not care about the values in the padding. For `use_orig_params=False`, the padding is exposed to the user, so we preserve the existing semantics of zeroing it. I left a to-do to follow-up since we may optimize that.
Pull Request resolved: pytorch#88769
Approved by: https://github.com/zhaojuanmao
@facebook-github-bot facebook-github-bot deleted the gh/awgu/196/head branch June 8, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants