[FSDP] Limit all gather after pre-unshard #89057

awgu · 2022-11-15T14:28:03Z

Stack from ghstack:

[FSDP] Include module classes in ModuleWrapPolicy.__repr__ #89058 [FSDP] Include module classes in ModuleWrapPolicy.__repr__
[FSDP] Limit all gather after pre-unshard #89057 [FSDP] Limit all gather after pre-unshard

To reuse memory when allocating the unsharded FlatParameter in the unshard stream, we only need to block the CPU thread on the preceding free event (i.e. event.synchronize()) before allocating the unsharded memory, which happens in handle.unshard(). Notably, this can be done after the pre-unshard logic, which at most performs sharded allocations (low precision shard or H2D sharded FlatParameter copy) in its own pre-unshard stream. This enables the pre-unshard to overlap with any pending ops.

With this change, I believe that we should use limit_all_gathers=True all the time to stay true to FSDP's proposed memory semantics.

If a user wants to set limit_all_gathers=False, that would mean that he/she wants to overlap ops that are issued after the unshard logic's all-gather with ops that are pending at the time when FSDP would block the CPU thread via event.synchronize().

If the user is willing to not reuse memory for that all-gather, then the user may as well have applied NO_SHARD and optionally ZeRO-1 (if this niche is important, then maybe we should consider hardening ZeRO-1). This is because now the unsharded memory for the all-gather additionally contributes to peak memory since it cannot reuse memory.
If the user wanted to reuse memory for that all-gather, then we needed to block the CPU thread. There is no way around that given the caching allocator semantics.

[ghstack-poisoned]

pytorch-bot · 2022-11-15T14:28:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89057

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 47af6d0:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 72e1b5f3d523e6fde6ca183af492cee947adbb43 Pull Request resolved: pytorch#89057

To reuse memory when allocating the unsharded `FlatParameter` in the unshard stream, we only need to block the CPU thread on the preceding free event (i.e. `event.synchronize()`) before allocating the unsharded memory, which happens in `handle.unshard()`. Notably, this can be done after the pre-unshard logic, which at most performs _sharded_ allocations (low precision shard or H2D sharded `FlatParameter` copy) in its own pre-unshard stream. This enables the pre-unshard to overlap with any pending ops. With this change, I believe that we should use `limit_all_gathers=True` all the time to stay true to FSDP's proposed memory semantics. If a user wants to set `limit_all_gathers=False`, that would mean that he/she wants to overlap ops that are issued after the unshard logic's all-gather with ops that are pending at the time when FSDP _would_ block the CPU thread via `event.synchronize()`. - If the user is willing to not reuse memory for that all-gather, then the user may as well have applied `NO_SHARD` and optionally ZeRO-1 (if this niche is important, then maybe we should consider hardening ZeRO-1). This is because now the unsharded memory for the all-gather additionally contributes to peak memory since it cannot reuse memory. - If the user wanted to reuse memory for that all-gather, then we needed to block the CPU thread. There is no way around that given the caching allocator semantics. [ghstack-poisoned]

mrshenli

LGTM

To reuse memory when allocating the unsharded `FlatParameter` in the unshard stream, we only need to block the CPU thread on the preceding free event (i.e. `event.synchronize()`) before allocating the unsharded memory, which happens in `handle.unshard()`. Notably, this can be done after the pre-unshard logic, which at most performs _sharded_ allocations (low precision shard or H2D sharded `FlatParameter` copy) in its own pre-unshard stream. This enables the pre-unshard to overlap with any pending ops. With this change, I believe that we should use `limit_all_gathers=True` all the time to stay true to FSDP's proposed memory semantics. If a user wants to set `limit_all_gathers=False`, that would mean that he/she wants to overlap ops that are issued after the unshard logic's all-gather with ops that are pending at the time when FSDP _would_ block the CPU thread via `event.synchronize()`. - If the user is willing to not reuse memory for that all-gather, then the user may as well have applied `NO_SHARD` and optionally ZeRO-1 (if this niche is important, then maybe we should consider hardening ZeRO-1). This is because now the unsharded memory for the all-gather additionally contributes to peak memory since it cannot reuse memory. - If the user wanted to reuse memory for that all-gather, then we needed to block the CPU thread. There is no way around that given the caching allocator semantics. Pull Request resolved: pytorch#89057 Approved by: https://github.com/mrshenli

[FSDP] Limit all gather after pre-unshard

da7d272

[ghstack-poisoned]

awgu requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, H-Huang and kwen2501 as code owners November 15, 2022 14:28

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Nov 15, 2022

awgu mentioned this pull request Nov 15, 2022

[FSDP] Include module classes in ModuleWrapPolicy.__repr__ #89058

Closed

awgu added the topic: improvements topic category label Nov 15, 2022

awgu added a commit to awgu/pytorch that referenced this pull request Nov 23, 2022

[FSDP] Limit all gather after pre-unshard

b6a976b

ghstack-source-id: 72e1b5f3d523e6fde6ca183af492cee947adbb43 Pull Request resolved: pytorch#89057

awgu added a commit to awgu/pytorch that referenced this pull request Nov 29, 2022

[FSDP] Limit all gather after pre-unshard

dcae44e

ghstack-source-id: 72e1b5f3d523e6fde6ca183af492cee947adbb43 Pull Request resolved: pytorch#89057

mrshenli approved these changes Nov 29, 2022

View reviewed changes

awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 29, 2022

pytorchmergebot closed this in c8aaad0 Nov 30, 2022

facebook-github-bot deleted the gh/awgu/200/head branch June 8, 2023 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Limit all gather after pre-unshard #89057

[FSDP] Limit all gather after pre-unshard #89057

awgu commented Nov 15, 2022 •

edited

pytorch-bot bot commented Nov 15, 2022 •

edited

mrshenli left a comment

[FSDP] Limit all gather after pre-unshard #89057

[FSDP] Limit all gather after pre-unshard #89057

Conversation

awgu commented Nov 15, 2022 • edited

pytorch-bot bot commented Nov 15, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/89057

✅ No Failures

mrshenli left a comment

Choose a reason for hiding this comment

awgu commented Nov 15, 2022 •

edited

pytorch-bot bot commented Nov 15, 2022 •

edited