[FSDP][Perf] Pre-allocate sharded grad in default stream #90617

awgu · 2022-12-10T15:55:00Z

Stack from ghstack:

[FSDP][Perf] Pre-allocate full prec sharded param #90619 [FSDP][Perf] Pre-allocate full prec sharded param
[FSDP][Easy][BE] Minor cleanup of post-backward logic #90618 [FSDP][Easy][BE] Minor cleanup of post-backward logic
[FSDP][Perf] Pre-allocate sharded grad in default stream #90617 [FSDP][Perf] Pre-allocate sharded grad in default stream
[FSDP][Perf] Pre-allocate sharded grad in default stream; save one copy when downcasting grad #90616 [FSDP][Perf] Save one copy when downcasting grad
[FSDP][Perf] Pre-allocate padded unsharded grad in default stream #90614 [FSDP][Perf] Pre-allocate padded unsharded grad in default stream
[FSDP] Sanitize HandleConfig for mixed precision #90631 [FSDP] Sanitize HandleConfig for mixed precision
[FSDP] Tighten post-bwd cast to reduce_dtype #90615 [FSDP] Tighten post-bwd cast to reduce_dtype
[FSDP][Easy] Move to _storage() in test file #90622 [FSDP][Easy] Move to _storage() in test file
[FSDP] Save _stream_to_name for debugging #90611 [FSDP] Save _stream_to_name for debugging
[Reland][FSDP] Another fix for DTensor, use_orig_params=True #90562 [Reland][FSDP] Another fix for DTensor, use_orig_params=True

Every sharded strategy always allocates a (padded) sharded gradient as the target tensor for the reduce-scatter. This PR moves that allocation to the default stream instead of the post-backward stream.

Minor: This PR changes from state.device to handle.device, which has no semantic difference. It is just better to lower as much logic as possible to the handle when appropriate.

[ghstack-poisoned]

pytorch-bot · 2022-12-10T15:55:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90617

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Failures, 10 Pending

As of commit 6007ced:

The following jobs have failed:

linux-bionic-cuda11.6-py3.10-gcc7 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Every sharded strategy always allocates a (padded) sharded gradient as the target tensor for the reduce-scatter. This PR moves that allocation to the default stream instead of the post-backward stream. Minor: This PR changes from `state.device` to `handle.device`, which has no semantic difference. It is just better to lower as much logic as possible to the `handle` when appropriate. [ghstack-poisoned]

awgu · 2022-12-11T02:41:43Z

This PR got absorbed into the previous PR by accident when dealing with rebase conflicts.

[FSDP][Perf] Pre-allocate sharded grad in default stream

528cfe3

[ghstack-poisoned]

awgu requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, H-Huang, kwen2501 and wanchaol as code owners December 10, 2022 15:55

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Dec 10, 2022

Update on "[FSDP][Perf] Pre-allocate sharded grad in default stream"

542eda9

[ghstack-poisoned]

awgu mentioned this pull request Dec 10, 2022

[FSDP][Easy] Move to _storage() in test file #90622

Closed

Update on "[FSDP][Perf] Pre-allocate sharded grad in default stream"

8bd790e

[ghstack-poisoned]

awgu mentioned this pull request Dec 10, 2022

[FSDP] Sanitize HandleConfig for mixed precision #90631

Closed

awgu closed this Dec 11, 2022

facebook-github-bot deleted the gh/awgu/255/head branch June 8, 2023 15:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP][Perf] Pre-allocate sharded grad in default stream #90617

[FSDP][Perf] Pre-allocate sharded grad in default stream #90617

Uh oh!

awgu commented Dec 10, 2022 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 10, 2022 •

edited

Loading

Uh oh!

awgu commented Dec 11, 2022

Uh oh!

Uh oh!

[FSDP][Perf] Pre-allocate sharded grad in default stream #90617

[FSDP][Perf] Pre-allocate sharded grad in default stream #90617

Uh oh!

Conversation

awgu commented Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/90617

❌ 1 Failures, 10 Pending

Uh oh!

awgu commented Dec 11, 2022

Uh oh!

Uh oh!

awgu commented Dec 10, 2022 •

edited

Loading

pytorch-bot bot commented Dec 10, 2022 •

edited

Loading