[FSDP] Use correct handle training state when prefetching #98249

awgu · 2023-04-03T21:17:44Z

Stack from ghstack (oldest at bottom):

This PR ensures that when prefetching a FlatParamHandle.unshard(), we temporarily set the FlatParamHandle._training_state to the expected training state as if the unshard() were not prefetched since the as_params argument to _use_unsharded_views() depends on the handle's training state.

[ghstack-poisoned]

pytorch-bot · 2023-04-03T21:17:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98249

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 58bfb34:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 867bf2c20716db0f80ca3e0f0d373f32b7da2d5a Pull Request resolved: #98249

rohan-varma

Unittest?

rohan-varma · 2023-04-03T22:03:34Z

torch/distributed/fsdp/_runtime_utils.py

@@ -1003,11 +1015,25 @@ def _prefetch_handles(
        return
    handles_to_prefetch = _get_handles_to_prefetch(state, current_handles_key)
    for handles_key in handles_to_prefetch:
+        # Temporarily emulate the training state while calling `_unshard`


what improvement does doing this provide if there's no functionality difference?

For _use_unsharded_views(as_params=....), as_params depends on the handle's _training_state.

For example, suppose we are backward prefetching with BACKWARD_PRE and handle h1 prefetches handle h2.

h2._training_state is IDLE.

h1._training_state is BACKWARD_PRE.

With this PR's change, we will make h2._training_state = BACKWARD_PRE, while we prefetch the unshard, so that as_params in _use_unsharded_views() will correctly use as_params=False instead of True.

Without this change, _use_unsharded_views() will use as_params=True, which is actually incorrect for reentrant checkpointing.

I need to investigate more, but I think our FSDP <> AC unit tests were too weak to catch this bug that I introduced in #97981. Before #97981, we would just override the prefetched unshard with the correct _use_unsharded_views(as_params=...). However, after #97981, we skip the second overriding _use_unsharded_views(as_params=...). With this PR, we still skip the second overriding _use_unsharded_views(), but that is no longer an issue since the first prefetched _use_unsharded_views() uses the correct training state and hence the correct as_params.

[ghstack-poisoned]

ghstack-source-id: 5e5806e2a1a89b9c607f401c3427dacc0be7e6a7 Pull Request resolved: pytorch#98249

[ghstack-poisoned]

ghstack-source-id: 99da08f26f09bb65fc27ecfbfc24281bb2f8186d Pull Request resolved: #98249

[ghstack-poisoned]

ghstack-source-id: 4d96d46f45b20f7b3da0280ebfce79217670e1d8 Pull Request resolved: #98249

[ghstack-poisoned]

ghstack-source-id: 2f0081349be96d0c4467cb2c78e94895d00b0510 Pull Request resolved: #98249

awgu · 2023-04-04T13:31:39Z

@pytorchbot merge

pytorchmergebot · 2023-04-04T13:33:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[FSDP] Use correct handle training state when prefetching

e381c7e

[ghstack-poisoned]

awgu mentioned this pull request Apr 3, 2023

[FSDP] Allow non-uniform requires_grad for use_orig_params=True #98221

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 3, 2023

awgu added a commit that referenced this pull request Apr 3, 2023

[FSDP] Use correct handle training state when prefetching

4107ce7

ghstack-source-id: 867bf2c20716db0f80ca3e0f0d373f32b7da2d5a Pull Request resolved: #98249

awgu mentioned this pull request Apr 3, 2023

[FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP #98250

Closed

rohan-varma approved these changes Apr 3, 2023

View reviewed changes

Update on "[FSDP] Use correct handle training state when prefetching"

12e0dae

[ghstack-poisoned]

awgu added a commit to awgu/pytorch that referenced this pull request Apr 3, 2023

[FSDP] Use correct handle training state when prefetching

2aa4af4

ghstack-source-id: 5e5806e2a1a89b9c607f401c3427dacc0be7e6a7 Pull Request resolved: pytorch#98249

awgu added 2 commits April 3, 2023 23:39

Update on "[FSDP] Use correct handle training state when prefetching"

48f5cc9

[ghstack-poisoned]

Update on "[FSDP] Use correct handle training state when prefetching"

86bb839

[ghstack-poisoned]

awgu added a commit that referenced this pull request Apr 4, 2023

[FSDP] Use correct handle training state when prefetching

e4736dd

ghstack-source-id: 99da08f26f09bb65fc27ecfbfc24281bb2f8186d Pull Request resolved: #98249

Update on "[FSDP] Use correct handle training state when prefetching"

971d2c1

[ghstack-poisoned]

awgu added a commit that referenced this pull request Apr 4, 2023

[FSDP] Use correct handle training state when prefetching

fd48359

ghstack-source-id: 4d96d46f45b20f7b3da0280ebfce79217670e1d8 Pull Request resolved: #98249

Update on "[FSDP] Use correct handle training state when prefetching"

58bfb34

[ghstack-poisoned]

awgu added a commit that referenced this pull request Apr 4, 2023

[FSDP] Use correct handle training state when prefetching

26caf87

ghstack-source-id: 2f0081349be96d0c4467cb2c78e94895d00b0510 Pull Request resolved: #98249

awgu marked this pull request as ready for review April 4, 2023 10:57

awgu requested review from mrshenli, zhaojuanmao, H-Huang, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners April 4, 2023 10:57

awgu added topic: improvements topic category topic: bug fixes topic category ciflow/trunk Trigger trunk jobs on your pull request labels Apr 4, 2023

pytorchmergebot added the Merged label Apr 4, 2023

pytorchmergebot closed this in 0b31f87 Apr 4, 2023

facebook-github-bot deleted the gh/awgu/376/head branch June 8, 2023 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Use correct handle training state when prefetching #98249

[FSDP] Use correct handle training state when prefetching #98249

awgu commented Apr 3, 2023 •

edited

pytorch-bot bot commented Apr 3, 2023 •

edited

rohan-varma left a comment

rohan-varma Apr 3, 2023

awgu Apr 3, 2023

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 4, 2023

[FSDP] Use correct handle training state when prefetching #98249

[FSDP] Use correct handle training state when prefetching #98249

Conversation

awgu commented Apr 3, 2023 • edited

pytorch-bot bot commented Apr 3, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/98249

✅ No Failures

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Apr 3, 2023

Choose a reason for hiding this comment

awgu Apr 3, 2023

Choose a reason for hiding this comment

awgu commented Apr 4, 2023

pytorchmergebot commented Apr 4, 2023

Merge started

awgu commented Apr 3, 2023 •

edited

pytorch-bot bot commented Apr 3, 2023 •

edited