-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Simplify handle indexing (#105006)" #105984
Conversation
This reverts commit 429d45f. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/105984
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ 1 Unrelated FailureAs of commit 33e2d2c: UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamped to unblock
This reverts commit 429d45f. Unfortunately, #105006 broke backward prefetching (where backward prefetching working correctly was not captured in our unit tests). I need more time to dig into this (tomorrow), but I think the issue is related to: 429d45f#diff-9a6937168d232432c34c2c4605b96f3147afa2786e287f74b6074b20aa5980e6R143-R146 Follow-ups: 1. Investigate this thoroughly 2. Add unit tests to capture backward prefetch functionality [ghstack-poisoned]
This reverts commit 429d45f. ghstack-source-id: 14ee53b0ac632ccc56ebb5b544a204dba2d9f44c Pull Request resolved: pytorch#105984
Failure is unrelated: |
@pytorchbot merge -f "unrelated failure" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: ec6826e16df19b11b7ff3c48c28c17d1a045c211 Pull Request resolved: #107058
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: f6119e94d29a8feaf6248cbe3c62a0a1df0e2387 Pull Request resolved: #107058
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: 2ee4bc10963a4bff0f39625dadee716ddaab2df0 Pull Request resolved: #107058
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: aae553db4518bb223c79a8c86c3a0767cb8e51b3 Pull Request resolved: #107058
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: 69aa41a50b24c21498c4ade9dfe47baa0b63b834 Pull Request resolved: #107058
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect params from prefetch handle to follow order DECODER_PARAM_FQNS, ENCODER_PARAM_FQNS * backward_prefetch = None: expect None prefetch handle [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: f0c358c27f89f004bba1953125ee42d7d5dad29d Pull Request resolved: #107058
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: 9926fd24f22b14e32261979b1e4a7f7cbd56e914 Pull Request resolved: #107058
…ectly with unit test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect params from prefetch handle to follow order DECODER_PARAM_FQNS, ENCODER_PARAM_FQNS * backward_prefetch = None: expect None prefetch handle [ghstack-poisoned]
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect params from prefetch handle to follow order DECODER_PARAM_FQNS, ENCODER_PARAM_FQNS * backward_prefetch = None: expect None prefetch handle [ghstack-poisoned]
…ectly with unit test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None [ghstack-poisoned]
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: 527d658362d51520fa30177521237cfd796f5ad0 Pull Request resolved: #107058
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get total function call and total non-None handle * BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE: expect total non-None handle > 0 * backward_prefetch = None: expect total function call = 0 ghstack-source-id: 34b907180ac5cd649aba940fa8bbdee2e5b5e57e Pull Request resolved: #107058
…ectly with unit test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None [ghstack-poisoned]
…test" issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None [ghstack-poisoned]
issue resolved: #105984 context: * CI did not catch the commit that breaks backward_prefetch #105006 * we had an action item to add unit test to prevent similar cases: #105984 what's included in this unit test * monkey patch torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched for backward_prefetch = BackwardPrefetch.BACKWARD_PRE * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * pre-backward hook order: root -> decoder 5...0 -> encoder 5...0 * prefetch order: decoder 5...0 -> encoder 5...0 -> None * when current_handle=encoder 0, _get_handle_to_prefetch returns None for backward_prefetch = BackwardPrefetch.BACKWARD_POST * state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root * post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root * prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None Pull Request resolved: #107058 Approved by: https://github.com/awgu
Stack from ghstack (oldest at bottom):
record_function
for explicit prefetching #105985This reverts commit 429d45f.
Unfortunately, #105006 broke backward prefetching (where backward prefetching working correctly was not captured in our unit tests).
I need more time to dig into this (tomorrow), but I think the issue is related to:
429d45f#diff-9a6937168d232432c34c2c4605b96f3147afa2786e287f74b6074b20aa5980e6R143-R146
Follow-ups:
Differential Revision: D47816273