Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Simplify handle indexing (#105006)" #105984

Closed
wants to merge 2 commits into from

Conversation

awgu
Copy link
Contributor

@awgu awgu commented Jul 25, 2023

Stack from ghstack (oldest at bottom):

This reverts commit 429d45f.

Unfortunately, #105006 broke backward prefetching (where backward prefetching working correctly was not captured in our unit tests).

I need more time to dig into this (tomorrow), but I think the issue is related to:
429d45f#diff-9a6937168d232432c34c2c4605b96f3147afa2786e287f74b6074b20aa5980e6R143-R146

Follow-ups:

  1. Investigate this thoroughly
  2. Add unit tests to capture backward prefetch functionality

Differential Revision: D47816273

This reverts commit 429d45f.

[ghstack-poisoned]
awgu added a commit that referenced this pull request Jul 25, 2023
This reverts commit 429d45f.

ghstack-source-id: fdc129260f85a4f14a3a325d97625ef5a6c25f0b
Pull Request resolved: #105984
@awgu awgu requested a review from voznesenskym July 25, 2023 23:33
@pytorch-bot pytorch-bot bot added release notes: distributed (fsdp) release notes category labels Jul 25, 2023
@pytorch-bot
Copy link

pytorch-bot bot commented Jul 25, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/105984

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ 1 Unrelated Failure

As of commit 33e2d2c:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@awgu awgu added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 25, 2023
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamped to unblock

This reverts commit 429d45f.

Unfortunately, #105006 broke backward prefetching (where backward prefetching working correctly was not captured in our unit tests).

I need more time to dig into this (tomorrow), but I think the issue is related to:
429d45f#diff-9a6937168d232432c34c2c4605b96f3147afa2786e287f74b6074b20aa5980e6R143-R146

Follow-ups:
1. Investigate this thoroughly
2. Add unit tests to capture backward prefetch functionality

[ghstack-poisoned]
@awgu awgu added the topic: not user facing topic category label Jul 26, 2023
awgu added a commit to awgu/pytorch that referenced this pull request Jul 26, 2023
This reverts commit 429d45f.

ghstack-source-id: 14ee53b0ac632ccc56ebb5b544a204dba2d9f44c
Pull Request resolved: pytorch#105984
@awgu
Copy link
Contributor Author

awgu commented Jul 26, 2023

@awgu
Copy link
Contributor Author

awgu commented Jul 26, 2023

@pytorchbot merge -f "unrelated failure"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@awgu
Copy link
Contributor Author

awgu commented Jul 26, 2023

@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@awgu
Copy link
Contributor Author

awgu commented Jul 26, 2023

@awgu has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot facebook-github-bot deleted the gh/awgu/429/head branch July 29, 2023 14:16
voznesenskym added a commit that referenced this pull request Aug 1, 2023
This reverts commit a9a3c45.

fixes

[ghstack-poisoned]
voznesenskym added a commit that referenced this pull request Aug 1, 2023
This reverts commit a9a3c45.

fixes

ghstack-source-id: a025b0c5d5aeb7606266767ffaa3b6ef1e7f96a9
Pull Request resolved: #106357
voznesenskym added a commit that referenced this pull request Aug 3, 2023
This reverts commit a9a3c45.

fixes

ghstack-source-id: 8350887b112673b94b6a13f61106ad2650a7376a
Pull Request resolved: #106357

fix
voznesenskym added a commit that referenced this pull request Aug 3, 2023
This reverts commit a9a3c45.

fixes

ghstack-source-id: 0342a7c447a14d8a46f177962d2cb1ded41d0682
Pull Request resolved: #106357

fix

fix
weifengpy added a commit that referenced this pull request Aug 11, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 11, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: ec6826e16df19b11b7ff3c48c28c17d1a045c211
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 11, 2023
…test"

issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 11, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: f6119e94d29a8feaf6248cbe3c62a0a1df0e2387
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 11, 2023
…test"

issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 11, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: 2ee4bc10963a4bff0f39625dadee716ddaab2df0
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 11, 2023
…test"

issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 11, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: aae553db4518bb223c79a8c86c3a0767cb8e51b3
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 15, 2023
…test"

issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 15, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: 69aa41a50b24c21498c4ade9dfe47baa0b63b834
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 15, 2023
…test"


issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect params from prefetch handle to follow order DECODER_PARAM_FQNS, ENCODER_PARAM_FQNS
* backward_prefetch = None: expect None prefetch handle

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 15, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: f0c358c27f89f004bba1953125ee42d7d5dad29d
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 23, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: 9926fd24f22b14e32261979b1e4a7f7cbd56e914
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 23, 2023
…ectly with unit test"


issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect params from prefetch handle to follow order DECODER_PARAM_FQNS, ENCODER_PARAM_FQNS
* backward_prefetch = None: expect None prefetch handle

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 23, 2023
…test"


issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect params from prefetch handle to follow order DECODER_PARAM_FQNS, ENCODER_PARAM_FQNS
* backward_prefetch = None: expect None prefetch handle

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 24, 2023
…ectly with unit test"


issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched

for backward_prefetch = BackwardPrefetch.BACKWARD_PRE
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* pre-backward hook order: root -> decoder 5...0 -> encoder 5...0
* prefetch order: decoder 5...0 -> encoder 5...0 -> None
  * when current_handle=encoder 0, _get_handle_to_prefetch returns None

for backward_prefetch = BackwardPrefetch.BACKWARD_POST
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root
* prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None
  * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None
  * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 24, 2023
…test"


issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched

for backward_prefetch = BackwardPrefetch.BACKWARD_PRE
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* pre-backward hook order: root -> decoder 5...0 -> encoder 5...0
* prefetch order: decoder 5...0 -> encoder 5...0 -> None
  * when current_handle=encoder 0, _get_handle_to_prefetch returns None

for backward_prefetch = BackwardPrefetch.BACKWARD_POST
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root
* prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None
  * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None
  * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 24, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: 527d658362d51520fa30177521237cfd796f5ad0
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 24, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and get
total function call and total non-None handle
* BackwardPrefetch.BACKWARD_POST and BackwardPrefetch.BACKWARD_PRE:
expect total non-None handle > 0
* backward_prefetch = None: expect total function call = 0

ghstack-source-id: 34b907180ac5cd649aba940fa8bbdee2e5b5e57e
Pull Request resolved: #107058
weifengpy added a commit that referenced this pull request Aug 24, 2023
…ectly with unit test"


issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched

for backward_prefetch = BackwardPrefetch.BACKWARD_PRE
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* pre-backward hook order: root -> decoder 5...0 -> encoder 5...0
* prefetch order: decoder 5...0 -> encoder 5...0 -> None
  * when current_handle=encoder 0, _get_handle_to_prefetch returns None

for backward_prefetch = BackwardPrefetch.BACKWARD_POST
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root
* prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None
  * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None
  * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None

[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Aug 24, 2023
…test"


issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched

for backward_prefetch = BackwardPrefetch.BACKWARD_PRE
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* pre-backward hook order: root -> decoder 5...0 -> encoder 5...0
* prefetch order: decoder 5...0 -> encoder 5...0 -> None
  * when current_handle=encoder 0, _get_handle_to_prefetch returns None

for backward_prefetch = BackwardPrefetch.BACKWARD_POST
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root
* prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None
  * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None
  * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Aug 25, 2023
issue resolved: #105984

context:
* CI did not catch the commit that breaks backward_prefetch #105006
* we had an action item to add unit test to prevent similar cases: #105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched

for backward_prefetch = BackwardPrefetch.BACKWARD_PRE
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* pre-backward hook order: root -> decoder 5...0 -> encoder 5...0
* prefetch order: decoder 5...0 -> encoder 5...0 -> None
  * when current_handle=encoder 0, _get_handle_to_prefetch returns None

for backward_prefetch = BackwardPrefetch.BACKWARD_POST
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root
* prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None
  * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None
  * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None
Pull Request resolved: #107058
Approved by: https://github.com/awgu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants