-
Notifications
You must be signed in to change notification settings - Fork 24.6k
Fix inefficient recursive update in ShardedTensor.state_dict hook #68806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
Hi @awaelchli! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit ec204a6 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this issue!
if submodule_name: | ||
_recurse_update_module(submodule, state_dict, key + '.') | ||
|
||
for attr_name, attr in module.__dict__.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be submodule.__dict__.items()
? I'm wondering why this test didn't catch this issue: https://github.com/pytorch/pytorch/blob/master/test/distributed/_sharded_tensor/test_sharded_tensor.py#L965? Can we enhance that test so that it does catch this issue (maybe we need multiple levels of nesting)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, it should be submodule.
I'm not yet sure why the test had passed, but I'll give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 I'm a little bit stuck here.
The test will pass no matter what. I'm running
python test/distributed/_sharded_tensor/test_sharded_tensor.py TestShardedTensorChunked.test_state_dict
I even added an assert False
directly inside the test case and it still passed. The output is always:
.
----------------------------------------------------------------------
Ran 1 test in 1.523s
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed weird, can you try pytest test/distributed/_sharded_tensor/test_sharded_tensor.py -k test_state_dict
?
cc @janeyx99 I was wondering if there might be something wrong here with our test infra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @pritamdamania87
The same phenomenon happens with your command. Output looks normal, as if all tests passed.
However, I found out that it is because of the @with_comms
decorator. This one makes the test pass without running it. By removing it, my self.assertEqual(1, 2)
triggers as expected.
Any intuition how this decorator was meant to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I signed the CLA when I opened the PR against master. The PR is now pointing to wanchaol/192/head, this must have reset the CLA bot. I submitted CLA once again a few hours ago, but the bot hasn't update the message.
Should I point back to master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli Ah sorry for the confusion, I meant just checking locally with #69493 patched to see if the unit tests work as expected. I think we should wait for #69493 to land, rebase this PR on top of that on master and then have this PR against master :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding on some test failures, this issue unfortunately hided some silent failures that we couldn't capture earlier, just updated the PR and should fixed all issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding CLA, as long as the state_dict related tests passed, I think you can re-point to master, i will try to merge #69493 as soon as i can. I do notice that test_load_state_dict_errors
failed both on master and on my fix (as it's without with_comms
wrapper, looks like a RPC related error). We can follow up on this separately I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks! Yes the state dict tests passed when rebased on top of #69493
ca2e618
to
7c7698c
Compare
7c7698c
to
f0607d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, have one small nit
@@ -360,26 +360,18 @@ def pre_load_state_dict_hook(module, state_dict, prefix, local_metadata, strict, | |||
_recurse_update_module(module, state_dict, prefix) | |||
|
|||
def _recurse_update_module(module, state_dict, prefix): | |||
for attr_name, attr in module.__dict__.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: given that we don't need the recursion anymore, shall we remove these two functions and put the main logic in state_dict_hook
and pre_load_state_dict_hook
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. @pritamdamania87 do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli Yes this makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
When added `with_comms` decorator with arguments, we added an `with_comms_decorator` inner function, `with_comms()` will refer to a function object, the added parentheses was necessary to use in test cases. This PR fixes the `with_comms` wrapper behavior, to allow we both specify with/without arguments in test cases: ``` @with_comms def test_case: ... ``` or ``` @with_comms(backend="gloo") def test_case: ... ``` Differential Revision: [D32897555](https://our.internmc.facebook.com/intern/diff/D32897555/) [ghstack-poisoned]
f0607d3
to
ba89913
Compare
@wanchaol has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks for fixing this!
Fixes #68805
The bug is described in the linked issue. This PR is an attempt to make the functions
_recurse_update_dict
and_recurse_update_module
more efficient in how they iterate over the submodules. The previous implementation was suboptimal, as it recursively called the update method on the submodules returned bymodule.named_modules()
, whilemodule.named_modules()
already returned all submodules including nested ones.cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang