-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix handling of replica parameters in DataParallel #33907
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome fix! Nice to see a test go from doing nothing to actually testing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
💊 CircleCI build failures summary and remediationsAs of commit d70e4cd (more details on the Dr. CI page):
🕵️ 10 new failures recognized by patternsThe following build failures do not appear to be due to upstream breakages (reran 8 jobs to discount flakiness):
|
Job | Step | Status |
---|---|---|
Build |
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker.
This comment has been revised 6 times.
cc @suo. Jit tests are now passing because I don't delete an attribute for scripted module. However, scripted RNN+DataParallel did not work before this PR, and is still broken after. The reason behind this is scripting does not call overriden _apply from rnn module when moving scripted module to cuda, and does not call overriden setattr from rnn module when creating replicas, and generic implementations of those do not work for rnn (that's why we needed those overrides). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change lgtm, can you file an issue about the brokenness of RNN + DataParallel + JIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
DDP was expecting .parameters() would return a all parameters of a replicated module. However, after pytorch#33907, we no longer populating _parameters for replicated modules. This PR fixes that problem by keeping params in a _former_parameters ordered dict on every module replica.
…taParallel replicas. ([pytorch/pytorch#33907](pytorch/pytorch#33907))
…ataParallel replicas. (pytorch/pytorch#33907)
…ataParallel replicas. (pytorch/pytorch#33907) Same issue is reported at huggingface/transformers#3936
In DataParallel, replica parameters are not leaves (because they are computed via broadcast from master parameters), and should be treated as such. Fixes #33552