New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to let DistributedDataParallel know in advance unused parameters at each forward pass #90171
Comments
Thanks for the suggestion, @netw0rkf10w ! This makes sense. Something like this (looks a bit ugly but could be made a little prettier with a wrapper etc.):
|
@aazzolini I'm really sorry for my late response. Unfortunately I was unable to reply earlier :( I have to say that what you have proposed is very clever! However it seems to me that it's still suboptimal because it still allocates GPU memory for unused parameters, and gradient reduction still happens for those. Let me summarize below the different solutions in terms of running time and memory footprint. Please correct me if I'm wrong because I don't know very well how DDP works. 1. Optimal solution: DDP knows in advance which parameters (or layers) are unusedIf DDP knows in advance which layers are unused at each forward pass, then it should be able to ignores them during both forward and backward passes. ✅ Running time: good, no forward and no reduction for unused layers. 2. First naive solution: Setting
|
cc as well @mrshenli who used to answer my questions on DDP |
@aazzolini I finally had the time to implement and benchmark your solution. Unfortunately it's even slower than simply set |
I've just realised that this was discussed in the PyTorch DDP paper by @mrshenli et al: I believe that the solution that I proposed above is much simpler than any of the ones proposed in the paper. What do you think about this, @mrshenli? |
🚀 The feature, motivation and pitch
Motivation: In models with stochastic depth and the like, at each forward pass some layers (or parts of them) are skipped and thus one needs to set
find_unused_parameters=True
, which makes training much slower in general. Yet, one can implement these models in such a way that the unused parameters at each step are known in advance (e.g., the layer sampling is done before the model forward pass and not on-the-fly). It would then be great if we could feed this information to the DDP model so that it doesn't need to find the unused parameters.The usage could be something like the following:
Alternatives
Currently there is no option other than setting
find_unused_parameters=True
.Additional context
No response
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
The text was updated successfully, but these errors were encountered: