Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP init can crash with shared parameters #83052

Open
rohan-varma opened this issue Aug 9, 2022 · 2 comments
Open

FSDP init can crash with shared parameters #83052

rohan-varma opened this issue Aug 9, 2022 · 2 comments
Labels
high priority module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@rohan-varma
Copy link
Member

rohan-varma commented Aug 9, 2022

馃悰 Describe the bug

FSDP initialization can crash when modules with shared params are wrapped separately. For example, if wrap https://github.com/facebookresearch/multimodal/blob/679f3596e4c44b483c68d4023b24e3c7f77292b3/torchmultimodal/modules/losses/flava.py#L138 linear (decoder) separately from the main module and then wrap the main module with device_id argument, this will raise an error due to bias param being shared. The bias param would have already been moved to GPU by the linear wrapped FSDP unit, but then the higher-level wrapper would still expect it to be on CPU, resulting in this error:

f"FSDP only supports single device modules, but got params on {module_devices}"

Versions

main

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501

@rohan-varma rohan-varma added high priority oncall: distributed Add this issue/PR to distributed oncall triage queue module: fsdp labels Aug 9, 2022
@awgu
Copy link
Contributor

awgu commented Aug 9, 2022

What is the desired behavior here instead?

@rohan-varma
Copy link
Member Author

@awgu Not sure - I understand we don't support shared params wrapped in different FSDP units, but we should probably have a different error than this which makes it very hard to root cause what the issue is and provides a poor UX.

@rohan-varma rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triage review triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants