FSDP init can crash with shared parameters #83052

rohan-varma · 2022-08-09T03:24:31Z

🐛 Describe the bug

FSDP initialization can crash when modules with shared params are wrapped separately. For example, if wrap https://github.com/facebookresearch/multimodal/blob/679f3596e4c44b483c68d4023b24e3c7f77292b3/torchmultimodal/modules/losses/flava.py#L138 linear (decoder) separately from the main module and then wrap the main module with device_id argument, this will raise an error due to bias param being shared. The bias param would have already been moved to GPU by the linear wrapped FSDP unit, but then the higher-level wrapper would still expect it to be on CPU, resulting in this error:

pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py

Line 814 in 9e65e93

f"FSDP only supports single device modules, but got params on {module_devices}"

Versions

main

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501

The text was updated successfully, but these errors were encountered:

awgu · 2022-08-09T12:18:07Z

What is the desired behavior here instead?

rohan-varma · 2022-08-10T18:13:04Z

@awgu Not sure - I understand we don't support shared params wrapped in different FSDP units, but we should probably have a different error than this which makes it very hard to root cause what the issue is and provides a poor UX.

rohan-varma added high priority oncall: distributed Add this issue/PR to distributed oncall triage queue module: fsdp labels Aug 9, 2022

pytorch-bot bot added the triage review label Aug 9, 2022

rohan-varma added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP init can crash with shared parameters #83052

FSDP init can crash with shared parameters #83052

rohan-varma commented Aug 9, 2022 •

edited by pytorch-bot bot

awgu commented Aug 9, 2022

rohan-varma commented Aug 10, 2022

FSDP init can crash with shared parameters #83052

FSDP init can crash with shared parameters #83052

Comments

rohan-varma commented Aug 9, 2022 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

awgu commented Aug 9, 2022

rohan-varma commented Aug 10, 2022

rohan-varma commented Aug 9, 2022 •

edited by pytorch-bot bot