Can't load on rank 0 only with `set_optimizer_state_dict` #125177

mvpatel2000 · 2024-04-29T18:37:38Z

🐛 Describe the bug

To avoid CPU OOMs, our training library only loads monolithic checkpoints on rank 0 and broadcasts to all other ranks (as PyTorch checkpointing supports). When migrating to the new distributed APIs,

                set_optimizer_state_dict(
                    model=self.model,
                    optimizers=optimizer,
                    optim_state_dict=optim_state_dict,
                    options=StateDictOptions(
                        full_state_dict=self.fsdp_state_dict_type == 'full',
                        strict=strict,
                        cpu_offload=True,
                    ),
                )

we hit an error with this approach in the below function:

pytorch/torch/distributed/checkpoint/state_dict.py

Line 582 in ae13c7e

optim_state_dict = _split_optim_state_dict(model, optim, state_dict, info)

In our code, optim_state_dict is only non-None on rank-0 as rank0only is set to True in PyTorch code.

Currently, I need to do:

optim_state_dict = MagicMock() if optim_state_dict is None else optim_state_dict

which is unideal. I think the split function should just not be run if the context manager is rank0only, but I am not sure here

Versions

Pytorch 2.3

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

The text was updated successfully, but these errors were encountered:

mvpatel2000 · 2024-04-29T18:38:00Z

@pytorchbot label "oncall: distributed"

fegin · 2024-04-30T22:34:29Z

@mvpatel2000 Curious, how many nodes/ranks are you using? Also, is the state_dict/optimizer state_dict DTensor based? I'm also implement the broadcasting feature at this moment. But it will only support DTensor based state_dict/optimizer state_dict.

mvpatel2000 · 2024-05-01T01:28:42Z

@mvpatel2000 Curious, how many nodes/ranks are you using? Also, is the state_dict/optimizer state_dict DTensor based? I'm also implement the broadcasting feature at this moment. But it will only support DTensor based state_dict/optimizer state_dict.

This is for multiple GPUs (2 is sufficient).

We currently unit test this with ShardedTensors (as this is the default unless a device mesh is passed in), but I believe it's fine if it's only DTensor

fegin · 2024-05-03T16:42:41Z

@mvpatel2000 You can try #125339 to see if that PR can help. But it only supports DTensor.

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 29, 2024

LucasLLC self-assigned this Apr 29, 2024

fegin added the module: distributed_checkpoint label Apr 30, 2024

LucasLLC assigned fegin May 7, 2024

LucasLLC added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't load on rank 0 only with `set_optimizer_state_dict` #125177

Can't load on rank 0 only with `set_optimizer_state_dict` #125177

mvpatel2000 commented Apr 29, 2024 •

edited by pytorch-bot bot

Loading

mvpatel2000 commented Apr 29, 2024

fegin commented Apr 30, 2024 •

edited

Loading

mvpatel2000 commented May 1, 2024

fegin commented May 3, 2024

Can't load on rank 0 only with set_optimizer_state_dict #125177

Can't load on rank 0 only with set_optimizer_state_dict #125177

Comments

mvpatel2000 commented Apr 29, 2024 • edited by pytorch-bot bot Loading

🐛 Describe the bug

Versions

mvpatel2000 commented Apr 29, 2024

fegin commented Apr 30, 2024 • edited Loading

mvpatel2000 commented May 1, 2024

fegin commented May 3, 2024

Can't load on rank 0 only with `set_optimizer_state_dict` #125177

Can't load on rank 0 only with `set_optimizer_state_dict` #125177

mvpatel2000 commented Apr 29, 2024 •

edited by pytorch-bot bot

Loading

fegin commented Apr 30, 2024 •

edited

Loading