FSDP device_id + CPU offload can have issues

### 🐛 Describe the bug

If we have the following setup

```
class MyModel(nn.Module):
            def __init__(self):
                super().__init__()
                self.a = nn.Linear(10, 10)
                self.b = nn.Linear(10, 10)

            def forward(self, x):
                return self.b(self.a(x))

        model = MyModel()

        fsdp = FSDP(
            model,
            auto_wrap_policy=always_wrap_policy,
            cpu_offload=CPUOffload(offload_params=True),
            device_id=torch.cuda.current_device()
        )
```

we hit the error:

```
RuntimeError: Module on rank 1 is given device_id argument cuda:1, but is on cpu.  Either move module before FSDP init or omit device_id argument.
```

This seems to be because the root FSDP unit does not manage any params, so when checking whether to move because it is given `device_id` argument, it accesses a FSDP submodule's FlatParam which is on CPU, and we throw an error: https://github.com/pytorch/pytorch/blob/5ca098fe3891f7151d73223c38e909aa9dd5c862/torch/distributed/fsdp/fully_sharded_data_parallel.py#L1054.

The proper fix should be to bypass this check if we end up with a flatparam.

Lightning integration has hit this issue.

### Versions

main

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP device_id + CPU offload can have issues #82891

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FSDP device_id + CPU offload can have issues #82891

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions