[FSDP2] Eager-Mode Execution Tracker

## Work Items
* Meta-device initialization / `_apply()` methods
  - [x] Support initial meta-device initialization using `swap_tensors` path
  - [ ] Remove manual padding logic after https://github.com/pytorch/pytorch/issues/113045 @wz337 
  - **Outcome:** Once we have `DTensor` manage the padded storage, then FSDP only needs to maintain a reference to the `DTensor`, not its `_local_tensor`. We should be able to do meta-device initialization via `nn.Module.to_empty()` (using `swap_tensors` path) followed by normal initialization ops (e.g. `torch.nn.init.<...>`) that runs through `DTensor`'s op dispatch _without_ any padding logic in `_apply()`.
* `clip_grad_norm_()` / sharded gradient scaler
    - [x] Support `DTensor`-based `clip_grad_norm_()`, e.g. implementing `torch.linalg.vector_norm()`
    - [x] Support `DTensor`-based `clip_grad_norm_()` with `foreach=True`
    - [x] Support `DTensor`-based sharded gradient scaler (low priority since only used for fp16) https://github.com/pytorch/pytorch/pull/132816
    - **Outcome:** We can replace existing FSDP with per-parameter FSDP in torchtrain.
* 2D sharded state dict
  - [x] Implement strided sharding placement to handle `(Shard(0), Shard(0))` placements where FSDP shards dim-0 and TP shards dim-1 https://github.com/pytorch/pytorch/pull/126697
  - [x]  Validate 2D sharded state dict integration with distributed checkpointing excluding checkpoint resharding (so just saving/loading same world size)
  - [ ] Validate 2D sharded state dict integration with distributed checkpointing including checkpoint resharding @wz337
  - **Outcome:** We can do 2D training including checkpointing without communication and can reshard checkpoints to different world sizes.
* `distribute_tensor()`
  - [ ] Allow `distribute_tensor()` to take in a `DTensor` to simplify/robust-ify construction of sharded parameter and sharded post-forward parameter @wz337 
  - **Outcome:** We remove all custom `DTensor` construction code from per-parameter FSDP (`_init_sharded_param`, `_init_sharded_post_forward_param_metadata`, `to_sharded_post_forward`).
* Optimizer <> `DTensor`
  - [x] Enable `foreach=True` by default for Adam/AdamW @wz337 
  - [x] Support `fused=True` for Adam/AdamW @wz337 
  - [ ] Support `foreach=True` for all torch-native optimizers @wz337 
  - **Outcome:** We have competitive optimizer performance on common optimizers without user-code changes (like passing `foreach=True` explicitly). This includes new CPU fused Adam/AdamW kernels.
  - [x] Support custom (user-defined) ops https://github.com/pytorch/pytorch/pull/131108
  - **Outcome:** We can support custom optimizers (e.g. Apex, low-precision, etc.).
* Reduce-scatter copy-in kernel/fast-path
  - [x] Add new aten op for chunk-cat @BoyuanFeng 
  - **Outcome:** We can achieve competitive performance with existing FSDP even on low-compute-density workloads (e.g. recommendation models).
* HSDP
  - [x] Add hybrid sharding (when passing 2D `mesh` arg) @weifengpy 
  - [ ] Validate HSDP sharded state dict integration with distributed checkpointing
  - [ ] Validate HSDP + TP sharded state dict integration with distributed checkpointing
  - **Outcome:** Existing HSDP has adoption both internally and externally. We should support it under per-parameter and migrate users.
* CPU offloading
  - [x] Add CPU offloading to per-parameter FSDP with async H2D parameter copy and async D2H gradient copy (at least when not accumulating gradient)
* FSDP extensions
  - [x] Add pre/post-all-gather extensions support @awgu 
  - [x] Add pre/post-all-gather extensions for float8_experimental @awgu (dynamic scaling eager done)
  - [x] Add pre/post all-gather extensions for QLoRA @weifengpy 


## References
RFC: https://github.com/pytorch/pytorch/issues/114299

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FSDP2] Eager-Mode Execution Tracker #120003

Work Items

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FSDP2] Eager-Mode Execution Tracker #120003

Description

Work Items

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions