Skip to content

[FSDP2] Bug: NaN gradient when both HSDP and CPU offload are enabled #160291

@leonardo0lyj

Description

@leonardo0lyj

Long time no see, Will @weifengpy , Yifan @mori360 , and Andrew @awgu~

As a big fan of FSDP2, here is my first bug fix in H2 2025 😄.

Bug: When both HSDP and CPU offload are enabled, FSDP2 suffers from wrong CPU-GPU sync, which gives NaN gradient after backward()

Reason: post_reduce_stream is all_reduce_stream during HSDP, but CPU-GPU sync is hard coded to reduce_scatter_stream!
Image

fsdp_param.grad_offload_event = reduce_scatter_stream.record_event()

Implication: This can give non-deterministic NaN gradient, which is hard to debug and identify in production.

Why HSDP x CPU offload is useful: when networking bandwidth is limited and GPU memory is small but CPU memory and PCI/NVLink is available.

Fix: simply change reduce_scatter_stream.record_event to post_reduce_stream.record_event 😄

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions