-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Description
Long time no see, Will @weifengpy , Yifan @mori360 , and Andrew @awgu~
As a big fan of FSDP2, here is my first bug fix in H2 2025 😄.
Bug: When both HSDP and CPU offload are enabled, FSDP2 suffers from wrong CPU-GPU sync, which gives NaN gradient after backward()
Reason: post_reduce_stream
is all_reduce_stream
during HSDP, but CPU-GPU sync is hard coded to reduce_scatter_stream
!
fsdp_param.grad_offload_event = reduce_scatter_stream.record_event() |
Implication: This can give non-deterministic NaN gradient, which is hard to debug and identify in production.
Why HSDP x CPU offload is useful: when networking bandwidth is limited and GPU memory is small but CPU memory and PCI/NVLink is available.
Fix: simply change reduce_scatter_stream.record_event
to post_reduce_stream.record_event
😄
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360