FSDP + DTensor Loss Flatlines Randomly #117471

mvpatel2000 · 2024-01-14T23:17:13Z

🐛 Describe the bug

We have been training dtensor off torch nightly (in anticipation for 2.2), and we are very often seeing the loss flatline. We do not see this at all on current nightly (as of 4 days ago), and at this point we are very confident there is a regression/bug in the current release candidate (for 2.2) that breaks FSDP training (at least with dtensor).
Our best guess is one of the two PRs linked fix it:

[2d] unflatten_tensor on compute stream for DTensorExtension #116559
[reland] unflatten_tensor on compute stream for DTensorExtension #117020
To be safe, I personally would want to also include the no grad bug fix:
[FSDP] enable autograd in forward prefetching #116792

Versions

Torch 2.2 branch

cc @zhaojuanmao @mrshenli @rohan-varma @awgu @fegin @penguinwu @kwen2501 @wanchaol @XilunWu @tianyu-l

Skylion007 · 2024-01-15T02:19:39Z

We confirmed this affects 2.2.0 final RC

atalman · 2024-01-15T14:40:14Z

HI @mvpatel2000 and @Skylion007

looks like [reland] unflatten_tensor on compute stream for DTensorExtension #117020 is a reland of [2d] unflatten_tensor on compute stream for DTensorExtension #116559 . So looks like [reland] unflatten_tensor on compute stream for DTensorExtension #117020 may be the only one required.
Can you please post details to reproduce this issue ? We want to be sure that [reland] unflatten_tensor on compute stream for DTensorExtension #117020 actually resolves it

mvpatel2000 · 2024-01-15T15:59:55Z

@atalman unfortunately I do not have a minimal repro nor am able to share the code for this run at this time :(

We run a transformer model with Dtensor + FSDP (pass in device mesh). The only different think we do is some weights are manually wrapped with dtensor and presharded before FSDP -- I'm pretty sure this won't matter so reproducing on your end shouldn't be too hard, but I'm not 100% confident

wanchaol · 2024-01-16T05:37:43Z

@atalman I just checked our release branch, in addition to #117020 We'll also need this PR together to resolve the merge conflicts #116122.

I can also confirms that I also met similar numeric issues (although not loss flatline, but it's loss NaN problem which looks similar to the issue that @mvpatel2000 met). These two fixes helps me resolve the NaN problem, it would be great if we can include these two fixes in our release branch :)

mvpatel2000 · 2024-01-30T03:58:30Z

Fixed in dev

mvpatel2000 mentioned this issue Jan 14, 2024

[v.2.2.0] Release Tracker #115300

Closed

Skylion007 added this to the 2.2.0 milestone Jan 15, 2024

awgu added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: fsdp module: dtensor distributed tensor tag labels Jan 16, 2024

atalman modified the milestones: 2.2.0, 2.2.1 Jan 18, 2024

mvpatel2000 closed this as completed Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP + DTensor Loss Flatlines Randomly #117471

FSDP + DTensor Loss Flatlines Randomly #117471

mvpatel2000 commented Jan 14, 2024 •

edited by pytorch-bot bot

Skylion007 commented Jan 15, 2024

atalman commented Jan 15, 2024

mvpatel2000 commented Jan 15, 2024

wanchaol commented Jan 16, 2024

mvpatel2000 commented Jan 30, 2024

FSDP + DTensor Loss Flatlines Randomly #117471

FSDP + DTensor Loss Flatlines Randomly #117471

Comments

mvpatel2000 commented Jan 14, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

Skylion007 commented Jan 15, 2024

atalman commented Jan 15, 2024

mvpatel2000 commented Jan 15, 2024

wanchaol commented Jan 16, 2024

mvpatel2000 commented Jan 30, 2024

mvpatel2000 commented Jan 14, 2024 •

edited by pytorch-bot bot