[FSDP] fix: fix for fsdp zero2 validation error #110139

Edwiv · 2023-09-27T04:24:06Z

Problem

When sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch is turned on, the validation after the train has an incorrect weight shape.

Analyze

When using SHARD_GRAD_OP, the free_unsharded_flat_param in _post_forward_reshard is often False, so it does not set the handle's _prefetched flag to False after the forward.

The normal train phase sets this flag to False in the _post_backward_final_callback, and the validation phase doesn't execute the hook, so after the first iter of the validation is done, the flag of the handle of the prefetched will remain True.

This will cause the handle to skip the _unshard in the next _pre_forward_unshard, and the _prefetch_handle will not do a prefetch, which will result in an incorrect weight shape.

pytorch-bot · 2023-09-27T04:24:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110139

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ef241bb with merge base 0013611 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / linux-focal-rocm5.6-py3.8 / test (default, 1, 3, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-09-27T04:24:11Z

The committers listed above are authorized under a signed CLA.

✅ login: Edwiv / name: Edwiv (33c4c76, 8465942, dea81ef, 1df044f, 57af60f, 7def417, f08b0e1, ef241bb)

awgu · 2023-09-27T12:00:22Z

@Edwiv Thanks for catching this!

Do you think you would be interested in adding a unit test or at least providing a repro?

Edwiv · 2023-09-27T12:39:03Z

@awgu sure~

awgu · 2023-09-27T20:29:37Z

@Edwiv feel free to put it in test_fsdp_misc.py:

pytorch/test/distributed/fsdp/test_fsdp_misc.py

Line 71 in fe11227

class TestFSDPMiscMultiProcess(FSDPTest):

(I do not think we have a place for eval tests right now.)

Edwiv · 2023-09-28T09:20:35Z

@awgu done～
Can you please tell me how I should test this UT using CI?

awgu · 2023-09-28T16:00:54Z

@Edwiv Thanks for adding the unit test! It will run automatically in CI.

Would it be possible to strengthen the unit test to check the correctness, e.g. with DDP?

rohan-varma

thanks for the fix!

rohan-varma · 2023-10-02T22:21:09Z

test/distributed/fsdp/test_fsdp_misc.py

+            loss = loss.sum()
+            loss.backward()
+
+        with torch.no_grad():


are we interested in putting the model back to train after eval, and verify that training still works fine?

awgu · 2023-10-09T12:18:07Z

Would it be possible to strengthen the unit test to check the correctness, e.g. with DDP?

Sorry, I wanted to bump this in case you missed it @Edwiv!

Edwiv · 2023-10-09T13:40:50Z

@awgu Sorry I didn't understand how to test with DDP, what should be done to strengthen the unit testing?

awgu · 2023-10-09T14:21:48Z

@Edwiv You can construct an identical model with DDP applied (to implement data parallel semantics). Then, you can the same training loop for both DDP and FSDP models and compare each iteration's loss (for example) to check for correctness.

Edwiv · 2023-10-10T04:01:31Z

@awgu I got it, I've added the DDP comparison to the unit test, please take a look.

awgu · 2023-10-11T18:59:05Z

test/distributed/fsdp/test_fsdp_misc.py

+
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed(seed)
+        for _ in range(5):


nit: We normally run optimizer in the training loop. Otherwise, since the inputs x and y are not changing, then the loss should be the same on all iterations.

nit: fsdp_loss and ddp_loss might be more informative names than loss and loss1.

awgu · 2023-10-11T19:00:04Z

test/distributed/fsdp/test_fsdp_misc.py

+        y = torch.randint(low=0, high=9, size=(8,), device="cuda")
+        x1 = x.clone().detach().requires_grad_()
+        y1 = y.clone().detach()
+        seed = 20231010


We should set the seed to be something like self.rank + 1 (anything so that data parallel ranks have different seeds); otherwise, the gradient reduction may not be tested (something like x + x / 2 = x is the same as not summing and dividing).

awgu · 2023-10-11T19:00:51Z

test/distributed/fsdp/test_fsdp_misc.py

+            loss1.backward()
+
+        assert torch.allclose(loss, loss1)
+        assert torch.allclose(x.grad.data, x1.grad.data)


nit: the .data are not needed.

awgu · 2023-10-12T11:51:55Z

test/distributed/fsdp/test_fsdp_misc.py

+        assert torch.allclose(fsdp_loss, ddp_loss)
+        assert torch.allclose(x.grad, x1.grad)


nit: Would it be possible to check correctness after every iteration? This lets us narrow down / catch bugs much faster.

You can reference: https://github.com/pytorch/pytorch/pull/110948/files#diff-ab5af580410c642dd66ea27656265fbbc1ec6c6713e048a0ef111573eb52286cR195

Otherwise, we are pretty much good to go.

I've perfected this UT, and by the way, the author of the PR above turned out to be a coworker of mine😂.

awgu · 2023-10-14T17:04:03Z

Lint error looks like an infra issue. Let me rebase and merge.

awgu · 2023-10-14T17:04:11Z

@pytorchbot rebase -s

pytorchmergebot · 2023-10-14T17:06:11Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-10-14T17:06:21Z

Successfully rebased zyj/fix/fsdp_val_error onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout zyj/fix/fsdp_val_error && git pull --rebase)

awgu · 2023-10-14T17:36:15Z

@pytorchbot merge

pytorchmergebot · 2023-10-14T17:38:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

# Problem When sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch is turned on, the validation after the train has an incorrect weight shape. <img width="1508" alt="image" src="https://github.com/pytorch/pytorch/assets/41232043/57a9c3bb-cb5c-46df-ac26-922740686f9e"> # Analyze When using `SHARD_GRAD_OP`, the `free_unsharded_flat_param` in `_post_forward_reshard` is often False, so it does not set the handle's `_prefetched` flag to False after the forward. The normal train phase sets this flag to False in the `_post_backward_final_callback`, and the validation phase doesn't execute the hook, so after the first iter of the validation is done, the flag of the handle of the prefetched will remain True. This will cause the handle to skip the `_unshard` in the next `_pre_forward_unshard`, and the `_prefetch_handle` will not do a prefetch, which will result in an incorrect weight shape. Pull Request resolved: pytorch#110139 Approved by: https://github.com/awgu

Edwiv requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, fduwjj, kiukchung, d4l3k and wz337 as code owners September 27, 2023 04:24

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Sep 27, 2023

pytorchbot added the open source label Sep 27, 2023

awgu self-assigned this Sep 27, 2023

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 27, 2023

rohan-varma reviewed Oct 2, 2023

View reviewed changes

awgu reviewed Oct 11, 2023

View reviewed changes

awgu reviewed Oct 12, 2023

View reviewed changes

awgu approved these changes Oct 14, 2023

View reviewed changes

Edwiv added 8 commits October 14, 2023 17:06

[FSDP] fix: fix for fsdp zero2 validation error

33c4c76

feat: add fsdp zero2 eval test

8465942

fix: fix test fsdp misc

dea81ef

feat: strengthen fsdp UT

1df044f

fix: fix for fsdp test misc lint

57af60f

fix: fix for lint

7def417

fix: update fsdp test

f08b0e1

fix: fix for fsdp UT seed

ef241bb

pytorchmergebot force-pushed the zyj/fix/fsdp_val_error branch from 685ad69 to ef241bb Compare October 14, 2023 17:06

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 14, 2023

pytorchmergebot added the merging label Oct 14, 2023

pytorchmergebot added Merged and removed merging labels Oct 14, 2023

pytorchmergebot closed this in 5caf2e5 Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] fix: fix for fsdp zero2 validation error #110139

[FSDP] fix: fix for fsdp zero2 validation error #110139

Edwiv commented Sep 27, 2023

pytorch-bot bot commented Sep 27, 2023 •

edited

linux-foundation-easycla bot commented Sep 27, 2023 •

edited

awgu commented Sep 27, 2023

Edwiv commented Sep 27, 2023

awgu commented Sep 27, 2023

Edwiv commented Sep 28, 2023 •

edited

awgu commented Sep 28, 2023

rohan-varma left a comment

rohan-varma Oct 2, 2023

Edwiv Oct 7, 2023

awgu commented Oct 9, 2023

Edwiv commented Oct 9, 2023

awgu commented Oct 9, 2023

Edwiv commented Oct 10, 2023

awgu Oct 11, 2023

awgu Oct 11, 2023

awgu Oct 11, 2023

awgu Oct 11, 2023

awgu Oct 12, 2023 •

edited

Edwiv Oct 14, 2023

awgu commented Oct 14, 2023

awgu commented Oct 14, 2023

pytorchmergebot commented Oct 14, 2023

pytorchmergebot commented Oct 14, 2023

awgu commented Oct 14, 2023

pytorchmergebot commented Oct 14, 2023

		assert torch.allclose(fsdp_loss, ddp_loss)
		assert torch.allclose(x.grad, x1.grad)

[FSDP] fix: fix for fsdp zero2 validation error #110139

[FSDP] fix: fix for fsdp zero2 validation error #110139

Conversation

Edwiv commented Sep 27, 2023

Problem

Analyze

pytorch-bot bot commented Sep 27, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110139

✅ You can merge normally! (1 Unrelated Failure)

linux-foundation-easycla bot commented Sep 27, 2023 • edited

awgu commented Sep 27, 2023

Edwiv commented Sep 27, 2023

awgu commented Sep 27, 2023

Edwiv commented Sep 28, 2023 • edited

awgu commented Sep 28, 2023

rohan-varma left a comment

Choose a reason for hiding this comment

rohan-varma Oct 2, 2023

Choose a reason for hiding this comment

Edwiv Oct 7, 2023

Choose a reason for hiding this comment

awgu commented Oct 9, 2023

Edwiv commented Oct 9, 2023

awgu commented Oct 9, 2023

Edwiv commented Oct 10, 2023

awgu Oct 11, 2023

Choose a reason for hiding this comment

awgu Oct 11, 2023

Choose a reason for hiding this comment

awgu Oct 11, 2023

Choose a reason for hiding this comment

awgu Oct 11, 2023

Choose a reason for hiding this comment

awgu Oct 12, 2023 • edited

Choose a reason for hiding this comment

Edwiv Oct 14, 2023

Choose a reason for hiding this comment

awgu commented Oct 14, 2023

awgu commented Oct 14, 2023

pytorchmergebot commented Oct 14, 2023

pytorchmergebot commented Oct 14, 2023

awgu commented Oct 14, 2023

pytorchmergebot commented Oct 14, 2023

Merge started

pytorch-bot bot commented Sep 27, 2023 •

edited

linux-foundation-easycla bot commented Sep 27, 2023 •

edited

Edwiv commented Sep 28, 2023 •

edited

awgu Oct 12, 2023 •

edited