[Traceable FSDP2] Add partial-graph (graph-break) unit tests #131747

yf225 · 2024-07-25T06:39:47Z

Stack from ghstack (oldest at bottom):

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @ezyang @chauhang @penguinwu

[ghstack-poisoned]

pytorch-bot · 2024-07-25T06:39:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131747

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6a35e66 with merge base 89bdd9c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bdhirsh · 2024-07-25T14:50:50Z

test/distributed/_composable/fsdp/test_fully_shard_compile.py

-            *self._create_transformer_factory_fns(), "aot_eager", fullgraph=True
-        )
+    def test_transformer_backend_aot_eager(self):
+        for fullgraph in [True, False]:


nit: @parametrize("graph_break", [True, False]) might make debugging the test easier if e.g. the graph break path fails at some point (example: https://github.com/pytorch/pytorch/blob/main/test/autograd/test_functional.py#L684)

The issue with distributed is that @parametrize directly would initialize NCCL PG for every parametrized test, which is slow. If you have what @yf225 has, it would reuse the same NCCL PG for all subtests.

I think @kwen2501 added MultiProcContinousTest to maybe address this. I am not sure exactly the limitations though.

bdhirsh · 2024-07-25T14:58:33Z

test/distributed/_composable/fsdp/test_fully_shard_compile.py

+            torch._dynamo.graph_break()
+        return orig_fn(*args, **kwargs)
+
+    def _mock_sdpa(self, fullgraph):


nit: maybe the name should imply that you're just conditionally adding a graph break to sdpa? with _maybe_add_graph_break_to_sdpa(...)

bdhirsh · 2024-07-25T15:02:37Z

test/distributed/_composable/fsdp/test_fully_shard_compile.py

+                    ).run(code)
+            else:
+                self.assertTrue(
+                    len(triton_codes) >= 3,


do you think it would be useful to be a bit stricter about what we assert? If I understand properly, my understanding is something like:

3 graphs total (2 fw's due to graph break, 1 unified bw due to compiled autograd)

fw 1: only comm is a all_gather_out, and one set_() (performs the weight AG, but does not free the weight)
fw2: no comms, only contains a set_() (just finishes the fw compute and frees the weight)
bw: contains all_gather_out, reduce_scatter_out, and 2set_() ops (fully gathers weight, does bw compute and fress)

(or at least, why is the assert >=3 and not just == 3?)

yes I think we want to be more strict here - there is some recompile happening when there is graph break, I'll look into it with @anijain2305 and then add more strict checks

bdhirsh

very cool to see graph breaks don't error :)

[ghstack-poisoned]

yf225 · 2024-07-25T20:32:00Z

@pytorchbot merge

pytorchmergebot · 2024-07-25T20:33:47Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[ghstack-poisoned]

pytorchmergebot · 2024-07-25T20:38:58Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

yf225 · 2024-07-25T20:53:56Z

@pytorchbot merge

pytorchmergebot · 2024-07-25T20:56:09Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[ghstack-poisoned]

pytorchmergebot · 2024-07-25T21:22:00Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

[ghstack-poisoned]

yf225 · 2024-07-25T22:15:46Z

@pytorchbot merge

pytorchmergebot · 2024-07-25T22:17:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-25T22:39:04Z

Merge failed

Reason: 68 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

yf225 · 2024-07-25T23:53:57Z

@pytorchbot merge

pytorchmergebot · 2024-07-25T23:55:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-07-25T23:56:33Z

Merge failed

Reason: 32 jobs have failed, first few of them are: trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / build, trunk / pytorch-linux-focal-py3-clang9-android-ndk-r21e-build / build (default, 1, 1, linux.2xlarge), trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu), trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu), trunk / linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build-test / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

yf225 · 2024-07-26T01:52:56Z

@pytorchbot merge

pytorchmergebot · 2024-07-26T01:54:48Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Pull Request resolved: #131747 Approved by: https://github.com/bdhirsh (cherry picked from commit 236d055)

ghstack-source-id: ee7a75f Pull Request resolved: pytorch/pytorch#131747

Update

31392c7

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 25, 2024

yf225 requested a review from bdhirsh July 25, 2024 06:41

yf225 mentioned this pull request Jul 25, 2024

[Traceable FSDP2] Add FSDP2 + AC unit tests #131749

Closed

bdhirsh reviewed Jul 25, 2024

View reviewed changes

bdhirsh approved these changes Jul 25, 2024

View reviewed changes

Update

14383ca

[ghstack-poisoned]

yf225 added topic: not user facing topic category oncall: pt2 and removed release notes: distributed (fsdp) release notes category labels Jul 25, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 25, 2024

pytorchmergebot added the merging label Jul 25, 2024

Update

7a96fbf

[ghstack-poisoned]

pytorchmergebot removed the merging label Jul 25, 2024

pytorchmergebot added the merging label Jul 25, 2024

Update

72b6d7c

[ghstack-poisoned]

pytorchmergebot removed the merging label Jul 25, 2024

Update

6a35e66

[ghstack-poisoned]

pytorchmergebot added the merging label Jul 25, 2024

pytorchmergebot removed the merging label Jul 25, 2024

yf225 closed this Jul 25, 2024

yf225 reopened this Jul 25, 2024

yf225 added the keep-going Don't stop on first failure, keep running tests until the end label Jul 25, 2024

pytorchmergebot added the merging label Jul 25, 2024

pytorchmergebot removed the merging label Jul 25, 2024

pytorchmergebot added the merging label Jul 26, 2024

pytorchmergebot added the Merged label Jul 26, 2024

pytorchmergebot closed this in 236d055 Jul 26, 2024

pytorchmergebot removed the merging label Jul 26, 2024

bigfootjon pushed a commit that referenced this pull request Jul 31, 2024

[Traceable FSDP2] Add partial-graph (graph-break) unit tests (#131747)

c4f53ff

Pull Request resolved: #131747 Approved by: https://github.com/bdhirsh (cherry picked from commit 236d055)

henrylhtsang mentioned this pull request Jul 31, 2024

[BE][typing] fix types in common pruning #132309

Closed

github-actions bot deleted the gh/yf225/89/head branch August 26, 2024 02:00

enter-ctrl9 pushed a commit to enter-ctrl9/pytorch11 that referenced this pull request Sep 15, 2024

[Traceable FSDP2] Add partial-graph (graph-break) unit tests

a779add

ghstack-source-id: ee7a75f Pull Request resolved: pytorch/pytorch#131747

[Traceable FSDP2] Add partial-graph (graph-break) unit tests #131747

[Traceable FSDP2] Add partial-graph (graph-break) unit tests #131747

Uh oh!

Conversation

yf225 commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131747

✅ No Failures

Uh oh!

bdhirsh Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

yf225 Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

bdhirsh left a comment

Choose a reason for hiding this comment

Uh oh!

yf225 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge failed

Uh oh!

yf225 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge failed

Uh oh!

yf225 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge failed

Uh oh!

yf225 commented Jul 25, 2024

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Jul 25, 2024

Merge failed

Uh oh!

yf225 commented Jul 26, 2024

Uh oh!

pytorchmergebot commented Jul 26, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yf225 commented Jul 25, 2024 •

edited

Loading

pytorch-bot bot commented Jul 25, 2024 •

edited

Loading