Skip to content

Conversation

yf225
Copy link
Contributor

@yf225 yf225 commented Jul 25, 2024

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Jul 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131747

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6a35e66 with merge base 89bdd9c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category labels Jul 25, 2024
@yf225 yf225 requested a review from bdhirsh July 25, 2024 06:41
*self._create_transformer_factory_fns(), "aot_eager", fullgraph=True
)
def test_transformer_backend_aot_eager(self):
for fullgraph in [True, False]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: @parametrize("graph_break", [True, False]) might make debugging the test easier if e.g. the graph break path fails at some point (example: https://github.com/pytorch/pytorch/blob/main/test/autograd/test_functional.py#L684)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with distributed is that @parametrize directly would initialize NCCL PG for every parametrized test, which is slow. If you have what @yf225 has, it would reuse the same NCCL PG for all subtests.

I think @kwen2501 added MultiProcContinousTest to maybe address this. I am not sure exactly the limitations though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh

torch._dynamo.graph_break()
return orig_fn(*args, **kwargs)

def _mock_sdpa(self, fullgraph):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe the name should imply that you're just conditionally adding a graph break to sdpa? with _maybe_add_graph_break_to_sdpa(...)

).run(code)
else:
self.assertTrue(
len(triton_codes) >= 3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it would be useful to be a bit stricter about what we assert? If I understand properly, my understanding is something like:

3 graphs total (2 fw's due to graph break, 1 unified bw due to compiled autograd)

fw 1: only comm is a all_gather_out, and one set_() (performs the weight AG, but does not free the weight)
fw2: no comms, only contains a set_() (just finishes the fw compute and frees the weight)
bw: contains all_gather_out, reduce_scatter_out, and 2set_() ops (fully gathers weight, does bw compute and fress)

(or at least, why is the assert >=3 and not just == 3?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I think we want to be more strict here - there is some recompile happening when there is graph break, I'll look into it with @anijain2305 and then add more strict checks

Copy link
Contributor

@bdhirsh bdhirsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool to see graph breaks don't error :)

[ghstack-poisoned]
@yf225 yf225 added topic: not user facing topic category oncall: pt2 and removed release notes: distributed (fsdp) release notes category labels Jul 25, 2024
@yf225
Copy link
Contributor Author

yf225 commented Jul 25, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 25, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

@yf225
Copy link
Contributor Author

yf225 commented Jul 25, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team Raised by workflow job

[ghstack-poisoned]
@yf225
Copy link
Contributor Author

yf225 commented Jul 25, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@yf225 yf225 closed this Jul 25, 2024
@yf225 yf225 reopened this Jul 25, 2024
@yf225 yf225 added the keep-going Don't stop on first failure, keep running tests until the end label Jul 25, 2024
@yf225
Copy link
Contributor Author

yf225 commented Jul 25, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yf225
Copy link
Contributor Author

yf225 commented Jul 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

bigfootjon pushed a commit that referenced this pull request Jul 31, 2024
@github-actions github-actions bot deleted the gh/yf225/89/head branch August 26, 2024 02:00
enter-ctrl9 pushed a commit to enter-ctrl9/pytorch11 that referenced this pull request Sep 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: pt2 topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants