New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate profiling tests from p2p tests #56412
Conversation
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 6064935 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages: pytorch_windows_vs2019_py36_cuda11.1_build (1/1)Step: "Install Cuda" (full log | diagnosis details | 🔁 rerun)
|
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/) [ghstack-poisoned]
rank = dist.get_rank() | ||
send_size = rank + 1 | ||
tensor = _build_tensor(send_size) | ||
with torch.autograd.profiler.profile(record_shapes=True) as prof: | ||
profile_ctx = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this occurs multiple times, shall we dedup this?
frequency = [len(list(group)) for key, group in groupby(global_recv_ranks_list)] | ||
self.assertEqual(dist.get_world_size(), len(frequency)) | ||
self.assertEqual([2 * (dist.get_world_size() - 1)] * dist.get_world_size(), frequency) | ||
self._barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this barrier is only needed for the profiler enabled case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like test_send_recv_nccl has this without profiling tests, hence thought we should keep it here as well. Although maybe that test copied the barrier from this test and it's not needed when profiler is not enabled.
Actually looking at this a bit more, I don't think _barrier()
is needed at all since send/recv are issued in blocking fashion and _barrier doesn't provide anything extra here. I think it's better to just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stamp to unblock.
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/) [ghstack-poisoned]
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/) [ghstack-poisoned]
Pull Request resolved: #56412 We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. ghstack-source-id: 126894186 Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/) [ghstack-poisoned]
Pull Request resolved: #56412 We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. ghstack-source-id: 126920867 Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)
This pull request has been merged in 04de24d. |
Summary: Pull Request resolved: pytorch#56412 We are investigating some flaky profiiling tests such as pytorch#56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests. To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled. ghstack-source-id: 126920867 Test Plan: CI Reviewed By: mrshenli Differential Revision: D27864845 fbshipit-source-id: 01f04a884482ec7741323218a7f8f4a8451eb4ae
Stack from ghstack:
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.
To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.
Differential Revision: D27864845