-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[Dist profiling] Fix flaky tests #56963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tests such as #50840 and #56690 have been flaky due to the profiling part. What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled. It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count. The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective. Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 278a581 (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Tests such as #50840 and #56690 have been flaky due to the profiling part. What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled. It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count. The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective. Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/) ghstack-source-id: 127450631 Pull Request resolved: #56963
thanks, though it would be def. better to mark these events as async We have logic that specifically exempts async events from parent-child analysis. Since these events are not marked as async, we recognize them as parent-child and the rest follows. So, i think the right fix is not to add these kinds of hacky exceptions (it's also fragile in a sense that this list of exemptions needs to be maintained), but to address the root cause. |
@ilia-cher Sure, that approach is probably cleaner. Although, currently we check |
Tests such as #50840 and #56690 have been flaky due to the profiling part. What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled. It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count. The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective. Testplan: To reproduce: 1) SSH into circleCI instance that runs distributed_test.py with 3 GPUs 2) Modify `test_send_recv_autograd_profiler` to run 100+ send/recv calls. Without this patch the issue reproduces, verify that this patch fixes the issue. Was able to reproduce the issue more often by doing 100+ send/recv calls in _test_send_recv. Verified this fix on CI machine by running the test thousands of times. An alternative fix is to mark these distributed collectives as async via the `is_async` flag (because work/wait() paradigm is basically async). This might be a bit cleaner as it also bypasses the parent/child heuristics. Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/) [ghstack-poisoned]
Pull Request resolved: #56963 Tests such as #50840 and #56690 have been flaky due to the profiling part. What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled. It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count. The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective. Testplan: Was able to reproduce the issue more often by doing 100+ send/recv calls in `_test_send_recv`. Verified this fix on CI machine by running the test thousands of times. ghstack-source-id: 127677260 Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/)
Closing in favor of a new approach per offline discussion |
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) ghstack-source-id: 127712736 Pull Request resolved: #57253
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
Pull Request resolved: #57253 This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). ghstack-source-id: 127751594 Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
Pull Request resolved: #57253 This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). ghstack-source-id: 127888967 Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
Pull Request resolved: #57253 This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). ghstack-source-id: 127937472 Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/) [ghstack-poisoned]
Pull Request resolved: #57253 This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). ghstack-source-id: 128021158 Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
Summary: Pull Request resolved: #57253 This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach). ghstack-source-id: 128021158 Test Plan: CI Reviewed By: walterddr, ilia-cher Differential Revision: D28086719 fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
Summary: Pull Request resolved: pytorch#57253 This PR: 1. Adds is_async getter/setter to RecordFunction 2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction 3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well) 4. Sets profiling of c10d collectives as async in ProcessGroup.cpp 5. Modifies tests to ensure is_async is set This also fixes flaky tests such as pytorch#50840 and pytorch#56690 which have been flaky due to the profiling part (pytorch#56963 tried to do so as well but this is a better approach). ghstack-source-id: 128021158 Test Plan: CI Reviewed By: walterddr, ilia-cher Differential Revision: D28086719 fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
Stack from ghstack:
Tests such as #50840 and #56690 have been flaky due to the profiling part.
What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled.
It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's
parse_legacy_records
, I verified the correct no. of events were alway returned. However, when constructingEventList
after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count.The reason they were getting filtered out is because
_remove_dup_nodes
filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective.Testplan:
To reproduce:
test_send_recv_autograd_profiler
to run 100+ send/recv calls. Without this patch the issue reproduces, verify that this patch fixes the issue.Was able to reproduce the issue more often by doing 100+ send/recv calls in _test_send_recv. Verified this fix on CI machine by running the test thousands of times.
An alternative fix is to mark these distributed collectives as async via the
is_async
flag (because work/wait() paradigm is basically async). This might be a bit cleaner as it also bypasses the parent/child heuristics.Differential Revision: D28013540