Skip to content

Conversation

rohan-varma
Copy link
Contributor

@rohan-varma rohan-varma commented Apr 26, 2021

Stack from ghstack:

Tests such as #50840 and #56690 have been flaky due to the profiling part.

What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled.

It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's parse_legacy_records, I verified the correct no. of events were alway returned. However, when constructing EventList after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count.

The reason they were getting filtered out is because _remove_dup_nodes filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective.

Testplan:

To reproduce:

  1. SSH into circleCI instance that runs distributed_test.py with 3 GPUs
  2. Modify test_send_recv_autograd_profiler to run 100+ send/recv calls. Without this patch the issue reproduces, verify that this patch fixes the issue.
    Was able to reproduce the issue more often by doing 100+ send/recv calls in _test_send_recv. Verified this fix on CI machine by running the test thousands of times.

An alternative fix is to mark these distributed collectives as async via the is_async flag (because work/wait() paradigm is basically async). This might be a bit cleaner as it also bypasses the parent/child heuristics.

Differential Revision: D28013540

Tests such as #50840 and #56690 have been flaky due to the profiling part.

What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled.

It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count.

The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective.

Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 26, 2021

💊 CI failures summary and remediations

As of commit 278a581 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job Step Action
CircleCI binary_linux_libtorch_3_7m_cpu_devtoolset7_shared-with-deps_build Build 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

rohan-varma added a commit that referenced this pull request Apr 26, 2021
Tests such as #50840 and #56690 have been flaky due to the profiling part.

What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled.

It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count.

The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective.

Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/)

ghstack-source-id: 127450631
Pull Request resolved: #56963
@ilia-cher
Copy link
Contributor

thanks, though it would be def. better to mark these events as async

We have logic that specifically exempts async events from parent-child analysis. Since these events are not marked as async, we recognize them as parent-child and the rest follows.

So, i think the right fix is not to add these kinds of hacky exceptions (it's also fragile in a sense that this list of exemptions needs to be maintained), but to address the root cause.

@rohan-varma
Copy link
Contributor Author

@ilia-cher Sure, that approach is probably cleaner. Although, currently we check is_async by checking if the record function start/end callbacks have different thread IDs. For profiling these distributed collectives that is not the case (its only true for RPC). So I guess we would still have some sort of list needed to mark these events as async?

Tests such as #50840 and #56690 have been flaky due to the profiling part.

What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled.

It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count.

The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective.

Testplan:

To reproduce:
1) SSH into circleCI instance that runs distributed_test.py with 3 GPUs
2) Modify `test_send_recv_autograd_profiler` to run 100+ send/recv calls. Without this patch the issue reproduces, verify that this patch fixes the issue. 
Was able to reproduce the issue more often by doing 100+ send/recv calls in _test_send_recv. Verified this fix on CI machine by running the test thousands of times.

An alternative fix is to mark these distributed collectives as async via the `is_async` flag (because work/wait() paradigm is basically async). This might be a bit cleaner as it also bypasses the parent/child heuristics. 

Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 28, 2021
Pull Request resolved: #56963

Tests such as #50840 and #56690 have been flaky due to the profiling part.

What was happening is that in these tests, we occasionally had inaccurate (off by one) counts of the expected no. of distributed collectives events expected to be profiled.

It was challenging to find the root cause of this as the issue was extremely rare. Digging into profiler's `parse_legacy_records`, I verified the correct no. of events were alway returned. However, when constructing `EventList` after parsing records, sometimes distributed collective events were getting filtered out leading to a reduction in their count.

The reason they were getting filtered out is because `_remove_dup_nodes` filters out parent/child events that appear to be the same event based on some heuristics. However these distributed collectives were incorrectly getting picked up as parent/child events since their times can overlap due to their async nature. To fix, add a blocklist and bypass the filtering if it is a distributed collective.

Testplan:
Was able to reproduce the issue more often by doing 100+ send/recv calls in `_test_send_recv`. Verified this fix on CI machine by running the test thousands of times.
ghstack-source-id: 127677260

Differential Revision: [D28013540](https://our.internmc.facebook.com/intern/diff/D28013540/)
@rohan-varma rohan-varma requested a review from ilia-cher April 28, 2021 22:34
@rohan-varma
Copy link
Contributor Author

Closing in favor of a new approach per offline discussion

rohan-varma added a commit that referenced this pull request Apr 29, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 29, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

ghstack-source-id: 127712736
Pull Request resolved: #57253
rohan-varma added a commit that referenced this pull request Apr 29, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 29, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 29, 2021
Pull Request resolved: #57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).
ghstack-source-id: 127751594

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
rohan-varma added a commit that referenced this pull request Apr 30, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 30, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 30, 2021
Pull Request resolved: #57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).
ghstack-source-id: 127888967

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
rohan-varma added a commit that referenced this pull request May 2, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 2, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 2, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 2, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 2, 2021
Pull Request resolved: #57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).
ghstack-source-id: 127937472

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
rohan-varma added a commit that referenced this pull request May 3, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 3, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 4, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 4, 2021
This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 4, 2021
Pull Request resolved: #57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).
ghstack-source-id: 128021158

Differential Revision: [D28086719](https://our.internmc.facebook.com/intern/diff/D28086719/)
facebook-github-bot pushed a commit that referenced this pull request May 5, 2021
Summary:
Pull Request resolved: #57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as #50840 and #56690 which have been flaky due to the profiling part (#56963 tried to do so as well but this is a better approach).
ghstack-source-id: 128021158

Test Plan: CI

Reviewed By: walterddr, ilia-cher

Differential Revision: D28086719

fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#57253

This PR:

1. Adds is_async getter/setter to RecordFunction
2. Adds is_async field to LegacyEvent and KinetoEvent, read from RecordFunction
3. Modifies python profiler code to check is_async via this flag (and keeps the old thread check as well)
4. Sets profiling of c10d collectives as async in ProcessGroup.cpp
5. Modifies tests to ensure is_async is set

This also fixes flaky tests such as pytorch#50840 and pytorch#56690 which have been flaky due to the profiling part (pytorch#56963 tried to do so as well but this is a better approach).
ghstack-source-id: 128021158

Test Plan: CI

Reviewed By: walterddr, ilia-cher

Differential Revision: D28086719

fbshipit-source-id: 4473db4aed939a71fbe9db5d6655f3008347cb29
@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/300/head branch May 29, 2021 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants