Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate profiling tests from p2p tests #56412

Closed
wants to merge 5 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Apr 19, 2021

Stack from ghstack:

We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.

Differential Revision: D27864845

We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.

Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Apr 19, 2021

💊 CI failures summary and remediations

As of commit 6064935 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda11.1_build (1/1)

Step: "Install Cuda" (full log | diagnosis details | 🔁 rerun)

ls: cannot access '/c/Program Files/NVIDIA GPU ...UDA/v11.1/bin/nvcc.exe': No such file or directory

Folders: 11
Files: 130
Size:       907512
Compressed: 111420
+ mkdir -p 'C:/Program Files/NVIDIA Corporation/NvToolsExt'
+ cp -r NvToolsExt/bin NvToolsExt/docs NvToolsExt/include NvToolsExt/lib NvToolsExt/samples 'C:/Program Files/NVIDIA Corporation/NvToolsExt/'
+ export 'NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\'
+ NVTOOLSEXT_PATH='C:\Program Files\NVIDIA Corporation\NvToolsExt\'
+ ls '/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/bin/nvcc.exe'
ls: cannot access '/c/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v11.1/bin/nvcc.exe': No such file or directory
+ echo 'CUDA installation failed'
CUDA installation failed
+ mkdir -p /c/w/build-results
+ 7z a 'c:\w\build-results\cuda_install_logs.7z' cuda_install_logs

7-Zip 19.00 (x64) : Copyright (c) 1999-2018 Igor Pavlov : 2019-02-21

Scanning the drive:
1 folder, 2 files, 3611791 bytes (3528 KiB)


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 19, 2021
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.

Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)

[ghstack-poisoned]
rank = dist.get_rank()
send_size = rank + 1
tensor = _build_tensor(send_size)
with torch.autograd.profiler.profile(record_shapes=True) as prof:
profile_ctx = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this occurs multiple times, shall we dedup this?

frequency = [len(list(group)) for key, group in groupby(global_recv_ranks_list)]
self.assertEqual(dist.get_world_size(), len(frequency))
self.assertEqual([2 * (dist.get_world_size() - 1)] * dist.get_world_size(), frequency)
self._barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this barrier is only needed for the profiler enabled case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like test_send_recv_nccl has this without profiling tests, hence thought we should keep it here as well. Although maybe that test copied the barrier from this test and it's not needed when profiler is not enabled.

Actually looking at this a bit more, I don't think _barrier() is needed at all since send/recv are issued in blocking fashion and _barrier doesn't provide anything extra here. I think it's better to just remove it.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp to unblock.

We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.

Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)

[ghstack-poisoned]
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.

Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 19, 2021
Pull Request resolved: #56412

We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.
ghstack-source-id: 126894186

Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)
We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.

Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 20, 2021
Pull Request resolved: #56412

We are investigating some flaky profiiling tests such as #56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.
ghstack-source-id: 126920867

Differential Revision: [D27864845](https://our.internmc.facebook.com/intern/diff/D27864845/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 04de24d.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/294/head branch April 24, 2021 14:16
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56412

We are investigating some flaky profiiling tests such as pytorch#56146. One issue is that the profiling tests are tightly coupled to these send/recv tests, hence if this test is disabled, we lose signal round send/recv collectives tests.

To mitigate this, separate the tests into ones that only test send/recv, and ones that test it with profiling. This way flakiness should not result in the send/recv only tests being disabled.
ghstack-source-id: 126920867

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27864845

fbshipit-source-id: 01f04a884482ec7741323218a7f8f4a8451eb4ae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants