-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ROCm] Enable test_multiprocessing tests #82356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If world_size is lesser than or equal to number of GPU's available then the rank can be directly mapped to corresponding GPU. This fixes the issue referenced in pytorch#45435 and pytorch#47629 For world_size = 3 and number of GPU's = 8, the rank to GPU mapping will be 0,2,4. This is due to the introduction of barrier, (refer pytorch#45181) the tensors in barrier is mapped to cuda0,1,2 and the tensors in the actual test cases are mapped to cuda0,2,4 resulting in different streams and leading to timeout. This issue is specific to default process group. Issue is not observed in new process group since the streams are created again after the initial barrier call. This patch maps the rank to corresponding GPU's when the world_size is less than or equal to the number of GPU's, in this case 0,1,2 Note: The barrier function in distributed_c10d.py should include new parameter to specify the tensor or rank to GPU mapping. In that case, this patch will be redundant but harmless since the tests can specify the tensors with appropriate GPU rankings.
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
🔗 Helpful links
❌ 4 New FailuresAs of commit c79240a (more details on the Dr. CI page): Expand to see more
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
2022-07-27T21:39:35.4539856Z test_event_handle_exporter (main.TestMultiprocessing) ... ok (5.014s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. ROCm CI is green. 4 test failures are unrelated to this PR.
We still need upstream approval.
@pytorchbot merge |
@pytorchbot successfully started a merge job. Check the current status here. |
Merge failed |
…in/enable_multiprocessing
2022-08-17T19:34:08.6449003Z test_event_handle_exporter (main.TestMultiprocessing) ... /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:123: UserWarning: loaded 39 slow tests |
@pytorchbot merge |
@pytorchbot successfully started a merge job. Check the current status here. |
Merge failed |
…in/enable_multiprocessing
@pytorchbot merge |
@pytorchbot successfully started a merge job. Check the current status here. |
Merge failed |
I see 4 failing checks in the report Below 3 jobs are failing due at "Install nvidia driver" step macos-12-py3-x86-64 / test (default, 2, 2, macos-12) --> The hosted runner: GitHub Actions 50 lost communication with the I do not see ROCm related failures. |
@pytorchbot merge -f "unrelated CI failures" |
@pytorchbot successfully started a merge job. Check the current status here. |
Hey @jaglinux. |
Summary: Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Issue fixed in ROCm 5.2 user space. Pull Request resolved: #82356 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/huydhn Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f5bfa4d0888e6cd5984092b38cb8b10609558d05 Reviewed By: weiwangmeta Differential Revision: D39008147 Pulled By: weiwangmeta fbshipit-source-id: 39e3aa6cb6329bb3c2a53c0ddbe71a084dc1e55e
Signed-off-by: Jagadish Krishnamoorthy jagdish.krishna@gmail.com
Issue fixed in ROCm 5.2 user space.