[ROCm] Enable test_multiprocessing tests #82356

jaglinux · 2022-07-27T19:48:01Z

Signed-off-by: Jagadish Krishnamoorthy jagdish.krishna@gmail.com

Issue fixed in ROCm 5.2 user space.

If world_size is lesser than or equal to number of GPU's available then the rank can be directly mapped to corresponding GPU. This fixes the issue referenced in pytorch#45435 and pytorch#47629 For world_size = 3 and number of GPU's = 8, the rank to GPU mapping will be 0,2,4. This is due to the introduction of barrier, (refer pytorch#45181) the tensors in barrier is mapped to cuda0,1,2 and the tensors in the actual test cases are mapped to cuda0,2,4 resulting in different streams and leading to timeout. This issue is specific to default process group. Issue is not observed in new process group since the streams are created again after the initial barrier call. This patch maps the rank to corresponding GPU's when the world_size is less than or equal to the number of GPU's, in this case 0,1,2 Note: The barrier function in distributed_c10d.py should include new parameter to specify the tensor or rank to GPU mapping. In that case, this patch will be redundant but harmless since the tests can specify the tensors with appropriate GPU rankings.

Co-authored-by: Jeff Daily <jeff.daily@amd.com>

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

facebook-github-bot · 2022-07-27T19:48:07Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/82356
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

❌ 4 New Failures

As of commit c79240a (more details on the Dr. CI page):

Expand to see more

4/4 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

trunk / macos-12-py3-x86-64 / test (default, 2, 2, macos-12) (1/1)

Step: "Unknown" (full log | diagnosis details)

2022-08-23T02:31:15.4641770Z clang: error: unsupported option '-fopenmp'

2022-08-23T02:31:15.2693800Z   test_attribute_serialization (__main__.TestScript) ... ok (0.018s)
2022-08-23T02:31:15.2874590Z   test_attribute_unpickling (__main__.TestScript) ... ok (0.018s)
2022-08-23T02:31:15.2984480Z   test_augmented_assign (__main__.TestScript) ... ok (0.011s)
2022-08-23T02:31:15.3005680Z   test_autodiff_complex (__main__.TestScript) ... skip: no CUDA (0.002s)
2022-08-23T02:31:15.3049190Z   test_backend_cudnn_enabled (__main__.TestScript) ... ok (0.004s)
2022-08-23T02:31:15.3089080Z   test_bad_multiline_annotations (__main__.TestScript) ... ok (0.004s)
2022-08-23T02:31:15.3234960Z   test_bailout_loop_carried_deps_name_clash (__main__.TestScript) ... ok (0.014s)
2022-08-23T02:31:15.3367630Z   test_bailout_loop_counter_transition (__main__.TestScript) ... ok (0.013s)
2022-08-23T02:31:15.3404680Z   test_batch_norm_inference_backward_cuda (__main__.TestScript) ... skip: running tests on cuda to verify cudnn fix (0.004s)
2022-08-23T02:31:15.4639380Z   test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp'
2022-08-23T02:31:15.4641770Z clang: error: unsupported option '-fopenmp'
2022-08-23T02:31:15.4768450Z warning: pytorch jit fuser failed to compile with openmp, trying without it...
2022-08-23T02:31:15.4770800Z You have not run this instance of FileCheck!
2022-08-23T02:31:15.4771820Z FileCheck checks:
2022-08-23T02:31:17.0515000Z ok (1.711s)
2022-08-23T02:31:17.0572340Z   test_big_float_literals (__main__.TestScript) ... ok (0.006s)
2022-08-23T02:31:17.0695190Z   test_big_int_literals (__main__.TestScript) ... ok (0.012s)
2022-08-23T02:31:17.3916400Z   test_binary_op_shape (__main__.TestScript) ... ok (0.322s)
2022-08-23T02:31:17.4276640Z   test_bitwise_ops (__main__.TestScript) ... ok (0.036s)
2022-08-23T02:31:17.4387250Z   test_block_input_grad_in_loop (__main__.TestScript) ... ok (0.011s)
2022-08-23T02:31:17.4509620Z   test_bool_augassign_bitwise_and (__main__.TestScript) ... ok (0.012s)

🕵️‍♀️ 3 failures not recognized by patterns:

The following CI failures may be due to changes from the PR

Job	Step
^{periodic / linux-bionic-cuda10.2-py3.9-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu)}	^{Install nvidia driver, nvidia-docker runtime, set GPU_FLAG}
^{trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)}	^{Install nvidia driver, nvidia-docker runtime, set GPU_FLAG}
^{trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)}	^{Install nvidia driver, nvidia-docker runtime, set GPU_FLAG}

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

jaglinux · 2022-07-27T19:50:28Z

cc @jithunnair-amd @jeffdaily @pruthvistony @arindamroy-eng

jaglinux · 2022-07-29T23:07:17Z

2022-07-27T21:39:35.4539856Z test_event_handle_exporter (main.TestMultiprocessing) ... ok (5.014s)
2022-07-27T21:39:40.2939614Z test_event_handle_importer (main.TestMultiprocessing) ... ok (4.840s)
2022-07-27T21:39:44.1968841Z test_event_multiprocess (main.TestMultiprocessing) ... ok (3.898s)

jeffdaily

LGTM. ROCm CI is green. 4 test failures are unrelated to this PR.

We still need upstream approval.

jithunnair-amd · 2022-08-17T14:42:03Z

@pytorchbot merge

pytorchmergebot · 2022-08-17T14:43:21Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-08-17T14:43:21Z

Merge failed
Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2876180459

…in/enable_multiprocessing

jaglinux · 2022-08-19T18:43:24Z

2022-08-17T19:34:08.6449003Z test_event_handle_exporter (main.TestMultiprocessing) ... /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:123: UserWarning: loaded 39 slow tests
2022-08-17T19:34:08.6449546Z warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
2022-08-17T19:34:08.6450211Z /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:127: UserWarning: loaded 238 disabled tests
2022-08-17T19:34:08.6450760Z warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
2022-08-17T19:34:12.3298069Z ok (5.001s)
2022-08-17T19:34:13.7041800Z test_event_handle_importer (main.TestMultiprocessing) ... /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:123: UserWarning: loaded 39 slow tests
2022-08-17T19:34:13.7043275Z warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
2022-08-17T19:34:13.7044910Z /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:127: UserWarning: loaded 238 disabled tests
2022-08-17T19:34:13.7046187Z warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
2022-08-17T19:34:16.8561133Z ok (4.526s)
2022-08-17T19:34:16.8598343Z test_event_handle_multi_gpu (main.TestMultiprocessing) ... skip: found only 1 GPU (0.004s)
2022-08-17T19:34:18.2344205Z test_event_multiprocess (main.TestMultiprocessing) ... /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:123: UserWarning: loaded 39 slow tests
2022-08-17T19:34:18.2345403Z warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
2022-08-17T19:34:18.2346859Z /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:127: UserWarning: loaded 238 disabled tests
2022-08-17T19:34:18.2348050Z warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
2022-08-17T19:34:20.7152070Z ok (3.855s)

cc @jithunnair-amd

jithunnair-amd · 2022-08-22T22:28:05Z

@pytorchbot merge

pytorchmergebot · 2022-08-22T22:30:40Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-08-22T22:30:40Z

Merge failed
Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2907508911

…in/enable_multiprocessing

jithunnair-amd · 2022-08-24T18:39:28Z

@pytorchbot merge

pytorchmergebot · 2022-08-24T18:40:45Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-08-24T18:40:49Z

Merge failed
Reason: 6 additional jobs have failed, first few of them are: trunk ,trunk / macos-12-py3-x86-64 / test (default, 2, 2, macos-12) ,trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu) ,trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu) ,periodic
Raised by workflow job

jaglinux · 2022-08-24T19:11:49Z

I see 4 failing checks in the report

Below 3 jobs are failing due at "Install nvidia driver" step
linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)
linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)
linux-bionic-cuda10.2-py3.9-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu)

macos-12-py3-x86-64 / test (default, 2, 2, macos-12) --> The hosted runner: GitHub Actions 50 lost communication with the
server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I do not see ROCm related failures.

jithunnair-amd · 2022-08-24T20:47:58Z

@pytorchbot merge -f "unrelated CI failures"

pytorchmergebot · 2022-08-24T20:49:16Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the force (-f) flag. This means your change will be merged immediately, bypassing any CI checks (ETA: 1-5 minutes). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-08-24T21:00:26Z

Hey @jaglinux.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Summary: Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com> Issue fixed in ROCm 5.2 user space. Pull Request resolved: #82356 Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/huydhn Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f5bfa4d0888e6cd5984092b38cb8b10609558d05 Reviewed By: weiwangmeta Differential Revision: D39008147 Pulled By: weiwangmeta fbshipit-source-id: 39e3aa6cb6329bb3c2a53c0ddbe71a084dc1e55e

jaglinux and others added 12 commits November 13, 2020 04:45

Update torch/testing/_internal/distributed/distributed_test.py

35d02df

Co-authored-by: Jeff Daily <jeff.daily@amd.com>

Merge branch 'master' of https://github.com/pytorch/pytorch

126c521

Merge branch 'pytorch:master' into master

78cb4a6

Merge branch 'pytorch:master' into master

c719136

Merge branch 'pytorch:master' into master

d040b05

Merge branch 'pytorch:master' into master

521d014

Merge branch 'pytorch:master' into master

3471644

Merge branch 'pytorch:master' into master

192622d

Merge branch 'pytorch:master' into master

c13e9e9

[ROCm] Enable test_multiprocessing tests

a643c27

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Merge branch 'pytorch:master' into master

ee7343e

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jul 27, 2022

facebook-github-bot added the cla signed label Jul 27, 2022

Merge branch 'master' into origin/enable_multiprocessing

25f0ce4

jeffdaily added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 27, 2022

pytorchbot added the open source label Jul 27, 2022

pruthvistony added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jul 27, 2022

jeffdaily approved these changes Aug 1, 2022

View reviewed changes

jeffdaily requested a review from janeyx99 August 1, 2022 15:21

malfet approved these changes Aug 16, 2022

View reviewed changes

huydhn approved these changes Aug 16, 2022

View reviewed changes

Merge branch 'master' of https://github.com/pytorch/pytorch into orig…

32914a8

…in/enable_multiprocessing

Merge branch 'master' of https://github.com/pytorch/pytorch into orig…

c79240a

…in/enable_multiprocessing

pytorchmergebot added the Merged label Aug 24, 2022

pytorchmergebot closed this in f5bfa4d Aug 24, 2022

[ROCm] Enable test_multiprocessing tests #82356

[ROCm] Enable test_multiprocessing tests #82356

Uh oh!

Conversation

jaglinux commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jul 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

❌ 4 New Failures

🕵️ 1 new failure recognized by patterns

trunk / macos-12-py3-x86-64 / test (default, 2, 2, macos-12) (1/1)

🕵️‍♀️ 3 failures not recognized by patterns:

Uh oh!

jaglinux commented Jul 27, 2022

Uh oh!

jaglinux commented Jul 29, 2022

Uh oh!

jeffdaily left a comment

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd commented Aug 17, 2022

Uh oh!

pytorchmergebot commented Aug 17, 2022

Uh oh!

pytorchmergebot commented Aug 17, 2022

Uh oh!

jaglinux commented Aug 19, 2022

Uh oh!

jithunnair-amd commented Aug 22, 2022

Uh oh!

pytorchmergebot commented Aug 22, 2022

Uh oh!

pytorchmergebot commented Aug 22, 2022

Uh oh!

jithunnair-amd commented Aug 24, 2022

Uh oh!

pytorchmergebot commented Aug 24, 2022

Uh oh!

pytorchmergebot commented Aug 24, 2022

Uh oh!

jaglinux commented Aug 24, 2022

Uh oh!

jithunnair-amd commented Aug 24, 2022

Uh oh!

pytorchmergebot commented Aug 24, 2022

Uh oh!

github-actions bot commented Aug 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jaglinux commented Jul 27, 2022 •

edited

Loading

facebook-github-bot commented Jul 27, 2022 •

edited

Loading