Skip to content

Conversation

jaglinux
Copy link
Contributor

@jaglinux jaglinux commented Jul 27, 2022

Signed-off-by: Jagadish Krishnamoorthy jagdish.krishna@gmail.com

Issue fixed in ROCm 5.2 user space.

jaglinux and others added 12 commits November 13, 2020 04:45
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in pytorch#45435 and pytorch#47629

For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer pytorch#45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.

This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2

Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>
@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jul 27, 2022
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 27, 2022

🔗 Helpful links

❌ 4 New Failures

As of commit c79240a (more details on the Dr. CI page):

Expand to see more
  • 4/4 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build trunk / macos-12-py3-x86-64 / test (default, 2, 2, macos-12) (1/1)

Step: "Unknown" (full log | diagnosis details)

2022-08-23T02:31:15.4641770Z clang: error: unsupported option '-fopenmp'
2022-08-23T02:31:15.2693800Z   test_attribute_serialization (__main__.TestScript) ... ok (0.018s)
2022-08-23T02:31:15.2874590Z   test_attribute_unpickling (__main__.TestScript) ... ok (0.018s)
2022-08-23T02:31:15.2984480Z   test_augmented_assign (__main__.TestScript) ... ok (0.011s)
2022-08-23T02:31:15.3005680Z   test_autodiff_complex (__main__.TestScript) ... skip: no CUDA (0.002s)
2022-08-23T02:31:15.3049190Z   test_backend_cudnn_enabled (__main__.TestScript) ... ok (0.004s)
2022-08-23T02:31:15.3089080Z   test_bad_multiline_annotations (__main__.TestScript) ... ok (0.004s)
2022-08-23T02:31:15.3234960Z   test_bailout_loop_carried_deps_name_clash (__main__.TestScript) ... ok (0.014s)
2022-08-23T02:31:15.3367630Z   test_bailout_loop_counter_transition (__main__.TestScript) ... ok (0.013s)
2022-08-23T02:31:15.3404680Z   test_batch_norm_inference_backward_cuda (__main__.TestScript) ... skip: running tests on cuda to verify cudnn fix (0.004s)
2022-08-23T02:31:15.4639380Z   test_batchnorm_fuser_cpu (__main__.TestScript) ... clang: error: unsupported option '-fopenmp'
2022-08-23T02:31:15.4641770Z clang: error: unsupported option '-fopenmp'
2022-08-23T02:31:15.4768450Z warning: pytorch jit fuser failed to compile with openmp, trying without it...
2022-08-23T02:31:15.4770800Z You have not run this instance of FileCheck!
2022-08-23T02:31:15.4771820Z FileCheck checks:
2022-08-23T02:31:17.0515000Z ok (1.711s)
2022-08-23T02:31:17.0572340Z   test_big_float_literals (__main__.TestScript) ... ok (0.006s)
2022-08-23T02:31:17.0695190Z   test_big_int_literals (__main__.TestScript) ... ok (0.012s)
2022-08-23T02:31:17.3916400Z   test_binary_op_shape (__main__.TestScript) ... ok (0.322s)
2022-08-23T02:31:17.4276640Z   test_bitwise_ops (__main__.TestScript) ... ok (0.036s)
2022-08-23T02:31:17.4387250Z   test_block_input_grad_in_loop (__main__.TestScript) ... ok (0.011s)
2022-08-23T02:31:17.4509620Z   test_bool_augassign_bitwise_and (__main__.TestScript) ... ok (0.012s)

🕵️‍♀️ 3 failures not recognized by patterns:

The following CI failures may be due to changes from the PR
Job Step
GitHub Actions periodic / linux-bionic-cuda10.2-py3.9-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu) Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
GitHub Actions trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu) Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
GitHub Actions trunk / linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu) Install nvidia driver, nvidia-docker runtime, set GPU_FLAG

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@jaglinux
Copy link
Contributor Author

@jeffdaily jeffdaily added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 27, 2022
@pruthvistony pruthvistony added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jul 27, 2022
@jaglinux
Copy link
Contributor Author

2022-07-27T21:39:35.4539856Z test_event_handle_exporter (main.TestMultiprocessing) ... ok (5.014s)
2022-07-27T21:39:40.2939614Z test_event_handle_importer (main.TestMultiprocessing) ... ok (4.840s)
2022-07-27T21:39:44.1968841Z test_event_multiprocess (main.TestMultiprocessing) ... ok (3.898s)

Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. ROCm CI is green. 4 test failures are unrelated to this PR.

We still need upstream approval.

@jeffdaily jeffdaily requested a review from janeyx99 August 1, 2022 15:21
@jithunnair-amd
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

Merge failed
Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2876180459

@jaglinux
Copy link
Contributor Author

2022-08-17T19:34:08.6449003Z test_event_handle_exporter (main.TestMultiprocessing) ... /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:123: UserWarning: loaded 39 slow tests
2022-08-17T19:34:08.6449546Z warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
2022-08-17T19:34:08.6450211Z /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:127: UserWarning: loaded 238 disabled tests
2022-08-17T19:34:08.6450760Z warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
2022-08-17T19:34:12.3298069Z ok (5.001s)
2022-08-17T19:34:13.7041800Z test_event_handle_importer (main.TestMultiprocessing) ... /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:123: UserWarning: loaded 39 slow tests
2022-08-17T19:34:13.7043275Z warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
2022-08-17T19:34:13.7044910Z /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:127: UserWarning: loaded 238 disabled tests
2022-08-17T19:34:13.7046187Z warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
2022-08-17T19:34:16.8561133Z ok (4.526s)
2022-08-17T19:34:16.8598343Z test_event_handle_multi_gpu (main.TestMultiprocessing) ... skip: found only 1 GPU (0.004s)
2022-08-17T19:34:18.2344205Z test_event_multiprocess (main.TestMultiprocessing) ... /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:123: UserWarning: loaded 39 slow tests
2022-08-17T19:34:18.2345403Z warnings.warn(f"loaded {len(slow_tests_dict)} slow tests")
2022-08-17T19:34:18.2346859Z /opt/conda/lib/python3.7/site-packages/torch/testing/_internal/common_utils.py:127: UserWarning: loaded 238 disabled tests
2022-08-17T19:34:18.2348050Z warnings.warn(f"loaded {len(disabled_tests_dict)} disabled tests")
2022-08-17T19:34:20.7152070Z ok (3.855s)

cc @jithunnair-amd

@jithunnair-amd
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

Merge failed
Reason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again.
Raised by https://github.com/pytorch/pytorch/actions/runs/2907508911

@jithunnair-amd
Copy link
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered without a flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@pytorchmergebot
Copy link
Collaborator

@jaglinux
Copy link
Contributor Author

I see 4 failing checks in the report

Below 3 jobs are failing due at "Install nvidia driver" step
linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 1, 2, linux.8xlarge.nvidia.gpu)
linux-bionic-cuda10.2-py3.9-gcc7 / test (distributed, 2, 2, linux.8xlarge.nvidia.gpu)
linux-bionic-cuda10.2-py3.9-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu)

macos-12-py3-x86-64 / test (default, 2, 2, macos-12) --> The hosted runner: GitHub Actions 50 lost communication with the
server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

I do not see ROCm related failures.

@jithunnair-amd
Copy link
Collaborator

@pytorchbot merge -f "unrelated CI failures"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the force (-f) flag. This means your change will be merged immediately, bypassing any CI checks (ETA: 1-5 minutes). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

@github-actions
Copy link
Contributor

Hey @jaglinux.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

facebook-github-bot pushed a commit that referenced this pull request Aug 26, 2022
Summary:
Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Issue fixed in ROCm 5.2 user space.

Pull Request resolved: #82356
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/huydhn

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f5bfa4d0888e6cd5984092b38cb8b10609558d05

Reviewed By: weiwangmeta

Differential Revision: D39008147

Pulled By: weiwangmeta

fbshipit-source-id: 39e3aa6cb6329bb3c2a53c0ddbe71a084dc1e55e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request cla signed Merged module: rocm AMD GPU support for Pytorch open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants