-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Open
Labels
triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
Something about the new Amazon2023 AMI is making some distributed tests fail, particularly tests that take nccl dumps during timeouts.
Failure 1: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175
FAILED [90.0880s] distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False - AssertionError: None mismatch: None is not -6
Failure 2: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963494
____ NCCLTraceTestTimeoutDumpOnStuckRanks.test_timeout_dumps_on_stuck_ranks ____
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 4214, in test_timeout_dumps_on_stuck_ranks
self.assertEqual(self._wait_process(0, timeout=90), -6)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3721, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: None mismatch: None is not -6
Failure 3:
#129539
Repo steps
Here’s the minimal PR to repro (just 3 lines). With it you can see the results in CI and ssh into those machines.
Metadata
Metadata
Labels
triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
External Teams