Skip to content

Distributed tests failing on Amazon2023 AMI #131962

@ZainRizvi

Description

@ZainRizvi

🐛 Describe the bug

Something about the new Amazon2023 AMI is making some distributed tests fail, particularly tests that take nccl dumps during timeouts.

Failure 1: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175

FAILED [90.0880s] distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False - AssertionError: None mismatch: None is not -6

Failure 2: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963494

____ NCCLTraceTestTimeoutDumpOnStuckRanks.test_timeout_dumps_on_stuck_ranks ____
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 4214, in test_timeout_dumps_on_stuck_ranks
    self.assertEqual(self._wait_process(0, timeout=90), -6)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3721, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: None mismatch: None is not -6

Failure 3:
#129539

Repo steps

Here’s the minimal PR to repro (just 3 lines). With it you can see the results in CI and ssh into those machines.

https://github.com/pytorch/pytorch/pull/131479/files

Metadata

Metadata

Labels

triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

External Teams

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions