Skip to content

test_zero_redundancy_optimizer.py fails when run more than 4 GPU setup #53322

@arindamroy-eng

Description

@arindamroy-eng

🐛 Bug

To Reproduce

Steps to reproduce the behavior:
On a setup with more than 4 GPU:

  1. CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3.6 test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_sharding -v

Error:
ERROR: test_sharding (main.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time

Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 286, in wrapper
self._join_processes(fn)
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 418, in _join_processes
self._check_return_codes(elapsed_time)
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 461, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 366, in run_test
getattr(self, test_name)()
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "/var/lib/jenkins/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 289, in test_sharding
self.assertEqual(sum([x.numel() for x in o.optim.param_groups[0]["params"]]), sum(sizes))
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1234, in assertEqual
super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
File "/opt/conda/lib/python3.6/unittest/case.py", line 682, in assertTrue
raise self.failureException(msg)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 21 and 24 gives a difference of 3, but the allowed difference with rtol=0 and atol=0 is only 0!


Ran 1 test in 2.194s

FAILED (errors=1)

Expected behavior

test_sharding (main.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time ... ok

Test should pass

Environment

Collecting environment information...
PyTorch version: 1.9.0a0+gitb4395b0
Is debug build: False
CUDA used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 12.0.0 (/src/external/llvm-project/clang 0bebc949e017e721b7cf4836a8b9a25fd28e0367)
CMake version: version 3.19.6

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: Device 66a1
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.9.0a0+git792a7db
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-include 2020.2 256
[conda] mkl-service 2.3.0 py36he8ac12f_0
[conda] mkl_fft 1.2.0 py36h23d657b_0
[conda] mkl_random 1.1.1 py36h0573a6f_0
[conda] numpy 1.18.5 py36ha1c710e_0
[conda] numpy-base 1.18.5 py36hde5b4d6_0
[conda] torch 1.9.0a0+git792a7db pypi_0 pypi

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

cc @vincentqb @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

Metadata

Metadata

Labels

module: flaky-testsProblem is a flaky test in CImodule: optimizerRelated to torch.optimoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions