test_zero_redundancy_optimizer.py fails when run more than 4 GPU setup

## 🐛 Bug



## To Reproduce

Steps to reproduce the behavior:
On a setup with more than 4 GPU:
1. CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3.6 test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_sharding -v


Error:
ERROR: test_sharding (__main__.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 286, in wrapper
    self._join_processes(fn)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 418, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 461, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 366, in run_test
    getattr(self, test_name)()
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 288, in wrapper
    fn()
  File "/var/lib/jenkins/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 289, in test_sharding
    self.assertEqual(sum([x.numel() for x in o.optim.param_groups[0]["params"]]), sum(sizes))
  File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1234, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
  File "/opt/conda/lib/python3.6/unittest/case.py", line 682, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 21 and 24 gives a difference of 3, but the allowed difference with rtol=0 and atol=0 is only 0!

----------------------------------------------------------------------
Ran 1 test in 2.194s

FAILED (errors=1)

## Expected behavior

test_sharding (__main__.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time ... ok


Test should pass

## Environment
Collecting environment information...
PyTorch version: 1.9.0a0+gitb4395b0
Is debug build: False
CUDA used to build PyTorch: N/A


OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 12.0.0 (/src/external/llvm-project/clang 0bebc949e017e721b7cf4836a8b9a25fd28e0367)
CMake version: version 3.19.6

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: Device 66a1
Nvidia driver version: Could not collect
cuDNN version: Could not collect


Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.9.0a0+git792a7db
[conda] blas                      1.0                         mkl  
[conda] mkl                       2020.2                      256  
[conda] mkl-include               2020.2                      256  
[conda] mkl-service               2.3.0            py36he8ac12f_0  
[conda] mkl_fft                   1.2.0            py36h23d657b_0  
[conda] mkl_random                1.1.1            py36h0573a6f_0  
[conda] numpy                     1.18.5           py36ha1c710e_0  
[conda] numpy-base                1.18.5           py36hde5b4d6_0  
[conda] torch                     1.9.0a0+git792a7db          pypi_0    pypi

Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py)
(or fill out the checklist below manually).

You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```

 - PyTorch Version (e.g., 1.0):
 - OS (e.g., Linux):
 - How you installed PyTorch (`conda`, `pip`, source):
 - Build command you used (if compiling from source):
 - Python version:
 - CUDA/cuDNN version:
 - GPU models and configuration:
 - Any other relevant information:

## Additional context




cc @vincentqb @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test_zero_redundancy_optimizer.py fails when run more than 4 GPU setup #53322

🐛 Bug

To Reproduce

Error:
ERROR: test_sharding (main.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

test_zero_redundancy_optimizer.py fails when run more than 4 GPU setup #53322

Description

🐛 Bug

To Reproduce

Error: ERROR: test_sharding (main.TestZeroRedundancyOptimizerDistributed) Check the sharding at construction time

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Error:
ERROR: test_sharding (main.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time