Multiple failures in TestZeroRedundancyOptimizerDistributed when not run with 2 or 4 GPUs

## 🐛 Bug

The tests in distributed/optim/test_zero_redundancy_optimizer.py fail when run on a node which has any other configuration than 2 or 4 GPUs.

Specifically those tests fail:
- test_add_param_group, test_sharding: Assertion errors
- test_step, test_step_with_closure: Hang indefinitely

## To Reproduce

Steps to reproduce the behavior:

1. `CUDA_VISIBLE_DEVICES=0,1,2 python distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group`

This triggers the assertion error:
```
ERROR:root:Caught exception: 
Traceback (most recent call last):
  File "/tmp/install_pt/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 285, in wrapper
    fn()
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.8.1/fosscuda-2020b/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 369, in test_add_param_group
    all_trainable()
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.8.1/fosscuda-2020b/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 348, in all_trainable
    assert sum([x.numel() for g in o.optim.param_groups for x in g["params"]]) == sum(sizes)
```

Printing x.numel and sizes yields e.g.: `[9, 5, 3, 5, 3] [9, 7, 5, 3]` on one rank and `[7, 9, 7] [9, 7, 5, 3]` on the other 2

The timeout occurs when run with anything but 2 visible GPUs

## Environment

 - PyTorch Version (e.g., 1.0): 1.8.1
 - OS (e.g., Linux): Linux
 - How you installed PyTorch (`conda`, `pip`, source): source
 - Build command you used (if compiling from source): `CMAKE_BUILD_TYPE=Release BUILD_TEST=0 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=$(nproc) BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=$EBROOTCUDNN/lib64 CUDNN_INCLUDE_DIR=$EBROOTCUDNN/include  USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=$EBROOTNCCL/include USE_METAL=0 USE_KINETO=0 python setup.py install`
 - Python version: 3.8
 - CUDA/cuDNN version: 11.1.1
 - GPU models and configuration: 8x A100


cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multiple failures in TestZeroRedundancyOptimizerDistributed when not run with 2 or 4 GPUs #59548

🐛 Bug

To Reproduce

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multiple failures in TestZeroRedundancyOptimizerDistributed when not run with 2 or 4 GPUs #59548

Description

🐛 Bug

To Reproduce

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions