Skip to content

Multiple failures in TestZeroRedundancyOptimizerDistributed when not run with 2 or 4 GPUs #59548

@Flamefire

Description

@Flamefire

🐛 Bug

The tests in distributed/optim/test_zero_redundancy_optimizer.py fail when run on a node which has any other configuration than 2 or 4 GPUs.

Specifically those tests fail:

  • test_add_param_group, test_sharding: Assertion errors
  • test_step, test_step_with_closure: Hang indefinitely

To Reproduce

Steps to reproduce the behavior:

  1. CUDA_VISIBLE_DEVICES=0,1,2 python distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group

This triggers the assertion error:

ERROR:root:Caught exception: 
Traceback (most recent call last):
  File "/tmp/install_pt/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 285, in wrapper
    fn()
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.8.1/fosscuda-2020b/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 369, in test_add_param_group
    all_trainable()
  File "/dev/shm/s3248973-EasyBuild/PyTorch/1.8.1/fosscuda-2020b/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 348, in all_trainable
    assert sum([x.numel() for g in o.optim.param_groups for x in g["params"]]) == sum(sizes)

Printing x.numel and sizes yields e.g.: [9, 5, 3, 5, 3] [9, 7, 5, 3] on one rank and [7, 9, 7] [9, 7, 5, 3] on the other 2

The timeout occurs when run with anything but 2 visible GPUs

Environment

  • PyTorch Version (e.g., 1.0): 1.8.1
  • OS (e.g., Linux): Linux
  • How you installed PyTorch (conda, pip, source): source
  • Build command you used (if compiling from source): CMAKE_BUILD_TYPE=Release BUILD_TEST=0 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=$(nproc) BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=$EBROOTCUDNN/lib64 CUDNN_INCLUDE_DIR=$EBROOTCUDNN/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=$EBROOTNCCL/include USE_METAL=0 USE_KINETO=0 python setup.py install
  • Python version: 3.8
  • CUDA/cuDNN version: 11.1.1
  • GPU models and configuration: 8x A100

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions