-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue
Description
🐛 Bug
The tests in distributed/optim/test_zero_redundancy_optimizer.py fail when run on a node which has any other configuration than 2 or 4 GPUs.
Specifically those tests fail:
- test_add_param_group, test_sharding: Assertion errors
- test_step, test_step_with_closure: Hang indefinitely
To Reproduce
Steps to reproduce the behavior:
CUDA_VISIBLE_DEVICES=0,1,2 python distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
This triggers the assertion error:
ERROR:root:Caught exception:
Traceback (most recent call last):
File "/tmp/install_pt/lib/python3.8/site-packages/torch/testing/_internal/common_distributed.py", line 285, in wrapper
fn()
File "/dev/shm/s3248973-EasyBuild/PyTorch/1.8.1/fosscuda-2020b/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 369, in test_add_param_group
all_trainable()
File "/dev/shm/s3248973-EasyBuild/PyTorch/1.8.1/fosscuda-2020b/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 348, in all_trainable
assert sum([x.numel() for g in o.optim.param_groups for x in g["params"]]) == sum(sizes)
Printing x.numel and sizes yields e.g.: [9, 5, 3, 5, 3] [9, 7, 5, 3]
on one rank and [7, 9, 7] [9, 7, 5, 3]
on the other 2
The timeout occurs when run with anything but 2 visible GPUs
Environment
- PyTorch Version (e.g., 1.0): 1.8.1
- OS (e.g., Linux): Linux
- How you installed PyTorch (
conda
,pip
, source): source - Build command you used (if compiling from source):
CMAKE_BUILD_TYPE=Release BUILD_TEST=0 PYTORCH_BUILD_VERSION=1.8.1 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=$(nproc) BLAS=Eigen USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=$EBROOTCUDNN/lib64 CUDNN_INCLUDE_DIR=$EBROOTCUDNN/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=$EBROOTNCCL/include USE_METAL=0 USE_KINETO=0 python setup.py install
- Python version: 3.8
- CUDA/cuDNN version: 11.1.1
- GPU models and configuration: 8x A100
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue