-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Description
🐛 Bug
The following two tests from distributed/optim/test_zero_redundancy_optimizer fail
ERROR: test_zero_model_parallel_with_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)
ERROR: test_zero_model_parallel_without_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)
With the following error:
ERROR: test_zero_model_parallel_without_bucket_view (distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed)
Check that ZeRO works with model parallelism where layers are sharded
----------------------------------------------------------------------
Traceback (most recent call last):
File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
self._join_processes(fn)
File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
self._check_return_codes(elapsed_time)
File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 682, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 536, in run_test
getattr(self, test_name)()
File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 420, in wrapper
fn()
File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 111, in wrapper
return func(*args, **kwargs)
File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 913, in test_zero_model_parallel_without_bucket_view
self._test_zero_model_parallel(parameters_as_bucket_view=False)
File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 885, in _test_zero_model_parallel
assert torch.allclose(
AssertionError: Losses differ between local optim and ZeroRedundancyOptimizer)
To Reproduce
Steps to reproduce the behavior:
- Build PyTorch from source (see my build command below)
- Run
python -m unittest distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed.test_zero_model_parallel_without_bucket_view -vorpython -m unittest distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed.test_zero_model_parallel_with_bucket_view -v
Expected behavior
I expected the tests to pass.
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
- PyTorch Version (e.g., 1.0): PyTorch 1.10.0
- OS (e.g., Linux): CentOS 8
- How you installed PyTorch (
conda,pip, source): source - Build command you used (if compiling from source):
USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.10.0 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=72 BLAS=Eigen WITH_BLAS=open USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/lib64 CUDNN_INCLUDE_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/tmp/sw_stack_gpu/software/NCCL/2.10.3-GCCcore-10.3.0-CUDA-11.3.1/include USE_METAL=0 /tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python setup.py build
- Python version: 3.9.5
- CUDA/cuDNN version: CUDA: 11.3.1, cuDNN: 8.2.1.32
- GPU models and configuration: N/A
- Any other relevant information: Ran test on a node with 4* A100 GPUs
Additional context
I modified the test to print the local_loss and ddp_loss, which are compared through an assert on torch.allclose(...), to check if they were really different. They were not very different at all:
Local_loss: 27.58694839477539, ddp_loss: 27.58181381225586
But that is too much for the default relative tolerance of a torch.allclose(...)
The test passes if I set
export NVIDIA_TF32_OVERRIDE=0
before running it. Thus, the cause of the small deviation seems to be the use of TensorFloat32.
I see three options for a fix:
- Detect if
TF32is supported/used, and if so, increase the tolerance. - Increase the tolerance for this test in general, so that it pass regardless of whether
TF32is used - Disable the use of
TF32before running this test
I personally think option (3) is not very good, because it doesn't represent how a user would run PyTorch. I think (2) is the easiest to implement, and I would still consider it pretty safe (the tolerance doesn't need to be all that large for this case to pass succesfully). I'll create a PR based on (2) and we can discuss if that's ok in the PR.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang