Skip to content

PyTorch 1.10.0, distributed/optim/test_zero_redundancy_optimizer fails on A100 #67764

@casparvl

Description

@casparvl

🐛 Bug

The following two tests from distributed/optim/test_zero_redundancy_optimizer fail

ERROR: test_zero_model_parallel_with_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)
ERROR: test_zero_model_parallel_without_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)

With the following error:

ERROR: test_zero_model_parallel_without_bucket_view (distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed)
Check that ZeRO works with model parallelism where layers are sharded
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
    self._join_processes(fn)
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 682, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 536, in run_test
    getattr(self, test_name)()
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 420, in wrapper
    fn()
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 111, in wrapper
    return func(*args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 913, in test_zero_model_parallel_without_bucket_view
    self._test_zero_model_parallel(parameters_as_bucket_view=False)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 885, in _test_zero_model_parallel
    assert torch.allclose(
AssertionError: Losses differ between local optim and ZeroRedundancyOptimizer)

To Reproduce

Steps to reproduce the behavior:

  1. Build PyTorch from source (see my build command below)
  2. Run python -m unittest distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed.test_zero_model_parallel_without_bucket_view -v or python -m unittest distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed.test_zero_model_parallel_with_bucket_view -v

Expected behavior

I expected the tests to pass.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0): PyTorch 1.10.0
  • OS (e.g., Linux): CentOS 8
  • How you installed PyTorch (conda, pip, source): source
  • Build command you used (if compiling from source):
USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.10.0 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=72 BLAS=Eigen WITH_BLAS=open USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/lib64 CUDNN_INCLUDE_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/tmp/sw_stack_gpu/software/NCCL/2.10.3-GCCcore-10.3.0-CUDA-11.3.1/include USE_METAL=0   /tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python setup.py build
  • Python version: 3.9.5
  • CUDA/cuDNN version: CUDA: 11.3.1, cuDNN: 8.2.1.32
  • GPU models and configuration: N/A
  • Any other relevant information: Ran test on a node with 4* A100 GPUs

Additional context

I modified the test to print the local_loss and ddp_loss, which are compared through an assert on torch.allclose(...), to check if they were really different. They were not very different at all:

Local_loss: 27.58694839477539, ddp_loss: 27.58181381225586

But that is too much for the default relative tolerance of a torch.allclose(...)

The test passes if I set

export NVIDIA_TF32_OVERRIDE=0

before running it. Thus, the cause of the small deviation seems to be the use of TensorFloat32.

I see three options for a fix:

  1. Detect if TF32 is supported/used, and if so, increase the tolerance.
  2. Increase the tolerance for this test in general, so that it pass regardless of whether TF32 is used
  3. Disable the use of TF32 before running this test

I personally think option (3) is not very good, because it doesn't represent how a user would run PyTorch. I think (2) is the easiest to implement, and I would still consider it pretty safe (the tolerance doesn't need to be all that large for this case to pass succesfully). I'll create a PR based on (2) and we can discuss if that's ok in the PR.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions