PyTorch 1.10.0, `distributed/optim/test_zero_redundancy_optimizer` fails on A100

## 🐛 Bug

The following two tests from `distributed/optim/test_zero_redundancy_optimizer` fail
```
ERROR: test_zero_model_parallel_with_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)
ERROR: test_zero_model_parallel_without_bucket_view (__main__.TestZeroRedundancyOptimizerDistributed)
```
With the following error:
```
ERROR: test_zero_model_parallel_without_bucket_view (distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed)
Check that ZeRO works with model parallelism where layers are sharded
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 418, in wrapper
    self._join_processes(fn)
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 637, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 682, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 536, in run_test
    getattr(self, test_name)()
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 420, in wrapper
    fn()
  File "/tmp/casparl/eb_tmp/eb-goi82_nt/tmp69ke_p3z/lib/python3.9/site-packages/torch/testing/_internal/common_distributed.py", line 111, in wrapper
    return func(*args, **kwargs)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 913, in test_zero_model_parallel_without_bucket_view
    self._test_zero_model_parallel(parameters_as_bucket_view=False)
  File "/tmp/casparl/eb_build/PyTorch/1.10.0/foss-2021a-CUDA-11.3.1/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 885, in _test_zero_model_parallel
    assert torch.allclose(
AssertionError: Losses differ between local optim and ZeroRedundancyOptimizer)
```

## To Reproduce

Steps to reproduce the behavior:

1. Build PyTorch from source (see my build command below)
1. Run `python -m unittest distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed.test_zero_model_parallel_without_bucket_view -v` or `python -m unittest distributed.optim.test_zero_redundancy_optimizer.TestZeroRedundancyOptimizerDistributed.test_zero_model_parallel_with_bucket_view -v`

## Expected behavior

I expected the tests to pass.

## Environment

Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py)
(or fill out the checklist below manually).

You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```

 - PyTorch Version (e.g., 1.0): PyTorch 1.10.0
 - OS (e.g., Linux): CentOS 8
 - How you installed PyTorch (`conda`, `pip`, source): source
 - Build command you used (if compiling from source):
 ```
USE_CUPTI_SO=1 PYTORCH_BUILD_VERSION=1.10.0 PYTORCH_BUILD_NUMBER=1 MAX_JOBS=72 BLAS=Eigen WITH_BLAS=open USE_FFMPEG=1 BUILD_CUSTOM_PROTOBUF=0 USE_IBVERBS=1 USE_CUDA=1 CUDNN_LIB_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/lib64 CUDNN_INCLUDE_DIR=/tmp/sw_stack_gpu/software/cuDNN/8.2.1.32-CUDA-11.3.1/include USE_SYSTEM_NCCL=1 NCCL_INCLUDE_DIR=/tmp/sw_stack_gpu/software/NCCL/2.10.3-GCCcore-10.3.0-CUDA-11.3.1/include USE_METAL=0   /tmp/sw_stack_gpu/software/Python/3.9.5-GCCcore-10.3.0/bin/python setup.py build
```
 - Python version: 3.9.5
 - CUDA/cuDNN version: CUDA: 11.3.1, cuDNN: 8.2.1.32
 - GPU models and configuration: N/A
 - Any other relevant information: Ran test on a node with 4* A100 GPUs

## Additional context

I modified the test to print the `local_loss` and `ddp_loss`, which are compared through an assert on `torch.allclose(...)`, to check if they were really different. They were not very different at all:
```
Local_loss: 27.58694839477539, ddp_loss: 27.58181381225586
```
But that is too much for the default relative tolerance of a `torch.allclose(...)`

The test passes if I set
```
export NVIDIA_TF32_OVERRIDE=0
```
before running it. Thus, the cause of the small deviation seems to be the use of `TensorFloat32`.

I see three options for a fix:
1. Detect if `TF32` is supported/used, and if so, increase the tolerance.
2. Increase the tolerance for this test in general, so that it pass regardless of whether `TF32` is used
3. Disable the use of `TF32` before running this test

I personally think option (3) is not very good, because it doesn't represent how a user would run PyTorch. I think (2) is the easiest to implement, and I would still consider it pretty safe (the tolerance doesn't need to be all that large for this case to pass succesfully). I'll create a PR based on (2) and we can discuss if that's ok in the PR.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch 1.10.0, `distributed/optim/test_zero_redundancy_optimizer` fails on A100 #67764

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PyTorch 1.10.0, distributed/optim/test_zero_redundancy_optimizer fails on A100 #67764

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

PyTorch 1.10.0, `distributed/optim/test_zero_redundancy_optimizer` fails on A100 #67764