-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
On a setup with more than 4 GPU:
- CUDA_VISIBLE_DEVICES=0,1,2,3,4 python3.6 test/distributed/optim/test_zero_redundancy_optimizer.py TestZeroRedundancyOptimizerDistributed.test_sharding -v
Error:
ERROR: test_sharding (main.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 286, in wrapper
self._join_processes(fn)
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 418, in _join_processes
self._check_return_codes(elapsed_time)
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 461, in _check_return_codes
raise RuntimeError(error)
RuntimeError: Process 3 exited with error code 10 and exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 366, in run_test
getattr(self, test_name)()
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 288, in wrapper
fn()
File "/var/lib/jenkins/pytorch/test/distributed/optim/test_zero_redundancy_optimizer.py", line 289, in test_sharding
self.assertEqual(sum([x.numel() for x in o.optim.param_groups[0]["params"]]), sum(sizes))
File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1234, in assertEqual
super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
File "/opt/conda/lib/python3.6/unittest/case.py", line 682, in assertTrue
raise self.failureException(msg)
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 21 and 24 gives a difference of 3, but the allowed difference with rtol=0 and atol=0 is only 0!
Ran 1 test in 2.194s
FAILED (errors=1)
Expected behavior
test_sharding (main.TestZeroRedundancyOptimizerDistributed)
Check the sharding at construction time ... ok
Test should pass
Environment
Collecting environment information...
PyTorch version: 1.9.0a0+gitb4395b0
Is debug build: False
CUDA used to build PyTorch: N/A
OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 12.0.0 (/src/external/llvm-project/clang 0bebc949e017e721b7cf4836a8b9a25fd28e0367)
CMake version: version 3.19.6
Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: Device 66a1
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.9.0a0+git792a7db
[conda] blas 1.0 mkl
[conda] mkl 2020.2 256
[conda] mkl-include 2020.2 256
[conda] mkl-service 2.3.0 py36he8ac12f_0
[conda] mkl_fft 1.2.0 py36h23d657b_0
[conda] mkl_random 1.1.1 py36h0573a6f_0
[conda] numpy 1.18.5 py36ha1c710e_0
[conda] numpy-base 1.18.5 py36hde5b4d6_0
[conda] torch 1.9.0a0+git792a7db pypi_0 pypi
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda
,pip
, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
cc @vincentqb @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu