We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
A PyTorch model with a no-op Linear layer works in the default single GPU setting but fails using DDP at _sync_module_states.
_sync_module_states
import os import torch.nn as nn import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10) self.relu = nn.ReLU() self.net2 = nn.Linear(10, 0) def setup(rank, size, backend='nccl'): """ Initialize the distributed environment. """ os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend, rank=rank, world_size=size) def demo_basic(rank, world_size): setup(rank, world_size) model = ToyModel().to(rank) ddp_model = DDP(model, device_ids=[rank]) def run_demo(demo_fn, world_size): mp.spawn(demo_fn, args=(world_size,), nprocs=world_size, join=True) if __name__ == "__main__": run_demo(demo_basic, 2)
RuntimeError: The size of tensor a (10) must match the size of tensor b (0) at non-singleton dimension 1
PyTorch version: 1.12.1+cu116 Is debug build: False CUDA used to build PyTorch: 11.6 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31
Python version: 3.8.10 (default, Sep 20 2022, 15:58:20) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.10.107-flatcar-x86_64-with-glibc2.2.5 Is CUDA available: True CUDA runtime version: 11.4.152 GPU models and configuration: GPU 0: NVIDIA A100-SXM-80GB GPU 1: NVIDIA A100-SXM-80GB
Nvidia driver version: 470.57.02 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Versions of relevant libraries: [pip3] numpy==1.22.4 [pip3] pytorch-lightning==1.7.7 [pip3] pytorch-memlab==0.2.4 [pip3] torch==1.12.1+cu116 [pip3] torch-dct==0.1.5 [pip3] torchmetrics==0.10.0
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
The text was updated successfully, but these errors were encountered:
I can reproduce the error, debugging
Sorry, something went wrong.
44f8efd
[BE]fix DDP when the number of output features is zero (pytorch#87793)
f1db5bc
Fixes pytorch#87280 Pull Request resolved: pytorch#87793 Approved by: https://github.com/rohan-varma
8339742
zhaojuanmao
Successfully merging a pull request may close this issue.
馃悰 Describe the bug
A PyTorch model with a no-op Linear layer works in the default single GPU setting but fails using DDP at
_sync_module_states
.Versions
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.8.10 (default, Sep 20 2022, 15:58:20) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.107-flatcar-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.152
GPU models and configuration:
GPU 0: NVIDIA A100-SXM-80GB
GPU 1: NVIDIA A100-SXM-80GB
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] pytorch-lightning==1.7.7
[pip3] pytorch-memlab==0.2.4
[pip3] torch==1.12.1+cu116
[pip3] torch-dct==0.1.5
[pip3] torchmetrics==0.10.0
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu
The text was updated successfully, but these errors were encountered: