Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP cannot handle Linear(output_features=0) #87280

Closed
nihir27 opened this issue Oct 19, 2022 · 1 comment
Closed

DDP cannot handle Linear(output_features=0) #87280

nihir27 opened this issue Oct 19, 2022 · 1 comment
Assignees
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@nihir27
Copy link

nihir27 commented Oct 19, 2022

馃悰 Describe the bug

A PyTorch model with a no-op Linear layer works in the default single GPU setting but fails using DDP at _sync_module_states.

import os

import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 0)


def setup(rank, size, backend='nccl'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)

def demo_basic(rank, world_size):
    setup(rank, world_size)
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == "__main__":
    run_demo(demo_basic, 2)
RuntimeError: The size of tensor a (10) must match the size of tensor b (0) at non-singleton dimension 1

Versions

PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Sep 20 2022, 15:58:20) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.107-flatcar-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.152
GPU models and configuration:
GPU 0: NVIDIA A100-SXM-80GB
GPU 1: NVIDIA A100-SXM-80GB

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] pytorch-lightning==1.7.7
[pip3] pytorch-memlab==0.2.4
[pip3] torch==1.12.1+cu116
[pip3] torch-dct==0.1.5
[pip3] torchmetrics==0.10.0

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

@albanD albanD added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 21, 2022
@zhaojuanmao
Copy link
Contributor

I can reproduce the error, debugging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants