DDP cannot handle Linear(output_features=0) #87280

nihir27 · 2022-10-19T11:26:37Z

🐛 Describe the bug

A PyTorch model with a no-op Linear layer works in the default single GPU setting but fails using DDP at _sync_module_states.

import os

import torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP


class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 0)


def setup(rank, size, backend='nccl'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)

def demo_basic(rank, world_size):
    setup(rank, world_size)
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)

if __name__ == "__main__":
    run_demo(demo_basic, 2)

RuntimeError: The size of tensor a (10) must match the size of tensor b (0) at non-singleton dimension 1

Versions

PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.4 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Sep 20 2022, 15:58:20) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.107-flatcar-x86_64-with-glibc2.2.5
Is CUDA available: True
CUDA runtime version: 11.4.152
GPU models and configuration:
GPU 0: NVIDIA A100-SXM-80GB
GPU 1: NVIDIA A100-SXM-80GB

Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.4
[pip3] pytorch-lightning==1.7.7
[pip3] pytorch-memlab==0.2.4
[pip3] torch==1.12.1+cu116
[pip3] torch-dct==0.1.5
[pip3] torchmetrics==0.10.0

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu

The text was updated successfully, but these errors were encountered:

zhaojuanmao · 2022-10-25T08:24:21Z

I can reproduce the error, debugging

Fixes pytorch#87280 Pull Request resolved: pytorch#87793 Approved by: https://github.com/rohan-varma

albanD added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 21, 2022

zhaojuanmao self-assigned this Oct 26, 2022

zhaojuanmao mentioned this issue Oct 26, 2022

[BE]fix DDP when the number of output features is zero #87793

Closed

pytorchmergebot closed this as completed in 44f8efd Nov 1, 2022

kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Nov 5, 2022

[BE]fix DDP when the number of output features is zero (pytorch#87793)

f1db5bc

Fixes pytorch#87280 Pull Request resolved: pytorch#87793 Approved by: https://github.com/rohan-varma

kulinseth pushed a commit to kulinseth/pytorch that referenced this issue Dec 10, 2022

[BE]fix DDP when the number of output features is zero (pytorch#87793)

8339742

Fixes pytorch#87280 Pull Request resolved: pytorch#87793 Approved by: https://github.com/rohan-varma

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP cannot handle Linear(output_features=0) #87280

DDP cannot handle Linear(output_features=0) #87280

nihir27 commented Oct 19, 2022 •

edited by pytorch-bot bot

zhaojuanmao commented Oct 25, 2022

DDP cannot handle Linear(output_features=0) #87280

DDP cannot handle Linear(output_features=0) #87280

Comments

nihir27 commented Oct 19, 2022 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

zhaojuanmao commented Oct 25, 2022

nihir27 commented Oct 19, 2022 •

edited by pytorch-bot bot