Skip to content

Calling backward with create_graph on the output of a DistributedDataParallel throws error #63812

@carlosgmartin

Description

@carlosgmartin

🐛 Bug

Calling backward with create_graph=True on the output of a DistributedDataParallel throws a RuntimeError.

To Reproduce

import torch
from torch import nn
from torch.distributed import init_process_group
from torch.nn.parallel import DistributedDataParallel as DDP

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.p = nn.Parameter(torch.tensor(1.))
    def forward(self):
        return self.p.pow(2)

model = Model()
init_process_group(
    'gloo',
    'tcp://localhost:12355',
    rank=0,
    world_size=1,
)
ddp_model = DDP(model)
ddp_model().backward(create_graph=True)
Traceback (most recent call last):
  File "test.py", line 21, in <module>
    ddp_model().backward(create_graph=True)
  File "/usr/local/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: mul(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.

Expected behavior

No error.

Environment

Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 11.5.2 (x86_64)
GCC version: Could not collect
Clang version: 12.0.5 (clang-1205.0.22.9)
CMake version: version 3.20.1
Libc version: N/A

Python version: 3.8.8 (default, Feb 27 2021, 02:19:17)  [Clang 12.0.0 (clang-1200.0.32.29)] (64-bit runtime)
Python platform: macOS-11.5.2-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] mypy==0.812
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] pytorch-lightning==1.2.10
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0.dev20210313
[pip3] torchmetrics==0.2.0
[pip3] torchvision==0.10.0
[conda] Could not collect

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23

Metadata

Metadata

Assignees

Labels

module: ddpIssues/PRs related distributed data parallel trainingoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions