-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Open
Labels
module: ddpIssues/PRs related distributed data parallel trainingIssues/PRs related distributed data parallel trainingoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Bug
Calling backward
with create_graph=True
on the output of a DistributedDataParallel
throws a RuntimeError
.
To Reproduce
import torch
from torch import nn
from torch.distributed import init_process_group
from torch.nn.parallel import DistributedDataParallel as DDP
class Model(nn.Module):
def __init__(self):
super().__init__()
self.p = nn.Parameter(torch.tensor(1.))
def forward(self):
return self.p.pow(2)
model = Model()
init_process_group(
'gloo',
'tcp://localhost:12355',
rank=0,
world_size=1,
)
ddp_model = DDP(model)
ddp_model().backward(create_graph=True)
Traceback (most recent call last):
File "test.py", line 21, in <module>
ddp_model().backward(create_graph=True)
File "/usr/local/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
Variable._execution_engine.run_backward(
RuntimeError: mul(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.
Expected behavior
No error.
Environment
Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: macOS 11.5.2 (x86_64)
GCC version: Could not collect
Clang version: 12.0.5 (clang-1205.0.22.9)
CMake version: version 3.20.1
Libc version: N/A
Python version: 3.8.8 (default, Feb 27 2021, 02:19:17) [Clang 12.0.0 (clang-1200.0.32.29)] (64-bit runtime)
Python platform: macOS-11.5.2-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] mypy==0.812
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.19.5
[pip3] pytorch-lightning==1.2.10
[pip3] torch==1.9.0
[pip3] torchaudio==0.9.0.dev20210313
[pip3] torchmetrics==0.2.0
[pip3] torchvision==0.10.0
[conda] Could not collect
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu @gcramer23
Metadata
Metadata
Assignees
Labels
module: ddpIssues/PRs related distributed data parallel trainingIssues/PRs related distributed data parallel trainingoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module