-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forward/backward hooks support in DistributedDataParallel #35191
Comments
I think this warning about forward and backward hooks in In single-gpu-per-process mode whole model is created from scratch on every node and forward and backward hooks work just fine (we didn't experienced any problems with hooks in such setup). |
Hi, @jbojar,
Prints
|
Hi,
Thanks for analyzing this issue.
I don't have a simple snippet of code that shows that hooks work normally
with DDP, but I can confirm that in our code we have a lot of hooks (for
example for metrics) and with DDP (run on multiple machines with multiple
GPUs) everything works as expected. Our hooks are all registered before
wrapping the model with DDP as in your example with `hook_before = True`.
So if your example code shows that it works, then probably it just works
now, and warning in docs can be clarified.
…On Tue, 3 Nov 2020 at 22:00, h6197627 ***@***.***> wrote:
Hi, @jbojar <https://github.com/jbojar>,
I really need to try it more thoroughly. I remember that at the time when
DDP was introduced as better to use alternative to DataParallel (quite long
ago) I was able to reproduce this limiting behavior, but now after your
comment I tried simple script and it looks like it works
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
class Model(nn.Module):
def __init__(self, input_dim=(32, 32, 3)):
super(Model, self).__init__()
assert (len(input_dim) == 3)
conv_channels = 16
self.conv = nn.Conv2d(in_channels=input_dim[-1], out_channels=conv_channels, kernel_size=3, stride=1, padding=1)
self.linear = nn.Linear(in_features=input_dim[0]*input_dim[1]*conv_channels, out_features=128)
nn.init.kaiming_uniform_(self.conv.weight, a=1)
nn.init.constant_(self.conv.bias, 0)
nn.init.kaiming_uniform_(self.linear.weight, a=1)
nn.init.constant_(self.linear.bias, 0)
def forward(self, input):
print('Forward: {}'.format(torch.distributed.get_rank()))
out = self.conv(input)
out = out.view(out.size(0), -1)
out = self.linear(out)
return out
def test_ddp_fn(rank, world_size, hook_before):
setup(rank, world_size)
model = Model().to(rank)
# Dummy hook function imitating ReLU activation function
def relu_act(module, input):
print('Hook: {}'.format(torch.distributed.get_rank()))
return F.relu(input[0])
if hook_before:
model.linear.register_forward_pre_hook(lambda module, input: relu_act(module, input))
print('Hook registered before wrapping with DDP: {}'.format(torch.distributed.get_rank()))
model = DDP(model, device_ids=[rank])
if not hook_before:
model.module.linear.register_forward_pre_hook(lambda module, input: relu_act(module, input))
print('Hook registered after wrapping with DDP: {}'.format(torch.distributed.get_rank()))
input_data = torch.rand(1, 3, 32, 32)
model(input_data)
cleanup()
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend='gloo', rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
if __name__ == "__main__":
world_size = 2
hook_before = True
mp.spawn(test_ddp_fn, args=(world_size, hook_before), nprocs=world_size, join=True)
Prints
Hook registered before wrapping with DDP: 1
Hook registered before wrapping with DDP: 0
Forward: 0
Forward: 1
Hook: 1
Hook: 0
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#35191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGQ2NDAN37GWUT3RHVYR5DSOBVQLANCNFSM4LRMUWKA>
.
|
I'm actually quite confused. The language is still present in the documentation:
So what is the correct way to register forward/backwards hooks when using DDP? |
Closed in #74063 |
🚀 Feature
PyTorch now recommends to use DistributedDataParallel over DataParallel for all sorts of multi-GPU trainings (#35063). However, it has one limitation comparing to old DataParallel module - currently it cannot handle forward/backward hooks in a user convenient way.
Proposed workaround
pytorch/torch/nn/parallel/distributed.py
Lines 146 to 149 in 95ad94c
requires users to edit each model's forward propagation code for using hooks with model wrapped into DDP.
As I understand, it wasn't initially designed having this limitation in mind and was discovered during fixing another issue #5061. So, I am wondering, maybe there are some possibilities to implement some sort of hook synchronization mechanism across distributed model replicas?
Motivation
Also with current workaround possibilities to use hooks dynamically is lost for DistributedDataParallel module. For example, in my current code with DataParallel I am able to place and remove hooks dynamically: during validation phase of training process I am placing hooks to extract additional bottleneck features to calculate some complementary evaluation metrics which are not calculated during training phase.
In general, current hooking mechanism looks not fully compatible with DDP.
Pitch
Hooking mechanism for DistributedDataParallel module working from the user perspective as in DataParallel module.
The text was updated successfully, but these errors were encountered: