Forward/backward hooks support in DistributedDataParallel #35191

h6197627 · 2020-03-22T19:00:36Z

🚀 Feature

PyTorch now recommends to use DistributedDataParallel over DataParallel for all sorts of multi-GPU trainings (#35063). However, it has one limitation comparing to old DataParallel module - currently it cannot handle forward/backward hooks in a user convenient way.
Proposed workaround

pytorch/torch/nn/parallel/distributed.py

Lines 146 to 149 in 95ad94c

    
               .. warning:: 
        
                   Forward and backward hooks defined on :attr:`module` and its submodules 
        
                   won't be invoked anymore, unless the hooks are initialized in the 
        
                   :meth:`forward` method.

requires users to edit each model's forward propagation code for using hooks with model wrapped into DDP.
As I understand, it wasn't initially designed having this limitation in mind and was discovered during fixing another issue #5061. So, I am wondering, maybe there are some possibilities to implement some sort of hook synchronization mechanism across distributed model replicas?

Motivation

Also with current workaround possibilities to use hooks dynamically is lost for DistributedDataParallel module. For example, in my current code with DataParallel I am able to place and remove hooks dynamically: during validation phase of training process I am placing hooks to extract additional bottleneck features to calculate some complementary evaluation metrics which are not calculated during training phase.
In general, current hooking mechanism looks not fully compatible with DDP.

Pitch

Hooking mechanism for DistributedDataParallel module working from the user perspective as in DataParallel module.

jbojar · 2020-09-29T13:27:46Z

I think this warning about forward and backward hooks in DistributedDataParallel is valid only if model is replicated by DistributedDataParallel.replicate method. And this can happen probably only when using multiple GPUs in single process.

In single-gpu-per-process mode whole model is created from scratch on every node and forward and backward hooks work just fine (we didn't experienced any problems with hooks in such setup).

h6197627 · 2020-11-03T21:00:38Z

Hi, @jbojar,
I really need to try it more thoroughly. I remember that at the time when DDP was introduced as better to use alternative to DataParallel (quite long ago) I was able to reproduce this limiting behavior, but now after your comment I tried simple script and it looks like it works

import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP

class Model(nn.Module):
	def __init__(self, input_dim=(32, 32, 3)):
		super(Model, self).__init__()
		assert (len(input_dim) == 3)
		conv_channels = 16
		self.conv = nn.Conv2d(in_channels=input_dim[-1], out_channels=conv_channels, kernel_size=3, stride=1, padding=1)
		self.linear = nn.Linear(in_features=input_dim[0]*input_dim[1]*conv_channels, out_features=128)
		nn.init.kaiming_uniform_(self.conv.weight, a=1)
		nn.init.constant_(self.conv.bias, 0)
		nn.init.kaiming_uniform_(self.linear.weight, a=1)
		nn.init.constant_(self.linear.bias, 0)

	def forward(self, input):
		print('Forward: {}'.format(torch.distributed.get_rank()))
		out = self.conv(input)
		out = out.view(out.size(0), -1)
		out = self.linear(out)
		return out


def test_ddp_fn(rank, world_size, hook_before):
	setup(rank, world_size)

	model = Model().to(rank)
	# Dummy hook function imitating ReLU activation function
	def relu_act(module, input):
		print('Hook: {}'.format(torch.distributed.get_rank()))
		return F.relu(input[0])
	
	if hook_before:
		model.linear.register_forward_pre_hook(lambda module, input: relu_act(module, input))
		print('Hook registered before wrapping with DDP: {}'.format(torch.distributed.get_rank()))
	model = DDP(model, device_ids=[rank])
	if not hook_before:
		model.module.linear.register_forward_pre_hook(lambda module, input: relu_act(module, input))
		print('Hook registered after wrapping with DDP: {}'.format(torch.distributed.get_rank()))

	input_data = torch.rand(1, 3, 32, 32)
	model(input_data)

	cleanup()


def setup(rank, world_size):
	os.environ['MASTER_ADDR'] = '127.0.0.1'
	os.environ['MASTER_PORT'] = '29500'
	dist.init_process_group(backend='gloo', rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()


if __name__ == "__main__":
	world_size = 2
	hook_before = True
	mp.spawn(test_ddp_fn, args=(world_size, hook_before), nprocs=world_size, join=True)

Prints

Hook registered before wrapping with DDP: 1
Hook registered before wrapping with DDP: 0
Forward: 0
Forward: 1
Hook: 1
Hook: 0

jbojar · 2020-11-04T00:06:02Z

Hi, Thanks for analyzing this issue. I don't have a simple snippet of code that shows that hooks work normally with DDP, but I can confirm that in our code we have a lot of hooks (for example for metrics) and with DDP (run on multiple machines with multiple GPUs) everything works as expected. Our hooks are all registered before wrapping the model with DDP as in your example with `hook_before = True`. So if your example code shows that it works, then probably it just works now, and warning in docs can be clarified.

…

On Tue, 3 Nov 2020 at 22:00, h6197627 ***@***.***> wrote: Hi, @jbojar <https://github.com/jbojar>, I really need to try it more thoroughly. I remember that at the time when DDP was introduced as better to use alternative to DataParallel (quite long ago) I was able to reproduce this limiting behavior, but now after your comment I tried simple script and it looks like it works import os import torch import torch.nn as nn import torch.nn.functional as F import torch.distributed as dist import torch.multiprocessing as mp from torch.nn.parallel import DistributedDataParallel as DDP class Model(nn.Module): def __init__(self, input_dim=(32, 32, 3)): super(Model, self).__init__() assert (len(input_dim) == 3) conv_channels = 16 self.conv = nn.Conv2d(in_channels=input_dim[-1], out_channels=conv_channels, kernel_size=3, stride=1, padding=1) self.linear = nn.Linear(in_features=input_dim[0]*input_dim[1]*conv_channels, out_features=128) nn.init.kaiming_uniform_(self.conv.weight, a=1) nn.init.constant_(self.conv.bias, 0) nn.init.kaiming_uniform_(self.linear.weight, a=1) nn.init.constant_(self.linear.bias, 0) def forward(self, input): print('Forward: {}'.format(torch.distributed.get_rank())) out = self.conv(input) out = out.view(out.size(0), -1) out = self.linear(out) return out def test_ddp_fn(rank, world_size, hook_before): setup(rank, world_size) model = Model().to(rank) # Dummy hook function imitating ReLU activation function def relu_act(module, input): print('Hook: {}'.format(torch.distributed.get_rank())) return F.relu(input[0]) if hook_before: model.linear.register_forward_pre_hook(lambda module, input: relu_act(module, input)) print('Hook registered before wrapping with DDP: {}'.format(torch.distributed.get_rank())) model = DDP(model, device_ids=[rank]) if not hook_before: model.module.linear.register_forward_pre_hook(lambda module, input: relu_act(module, input)) print('Hook registered after wrapping with DDP: {}'.format(torch.distributed.get_rank())) input_data = torch.rand(1, 3, 32, 32) model(input_data) cleanup() def setup(rank, world_size): os.environ['MASTER_ADDR'] = '127.0.0.1' os.environ['MASTER_PORT'] = '29500' dist.init_process_group(backend='gloo', rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() if __name__ == "__main__": world_size = 2 hook_before = True mp.spawn(test_ddp_fn, args=(world_size, hook_before), nprocs=world_size, join=True) Prints Hook registered before wrapping with DDP: 1 Hook registered before wrapping with DDP: 0 Forward: 0 Forward: 1 Hook: 1 Hook: 0 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#35191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGQ2NDAN37GWUT3RHVYR5DSOBVQLANCNFSM4LRMUWKA> .

aluo-x · 2021-08-25T21:21:18Z

I'm actually quite confused. The language is still present in the documentation:

Forward and backward hooks defined on module and its submodules won’t be invoked anymore, unless the hooks are initialized in the forward() method.

So what is the correct way to register forward/backwards hooks when using DDP?

h6197627 · 2022-03-29T20:55:47Z

Closed in #74063

ezyang removed the triage review label Mar 30, 2020

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 30, 2020

h6197627 mentioned this issue Oct 1, 2021

[POLL][RFC] DataParallel Deprecation #65936

Open

rohan-varma mentioned this issue Oct 6, 2021

Clarify hook support in DDP #66229

Closed

fduwjj linked a pull request Mar 10, 2022 that will close this issue

[PT-D][DDP] Add unit tests for Forward and Backward Hook #74063

Closed

h6197627 closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward/backward hooks support in DistributedDataParallel #35191

Forward/backward hooks support in DistributedDataParallel #35191

h6197627 commented Mar 22, 2020 •

edited

Loading

jbojar commented Sep 29, 2020

h6197627 commented Nov 3, 2020

jbojar commented Nov 4, 2020 via email

aluo-x commented Aug 25, 2021

h6197627 commented Mar 29, 2022

Forward/backward hooks support in DistributedDataParallel #35191

Forward/backward hooks support in DistributedDataParallel #35191

Comments

h6197627 commented Mar 22, 2020 • edited Loading

🚀 Feature

Motivation

Pitch

jbojar commented Sep 29, 2020

h6197627 commented Nov 3, 2020

jbojar commented Nov 4, 2020 via email

aluo-x commented Aug 25, 2021

h6197627 commented Mar 29, 2022

h6197627 commented Mar 22, 2020 •

edited

Loading