Can't get module gradient in autograd.Function's custom backward when DataParallel is used #33800

jerrybai1995 · 2020-02-26T02:31:06Z

I noticed a strange behavior when using DataParallel with custom backward pass in autograd.Function. Here is an example:

import torch
import torch.nn as nn
from torch.autograd import Function
torch.set_default_tensor_type('torch.cuda.FloatTensor')
import os

class Combo(nn.Module):
    def __init__(self):
        super(Combo, self).__init__()
        self.func = nn.Conv1d(3,3,3,padding=1)
        
    def forward(self, x):
        z = Debug.apply(self.func, x)
        return z 

class Debug(Function):
    @staticmethod
    def forward(ctx, f, z):
        ctx.save_for_backward(z)
        ctx.f = f
        return z
    
    @staticmethod
    def backward(ctx, grad):
        grad = grad.clone()
        f = ctx.f
        z, = ctx.saved_tensors
        z = z.clone().detach().requires_grad_()
        with torch.enable_grad():
            y = f(z)
        y.backward(torch.randn(z.shape), retain_graph=False)
        print(f.weight.grad)                               # <------------------ HERE
        return None, grad

net = Combo()
para_net = nn.DataParallel(net)

xx = torch.randn(4,3,7).requires_grad_()    # Batch size 4
yy = para_net(xx)
loss = yy.mean()
loss.backward()

I want to compute and update f.weight.grad within the custom backward function (see "<----- HERE" in the code). I found that when CUDA_VISIBLE_DEVICES=0 (i.e., only 1 GPU is used), this works fine; but if I use CUDA_VISIBLE_DEVICES=0,1,2,3, the printed f.weight.grad will be None on each GPU device.

My guess is that when using multiple GPUs, each device will store a copy of f, which creates this problem.

The desired behavior is for each device to compute its own f.weight.grad and then added together when eventually collected by GPU 0. Is there anyway to resolve this?

Thanks a lot!

The text was updated successfully, but these errors were encountered:

ruoshiliu · 2021-11-22T00:06:01Z

I'm getting similar error. Has this issue been resolved yet?

amrhamedp · 2021-12-26T22:23:38Z

same here

agolynski added module: data parallel triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't get module gradient in autograd.Function's custom backward when DataParallel is used #33800

Can't get module gradient in autograd.Function's custom backward when DataParallel is used #33800

jerrybai1995 commented Feb 26, 2020 •

edited

Loading

ruoshiliu commented Nov 22, 2021

amrhamedp commented Dec 26, 2021

Can't get module gradient in autograd.Function's custom backward when DataParallel is used #33800

Can't get module gradient in autograd.Function's custom backward when DataParallel is used #33800

Comments

jerrybai1995 commented Feb 26, 2020 • edited Loading

ruoshiliu commented Nov 22, 2021

amrhamedp commented Dec 26, 2021

jerrybai1995 commented Feb 26, 2020 •

edited

Loading