Gradient of zero norm is nan #2421

zkolter · 2017-08-15T13:06:31Z

If a norm is zero, its gradient returns nan:

x = Variable(torch.zeros(1), requires_grad=True)
x.norm().backward()
print x.grad

# Variable containing:
# nan
# [torch.FloatTensor of size 1]

Obviously just happening because the gradient divides by the norm, but the (sub)gradient here should probably be zero, or at least not nan, since that will propagate to make all updates nan. Probably low priority, as it's not going to be an issue in 99% of cases, but we're doing a few things with (exact) line searches where this caused a nan to appear, breaking everything downstream.

cai-lw · 2017-08-17T00:00:31Z

I'm encountering exactly the same issue! Spent hours on debugging, just to find PyTorch has a bug on such basic thing.

el3ment · 2017-09-06T21:08:01Z

+1 just found this bug too

dannysdeng · 2017-09-07T03:36:39Z

+1 for this bug. Temporarily changing my code to something like the following for the sake of debugging.

x = Variable(torch.zeros(1), requires_grad=True)
y = x + 1e-16
y.norm().backward()
print x.grad

albanD · 2017-09-07T08:31:17Z

The thing is that in the 2 norm, there is a square root, which has a gradient of + infinity at 0.
The gradient gives you nan because you then multiply 0 and an infinity during the backward pass.

ruotianluo · 2017-09-07T08:45:58Z

For a scalar, norm 2 is basically abs. But x.abs().backward() gives you 0 gradient. In this sense, it's not coherent.

JianboTang · 2017-09-07T11:55:21Z

I found this error, too

soumith · 2017-09-18T16:26:56Z

Alban fixed this behavior in #2775

D-X-Y · 2017-12-31T01:56:37Z

@soumith Hi, the norm function can give use the 0 gradients now.
However, the following code still has the nan gradient problem

x = torch.autograd.Variable(torch.zeros(1), requires_grad=True)
y = torch.sqrt( x * x )
y.backward()
print (x.grad)

albanD · 2017-12-31T06:55:21Z

Ho
The square root has no gradient at 0. This is expected behavior.

D-X-Y · 2017-12-31T15:27:50Z

Hi, @albanD but the sub-gradient of the square root should be zero?
Also, y = torch.sqrt( x * x ) should equal to x.norm(), why they have different gradient ( 0 and nan )?

cai-lw · 2017-12-31T15:56:14Z

@D-X-Y I think @albanD was right. The left-side derivative of sqrt(x) at x=0 is undefined, so it doesn't even have a subgradient at x=0.

albanD · 2018-01-05T11:31:56Z

@D-X-Y square root has no subgradient at 0. You could define a gradient by continuity but then it would be +inf...
Given that pytorch is using autograd, x.norm() and x.pow(2).sqrt() (equivalent to your torch.sqrt(x*x)) are completely different:

The first one is a single function that is convex and defined on R, it has a subgradient of 0 at 0.
The second one is composed of two function, the first function is the square function which is differentiable and outputs values in [0, +inf[. The second function is the square root that is not convex and even though it is defined on [0, +inf[, it is only differentiable on ]0, +inf[ and it's gradient in 0 in undefined.
Given that, even though x.norm() and x.pow(2).sqrt() will return the same value, their gradients may differ at points where it is not differentiable, this is because automatic differentiation looks at each step of the computation one by one and even though in some cases a subgradient exist (because we look at multiple operations as a single function), it is not always the case and the gradient remains undefined.

xforceco · 2018-03-17T17:47:54Z

I think math is math. Any root's gradient at zero is either inf or undefined.
This issue shall be handled by the users' themselves by adding a small value(as @dannysdeng did), but an error(warning) message may be helpful since it is pretty hard to debug.

Say: Infinite/Undefined gradient is detected at X_function_X at line Y. Exit.

ngimel · 2018-03-17T18:19:11Z

Agree, norm is not differentiable at 0 https://math.stackexchange.com/questions/310325/is-the-euclidean-norm-differentiable-at-0, the bandaid that Alban put there in #2775 is wrong (even in the limit sense the gradient at 0 should be 1 not 0), but it should not have been there at all. Norm is norm, if someone want to add epsilons to their norms (like batchrnorm, e.g.) they are welcome to do so in the user code. What would numpy do?

asford · 2018-05-10T20:26:47Z

@ngimel @albanD

I've also run into a number of problems related to the change introduced in #2775. Is there a reason the subgradient is set to 0, rather than the 1? (The limit as norm->0?)

As a minimal example:

import torch

v = torch.linspace(0, 1e-6, steps=10).requires_grad_()

def bnorm(val):
    n = val.detach().clone().requires_grad_(True)
    c = n.reshape(-1, 1)
    
    nn = c.norm(dim=-1)
    torch.autograd.backward(nn, torch.ones_like(nn))
    return n.grad

def snorm(val):
    n = val.detach().clone().requires_grad_(True)
    c = n.reshape(-1, 1)
    
    nn = (c ** 2).sum(dim=-1).sqrt()
    
    torch.autograd.backward(nn, torch.ones_like(nn))
    return n.grad

print("torch.__version__:")
display(torch.__version__)

print("vals:")
display(v)

print("torch.norm(v, dim=-1) grad:")
display(bnorm(v))

print("(v**2).sum(dim=-1).sqrt() grad:")
display(snorm(v))

Produces

torch.__version__:
'0.4.0'
vals:
tensor(1.00000e-07 *
       [ 0.0000,  1.1111,  2.2222,  3.3333,  4.4444,  5.5556,  6.6667,
         7.7778,  8.8889, 10.0000])
torch.norm(v, dim=-1) grad:
tensor([ 0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
(v**2).sum(dim=-1).sqrt() grad:
tensor([nan.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.,   1.])

albanD · 2018-05-11T10:11:36Z

Well any value between [-1, 1] is a valid subgradient for the 2-norm.
More genereally, any vector in the 1 ball for the dual norm is a valid subgradient.
This means that 0 is always going to be a subgradient, while 1 will not be for all p.

Anyway, the theory says that any of them could be taken and subgradient descent will work. I'm sure that depending on the application, one will be better than the other.
For example, the relu function will also give a 0 subgradient at 0, you could have given 1.
The main point here was to remove nans that make your network give nan for everything which is not convenient.

gradient of sqrt(x^2) near zero is infinite due to chain rule decomposition: 1/2*z^(-1/2)*2x where z = x^2. pytorch/pytorch#2421

soumith added this to Uncategorized in Issue Status Aug 23, 2017

soumith added this to numerical-stability/correctness in Issue Categories Aug 30, 2017

MeJn mentioned this issue Sep 1, 2017

Gradient of gradient explodes(nan) when training WGAN-GP on Mnist #2534

Closed

albanD mentioned this issue Sep 18, 2017

Norm subgradient at 0 #2775

Merged

soumith closed this as completed in #2775 Sep 18, 2017

soumith removed this from correctness/stability in Issue Categories Sep 19, 2017

albanD mentioned this issue Oct 25, 2017

F.normalize NaN gradient #3264

Closed

elbaro mentioned this issue Dec 22, 2017

torch.std NaN gradient #4320

Closed

fehiepsi mentioned this issue Feb 27, 2018

[Done] Add various kernels for GP module pyro-ppl/pyro#805

Merged

6 tasks

Po-Hsun-Su mentioned this issue Apr 18, 2018

Gettin NaN after few iterations Po-Hsun-Su/pytorch-ssim#8

Closed

albanD mentioned this issue Nov 2, 2018

[BUG] Wrong gradient for slicings if parent nodes include undefined gradients #9688

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient of zero norm is nan #2421

Gradient of zero norm is nan #2421

zkolter commented Aug 15, 2017 •

edited

cai-lw commented Aug 17, 2017

el3ment commented Sep 6, 2017

dannysdeng commented Sep 7, 2017 •

edited

albanD commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

JianboTang commented Sep 7, 2017

soumith commented Sep 18, 2017

D-X-Y commented Dec 31, 2017

albanD commented Dec 31, 2017

D-X-Y commented Dec 31, 2017

cai-lw commented Dec 31, 2017

albanD commented Jan 5, 2018 •

edited

xforceco commented Mar 17, 2018 •

edited

ngimel commented Mar 17, 2018

asford commented May 10, 2018

albanD commented May 11, 2018

Gradient of zero norm is nan #2421

Gradient of zero norm is nan #2421

Comments

zkolter commented Aug 15, 2017 • edited

cai-lw commented Aug 17, 2017

el3ment commented Sep 6, 2017

dannysdeng commented Sep 7, 2017 • edited

albanD commented Sep 7, 2017

ruotianluo commented Sep 7, 2017

JianboTang commented Sep 7, 2017

soumith commented Sep 18, 2017

D-X-Y commented Dec 31, 2017

albanD commented Dec 31, 2017

D-X-Y commented Dec 31, 2017

cai-lw commented Dec 31, 2017

albanD commented Jan 5, 2018 • edited

xforceco commented Mar 17, 2018 • edited

ngimel commented Mar 17, 2018

asford commented May 10, 2018

albanD commented May 11, 2018

zkolter commented Aug 15, 2017 •

edited

dannysdeng commented Sep 7, 2017 •

edited

albanD commented Jan 5, 2018 •

edited

xforceco commented Mar 17, 2018 •

edited