New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient of zero norm is nan #2421
Comments
I'm encountering exactly the same issue! Spent hours on debugging, just to find PyTorch has a bug on such basic thing. |
+1 just found this bug too |
+1 for this bug. Temporarily changing my code to something like the following for the sake of debugging.
|
The thing is that in the 2 norm, there is a square root, which has a gradient of |
For a scalar, norm 2 is basically abs. But x.abs().backward() gives you 0 gradient. In this sense, it's not coherent. |
I found this error, too |
Alban fixed this behavior in #2775 |
@soumith Hi, the norm function can give use the 0 gradients now.
|
Ho |
Hi, @albanD but the sub-gradient of the square root should be zero? |
@D-X-Y square root has no subgradient at 0. You could define a gradient by continuity but then it would be
|
I think math is math. Any root's gradient at zero is either inf or undefined. Say: Infinite/Undefined gradient is detected at X_function_X at line Y. Exit. |
Agree, norm is not differentiable at 0 https://math.stackexchange.com/questions/310325/is-the-euclidean-norm-differentiable-at-0, the bandaid that Alban put there in #2775 is wrong (even in the limit sense the gradient at 0 should be 1 not 0), but it should not have been there at all. Norm is norm, if someone want to add epsilons to their norms (like batchrnorm, e.g.) they are welcome to do so in the user code. What would numpy do? |
I've also run into a number of problems related to the change introduced in #2775. Is there a reason the subgradient is set to 0, rather than the 1? (The limit as norm->0?) As a minimal example: import torch
v = torch.linspace(0, 1e-6, steps=10).requires_grad_()
def bnorm(val):
n = val.detach().clone().requires_grad_(True)
c = n.reshape(-1, 1)
nn = c.norm(dim=-1)
torch.autograd.backward(nn, torch.ones_like(nn))
return n.grad
def snorm(val):
n = val.detach().clone().requires_grad_(True)
c = n.reshape(-1, 1)
nn = (c ** 2).sum(dim=-1).sqrt()
torch.autograd.backward(nn, torch.ones_like(nn))
return n.grad
print("torch.__version__:")
display(torch.__version__)
print("vals:")
display(v)
print("torch.norm(v, dim=-1) grad:")
display(bnorm(v))
print("(v**2).sum(dim=-1).sqrt() grad:")
display(snorm(v)) Produces
|
Well any value between [-1, 1] is a valid subgradient for the 2-norm. Anyway, the theory says that any of them could be taken and subgradient descent will work. I'm sure that depending on the application, one will be better than the other. |
gradient of sqrt(x^2) near zero is infinite due to chain rule decomposition: 1/2*z^(-1/2)*2x where z = x^2. pytorch/pytorch#2421
If a norm is zero, its gradient returns nan:
Obviously just happening because the gradient divides by the norm, but the (sub)gradient here should probably be zero, or at least not nan, since that will propagate to make all updates nan. Probably low priority, as it's not going to be an issue in 99% of cases, but we're doing a few things with (exact) line searches where this caused a nan to appear, breaking everything downstream.
The text was updated successfully, but these errors were encountered: