x.grad should be 0 but get NaN after x/0 #4132

DIYer22 · 2017-12-12T16:44:52Z

x.grad should be 0 but get NaN after x/0

Reproduction BUG code

import torch
from torch.autograd import Variable

x = Variable(torch.FloatTensor([1.,1]), requires_grad=True)
div = Variable(torch.FloatTensor([0.,1]))

y = x/div # => y is [inf, 1]

zero_mask = (div==0) # => zero_mask is [1, 0]
y[zero_mask] = 0  # => y is [0, 1]

loss = y.sum()
loss.backward()
print(x.grad) # grad is [nan, 1], but expected [0, 1]

Computational graph of loss not include x[0]
So, gradient of x[0] should be 0, but get NaN

more simple reproduction

x = Variable(torch.FloatTensor([1.,1]), requires_grad=True)
div = Variable(torch.FloatTensor([0.,1]))

y = x/div # => y is [inf, 1]

mask = (div!=0) # => mask is [0, 1]
loss = y[mask]

loss.backward()
print(x.grad) # grad is [nan, 1], but expected [0, 1]

Versions:

Python: 2.7
pyTorch: 0.3.0.post4

cc @svekars @holly1238 @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @brianjo @mruberry @gchanan @bdhirsh @jbschlosser @anjali411 @jlin27

The text was updated successfully, but these errors were encountered:

zou3519 · 2017-12-12T17:04:10Z

I think the reason why this is happening is that the backwards pass for indexing (y[mask]) returns a tensor that has 0's in it for the masked-out indices. 0 * float('inf') gives nan, which is why the nan is showing up instead of 0.

DIYer22 · 2017-12-13T05:41:52Z

@zou3519 I argee that reason,
But, is it a BUG? because Index should do select rather than multiplication a mask, and x[0] should not be in computational graph

DIYer22 · 2017-12-13T07:26:08Z

Key to avoid `nan`

Your code should't generat any inf in forward, which often produce by torch.log(0) and x/[0, ]
That means 0 should be filtered before do torch.log(x) and x/div

My example

Variables

└─ /: 4
    ├── gtind: torch.Size([1, 2, 300, 400]) torch.cuda.FloatTensor
    ├── edge: torch.Size([1, 300, 400]) torch.cuda.ByteTensor
    ├── probnb: torch.Size([1, 8, 2, 300, 400]) torch.cuda.FloatTensor
    └── gtdf: torch.Size([1, 8, 300, 400]) torch.cuda.FloatTensor
th = torch
tots = lambda x:x.data

code(before)

    otherSideEdgeLossMap = -th.log(((probnb*gtind).sum(-3)*gtdf).sum(-3)/gtdf.sum(-3))
    otherSideEdgeLossMap[~tots(edge)] = 0

code(after)

    numerator = ((probnb*gtind).sum(-3)*gtdf).sum(-3)
    numerator[tots(edge)] /= gtdf.sum(-3)[tots(edge)]
    numerator[tots(edge)] = -th.log(numerator[tots(edge)])
    otherSideEdgeLossMap = (numerator)
    otherSideEdgeLossMap[~tots(edge)] = 0

Both code(before) and code(after) has total same otherSideEdgeLossMap.
After otherSideEdgeLossMap[edge].mean().backward().
the simple one code(before)'s grad has lots of nan

After many trys, code(after) could get the right grad !!

In my opinion,It's a BUG, because index operation should totally isolate the gradient of tensors which are not be indexed!

pranerd · 2017-12-23T04:25:04Z

@DIYer22 Did you release your complete code about this project on github?

DIYer22 · 2017-12-23T05:04:09Z

@pranerd I didn't, The project is dirty code right now

pranerd · 2017-12-24T02:43:28Z

@DIYer22 will you release it after you finish it?

yawudede · 2018-01-17T11:27:11Z

my solution beyond the owner's advice

import torch
from torch.autograd import Variable
x = Variable(torch.FloatTensor([1.,1]), requires_grad=True)
div = Variable(torch.FloatTensor([0.,1]))
numerator=Variable(torch.FloatTensor([0.,1]))
one_mask = (div!=0) # => one_mask is [0, 1]
numerator[one_mask]=x[one_mask] /div[one_mask] 
y=(numerator)
zero_mask = (div==0) # => zero_mask is [1, 0]
y[zero_mask] = 0  # => y is [0, 1]
print(y.data)
loss = y.sum()
loss.backward()
print(x.grad) # grad is [0, 1], as expected [0, 1] #打印梯度 dy/dx`

… the gradient. This link was useful in finally getting the right insight: pytorch/pytorch#4132 "Key to avoid nan Your code should't generat any inf in forward, which often produce by torch.log(0) and x/[0, ] That means 0 should be filtered before do torch.log(x) and x/div " As it turned out, it was not clear that my forward code was causing infinity. However, putting some prints on the maximum values of tensors in the MDLSTM computation revealed a clear problem: 1. Earlier it was revealed that "output_gate_weighted_states_plus_input" was the first place were nan gradients were observed. This was found by registering a gradient-clamping backward hook on this variable in the function "compute_multi_dimensional_lstm_one_direction" in multi_dimensional_lstm.py. This backward hook checked for bad gradients (including nans) and raised a runtime error once it found one. 2. Second, I observed that in fact none of the predecessor gradients that were computed before this gradient (the gradient from the convolution layer after the first MDLSTM layer for example) seemsed to be problematic. This seemed odd, as it was hard to imagine how the computation of the gradient of a sigmoid function could lead to problems. 3. Adding information about the layer index and column number to the bad-gradient identifying hook, revealed that the problem appeared in the last column of the first MDLSTM layer. That is, at the start of the backward pass for that layer. 4. At the point I was almost stuck with this bug, I decided to take the point of problematic values in the forward computation itself more serious. So I added printouts of the maximum values of tensors. It turned out then that the variables "output_gate_weighted_states_plus_input" and at the root of that "new_memory_state" were growing unboundedly, with a growth rate of approximately 2 or less, per column iteration in the MDLSTM forward computation, obtaining values greater than e.g. 1.4322e+10 after a while. I realized, that new_memory_state would not be gated anymore, but directly saved for the next column iteration. Hence it seemed to make sense that the old computation: " new_memory_state = input_and_input_gate_combined + \ forget_gate_two_activation_multiplied_with_previous_memory_state + \ forget_gate_one_activation_multiplied_with_previous_memory_state # + \ # forget_gates_combined_activation_multiplied_with_previous_memory_state \ " could indeed cause the memory state to grow over time, since it effectively replaced new_memory_state with two functions of the old memory state in addition to the function of the input. Importantly, both "forget_gate_two_activation_multiplied_with_previous_memory_state" and "forget_gate_one_activation_multiplied_with_previous_memory_state" in the above formula have the potential to become nearly as large as memory_state, as the activity of their gating component (memory gate 1, memory gate 2) go to 1. A simple fix that is added in this commit, and that indeed stabilizes the values is: " new_memory_state = input_and_input_gate_combined + \ 0.5 * forget_gate_two_activation_multiplied_with_previous_memory_state + \ 0.5 * forget_gate_one_activation_multiplied_with_previous_memory_state # + \ " This fix also removes the need to do gradient clamping during backward computation, now norm-based gradient clipping, which is computationally cheaper and has nicer theoretical properties (preserving the direction of the original gradient, which value-based clamping/clipping cannot achieve). Note however, that another way to fix the problem would be to initialize the foget gates differently, which are now initialized to 1. While the latter may be advisible for single-dimensional LSTM, it may lead to unnescessary instability for MultiDimensionalLSTM. TODO: Next I will look if the scaling of the components by 0.5 can be gotten rid of by changing the bias to the forget gates instead. modified: modules/block_strided_convolution.py modified: modules/gradient_clamped_module.py modified: modules/inside_model_gradient_clipping.py modified: modules/mdlstm_layer_block_strided_convolution_layer_pair.py modified: modules/multi_dimensional_lstm.py modified: modules/multi_dimensional_lstm_layer_pair_stacking.py modified: modules/multi_dimensional_lstm_parameters.py modified: modules/multi_dimensional_rnn.py modified: modules/network_to_softmax_network.py modified: modules/parallel_multiple_state_weightings_computation.py modified: modules/train_multi_dimensional_rnn_ctc.py modified: modules/trainer.py modified: util/tensor_utils.py

Rpersie · 2019-07-15T12:48:00Z

If you add a small value in the zero items, this will happen? such as 1e-30

zou3519 · 2019-07-16T16:11:01Z

I think this theme (getting NaNs) comes up a lot. Maybe we should have some documentation around NaN best practices, if we don't have that already.

Rpersie · 2019-07-18T12:29:00Z

I think this theme (getting NaNs) comes up a lot. Maybe we should have some documentation around NaN best practices, if we don't have that already.

if I add a eps=1e-15, this problem nan caused by x/0 will be avoided ?? Thank you !

chengxuanying · 2019-07-26T13:53:20Z

This bug is actually not a bug, 1/0=inf, the easy solution is to use eps instead of 0

saluto · 2020-06-03T06:07:38Z

It is a bug. The fact that in pytorch 1/0=inf (which is, by the way, mathematically wrong) has nothing to do with the fact, that y is independent of x[0], thus y's gradient w.r.t. x[0] should be 0. Masking does not solve the bug, just avoids its occurrence if done right. It is a pain, inefficient and should not be necessary. Please, dear pytorch team, fix this. Thank you!

Another example:

import torch


a = torch.tensor([1.], requires_grad=True)
b = a ** 2
b.backward()
# `a.grad` is `2` as expected
print(a.grad)

a = torch.tensor([1.], requires_grad=True)
b = a ** 2
b = b.clone()
b[0] = 0
b.backward()
# `a.grad` is `0` as expected
print(a.grad)


a = torch.tensor([0.], requires_grad=True)
b = a.sqrt()
b.backward()
# `a.grad` is `inf` as expected
print(a.grad)

a = torch.tensor([0.], requires_grad=True)
b = a.sqrt()
b = b.clone()
b[0] = 0
b.backward()
# `a.grad` is `nan` but expected `0`
print(a.grad)

ezyang · 2021-08-05T14:03:54Z

This still reproduces on master

ppwwyyxx · 2021-08-05T19:13:31Z

The repro at the top looks the same as #15506 and #12986. Would be difficult to solve properly unless there is a way for autograd to distinguish "zero gradient" and "no gradient" on a per-element basis.

zou3519 · 2021-08-09T17:17:05Z

At the triage review meeting, we decided: this is expected behavior (for now at least). It is a legitimate request to distinguish "zero gradient" and "no gradient". However, we may need a first-class concept of masking in PyTorch to accomplish this, so this won't happen anytime soon.

george-qi · 2022-03-16T20:45:39Z

Hi all, just wanted to surface some attention for MaskedTensor, a tensor extension library that we just launched the prototype for along PyTorch 1.11. It should help solve many existing problems related to masked semantics, distinguishing between 0 and undefined gradients, etc.

e.g. for this example:

x = torch.tensor([1., 1.], requires_grad=True)
div = torch.tensor([0., 1.])
y = x/div # => y is [inf, 1]

mask = (div != 0) # => mask is [0, 1]
loss = as_masked_tensor(y, mask)
loss = loss.sum()
loss.backward()

x.grad
masked_tensor(
  [      --,   1.0000]
)

For more details, you can check out the website or Github. We're actively looking for new users, so if you're interested in using it or have any suggestions on things we should build out, please reach out to me or @cpuhrsch. Thanks!!

DIYer22 mentioned this issue May 13, 2018

Review-2018.05.13-杨磊 BUCT-Vision/weekly-review#138

Open

zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: docs Related to our documentation, both in docs/ and docblocks labels Jul 16, 2019

heitorschueroff added the module: NaNs and Infs Problems related to NaN and Inf handling in floating point label Oct 19, 2020

ezyang added the high priority label Aug 5, 2021

pytorch-probot bot added the triage review label Aug 5, 2021

zou3519 removed triage review high priority labels Aug 9, 2021

zou3519 added the module: autograd Related to torch.autograd, and the autograd engine in general label Aug 9, 2021

soulitzer added has workaround needs design labels Jul 25, 2022

This was referenced Jul 25, 2022

Gradient of clamp is nan for inf inputs #10729

Closed

Pruning off NaN values in the gradient graph still produces NaN gradients. #23156

Closed

This was referenced Jul 25, 2022

torch.where produces nan in backward pass for differentiable forward pass #68425

Closed

torch.where leads to unexpected gradients #52248

Closed

Gradient of nansum and nanmean wrongly produces nan #67180

Closed

albanD mentioned this issue Feb 6, 2023

Incorrect gradient with atan2 and "late" masking #93998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x.grad should be 0 but get NaN after x/0 #4132

x.grad should be 0 but get NaN after x/0 #4132

DIYer22 commented Dec 12, 2017 •

edited by pytorch-bot bot

zou3519 commented Dec 12, 2017

DIYer22 commented Dec 13, 2017

DIYer22 commented Dec 13, 2017

pranerd commented Dec 23, 2017

DIYer22 commented Dec 23, 2017

pranerd commented Dec 24, 2017

yawudede commented Jan 17, 2018 •

edited

Rpersie commented Jul 15, 2019

zou3519 commented Jul 16, 2019

Rpersie commented Jul 18, 2019

chengxuanying commented Jul 26, 2019

saluto commented Jun 3, 2020 •

edited

ezyang commented Aug 5, 2021

ppwwyyxx commented Aug 5, 2021

zou3519 commented Aug 9, 2021

george-qi commented Mar 16, 2022

x.grad should be 0 but get NaN after x/0 #4132

x.grad should be 0 but get NaN after x/0 #4132

Comments

DIYer22 commented Dec 12, 2017 • edited by pytorch-bot bot

x.grad should be 0 but get NaN after x/0

Reproduction BUG code

more simple reproduction

Versions:

zou3519 commented Dec 12, 2017

DIYer22 commented Dec 13, 2017

DIYer22 commented Dec 13, 2017

Key to avoid nan

My example

Variables

code(before)

code(after)

pranerd commented Dec 23, 2017

DIYer22 commented Dec 23, 2017

pranerd commented Dec 24, 2017

yawudede commented Jan 17, 2018 • edited

Rpersie commented Jul 15, 2019

zou3519 commented Jul 16, 2019

Rpersie commented Jul 18, 2019

chengxuanying commented Jul 26, 2019

saluto commented Jun 3, 2020 • edited

ezyang commented Aug 5, 2021

ppwwyyxx commented Aug 5, 2021

zou3519 commented Aug 9, 2021

george-qi commented Mar 16, 2022

DIYer22 commented Dec 12, 2017 •

edited by pytorch-bot bot

Key to avoid `nan`

yawudede commented Jan 17, 2018 •

edited

saluto commented Jun 3, 2020 •

edited