Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x.grad should be 0 but get NaN after x/0 #4132

Open
DIYer22 opened this issue Dec 12, 2017 · 16 comments
Open

x.grad should be 0 but get NaN after x/0 #4132

DIYer22 opened this issue Dec 12, 2017 · 16 comments
Labels
has workaround module: autograd Related to torch.autograd, and the autograd engine in general module: docs Related to our documentation, both in docs/ and docblocks module: NaNs and Infs Problems related to NaN and Inf handling in floating point needs design triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@DIYer22
Copy link

DIYer22 commented Dec 12, 2017

x.grad should be 0 but get NaN after x/0

Reproduction BUG code

import torch
from torch.autograd import Variable

x = Variable(torch.FloatTensor([1.,1]), requires_grad=True)
div = Variable(torch.FloatTensor([0.,1]))

y = x/div # => y is [inf, 1]

zero_mask = (div==0) # => zero_mask is [1, 0]
y[zero_mask] = 0  # => y is [0, 1]

loss = y.sum()
loss.backward()
print(x.grad) # grad is [nan, 1], but expected [0, 1]

Computational graph of loss not include x[0]
So, gradient of x[0] should be 0, but get NaN

more simple reproduction

x = Variable(torch.FloatTensor([1.,1]), requires_grad=True)
div = Variable(torch.FloatTensor([0.,1]))

y = x/div # => y is [inf, 1]

mask = (div!=0) # => mask is [0, 1]
loss = y[mask]

loss.backward()
print(x.grad) # grad is [nan, 1], but expected [0, 1]

Versions:

  • Python: 2.7
  • pyTorch: 0.3.0.post4

cc @svekars @holly1238 @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @brianjo @mruberry @gchanan @bdhirsh @jbschlosser @anjali411 @jlin27

@zou3519
Copy link
Contributor

zou3519 commented Dec 12, 2017

I think the reason why this is happening is that the backwards pass for indexing (y[mask]) returns a tensor that has 0's in it for the masked-out indices. 0 * float('inf') gives nan, which is why the nan is showing up instead of 0.

@DIYer22
Copy link
Author

DIYer22 commented Dec 13, 2017

@zou3519 I argee that reason,
But, is it a BUG? because Index should do select rather than multiplication a mask, and x[0] should not be in computational graph

@DIYer22
Copy link
Author

DIYer22 commented Dec 13, 2017

Key to avoid nan

Your code should't generat any inf in forward, which often produce by torch.log(0) and x/[0, ]
That means 0 should be filtered before do torch.log(x) and x/div

My example

Variables

└─ /: 4
    ├── gtind: torch.Size([1, 2, 300, 400]) torch.cuda.FloatTensor
    ├── edge: torch.Size([1, 300, 400]) torch.cuda.ByteTensor
    ├── probnb: torch.Size([1, 8, 2, 300, 400]) torch.cuda.FloatTensor
    └── gtdf: torch.Size([1, 8, 300, 400]) torch.cuda.FloatTensor
th = torch
tots = lambda x:x.data

code(before)

    otherSideEdgeLossMap = -th.log(((probnb*gtind).sum(-3)*gtdf).sum(-3)/gtdf.sum(-3))
    otherSideEdgeLossMap[~tots(edge)] = 0

code(after)

    numerator = ((probnb*gtind).sum(-3)*gtdf).sum(-3)
    numerator[tots(edge)] /= gtdf.sum(-3)[tots(edge)]
    numerator[tots(edge)] = -th.log(numerator[tots(edge)])
    otherSideEdgeLossMap = (numerator)
    otherSideEdgeLossMap[~tots(edge)] = 0

Both code(before) and code(after) has total same otherSideEdgeLossMap.
After otherSideEdgeLossMap[edge].mean().backward().
the simple one code(before)'s grad has lots of nan

After many trys, code(after) could get the right grad !!

In my opinion,It's a BUG, because index operation should totally isolate the gradient of tensors which are not be indexed!

@pranerd
Copy link

pranerd commented Dec 23, 2017

@DIYer22 Did you release your complete code about this project on github?

@DIYer22
Copy link
Author

DIYer22 commented Dec 23, 2017

@pranerd I didn't, The project is dirty code right now

@pranerd
Copy link

pranerd commented Dec 24, 2017

@DIYer22 will you release it after you finish it?

@yawudede
Copy link

yawudede commented Jan 17, 2018

my solution beyond the owner's advice

import torch
from torch.autograd import Variable
x = Variable(torch.FloatTensor([1.,1]), requires_grad=True)
div = Variable(torch.FloatTensor([0.,1]))
numerator=Variable(torch.FloatTensor([0.,1]))
one_mask = (div!=0) # => one_mask is [0, 1]
numerator[one_mask]=x[one_mask] /div[one_mask] 
y=(numerator)
zero_mask = (div==0) # => zero_mask is [1, 0]
y[zero_mask] = 0  # => y is [0, 1]
print(y.data)
loss = y.sum()
loss.backward()
print(x.grad) # grad is [0, 1], as expected [0, 1] #打印梯度 dy/dx`

gwenniger added a commit to gwenniger/multi-hare that referenced this issue Jun 25, 2019
… the gradient.

This link was useful in finally getting the right insight:
pytorch/pytorch#4132

"Key to avoid nan

Your code should't generat any inf in forward, which often produce by torch.log(0) and x/[0, ]
That means 0 should be filtered before do torch.log(x) and x/div
"

As it turned out, it was not clear that my forward code was causing infinity.
However, putting some prints on the maximum values of tensors in the
MDLSTM computation revealed a clear problem:

1. Earlier it was revealed that "output_gate_weighted_states_plus_input"
was the first place were nan gradients were observed. This was found by
registering a gradient-clamping backward hook on this variable in the function
"compute_multi_dimensional_lstm_one_direction" in multi_dimensional_lstm.py.
This backward hook checked for bad gradients (including nans) and raised
a runtime error once it found one.

2. Second, I observed that in fact none of the predecessor gradients that
were computed before this gradient (the gradient from the convolution layer
after the first MDLSTM layer for example) seemsed to be problematic.
This seemed odd, as it was hard to imagine how the computation of the gradient
of a sigmoid function could lead to problems.

3. Adding information about the layer index and column number to the bad-gradient
identifying hook, revealed that the problem appeared in the last column of the
first MDLSTM layer. That is, at the start of the backward pass for that layer.

4. At the point I was almost stuck with this bug, I decided to take the point of
problematic values in the forward computation itself more serious. So I added
printouts of the maximum values of tensors. It turned out then that the
variables "output_gate_weighted_states_plus_input" and at the root of that
"new_memory_state" were growing unboundedly, with a growth rate of approximately 2
or less, per column iteration in the MDLSTM forward computation, obtaining values
greater than e.g. 1.4322e+10 after a while.

I realized, that new_memory_state would not be gated anymore, but directly saved
for the next column iteration. Hence it seemed to make sense that the
old computation:

"
new_memory_state = input_and_input_gate_combined + \
                forget_gate_two_activation_multiplied_with_previous_memory_state + \
                forget_gate_one_activation_multiplied_with_previous_memory_state # + \
                # forget_gates_combined_activation_multiplied_with_previous_memory_state \
"

could indeed cause the memory state to grow over time, since it effectively replaced new_memory_state
with two functions of the old memory state in addition to the function of the input. Importantly,
both  "forget_gate_two_activation_multiplied_with_previous_memory_state"  and
"forget_gate_one_activation_multiplied_with_previous_memory_state" in the above formula have the
potential to become nearly as large as memory_state, as the activity of their gating component
(memory gate 1, memory gate 2) go to 1.

A simple fix that is added in this commit, and that indeed stabilizes the values is:

"
            new_memory_state = input_and_input_gate_combined + \
                               0.5 * forget_gate_two_activation_multiplied_with_previous_memory_state + \
                               0.5 * forget_gate_one_activation_multiplied_with_previous_memory_state  # + \
"

This fix also removes the need to do gradient clamping during backward computation, now norm-based gradient clipping,
which is computationally cheaper and has nicer theoretical properties (preserving the direction of the original gradient,
which value-based clamping/clipping cannot achieve).

Note however, that another way to fix the problem would be to initialize the foget gates differently, which are
now initialized to 1. While the latter may be advisible for single-dimensional LSTM, it may lead to unnescessary
instability for MultiDimensionalLSTM.

TODO: Next I will look if the scaling of the components by 0.5 can be gotten rid of by changing the bias to the forget
gates instead.

	modified:   modules/block_strided_convolution.py
	modified:   modules/gradient_clamped_module.py
	modified:   modules/inside_model_gradient_clipping.py
	modified:   modules/mdlstm_layer_block_strided_convolution_layer_pair.py
	modified:   modules/multi_dimensional_lstm.py
	modified:   modules/multi_dimensional_lstm_layer_pair_stacking.py
	modified:   modules/multi_dimensional_lstm_parameters.py
	modified:   modules/multi_dimensional_rnn.py
	modified:   modules/network_to_softmax_network.py
	modified:   modules/parallel_multiple_state_weightings_computation.py
	modified:   modules/train_multi_dimensional_rnn_ctc.py
	modified:   modules/trainer.py
	modified:   util/tensor_utils.py
@Rpersie
Copy link

Rpersie commented Jul 15, 2019

If you add a small value in the zero items, this will happen? such as 1e-30

@zou3519 zou3519 added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: docs Related to our documentation, both in docs/ and docblocks labels Jul 16, 2019
@zou3519
Copy link
Contributor

zou3519 commented Jul 16, 2019

I think this theme (getting NaNs) comes up a lot. Maybe we should have some documentation around NaN best practices, if we don't have that already.

@Rpersie
Copy link

Rpersie commented Jul 18, 2019

I think this theme (getting NaNs) comes up a lot. Maybe we should have some documentation around NaN best practices, if we don't have that already.

if I add a eps=1e-15, this problem nan caused by x/0 will be avoided ?? Thank you !

@chengxuanying
Copy link

This bug is actually not a bug, 1/0=inf, the easy solution is to use eps instead of 0

@saluto
Copy link

saluto commented Jun 3, 2020

It is a bug. The fact that in pytorch 1/0=inf (which is, by the way, mathematically wrong) has nothing to do with the fact, that y is independent of x[0], thus y's gradient w.r.t. x[0] should be 0. Masking does not solve the bug, just avoids its occurrence if done right. It is a pain, inefficient and should not be necessary. Please, dear pytorch team, fix this. Thank you!


Another example:

import torch


a = torch.tensor([1.], requires_grad=True)
b = a ** 2
b.backward()
# `a.grad` is `2` as expected
print(a.grad)

a = torch.tensor([1.], requires_grad=True)
b = a ** 2
b = b.clone()
b[0] = 0
b.backward()
# `a.grad` is `0` as expected
print(a.grad)


a = torch.tensor([0.], requires_grad=True)
b = a.sqrt()
b.backward()
# `a.grad` is `inf` as expected
print(a.grad)

a = torch.tensor([0.], requires_grad=True)
b = a.sqrt()
b = b.clone()
b[0] = 0
b.backward()
# `a.grad` is `nan` but expected `0`
print(a.grad)

@heitorschueroff heitorschueroff added the module: NaNs and Infs Problems related to NaN and Inf handling in floating point label Oct 19, 2020
@ezyang
Copy link
Contributor

ezyang commented Aug 5, 2021

This still reproduces on master

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Aug 5, 2021

The repro at the top looks the same as #15506 and #12986. Would be difficult to solve properly unless there is a way for autograd to distinguish "zero gradient" and "no gradient" on a per-element basis.

@zou3519
Copy link
Contributor

zou3519 commented Aug 9, 2021

At the triage review meeting, we decided: this is expected behavior (for now at least). It is a legitimate request to distinguish "zero gradient" and "no gradient". However, we may need a first-class concept of masking in PyTorch to accomplish this, so this won't happen anytime soon.

@zou3519 zou3519 added the module: autograd Related to torch.autograd, and the autograd engine in general label Aug 9, 2021
@george-qi
Copy link
Contributor

Hi all, just wanted to surface some attention for MaskedTensor, a tensor extension library that we just launched the prototype for along PyTorch 1.11. It should help solve many existing problems related to masked semantics, distinguishing between 0 and undefined gradients, etc.

e.g. for this example:

x = torch.tensor([1., 1.], requires_grad=True)
div = torch.tensor([0., 1.])
y = x/div # => y is [inf, 1]

mask = (div != 0) # => mask is [0, 1]
loss = as_masked_tensor(y, mask)
loss = loss.sum()
loss.backward()

x.grad
masked_tensor(
  [      --,   1.0000]
)

For more details, you can check out the website or Github. We're actively looking for new users, so if you're interested in using it or have any suggestions on things we should build out, please reach out to me or @cpuhrsch. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has workaround module: autograd Related to torch.autograd, and the autograd engine in general module: docs Related to our documentation, both in docs/ and docblocks module: NaNs and Infs Problems related to NaN and Inf handling in floating point needs design triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests