New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x.grad should be 0 but get NaN after x/0 #4132
Comments
I think the reason why this is happening is that the backwards pass for indexing ( |
@zou3519 I argee that reason, |
Key to avoid
|
@DIYer22 Did you release your complete code about this project on github? |
@pranerd I didn't, The project is dirty code right now |
@DIYer22 will you release it after you finish it? |
my solution beyond the owner's advice
|
… the gradient. This link was useful in finally getting the right insight: pytorch/pytorch#4132 "Key to avoid nan Your code should't generat any inf in forward, which often produce by torch.log(0) and x/[0, ] That means 0 should be filtered before do torch.log(x) and x/div " As it turned out, it was not clear that my forward code was causing infinity. However, putting some prints on the maximum values of tensors in the MDLSTM computation revealed a clear problem: 1. Earlier it was revealed that "output_gate_weighted_states_plus_input" was the first place were nan gradients were observed. This was found by registering a gradient-clamping backward hook on this variable in the function "compute_multi_dimensional_lstm_one_direction" in multi_dimensional_lstm.py. This backward hook checked for bad gradients (including nans) and raised a runtime error once it found one. 2. Second, I observed that in fact none of the predecessor gradients that were computed before this gradient (the gradient from the convolution layer after the first MDLSTM layer for example) seemsed to be problematic. This seemed odd, as it was hard to imagine how the computation of the gradient of a sigmoid function could lead to problems. 3. Adding information about the layer index and column number to the bad-gradient identifying hook, revealed that the problem appeared in the last column of the first MDLSTM layer. That is, at the start of the backward pass for that layer. 4. At the point I was almost stuck with this bug, I decided to take the point of problematic values in the forward computation itself more serious. So I added printouts of the maximum values of tensors. It turned out then that the variables "output_gate_weighted_states_plus_input" and at the root of that "new_memory_state" were growing unboundedly, with a growth rate of approximately 2 or less, per column iteration in the MDLSTM forward computation, obtaining values greater than e.g. 1.4322e+10 after a while. I realized, that new_memory_state would not be gated anymore, but directly saved for the next column iteration. Hence it seemed to make sense that the old computation: " new_memory_state = input_and_input_gate_combined + \ forget_gate_two_activation_multiplied_with_previous_memory_state + \ forget_gate_one_activation_multiplied_with_previous_memory_state # + \ # forget_gates_combined_activation_multiplied_with_previous_memory_state \ " could indeed cause the memory state to grow over time, since it effectively replaced new_memory_state with two functions of the old memory state in addition to the function of the input. Importantly, both "forget_gate_two_activation_multiplied_with_previous_memory_state" and "forget_gate_one_activation_multiplied_with_previous_memory_state" in the above formula have the potential to become nearly as large as memory_state, as the activity of their gating component (memory gate 1, memory gate 2) go to 1. A simple fix that is added in this commit, and that indeed stabilizes the values is: " new_memory_state = input_and_input_gate_combined + \ 0.5 * forget_gate_two_activation_multiplied_with_previous_memory_state + \ 0.5 * forget_gate_one_activation_multiplied_with_previous_memory_state # + \ " This fix also removes the need to do gradient clamping during backward computation, now norm-based gradient clipping, which is computationally cheaper and has nicer theoretical properties (preserving the direction of the original gradient, which value-based clamping/clipping cannot achieve). Note however, that another way to fix the problem would be to initialize the foget gates differently, which are now initialized to 1. While the latter may be advisible for single-dimensional LSTM, it may lead to unnescessary instability for MultiDimensionalLSTM. TODO: Next I will look if the scaling of the components by 0.5 can be gotten rid of by changing the bias to the forget gates instead. modified: modules/block_strided_convolution.py modified: modules/gradient_clamped_module.py modified: modules/inside_model_gradient_clipping.py modified: modules/mdlstm_layer_block_strided_convolution_layer_pair.py modified: modules/multi_dimensional_lstm.py modified: modules/multi_dimensional_lstm_layer_pair_stacking.py modified: modules/multi_dimensional_lstm_parameters.py modified: modules/multi_dimensional_rnn.py modified: modules/network_to_softmax_network.py modified: modules/parallel_multiple_state_weightings_computation.py modified: modules/train_multi_dimensional_rnn_ctc.py modified: modules/trainer.py modified: util/tensor_utils.py
If you add a small value in the zero items, this will happen? such as 1e-30 |
I think this theme (getting NaNs) comes up a lot. Maybe we should have some documentation around NaN best practices, if we don't have that already. |
if I add a eps=1e-15, this problem nan caused by x/0 will be avoided ?? Thank you ! |
This bug is actually not a bug, 1/0=inf, the easy solution is to use eps instead of 0 |
It is a bug. The fact that in pytorch Another example: import torch
a = torch.tensor([1.], requires_grad=True)
b = a ** 2
b.backward()
# `a.grad` is `2` as expected
print(a.grad)
a = torch.tensor([1.], requires_grad=True)
b = a ** 2
b = b.clone()
b[0] = 0
b.backward()
# `a.grad` is `0` as expected
print(a.grad)
a = torch.tensor([0.], requires_grad=True)
b = a.sqrt()
b.backward()
# `a.grad` is `inf` as expected
print(a.grad)
a = torch.tensor([0.], requires_grad=True)
b = a.sqrt()
b = b.clone()
b[0] = 0
b.backward()
# `a.grad` is `nan` but expected `0`
print(a.grad) |
This still reproduces on master |
At the triage review meeting, we decided: this is expected behavior (for now at least). It is a legitimate request to distinguish "zero gradient" and "no gradient". However, we may need a first-class concept of masking in PyTorch to accomplish this, so this won't happen anytime soon. |
Hi all, just wanted to surface some attention for MaskedTensor, a tensor extension library that we just launched the prototype for along PyTorch 1.11. It should help solve many existing problems related to masked semantics, distinguishing between 0 and undefined gradients, etc. e.g. for this example: x = torch.tensor([1., 1.], requires_grad=True)
div = torch.tensor([0., 1.])
y = x/div # => y is [inf, 1]
mask = (div != 0) # => mask is [0, 1]
loss = as_masked_tensor(y, mask)
loss = loss.sum()
loss.backward()
x.grad
masked_tensor(
[ --, 1.0000]
) For more details, you can check out the website or Github. We're actively looking for new users, so if you're interested in using it or have any suggestions on things we should build out, please reach out to me or @cpuhrsch. Thanks!! |
x.grad should be 0 but get NaN after x/0
Reproduction BUG code
Computational graph of loss not include
x[0]
So, gradient of
x[0]
should be 0, but getNaN
more simple reproduction
Versions:
cc @svekars @holly1238 @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @brianjo @mruberry @gchanan @bdhirsh @jbschlosser @anjali411 @jlin27
The text was updated successfully, but these errors were encountered: