Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Function 'PowBackward0' returned nan values in its 0th output. #38

Open
m1kit opened this issue May 19, 2021 · 12 comments
Open

Comments

@m1kit
Copy link

m1kit commented May 19, 2021

During the learning process, the following error occurs and learning is interrupted.

[TRAIN] Iter: 40300 Loss: 0.011321269907057285  PSNR: 23.059185028076172
 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                 11%|█████████████████████▎                                                                                                                                                                       | 20356/180001 [1:25:30<11:07:17,  3.99it/s][W python_anomaly_mode.cpp:104] Warning: Error detected in PowBackward0. Traceback of forward call that caused the error:
  File "run_nerf.py", line 858, in <module>
    train()
  File "run_nerf.py", line 751, in train
    img_loss0 = img2mse(extras['rgb0'], target_s)
  File "/app/nerf/run_nerf_helpers.py", line 12, in <lambda>
    img2mse = lambda x, y : torch.mean((x - y) ** 2)
 (function _print_stack)
 11%|█████████████████████▎                                                                                                                                                                       | 20356/180001 [1:25:30<11:10:36,  3.97it/s]
Traceback (most recent call last):
  File "run_nerf.py", line 858, in <module>
    train()
  File "run_nerf.py", line 755, in train
    loss.backward()
  File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

Here's my configuration.

expname = mydata_test
basedir = ./logs
datadir = ./data/nerf_llff_data/mydata
dataset_type = llff

factor = 8
llffhold = 8

N_rand = 1024
N_samples = 64
N_importance = 64

use_viewdirs = True
raw_noise_std = 1e0
@m1kit
Copy link
Author

m1kit commented May 19, 2021

Maybe this can be fixed by adding eps here?

@xiaohulihutu
Copy link

Have the same issue here, Any solutions?

@xiaohulihutu
Copy link

Maybe this can be fixed by adding eps here?
Sorry sir, what is eps?

@m1kit
Copy link
Author

m1kit commented Sep 23, 2021

eps means epsilon ε. it means very small value like 0.0000001

@yenchenlin
Copy link
Owner

Do you have code to reproduce the error?

@xiaohulihutu
Copy link

I saw m1kit was testing his own set and my error popped up when I was training my own set. I used Colmap to get the camera position info and it can run like 20k iterations, but it will stop randomly at a point saying RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.

I tried fern sample set and no error at all (sometimes it shows GPU om, but no error when I reduced the settings). I did not change much in code except adding a ParallelData to use all four GPUs at the same time.

I just wondering what did m1kit do and waiting for his response.

@m1kit
Copy link
Author

m1kit commented Sep 24, 2021

Unfortunately, for personal reasons, I cannot provide the dataset that caused this error. To be honest, it was 4 months ago, so it's hard to remember how to reproduce it in detail. I apologize for not being able to help you.

@AugustasMacijauskas
Copy link

Hello, I encountered the same problem when using SCNeRF, which borrows heavily from this repository, to train on custom data.

Data

The data can be accessed through this google drive link: https://drive.google.com/drive/folders/1SUzKMn6oD4inzN-m7RmHVl7gGEnq-Iv4?usp=sharing

Logs

[TRAIN] Iter: 209100 Loss: 0.006338230334222317  PSNR: 25.197158813476562
[TRAIN] Iter: 209200 Loss: 0.007395393215119839  PSNR: 24.48368263244629
[TRAIN] Iter: 209300 Loss: 0.007888318039476871  PSNR: 24.342876434326172
[TRAIN] Iter: 209400 Loss: 0.00826267059892416  PSNR: 24.05372428894043
[TRAIN] Iter: 209500 Loss: 0.0067442795261740685  PSNR: 24.944828033447266
Starts Validation Rendering
VAL PSNR 144: 22.382625579833984
Validation PRD : 0.4792793095111847
  File "run_nerf.py", line 1052, in <module>
    train()
  File "run_nerf.py", line 506, in train
    train_loss_0 = img2mse(extras['rgb0'], target_s)
  File "/home/julius_m/code/SCNeRF/NeRF/run_nerf_helpers.py", line 10, in <lambda>
    img2mse = lambda x, y : torch.mean((x - y) ** 2)
 (function _print_stack)
 26%|██████████████████████████████████████████▋                                                                                                                        | 209573/800000 [8:00:36<22:34:01,  7.27it/s]
Traceback (most recent call last):
  File "run_nerf.py", line 1052, in <module>
    train()
  File "run_nerf.py", line 606, in train
    train_loss.backward()
  File "/home/julius_m/miniconda3/envs/icn/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/julius_m/miniconda3/envs/icn/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output.
! [Numerical Error] rgb_map contains nan or inf.
! [Numerical Error] disp_map contains nan or inf.
! [Numerical Error] acc_map contains nan or inf.
! [Numerical Error] raw contains nan or inf.
! [Numerical Error] rgb0 contains nan or inf.
! [Numerical Error] disp0 contains nan or inf.
! [Numerical Error] acc0 contains nan or inf.
! [Numerical Error] z_std contains nan or inf.

Launch script

cd NeRF

python run_nerf.py \
    --config configs/llff_data/lamp.txt \
    --expname lamp \
    --chunk 8192 \
    --N_rand 1024 \
    --camera_model pinhole_rot_noise_10k_rayo_rayd \
    --ray_loss_type proj_ray_dist \
    --multiplicative_noise True \
    --i_ray_dist_loss 10 \
    --grid_size 10 \
    --ray_dist_loss_weight 0.0001 \
    --N_iters 800001 \
    --use_custom_optim True \
    --ray_o_noise_scale 1e-3 \
    --ray_d_noise_scale 1e-3 \
    --non_linear_weight_decay 0.1 \
    --add_ie 200000 \
    --add_od 400000 \
    --add_prd 600000

Config

Note: make sure to change the datadir to where you downloaded the above data.

configs/llff_data/lamp.txt

expname = lamp
basedir = ./logs
datadir = <path_to_lamp_dir>/lamp
dataset_type = llff

factor = 8
llffhold = 8

N_rand = 1024
N_samples = 64
N_importance = 64

use_viewdirs = True
raw_noise_std = 1e0

@cduguet
Copy link

cduguet commented Nov 8, 2021

I can confirm this problem is happening to me on https://github.com/apchenstu/mvsnerf, trying out with either the lego synthetic dataset, or the orchid llff dataset.

I'll try to see how to make this reproducible.

@davodogster
Copy link

davodogster commented Jan 19, 2022

Hello, I encountered the same problem when using SCNeRF, which borrows heavily from this repository, to train on custom data.

Data

The data can be accessed through this google drive link: https://drive.google.com/drive/folders/1SUzKMn6oD4inzN-m7RmHVl7gGEnq-Iv4?usp=sharing

Hi @AugustasMacijauskas did you have any success training with your custom dataset?

@AugustasMacijauskas
Copy link

@davodogster No, I lost my patience and moved on to other things. I was also having a hard time figuring out how to debug this efficiently, since training for a few hours before it crashes and then changing one line of code and seeing if that helps is not going to work.

@snknitin
Copy link

snknitin commented Nov 1, 2022

If it is an error in the 0th output, that means your weights are still not fully updated so some values in some batch's predictions , during your first epoch are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a tensor(nan).What you can do is put a check for when loss is nan and let the weights adjust themselves

criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants