-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Function 'PowBackward0' returned nan values in its 0th output. #38
Comments
Maybe this can be fixed by adding eps here? |
Have the same issue here, Any solutions? |
|
eps means epsilon ε. it means very small value like 0.0000001 |
Do you have code to reproduce the error? |
I saw m1kit was testing his own set and my error popped up when I was training my own set. I used Colmap to get the camera position info and it can run like 20k iterations, but it will stop randomly at a point saying RuntimeError: Function 'PowBackward0' returned nan values in its 0th output. I tried fern sample set and no error at all (sometimes it shows GPU om, but no error when I reduced the settings). I did not change much in code except adding a ParallelData to use all four GPUs at the same time. I just wondering what did m1kit do and waiting for his response. |
Unfortunately, for personal reasons, I cannot provide the dataset that caused this error. To be honest, it was 4 months ago, so it's hard to remember how to reproduce it in detail. I apologize for not being able to help you. |
Hello, I encountered the same problem when using SCNeRF, which borrows heavily from this repository, to train on custom data. DataThe data can be accessed through this google drive link: https://drive.google.com/drive/folders/1SUzKMn6oD4inzN-m7RmHVl7gGEnq-Iv4?usp=sharing Logs
Launch scriptcd NeRF
python run_nerf.py \
--config configs/llff_data/lamp.txt \
--expname lamp \
--chunk 8192 \
--N_rand 1024 \
--camera_model pinhole_rot_noise_10k_rayo_rayd \
--ray_loss_type proj_ray_dist \
--multiplicative_noise True \
--i_ray_dist_loss 10 \
--grid_size 10 \
--ray_dist_loss_weight 0.0001 \
--N_iters 800001 \
--use_custom_optim True \
--ray_o_noise_scale 1e-3 \
--ray_d_noise_scale 1e-3 \
--non_linear_weight_decay 0.1 \
--add_ie 200000 \
--add_od 400000 \
--add_prd 600000 ConfigNote: make sure to change the configs/llff_data/lamp.txt
|
I can confirm this problem is happening to me on https://github.com/apchenstu/mvsnerf, trying out with either the lego synthetic dataset, or the orchid llff dataset. I'll try to see how to make this reproducible. |
Hi @AugustasMacijauskas did you have any success training with your custom dataset? |
@davodogster No, I lost my patience and moved on to other things. I was also having a hard time figuring out how to debug this efficiently, since training for a few hours before it crashes and then changing one line of code and seeing if that helps is not going to work. |
If it is an error in the 0th output, that means your weights are still not fully updated so some values in some batch's predictions , during your first epoch are nans. So it's not your inputs, but your model predictions that are nans. Could be an overflow or underflow error. This will make any loss function give you a criterion = SomeLossFunc()
eps = 1e-6
loss = criterion(preds,targets)
if loss.isnan(): loss=eps
else: loss = loss.item()
loss = loss+ L1_loss + ... |
During the learning process, the following error occurs and learning is interrupted.
Here's my configuration.
The text was updated successfully, but these errors were encountered: