Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN in training #17

Closed
chky1997 opened this issue Jan 17, 2023 · 8 comments
Closed

NaN in training #17

chky1997 opened this issue Jan 17, 2023 · 8 comments

Comments

@chky1997
Copy link

Hi, when I trained my own dataset, error occurred as below:
image
I set 'shuffle' to False to check if some particular images in my dataset cause this error, but it still occurs randomly (mostly in the first epoch, but it occurred once in the second epoch while the first epoch seem good).
Do you have any idea? Thank you for your help!

@haotongl
Copy link
Member

Hi, I have encountered NaN during training in the following situations, and can be avoided by using corresponding methods:

  1. When using pytorch 1.7. or 1.8, there may be NaN bugs. Using 1.9.0 or 1.6.0 always solves the problem.
  2. Change this line to: std = torch.clamp_min(var, 1e-8).sqrt()
    std = var.sqrt()

@chky1997
Copy link
Author

Solved! Thank you for your advice!

@chky1997 chky1997 reopened this Jan 18, 2023
@chky1997
Copy link
Author

chky1997 commented Jan 23, 2023

Hi, I met another problem during training.

  1. It seems like that the network can not converge.
    At the beginning of training, the PSNR of training set is only about 10 and ends up at about 11 after epoch 0. When it comes to the zjumocap data, the PSNR is more than 20 at first and increases fast to about 25 after epoch 0.
  2. The PSNR of test set is much higher than training set.
    After several epochs, the PSNR of training set reamins 11-12, but the PSNR of test set can increases to about 22.
    屏幕截图 2023-01-23 122434
    I wonder if I should adjust the config or there is something wrong with my dataset (I follow the tutorial of Easymocap). Thank you for your help!

@haotongl
Copy link
Member

haotongl commented Jan 23, 2023

Hi, what kind of data did you use? Is it similiar to ZJU-Mocap (Multi-view dynamic dataset, including mask information)?
Could you provide more information so I can locate the problem?

@chky1997
Copy link
Author

chky1997 commented Jan 24, 2023

I use 4 cameras to record and extract images (1280x720, padding to 1280x1280 and resize to 1024x1024).

I also use a segmentation model to get the mask information.

I set the train input view in the yaml and the length of source view in enerf.py to avoid out of range.

b5e05239ab95825e4217379534e28a5

b9b125eeaeef6bb96718f1d0415e257

At first I think the calibration from Easymocap is not correct. But I use Matlab to double check it and the two results are close.

@haotongl
Copy link
Member

Try to use more source views, (one input view is not enough).
train_input_views: [3,4] train_input_views_prob: [0.5, 0.5] test_input_views: 3(or 4)

I have some other suggestions:

  1. Try rendering a test image from the remaining 3 images instead of using the GUI to render a novel view. This helps avoid extrapolation problems, which image-based rendering is not good at.
  2. Use 4 views as training views: Change these two lines to [0,-1,1] https://github.com/zju3dv/ENeRF/blob/master/configs/enerf/zjumocap/zjumocap_train.yaml#L28

@chky1997
Copy link
Author

chky1997 commented Jan 24, 2023

I followed your advice and tried, but the result seems the same. Both the PSNR and the visualization are still bad.

  1. Actually, when I use the same yaml for training, with zjumocap (only use 4 cameras data) and my dataset. The result of zjumocap seems good. Only after epoch 0, PSNR of zjumocap (4 cameras) can raise from 20 to 25. For my dataset, it only begins with 10 and ends up with 13. The initial network can get such results with different data and same training strategy. So it has to be my data or my calibration that cause the difference, right?

  2. I also checked the depth_range calculated by the 3D bbox. It usually ranges from 2 to 4.5 in CoreView313. But for my dataset it ranges from 0.1 to 0.5, and sometimes the value is negative. Is that a problem?

Thank you so much for your help these days!

@chky1997
Copy link
Author

As far as I know, it is the normalization of the extrinsics that cause the problem.
微信图片_20230131105725
After the modification, everything seems good.
And I deleted the images above for privacy.
Thank you for your help and your great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants