Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training is stuck for pretraining #3

Closed
mu-cai opened this issue Jul 1, 2022 · 8 comments
Closed

Training is stuck for pretraining #3

mu-cai opened this issue Jul 1, 2022 · 8 comments

Comments

@mu-cai
Copy link

mu-cai commented Jul 1, 2022

I found an error around line 200 in transforms.py. For some samples, it never goes out of the loop, which makes the training not work, meaning that len_indexes= sum_indexes=0

@CSautier
Copy link
Collaborator

CSautier commented Jul 1, 2022

Hi, thank you for the notice.
Did you experiment the bug in practice on a nuScenes training? Because while it's true that the code might not work on some specific conditions, the resizecrop has a crop ratio reaching 1, and allowing for the original image ratio, thus the resizecrop has a non-zero probability of outputting the unmodified input image, which should be non-blocking, unless there is some problem that I do not see. In all my pre-training experiments on nuScenes, I never had an issue of this transform being stuck indefinitely.

@mu-cai
Copy link
Author

mu-cai commented Jul 2, 2022

Thank you for your Response!
However, at a certain sample, specifically, idx = 17903, camera_id =3, the mask could be all zero:
image

@CSautier
Copy link
Collaborator

CSautier commented Jul 5, 2022

I just did a re-train on a few epochs, with no issue of workers stucking there. Have you modified anything in the config? Are you running with slidr_minkunet.yaml?
I've looked specifically at the dataset index 17903 and I don't see that there would be a fully empty mask. mask.sum() gives me 1115 with camera_id=3 before any augmentation. Then there is the DropCuboid which cannot reduce that lower than 1024 (line 252) meaning that on line 150 where I assume you put your breakpoint, there is no reason for the mask to be empty, unless it was empty from the start (which isn't the case for me), then this might either be corrupted data on your dataset, or a modification you did to dataloader_nuscenes.py.

@mu-cai
Copy link
Author

mu-cai commented Jul 9, 2022

Thank you! I just found that the images are corrupted! Thank you for the notice! Really helpful!
Besides, I wonder for the downstream task of object detection, do you just transfer the pretrained weights in your repo to OpenPCdet? If so, how do you conduct the experiments on "Results on the validation set using Minkowski SR-Unet with a fraction of the training labels"?
Thank you!

@CSautier
Copy link
Collaborator

Hi, I'm sorry for the late reply.
Glad I could be of help; as for the Minkowski SR-Unet, we modified PointRCNN to use as backbone our Minkowski-UNet, where we have loaded the weights. We also modified the kitti dataloader to only keep 1 in n training data to use a fraction of the training data. This can be done by subsampling self.kitti_infos in kitti_dataset.py of OpenPCDet.
We did not provide the code as it did not belong to us, and that would have added an entire fork just for this one experiment, but I can send it to you by email if you need it.

@mu-cai
Copy link
Author

mu-cai commented Jul 24, 2022 via email

@CSautier
Copy link
Collaborator

I sent you the files by email and am closing this issue. Let me know by mail if you need help to use it.

@mu-cai
Copy link
Author

mu-cai commented Jul 28, 2022

I wonder whether there is a bug around line https://github.com/valeoai/SLidR/blob/main/pretrain/dataloader_nuscenes_spconv.py#L321? (You should not comment this line) Because in your object detection code base, you use 4 features in object detection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants