Training is stuck for pretraining #3

mu-cai · 2022-07-01T09:23:26Z

I found an error around line 200 in transforms.py. For some samples, it never goes out of the loop, which makes the training not work, meaning that len_indexes= sum_indexes=0

CSautier · 2022-07-01T10:15:43Z

Hi, thank you for the notice.
Did you experiment the bug in practice on a nuScenes training? Because while it's true that the code might not work on some specific conditions, the resizecrop has a crop ratio reaching 1, and allowing for the original image ratio, thus the resizecrop has a non-zero probability of outputting the unmodified input image, which should be non-blocking, unless there is some problem that I do not see. In all my pre-training experiments on nuScenes, I never had an issue of this transform being stuck indefinitely.

mu-cai · 2022-07-02T03:05:36Z

Thank you for your Response!
However, at a certain sample, specifically, idx = 17903, camera_id =3, the mask could be all zero:

CSautier · 2022-07-05T10:08:06Z

I just did a re-train on a few epochs, with no issue of workers stucking there. Have you modified anything in the config? Are you running with slidr_minkunet.yaml?
I've looked specifically at the dataset index 17903 and I don't see that there would be a fully empty mask. mask.sum() gives me 1115 with camera_id=3 before any augmentation. Then there is the DropCuboid which cannot reduce that lower than 1024 (line 252) meaning that on line 150 where I assume you put your breakpoint, there is no reason for the mask to be empty, unless it was empty from the start (which isn't the case for me), then this might either be corrupted data on your dataset, or a modification you did to dataloader_nuscenes.py.

mu-cai · 2022-07-09T01:47:28Z

Thank you! I just found that the images are corrupted! Thank you for the notice! Really helpful!
Besides, I wonder for the downstream task of object detection, do you just transfer the pretrained weights in your repo to OpenPCdet? If so, how do you conduct the experiments on "Results on the validation set using Minkowski SR-Unet with a fraction of the training labels"?
Thank you!

CSautier · 2022-07-24T09:39:14Z

Hi, I'm sorry for the late reply.
Glad I could be of help; as for the Minkowski SR-Unet, we modified PointRCNN to use as backbone our Minkowski-UNet, where we have loaded the weights. We also modified the kitti dataloader to only keep 1 in n training data to use a fraction of the training data. This can be done by subsampling self.kitti_infos in kitti_dataset.py of OpenPCDet.
We did not provide the code as it did not belong to us, and that would have added an entire fork just for this one experiment, but I can send it to you by email if you need it.

mu-cai · 2022-07-24T16:46:26Z

Thanks! It will be great if you can send me the code via email!

…

Sent from my iPhone

On Jul 24, 2022, at 2:39 AM, CSautier ***@***.***> wrote: Hi, I'm sorry for the late reply. Glad I could be of help; as for the Minkowski SR-Unet, we modified PointRCNN to use as backbone our Minkowski-UNet, where we have loaded the weights. We also modified the kitti dataloader to only keep 1 in n training data to use a fraction of the training data. This can be done by subsampling self.kitti_infos in kitti_dataset.py of OpenPCDet. We did not provide the code as it did not belong to us, and that would have added an entire fork just for this one experiment, but I can send it to you by email if you need it. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

CSautier · 2022-07-27T16:27:16Z

I sent you the files by email and am closing this issue. Let me know by mail if you need help to use it.

mu-cai · 2022-07-28T00:52:10Z

I wonder whether there is a bug around line https://github.com/valeoai/SLidR/blob/main/pretrain/dataloader_nuscenes_spconv.py#L321? (You should not comment this line) Because in your object detection code base, you use 4 features in object detection.

CSautier closed this as completed Jul 27, 2022

nazMahmoud mentioned this issue Aug 11, 2022

Object detection experiments #7

Closed

This was referenced Aug 17, 2022

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #8

Closed

Object detection experiments #13

Closed

BoPang1996 mentioned this issue Sep 9, 2022

Object detection experiments #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training is stuck for pretraining #3

Training is stuck for pretraining #3

mu-cai commented Jul 1, 2022

CSautier commented Jul 1, 2022

mu-cai commented Jul 2, 2022

CSautier commented Jul 5, 2022

mu-cai commented Jul 9, 2022

CSautier commented Jul 24, 2022

mu-cai commented Jul 24, 2022 via email

CSautier commented Jul 27, 2022

mu-cai commented Jul 28, 2022

Training is stuck for pretraining #3

Training is stuck for pretraining #3

Comments

mu-cai commented Jul 1, 2022

CSautier commented Jul 1, 2022

mu-cai commented Jul 2, 2022

CSautier commented Jul 5, 2022

mu-cai commented Jul 9, 2022

CSautier commented Jul 24, 2022

mu-cai commented Jul 24, 2022 via email

CSautier commented Jul 27, 2022

mu-cai commented Jul 28, 2022