New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spconv and Cuda error while training on my own dataset #56
Comments
you may set num_gpu to 1 and add pdb debug checkpoint to |
Thanks for your quick reply. I am getting the CUDA 700 error which arises after when I pass the output data to the UNET (I have highlighted this below). However, I am not sure what is it. As I have only 3 semantic classes in my dataset, can it happen that I forgot to change something in the config file that is necessary when there are only 3 classes? In addition, I also changed the GPU number to 1 in dist_train.sh, however, I get a value error 'Unsupported nproc_per_node value: configs/softgroup_s3dis_backbone_fold5_mintA.yaml'. Am I supposed to set the num_gpu anywhere else?
|
@SijanNeupane49 Hi, I have the same problem. |
@SijanNeupane49 Since your bug is in Unet. I think your data after voxelization is not proper for the unet. Can you print out |
train.py data: dataloader: optimizer: save_cfg: fp16: False 2022-04-29 16:58:26,249 - INFO - Distributed: False 2022-04-29 16:58:28,973 - INFO - Load train dataset: 580 scans 进程已结束,退出代码1 —————————————————————— |
Your input spatial_shape is too big |
I think I didn't make changes in dataset. |
I think your using default voxel size of 0.02 cm is too small for your data. That leading to spatial_shape is too big. Please check the config for new dataset stpls3d here for tips on new dataset. |
Thank you for your reply. I was able to successfully to fine tune and train model from frozen backbone on my dataset for the first time. I changed the following in both config files:
However, while training the model from frozen backbone no epochs were saved. But while fine tuning my dataset every second epochs were saved. Then I had run the training from frozen backbone again. But, I got value error as following:
|
The error indicates that all your model is frozen such that the optimizer does not have any model parameters to be optimized. |
What could be the possible cause of this? I'm also getting the same error when I train with my own dataset. When I train with S3DIS Dataset I do not get this issue |
@theshapguy can you share the config file. |
This is my config file for softgroup_s3dis_fold5_mintA.yaml
Also it would be great if you could explain what spatial_shape inside voxel_cfg is? Do I need to change this for custom dataset. And the same for grouping_cfg -> class_numpoint_mean I think once I understand the config file better, I'll could easily solve my problem FYI: I am Getting the above error when I train on the frozen backbone only, fine tuning part is fine. |
The spatial shape is the [min, max] dimension of the cropped scan in terms of voxel. See here and here. It is weird that your model does not have any parameter. I notice that the model is not run in distributed mode. Could you run the model in distributed mode using ./dist_train.sh with NUM_GPU = 1 and see if the problem happens again? |
I think the problem is that you are not running in distributed mode. It is fixed in the latest commit. |
When I do distributed training I get following error:
|
When I change NUM_GPU = 1 in dist_train.sh I get the following error:
|
You need to retrain the pretrain model. Your pretrained model doesnot correct. |
The following is how I got my pretrain model "work_dirs/softgroup_s3dis_backbone_fold5_mintA/latest.pth". Should I change something in my config file to retrain the model?
|
Hi, I got the following error when training on my own dataset. However, I could reproduce the results on S3DIS dataset without any error. Can you please look into my error and give me suggestion what might have been wrong? My dataset is same like S3DIS dataset with xyzrgb format.
I have the following configuration:
CUDA = V10.2.89
torch = 1.11.0
spconv-cu102 = 2.1.21
Ubuntu 18.04
I am really sorry for a long error message.
Thank you for your help !
The text was updated successfully, but these errors were encountered: