-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
spconv/src/spconv/indice.cu 125 #2
Comments
Can you tell me your torch, cuda, spconv version? Also what other changes(if any) did you make to the code? Unfortunately, I can't reproduce this error. (I guess it is a few hours into the training?) |
|
Sorry for late reply. Apart from this change, there is no change in code. |
You don't need to comment loading files. Just change the config nsweep field to 1. Also I suspect it is gpu out of memory issue from the error log, can you check this ? |
how to reduce memory usage in test phase? |
@AbdeslemSmahi the simplest way is to add a --speed_test flag during testing. This will by default use batch size 1. Not sure how to go beyond this |
Even that didn't work. |
probably need to get a larger gpu then... or try pointpillars model which take less memory. |
close for now. Feel free to reopen if you still have questions. |
@AbdeslemSmahi @tianweiy |
Hi, 0075 voxelnet will definitely take much more memory than pp. Can you train the 0.1 voxel size model ? You can also decrease the batch size a bit, I don't think this matter much for performance. For the distributed data parallel stuff, does your model work with a single gpu ? Also it seems spconv(voxelnet) is quite weird for Titan v. Basically, I try to train voxelnet on Titan xp, Titan rtx, 2070/2080, v100, Titan v. All other gpu works but for Titanv I can't use even batch size 2 for a kitti model. I feel this is a bug with spconv. Do let me know if your titanv works well with spconv |
@tianweiy Thank you for your reply. I followed your advice and the results are:
it seems that spconv cannot work on Titan V(when voxelnet involved), and it indeed takes large amount of memory to run the config with small voxel size. But now I reduce the bacth size to 2 and it worked, so nothing wired happen for the moment. |
Sure, good luck with your project. |
Hi,
I was tring to train with confif file "nusc_centerpoint_voxelnet_01voxel.py" with 1 GPU and sweep=1.
I encountered the crash during training. Kindly help.
File "/home/Nuscene_Top/CenterPoint/tools/train.py", line 128, in main
logger=logger,
File "/home/Nuscene_Top/CenterPoint/det3d/torchie/apis/train.py", line 381, in train_detector
trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
File "/home/Nuscene_Top/CenterPoint/det3d/torchie/trainer/trainer.py", line 538, in run
epoch_runner(data_loaders[i], self.epoch, **kwargs)
File "/home/Nuscene_Top/CenterPoint/det3d/torchie/trainer/trainer.py", line 405, in train
self.model, data_batch, train_mode=True, **kwargs
File "/home/Nuscene_Top/CenterPoint/det3d/torchie/trainer/trainer.py", line 363, in batch_processor_inline
losses = model(example, return_loss=True)
File "/home/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/Nuscene_Top/CenterPoint/det3d/models/detectors/voxelnet.py", line 47, in forward
x = self.extract_feat(data)
File "/home/Nuscene_Top/CenterPoint/det3d/models/detectors/voxelnet.py", line 24, in extract_feat
input_features, data["coors"], data["batch_size"], data["input_shape"]
File "/home/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/Nuscene_Top/CenterPoint/det3d/models/backbones/scn.py", line 364, in forward
ret = self.middle_conv(ret)
File "/home/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/centerpoint/lib/python3.6/site-packages/spconv/modules.py", line 123, in forward
input = module(input)
File "/home/anaconda3/envs/centerpoint/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/anaconda3/envs/centerpoint/lib/python3.6/site-packages/spconv/conv.py", line 155, in forward
self.stride, self.padding, self.dilation, self.output_padding, self.subm, self.transposed, grid=input.grid)
File "/home/anaconda3/envs/centerpoint/lib/python3.6/site-packages/spconv/ops.py", line 89, in get_indice_pairs
stride, padding, dilation, out_padding, int(subm), int(transpose))
RuntimeError: /home/Nuscene_Top/spconv/src/spconv/indice.cu 125
cuda execution failed with error 2
The text was updated successfully, but these errors were encountered: