Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid in_feat_size 0 with Cuda 11 #1

Closed
zgojcic opened this issue Mar 14, 2021 · 9 comments
Closed

Invalid in_feat_size 0 with Cuda 11 #1

zgojcic opened this issue Mar 14, 2021 · 9 comments

Comments

@zgojcic
Copy link
Owner

zgojcic commented Mar 14, 2021

When using Cuda 11 our model returns the following error:

File "/home/zgojcic/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine-0.5.1-py3.7-linux-x86_64.egg/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
    coordinate_manager._manager,
RuntimeError: /home/zgojcic/Documents/Rigid3DSceneFlow/MinkowskiEngine/src/convolution_gpu.cu:85, assertion (in_feat.size(0) == p_map_manager->size(in_key)) failed. Invalid in_feat size 0 != 5296

It seems that this is due to the combination of Cuda 11 with MinkowskiEngine. The issue is currently under investigation NVIDIA/MinkowskiEngine#330

Until solved we suggest using Cuda 10.2 or 10.1.

@zmlshiwo
Copy link

Hi, Zan. Nice work! I want to run your code. My GPU is 3090, so it only supports Cuda 11. Is this problem solved? Thank you.

@zgojcic
Copy link
Owner Author

zgojcic commented Mar 19, 2021

Hi, yeah this is how we actually first saw the problem (with a 3090). I think that it is not solved yet but Chris is usually very fast with these things so it should be quick. I will update this issue once it is solved.

@zmlshiwo
Copy link

OK, great, thank you very much.

@zmlshiwo
Copy link

Hi, Zan. I have tested the code in my computer, with Ubuntu20.04, CUDA 11.1, RTX 3090 GPU, MinkowskiEngine-0.5.2, PyTorch 1.8, Python 3.7. I run the training code, it shows the error as follows.

python train.py ./configs/train/train_fully_supervised.yaml
/home/ps/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/MinkowskiEngine/init.py:42: UserWarning: The environment variable OMP_NUM_THREADS not set. MinkowskiEngine will automatically set OMP_NUM_THREADS=16. If you want to set OMP_NUM_THREADS manually, please export it on the command line before running a python script. e.g. export OMP_NUM_THREADS=12; python your_program.py. It is recommended to set it below 24.
"It is recommended to set it below 24.",
Using /home/ps/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ps/.cache/torch_extensions/cd/build.ninja...
Building extension module cd...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cd...
2021-03-29 10:21:52 zml root[103079] INFO Command: train.py ./configs/train/train_fully_supervised.yaml
2021-03-29 10:21:52 zml root[103079] INFO Arguments: method_backbone: ME, method_flow: True, method_ego_motion: False, method_semantic: False, method_clustering: False, misc_voxel_size: 0.1, misc_num_points: 8192, misc_trainer: FlowTrainer, misc_use_gpu: True, misc_log_dir: ./logs/, misc_run_mode: train, data_input_features: absolute_coords, data_only_near_points: True, data_dataset: FlyingThings3D_ME, data_root: /media/ps/data/rigid_scene_flow_dataset/flying_things_3d/, data_remove_ground: False, data_augment_data: False, train_batch_size: 8, train_acc_iter_size: 1, train_num_workers: 6, train_max_epoch: 50, train_stat_interval: 5, train_chkpt_interval: 40, train_val_interval: 20, train_weighted_seg_loss: True, val_batch_size: 8, val_num_workers: 6, test_results_dir: ./eval/, test_batch_size: 1, test_num_workers: 1, loss_bg_loss_w: 1.0, loss_fg_loss_w: 1.0, loss_flow_loss_w: 1.0, loss_ego_loss_w: 1.0, loss_inlier_loss_w: 0.005, loss_cd_loss_w: 0.5, loss_rigid_loss_w: 1.0, loss_background_loss: False, loss_flow_loss: True, loss_ego_loss: False, loss_foreground_loss: False, optimizer_alg: Adam, optimizer_learning_rate: 0.001, optimizer_weight_decay: 0.0, optimizer_momentum: 0.8, optimizer_scheduler: ExponentialLR, optimizer_exp_gamma: 0.98, network_normalize_features: True, network_norm_type: IN, network_in_kernel_size: 7, network_feature_dim: 64, network_use_pretrained: True, network_pretrained_path: , metrics_flow: True, metrics_ego_motion: False, metrics_semantic: False
2021-03-29 10:21:52 zml root[103079] INFO Output and logs will be saved to ./logs/logs_FlyingThings3D_ME/21_03_29-10_21_52_431116__Method_ME__Flow___VoxSize_0.1__Pts_8192
2021-03-29 10:21:52 zml root[103079] INFO Parameter Count: 8073729
2021-03-29 10:21:52 zml root[103079] INFO Torch version: 1.8.0
2021-03-29 10:21:52 zml root[103079] INFO CUDA version: 11.1
2021-03-29 10:21:52 zml root[103079] INFO Training epoch: 0, LR: [0.001]
0%| | 0/1963 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 243, in
main(cfg, args.config)
File "train.py", line 136, in main
losses, metrics, total_loss = trainer.train_step(batch)
File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/trainer.py", line 45, in train_step
losses, metrics = self._compute_loss_metrics(data)
File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/trainer.py", line 152, in _compute_loss_metrics
inferred_values = self.model(sinput1, sinput2, input_dict['pcd_eval_s'], input_dict['pcd_eval_t'], input_dict['fg_labels_s'], input_dict['fg_labels_t'])
File "/home/ps/anaconda3/envs/rigid_3dsf/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/model/rigid_3d_sf.py", line 308, in forward
self._infer_flow(dec_feat_1, dec_feat_2)
File "/home/ps/zml_3D_scene_flow/Rigid3DSceneFlow-master/lib/model/rigid_3d_sf.py", line 102, in _infer_flow
feat_s = flow_f_1.F[flow_f_1.C[:,0] == b_idx]
RuntimeError: invalid shape dimension -1122459258
0%| | 0/1963 [00:02<?, ?it/s]

Yes, it shows the invalid shape dimension. I also guess this is the problem of MinkowskiEngine.

@zgojcic
Copy link
Owner Author

zgojcic commented Mar 29, 2021

Hei, yes it seems that this is the same error. In the ME thread there was a comment that the code should work with ME 0.5 (even with CUDA 11.x) so maybe you can try that.

@zmlshiwo
Copy link

OK, thank you. I will try it.

@zgojcic
Copy link
Owner Author

zgojcic commented Apr 8, 2021

It seems that this is a problem with the combination of pytorch 1.8.x and Cuda 11.x and not a ME bug. Until fixed I suggest using pytorch 1.7.1 with CUDA 11.X. The updates can also be followed in the issue refferenced in the first post of this thread.

@zgojcic
Copy link
Owner Author

zgojcic commented Nov 18, 2021

Closing due to inactivity. Please open a new issue if you have further questions.

@zgojcic zgojcic closed this as completed Nov 18, 2021
@Alt216
Copy link

Alt216 commented Jan 15, 2022

Hi @zgojcic , from my understanding 30 series gpu only work with CUDA 11.1 up but pytorch 1.7.1 only works with CUDA 11.0. How did you get pytorch 1.7.1 to work with a CUDA version that is compatible with 30 series gpu? Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants