Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #8

Closed
jaycheney opened this issue Aug 15, 2022 · 13 comments

Comments

@jaycheney
Copy link

This problem occurs when I run the code pretrain.py.
I tried a lot of methods, but do not know how to deal with them.
Could you help me?

I run with 1GPU, ubuntu 18.04, cudnn8, cuda11.1 and other requirements same like requirements.txt.

Training: -1it [00:00, ?it/s]
Training: 0%| | 0/7033 [00:00<00:00, 22671.91it/s]
Epoch 0: 0%| | 0/7033 [00:00<00:01, 3584.88it/s] Traceback (most recent call last):
File "pretrain.py", line 61, in
main()
File "pretrain.py", line 57, in main
trainer.fit(module, dm)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 131, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
super().run(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 402, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/sgd.py", line 87, in step
loss = closure()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 533, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 306, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 386, in training_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in training_step
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 124, in loss_superpixels_average
k = one_hot_P @ output_points[batch["pairing_points"]]
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Epoch 0: 0%| | 0/7033 [00:37<74:08:49, 37.95s/it]

@CSautier
Copy link
Collaborator

Hi, have you been modifying the code? Because Line 124 in "SLidR/pretrain/lightning_trainer.py" isn't supposed to be
"k = one_hot_P @ output_points[batch["pairing_points"]]".
This error is typical of an incorrect indexing. That can happen in the superpixels indices, the pairing of the points or the creation of the sparse matrices.

@jaycheney
Copy link
Author

Thank you for your reply. I didn't modify the original code (only some comments), but when I run the pretrain.py code with the slidr_minkunet.yaml, the problem RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered still occurs. I checked the k = one_hot_P @ output_points[batch["pairing_points"]], the dimension is one_hot_P(3600,45771), output_points(44516,64) and pairing_points(45771,).
But when I run the the pretrain.py code with the config/slidr_voxelnet.yaml, doesn't have any problem.

BTW, could you kindly share the code via email used to finetune object detection models from OpenPCDet? I'm still reproducing the results of SLidR. It'll help me a lot.
Originally posted by @CSautier in #3 (comment)

@ZhengLeon
Copy link

ZhengLeon commented Sep 1, 2022

I met this problem too, did you solve it? @JakeVander

@andrewcaunes
Copy link

Same problem here.

1 similar comment
@modifierT
Copy link

Same problem here.

@CSautier CSautier reopened this Sep 25, 2023
@CSautier
Copy link
Collaborator

Could you please tell me exactly what you are running? I will try to add a dockerfile to setup a working environment, and see if I can either reproduce the issue, or specify an environment so that you won't have any. However I'm not sure a single dockerfile will suit every configuration since MinkowskiEngine can be particularly painful to compile.

@andrewcaunes
Copy link

andrewcaunes commented Sep 25, 2023

I managed to fix this error but don't remember exactly how, sorry.
I'm pretty sure it was by changing pytorch version, and I ended up using these :
pytorch=1.12.1
MinkowskiEngine=0.5.4
along with a nvidia V100 with cuda=12.0

Thanks for the amazing work by the way !

@modifierT
Copy link

Thanks for reopening the issue!
I have ran
python pretrain.py --cfg config/slidr_minkunet.yaml
in terminal and got
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
almost the same as 👇

This problem occurs when I run the code pretrain.py. I tried a lot of methods, but do not know how to deal with them. Could you help me?

I run with 1GPU, ubuntu 18.04, cudnn8, cuda11.1 and other requirements same like requirements.txt.

Training: -1it [00:00, ?it/s]
Training: 0%| | 0/7033 [00:00<00:00, 22671.91it/s]
Epoch 0: 0%| | 0/7033 [00:00<00:01, 3584.88it/s] Traceback (most recent call last):
File "pretrain.py", line 61, in
main()
File "pretrain.py", line 57, in main
trainer.fit(module, dm)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 131, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
super().run(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 402, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/sgd.py", line 87, in step
loss = closure()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 533, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 306, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 386, in training_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in training_step
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 124, in loss_superpixels_average
k = one_hot_P @ output_points[batch["pairing_points"]]
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Epoch 0: 0%| | 0/7033 [00:37<74:08:49, 37.95s/it]

in which

k = one_hot_P @ output_points[pairing_points]
one_hot_P: torch.Size([7200, 104372])
output_points[pairing_points]: torch.Size([104372, 64])

And this problem won't occurred when running python pretrain.py --cfg config/slidr_voxelnet.yaml

my env is

torch == 1.10.0+cu113
numpy == 1.24.1
MinkowskiEngine == 0.5.4
nuscenes-devkit == 1.1.10
pytorch_lightning == 1.4.0
multiprocess == 0.70.15
scikit-image == 0.21.0
torchvision == 0.11.1+cu113
spconv == 2.3.6
torchmetrics == 0.4.0

along with nvidia A800

@CSautier
Copy link
Collaborator

Ok, I was actually able to re-produce the issue now. I'll see what I can do.

@CSautier
Copy link
Collaborator

The issue is indeed a compatibility issue between MinkowskiEngine, and some versions of Pytorch+CUDA (see for instance NVIDIA/MinkowskiEngine#299).

I was able to run the code again using CUDA 11.3, torch 1.12.0 cudnn 8, the latest commit of MinkowskiEngine and pytorch_lightning 1.6.0

I will add a Dockerfile with this config to the repo.

CSautier added a commit that referenced this issue Sep 26, 2023
@modifierT
Copy link

It works! Thanks a lot!

@CSautier
Copy link
Collaborator

Ok, I'm closing the issue. For future reference, if problems with MinkowskiEngine reappear, it could be easier to switch to torchsparse see for instance.
That would require some rewriting, especially in the dataloader, and would break compatibility with published results' weights.

@Eaphan
Copy link
Contributor

Eaphan commented Mar 2, 2024

I have the same error too when I use the MinkowskiEngine with tag v0.5.4.
Then I checkout commit 02fc608bea4c0549b0a7b00ca1bf15dee4a0b228 and re-install the MinkowskiEngine, and the error disappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants