RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #8

jaycheney · 2022-08-15T06:55:28Z

This problem occurs when I run the code pretrain.py.
I tried a lot of methods, but do not know how to deal with them.
Could you help me？

I run with 1GPU, ubuntu 18.04, cudnn8, cuda11.1 and other requirements same like requirements.txt.

Training: -1it [00:00, ?it/s]
Training: 0%| | 0/7033 [00:00<00:00, 22671.91it/s]
Epoch 0: 0%| | 0/7033 [00:00<00:01, 3584.88it/s] Traceback (most recent call last):
File "pretrain.py", line 61, in
main()
File "pretrain.py", line 57, in main
trainer.fit(module, dm)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 131, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
super().run(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 402, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/sgd.py", line 87, in step
loss = closure()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 533, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 306, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 386, in training_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in training_step
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 124, in loss_superpixels_average
k = one_hot_P @ output_points[batch["pairing_points"]]
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Epoch 0: 0%| | 0/7033 [00:37<74:08:49, 37.95s/it]

CSautier · 2022-08-15T09:30:30Z

Hi, have you been modifying the code? Because Line 124 in "SLidR/pretrain/lightning_trainer.py" isn't supposed to be
"k = one_hot_P @ output_points[batch["pairing_points"]]".
This error is typical of an incorrect indexing. That can happen in the superpixels indices, the pairing of the points or the creation of the sparse matrices.

jaycheney · 2022-08-17T06:02:26Z

Thank you for your reply. I didn't modify the original code (only some comments), but when I run the pretrain.py code with the slidr_minkunet.yaml, the problem RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered still occurs. I checked the k = one_hot_P @ output_points[batch["pairing_points"]], the dimension is one_hot_P(3600,45771), output_points(44516,64) and pairing_points(45771,).
But when I run the the pretrain.py code with the config/slidr_voxelnet.yaml, doesn't have any problem.

BTW, could you kindly share the code via email used to finetune object detection models from OpenPCDet? I'm still reproducing the results of SLidR. It'll help me a lot.
Originally posted by @CSautier in #3 (comment)

ZhengLeon · 2022-09-01T03:34:33Z

I met this problem too, did you solve it? @JakeVander

andrewcaunes · 2023-08-07T10:03:40Z

Same problem here.

modifierT · 2023-09-24T02:33:01Z

Same problem here.

CSautier · 2023-09-25T12:12:01Z

Could you please tell me exactly what you are running? I will try to add a dockerfile to setup a working environment, and see if I can either reproduce the issue, or specify an environment so that you won't have any. However I'm not sure a single dockerfile will suit every configuration since MinkowskiEngine can be particularly painful to compile.

andrewcaunes · 2023-09-25T12:32:33Z

I managed to fix this error but don't remember exactly how, sorry.
I'm pretty sure it was by changing pytorch version, and I ended up using these :
pytorch=1.12.1
MinkowskiEngine=0.5.4
along with a nvidia V100 with cuda=12.0

Thanks for the amazing work by the way !

modifierT · 2023-09-26T03:28:53Z

Thanks for reopening the issue!
I have ran
python pretrain.py --cfg config/slidr_minkunet.yaml
in terminal and got
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
almost the same as 👇

This problem occurs when I run the code pretrain.py. I tried a lot of methods, but do not know how to deal with them. Could you help me？

I run with 1GPU, ubuntu 18.04, cudnn8, cuda11.1 and other requirements same like requirements.txt.

Training: -1it [00:00, ?it/s]
Training: 0%| | 0/7033 [00:00<00:00, 22671.91it/s]
Epoch 0: 0%| | 0/7033 [00:00<00:01, 3584.88it/s] Traceback (most recent call last):
File "pretrain.py", line 61, in
main()
File "pretrain.py", line 57, in main
trainer.fit(module, dm)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 131, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
super().run(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 402, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1593, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/optimizer.py", line 89, in wrapper
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/optim/sgd.py", line 87, in step
loss = closure()
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 533, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 306, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 386, in training_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/SLidR/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 82, in forward
output = self.module.training_step(*inputs, **kwargs)
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in training_step
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 62, in
for loss in self.losses
File "/user/SLidR/pretrain/lightning_trainer.py", line 124, in loss_superpixels_average
k = one_hot_P @ output_points[batch["pairing_points"]]
RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Epoch 0: 0%| | 0/7033 [00:37<74:08:49, 37.95s/it]

in which

k = one_hot_P @ output_points[pairing_points]
one_hot_P: torch.Size([7200, 104372])
output_points[pairing_points]: torch.Size([104372, 64])

And this problem won't occurred when running python pretrain.py --cfg config/slidr_voxelnet.yaml

my env is

torch == 1.10.0+cu113
numpy == 1.24.1
MinkowskiEngine == 0.5.4
nuscenes-devkit == 1.1.10
pytorch_lightning == 1.4.0
multiprocess == 0.70.15
scikit-image == 0.21.0
torchvision == 0.11.1+cu113
spconv == 2.3.6
torchmetrics == 0.4.0

along with nvidia A800

CSautier · 2023-09-26T09:11:29Z

Ok, I was actually able to re-produce the issue now. I'll see what I can do.

CSautier · 2023-09-26T17:24:38Z

The issue is indeed a compatibility issue between MinkowskiEngine, and some versions of Pytorch+CUDA (see for instance NVIDIA/MinkowskiEngine#299).

I was able to run the code again using CUDA 11.3, torch 1.12.0 cudnn 8, the latest commit of MinkowskiEngine and pytorch_lightning 1.6.0

I will add a Dockerfile with this config to the repo.

modifierT · 2023-09-27T11:05:23Z

It works! Thanks a lot!

CSautier · 2023-09-27T11:39:45Z

Ok, I'm closing the issue. For future reference, if problems with MinkowskiEngine reappear, it could be easier to switch to torchsparse see for instance.
That would require some rewriting, especially in the dataloader, and would break compatibility with published results' weights.

Eaphan · 2024-03-02T11:31:06Z

I have the same error too when I use the MinkowskiEngine with tag v0.5.4.
Then I checkout commit 02fc608bea4c0549b0a7b00ca1bf15dee4a0b228 and re-install the MinkowskiEngine, and the error disappears.

jaycheney closed this as completed Aug 20, 2022

CSautier reopened this Sep 25, 2023

CSautier added a commit that referenced this issue Sep 26, 2023

update requirements, (issue #8)

2c2dcf9

CSautier closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #8

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #8

jaycheney commented Aug 15, 2022

CSautier commented Aug 15, 2022

jaycheney commented Aug 17, 2022

ZhengLeon commented Sep 1, 2022 •

edited

Loading

andrewcaunes commented Aug 7, 2023

modifierT commented Sep 24, 2023

CSautier commented Sep 25, 2023

andrewcaunes commented Sep 25, 2023 •

edited

Loading

modifierT commented Sep 26, 2023

CSautier commented Sep 26, 2023

CSautier commented Sep 26, 2023

modifierT commented Sep 27, 2023

CSautier commented Sep 27, 2023

Eaphan commented Mar 2, 2024

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #8

RuntimeError: merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered #8

Comments

jaycheney commented Aug 15, 2022

CSautier commented Aug 15, 2022

jaycheney commented Aug 17, 2022

ZhengLeon commented Sep 1, 2022 • edited Loading

andrewcaunes commented Aug 7, 2023

modifierT commented Sep 24, 2023

CSautier commented Sep 25, 2023

andrewcaunes commented Sep 25, 2023 • edited Loading

modifierT commented Sep 26, 2023

CSautier commented Sep 26, 2023

CSautier commented Sep 26, 2023

modifierT commented Sep 27, 2023

CSautier commented Sep 27, 2023

Eaphan commented Mar 2, 2024

ZhengLeon commented Sep 1, 2022 •

edited

Loading

andrewcaunes commented Sep 25, 2023 •

edited

Loading