Unexpected bus error encountered #4

zhm1211 · 2022-04-30T13:09:53Z

Hi, I ran into a problem while following your README to run the code. I wonder whether it's because I ran it on CUDA11.3 did I run the render_mano_ih operation incorrectly, and there is a problem with the lmdb file obtained..

When I train DIGIT or test it, I met the problem with dataloader：

Number of annotations in single hand sequences: 58463
Number of annotations in interacting hand sequences: 36568
Total number of annotations: 95031

Defrost ALL
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.

| Name | Type | Params

0 | model | Model | 40 M

Validation sanity check: 0it [00:00, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4844) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train.py", line 56, in
run_exp(cfg)
File "train.py", line 51, in run_exp
trainer.fit(model, train_loader, [val_loader])
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in run_pretrain_routine
self._run_sanity_check(ref_model, model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1257, in _run_sanity_check
eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 305, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1024, in _get_data
success, data = self._try_get_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 4844) exited unexpectedly.

However, when I change num_workers to 0, I met another problem：

Using native 16bit precision.

| Name | Type | Params

0 | model | Model | 40 M

/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 36 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]Bus error (core dumped)

The text was updated successfully, but these errors were encountered:

zc-alexfan · 2022-04-30T14:40:38Z

Hi, I didn't encounter the issue before. However, I think it is related to how the environment is installed and your operating system. Could you check if the versions of your packages match the ones in my requirement?

Training with 0 worker is generally not suggested as it will slow down your data loading.

zhm1211 · 2022-05-01T05:08:58Z

I greatly appreciate your prompt reply. I didn't have a server with CUDA10.1 before, the lower version of pytorch will have a warning. So does this algorithm need strict environmental requirements. If that's the case then I'll have to wait until I get server permissions for lower versions of CUDA before trying. Thanks again for your patient answer.

zc-alexfan · 2022-05-01T08:52:40Z

I dont think the code is version specific. Maybe you can try a different pytorch version?

zhm1211 · 2022-05-03T14:23:42Z

Hi, I recently tried running your code on CUDA10.1 and configured the environment exactly as you recommended. Unfortunately, when I ran main.py in render_mano_ih, I met another problem:

Fitting error: 4.7883463 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 6.800863 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 4.305114 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 4.323252 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 8.250234 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]

Incidentally, I have tried a higher pytorch version on CUDA11.3, but I still encountered the first problem mentioned above when I ran train.py in digit-interacting. In order to prevent entering PDB mode, I deleted the code of import pdb when I was running package_images_lmdb.py before. I don't know whether this will be the cause of my failure. But for now, the problems I'm having with CUDA10.1 are probably not related to this operation.

zhm1211 closed this as completed May 1, 2022

zhm1211 reopened this May 3, 2022

zc-alexfan closed this as completed Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected bus error encountered #4

Unexpected bus error encountered #4

zhm1211 commented Apr 30, 2022

zc-alexfan commented Apr 30, 2022

zhm1211 commented May 1, 2022

zc-alexfan commented May 1, 2022

zhm1211 commented May 3, 2022

Unexpected bus error encountered #4

Unexpected bus error encountered #4

Comments

zhm1211 commented Apr 30, 2022

| Name | Type | Params

| Name | Type | Params

zc-alexfan commented Apr 30, 2022

zhm1211 commented May 1, 2022

zc-alexfan commented May 1, 2022

zhm1211 commented May 3, 2022