Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected bus error encountered #4

Closed
zhm1211 opened this issue Apr 30, 2022 · 4 comments
Closed

Unexpected bus error encountered #4

zhm1211 opened this issue Apr 30, 2022 · 4 comments

Comments

@zhm1211
Copy link

zhm1211 commented Apr 30, 2022

Hi, I ran into a problem while following your README to run the code. I wonder whether it's because I ran it on CUDA11.3 did I run the render_mano_ih operation incorrectly, and there is a problem with the lmdb file obtained..

When I train DIGIT or test it, I met the problem with dataloader:

Number of annotations in single hand sequences: 58463
Number of annotations in interacting hand sequences: 36568
Total number of annotations: 95031

Defrost ALL
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.

| Name | Type | Params

0 | model | Model | 40 M

Validation sanity check: 0it [00:00, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4844) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "train.py", line 56, in
run_exp(cfg)
File "train.py", line 51, in run_exp
trainer.fit(model, train_loader, [val_loader])
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in run_pretrain_routine
self._run_sanity_check(ref_model, model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1257, in _run_sanity_check
eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 305, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1024, in _get_data
success, data = self._try_get_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 4844) exited unexpectedly.

However, when I change num_workers to 0, I met another problem:

Using native 16bit precision.

| Name | Type | Params

0 | model | Model | 40 M

/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 36 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]Bus error (core dumped)

@zc-alexfan
Copy link
Owner

Hi, I didn't encounter the issue before. However, I think it is related to how the environment is installed and your operating system. Could you check if the versions of your packages match the ones in my requirement?

Training with 0 worker is generally not suggested as it will slow down your data loading.

@zhm1211
Copy link
Author

zhm1211 commented May 1, 2022

I greatly appreciate your prompt reply. I didn't have a server with CUDA10.1 before, the lower version of pytorch will have a warning. So does this algorithm need strict environmental requirements. If that's the case then I'll have to wait until I get server permissions for lower versions of CUDA before trying. Thanks again for your patient answer.

@zhm1211 zhm1211 closed this as completed May 1, 2022
@zc-alexfan
Copy link
Owner

I dont think the code is version specific. Maybe you can try a different pytorch version?

@zhm1211 zhm1211 reopened this May 3, 2022
@zhm1211
Copy link
Author

zhm1211 commented May 3, 2022

Hi, I recently tried running your code on CUDA10.1 and configured the environment exactly as you recommended. Unfortunately, when I ran main.py in render_mano_ih, I met another problem:

Fitting error: 4.7883463 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 6.800863 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 4.305114 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 4.323252 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function
Error in forward_face_index_map_2: invalid device function
Error in forward_texture_sampling: invalid device function
Fitting error: 8.250234 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]

Incidentally, I have tried a higher pytorch version on CUDA11.3, but I still encountered the first problem mentioned above when I ran train.py in digit-interacting. In order to prevent entering PDB mode, I deleted the code of import pdb when I was running package_images_lmdb.py before. I don't know whether this will be the cause of my failure. But for now, the problems I'm having with CUDA10.1 are probably not related to this operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants