New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected bus error encountered #4
Comments
Hi, I didn't encounter the issue before. However, I think it is related to how the environment is installed and your operating system. Could you check if the versions of your packages match the ones in my requirement? Training with 0 worker is generally not suggested as it will slow down your data loading. |
I greatly appreciate your prompt reply. I didn't have a server with CUDA10.1 before, the lower version of pytorch will have a warning. So does this algorithm need strict environmental requirements. If that's the case then I'll have to wait until I get server permissions for lower versions of CUDA before trying. Thanks again for your patient answer. |
I dont think the code is version specific. Maybe you can try a different pytorch version? |
Hi, I recently tried running your code on CUDA10.1 and configured the environment exactly as you recommended. Unfortunately, when I ran main.py in render_mano_ih, I met another problem: Fitting error: 4.7883463 mm: 0%|▍ | 1195/380125 [00:10<52:05, 121.25it/s]Error in forward_face_index_map_1: invalid device function Incidentally, I have tried a higher pytorch version on CUDA11.3, but I still encountered the first problem mentioned above when I ran train.py in digit-interacting. In order to prevent entering PDB mode, I deleted the code of import pdb when I was running package_images_lmdb.py before. I don't know whether this will be the cause of my failure. But for now, the problems I'm having with CUDA10.1 are probably not related to this operation. |
Hi, I ran into a problem while following your README to run the code. I wonder whether it's because I ran it on CUDA11.3 did I run the render_mano_ih operation incorrectly, and there is a problem with the lmdb file obtained..
When I train DIGIT or test it, I met the problem with dataloader:
Number of annotations in single hand sequences: 58463
Number of annotations in interacting hand sequences: 36568
Total number of annotations: 95031
| Name | Type | Params
0 | model | Model | 40 M
Validation sanity check: 0it [00:00, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4844) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 56, in
run_exp(cfg)
File "train.py", line 51, in run_exp
trainer.fit(model, train_loader, [val_loader])
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1224, in run_pretrain_routine
self._run_sanity_check(ref_model, model)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1257, in _run_sanity_check
eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 305, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data
idx, data = self._get_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1024, in _get_data
success, data = self._try_get_data()
File "/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 4844) exited unexpectedly.
However, when I change num_workers to 0, I met another problem:
Using native 16bit precision.
| Name | Type | Params
0 | model | Model | 40 M
/home/zhenghanmo/anaconda3/envs/digit/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workers
argument(try 36 which is the number of cpus on this machine) in the
DataLoader` init to improve performance.warnings.warn(*args, **kwargs)
Validation sanity check: 0it [00:00, ?it/s]Bus error (core dumped)
The text was updated successfully, but these errors were encountered: