Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one GPU #39

Open
gh2517956473 opened this issue May 14, 2019 · 5 comments
Open

one GPU #39

gh2517956473 opened this issue May 14, 2019 · 5 comments

Comments

@gh2517956473
Copy link

Can I use one GPU with 12G memory to train? Where does the code need to change?
Thank you very much!

@YuwenXiong
Copy link
Contributor

please follow this #36 (comment)

@gh2517956473
Copy link
Author

Thank you!

@lfdeep
Copy link

lfdeep commented Jun 10, 2019

Thank you!
Hello,i use one gpu,but it occured:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "upsnet/upsnet_end2end_train.py", line 414, in
upsnet_train()
File "upsnet/upsnet_end2end_train.py", line 268, in upsnet_train
data, label, _ = train_iterator.next()
File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 330, in next
idx, batch = self._get_batch()
File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 309, in _get_batch
return self.data_queue.get()
File "/root/anaconda3/lib/python3.7/multiprocessing/queues.py", line 352, in get
res = self._reader.recv_bytes()
File "/root/anaconda3/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/root/anaconda3/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/root/anaconda3/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
File "/root/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 227, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 31613) is killed by signal: Bus error. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

@lfdeep
Copy link

lfdeep commented Jun 11, 2019

Can I use one GPU with 12G memory to train? Where does the code need to change?
Thank you very much!

Hello,Can you run the code successfully on a gpu?

@pkuCactus
Copy link

pkuCactus commented Aug 6, 2019

Thank you for great work. what if i use horovod on a single gpu machine?I tried it and found it fast than not use horovod, do this have any problem?Moreover, how could i run multiple horovod worker to mimic multiple gpu on a single gpu machine, thanks a lot. Expect your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants