Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931

Open
Heermosi opened this issue Nov 26, 2021 · 1 comment
Open

Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931

Heermosi opened this issue Nov 26, 2021 · 1 comment
Labels
module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Heermosi
Copy link

Heermosi commented Nov 26, 2021

馃悰 Bug

  1. Failed to ran module code training on gpu 0&1, code stuck at module.forward
  2. Ran module successfully on gpu 0&3, however the gradient was abnormal
  3. normally ran training on single gpu

To Reproduce

Steps to reproduce the behavior:

1.ran a training code with dataparallel on gpu 0,1, then things blocked

only shown blocked and no error message in dmesg,kernel,nvidia-smi,shell or log

  1. ran a training code with dataparallel on gpu 0,3 things not blocked, but the gradients are abnormal

in fact you may observe a likely successful run on gpus belong to different cpu(02 or 03)
when you try to use 2gpu belong to 1 cpu then blocked at forwarding(gpu 01, 023,012,013,0123)

Expected behavior

Expect fully ultilize the 4gpus in data parallel

Environment

cpu: xeon gold 6326 x2
cuda: 11.2
pytorch: 1.9.0 self compile(following instruction steps except for magma)
python: 3.9, pip,
gpu: a40 x4
ram: 256g
os: Ubuntu 20.04lts server
pc: dell r750xa

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

@samdow samdow added module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 30, 2021
@Heermosi
Copy link
Author

Heermosi commented Dec 1, 2021

For running a simple test, may refer to the CRAFT project
We were working on retraining of it. it was compatible with pytorch 1.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants