Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931

Heermosi · 2021-11-26T07:33:50Z

🐛 Bug

Failed to ran module code training on gpu 0&1, code stuck at module.forward
Ran module successfully on gpu 0&3, however the gradient was abnormal
normally ran training on single gpu

To Reproduce

Steps to reproduce the behavior:

1.ran a training code with dataparallel on gpu 0,1, then things blocked

only shown blocked and no error message in dmesg,kernel,nvidia-smi,shell or log

ran a training code with dataparallel on gpu 0,3 things not blocked, but the gradients are abnormal

in fact you may observe a likely successful run on gpus belong to different cpu(02 or 03)
when you try to use 2gpu belong to 1 cpu then blocked at forwarding(gpu 01, 023,012,013,0123)

Expected behavior

Expect fully ultilize the 4gpus in data parallel

Environment

cpu: xeon gold 6326 x2
cuda: 11.2
pytorch: 1.9.0 self compile(following instruction steps except for magma)
python: 3.9, pip,
gpu: a40 x4
ram: 256g
os: Ubuntu 20.04lts server
pc: dell r750xa

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

Heermosi · 2021-12-01T03:36:40Z

For running a simple test, may refer to the CRAFT project
We were working on retraining of it. it was compatible with pytorch 1.9.0

samdow added module: data parallel oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931

Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931

Heermosi commented Nov 26, 2021 •

edited by pytorch-probot bot

Heermosi commented Dec 1, 2021

Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931

Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931

Comments

Heermosi commented Nov 26, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

Heermosi commented Dec 1, 2021

Heermosi commented Nov 26, 2021 •

edited by pytorch-probot bot