Dead lock happened ran pytorch 1.9.0 cuda11.2 on xeon gold 6326 cpu #68931
Labels
module: data parallel
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Bug
To Reproduce
Steps to reproduce the behavior:
1.ran a training code with dataparallel on gpu 0,1, then things blocked
only shown blocked and no error message in dmesg,kernel,nvidia-smi,shell or log
in fact you may observe a likely successful run on gpus belong to different cpu(02 or 03)
when you try to use 2gpu belong to 1 cpu then blocked at forwarding(gpu 01, 023,012,013,0123)
Expected behavior
Expect fully ultilize the 4gpus in data parallel
Environment
cpu: xeon gold 6326 x2
cuda: 11.2
pytorch: 1.9.0 self compile(following instruction steps except for magma)
python: 3.9, pip,
gpu: a40 x4
ram: 256g
os: Ubuntu 20.04lts server
pc: dell r750xa
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
The text was updated successfully, but these errors were encountered: