You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Then it starts 5 processes in the system, 1 main process appears in nvidia-smi.
Most of the Time (90% of the time) after I first kill the main process, GPU usage down to 0% so I can kill the other 4 to release GPU Mem to start a new training task. Sometimes (10% of the time), after I killed these 5 processes, the main process remained to be "python [defunct]" that cannot be killed even by sudo kill -s 9. The usage of GPU AND the GPU mem are not released.
Multi-gpu training happened at where I use the following line in my code:
model = torch.nn.DataParallel(model).cuda()
Please give some hint on "how to correctly kill multi-gpu training pytorch process[es]."
Thanks.
The text was updated successfully, but these errors were encountered:
During the training of using examples/imagenet/main.py, I used the following command:
Then it starts 5 processes in the system, 1 main process appears in nvidia-smi.
Most of the Time (90% of the time) after I first kill the main process, GPU usage down to 0% so I can kill the other 4 to release GPU Mem to start a new training task. Sometimes (10% of the time), after I killed these 5 processes, the main process remained to be "python [defunct]" that cannot be killed even by sudo kill -s 9. The usage of GPU AND the GPU mem are not released.
Multi-gpu training happened at where I use the following line in my code:
Please give some hint on "how to correctly kill multi-gpu training pytorch process[es]."
Thanks.
The text was updated successfully, but these errors were encountered: