How To Correctly Kill MultiProcesses During Multi-GPU Training #137

catalystfrank · 2017-04-10T07:36:38Z

During the training of using examples/imagenet/main.py, I used the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python main.py [options] path/to/imagenetdir 1>a.log 2>a.err &

Then it starts 5 processes in the system, 1 main process appears in nvidia-smi.

Most of the Time (90% of the time) after I first kill the main process, GPU usage down to 0% so I can kill the other 4 to release GPU Mem to start a new training task. Sometimes (10% of the time), after I killed these 5 processes, the main process remained to be "python [defunct]" that cannot be killed even by sudo kill -s 9. The usage of GPU AND the GPU mem are not released.

Multi-gpu training happened at where I use the following line in my code:

model = torch.nn.DataParallel(model).cuda()

Please give some hint on "how to correctly kill multi-gpu training pytorch process[es]."

Thanks.

The text was updated successfully, but these errors were encountered:

fehiepsi · 2017-04-17T14:49:04Z

I usually kill [defunct] process by killing its parent process. This thread might help you: https://askubuntu.com/questions/201303/what-is-a-defunct-process-and-why-doesnt-it-get-killed

subramen closed this as completed Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To Correctly Kill MultiProcesses During Multi-GPU Training #137

How To Correctly Kill MultiProcesses During Multi-GPU Training #137

catalystfrank commented Apr 10, 2017

fehiepsi commented Apr 17, 2017

How To Correctly Kill MultiProcesses During Multi-GPU Training #137

How To Correctly Kill MultiProcesses During Multi-GPU Training #137

Comments

catalystfrank commented Apr 10, 2017

fehiepsi commented Apr 17, 2017