Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How To Correctly Kill MultiProcesses During Multi-GPU Training #137

Closed
catalystfrank opened this issue Apr 10, 2017 · 1 comment
Closed

Comments

@catalystfrank
Copy link

During the training of using examples/imagenet/main.py, I used the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python main.py [options] path/to/imagenetdir 1>a.log 2>a.err &

Then it starts 5 processes in the system, 1 main process appears in nvidia-smi.

Most of the Time (90% of the time) after I first kill the main process, GPU usage down to 0% so I can kill the other 4 to release GPU Mem to start a new training task. Sometimes (10% of the time), after I killed these 5 processes, the main process remained to be "python [defunct]" that cannot be killed even by sudo kill -s 9. The usage of GPU AND the GPU mem are not released.

Multi-gpu training happened at where I use the following line in my code:

model = torch.nn.DataParallel(model).cuda()

Please give some hint on "how to correctly kill multi-gpu training pytorch process[es]."

Thanks.

@fehiepsi
Copy link
Contributor

I usually kill [defunct] process by killing its parent process. This thread might help you: https://askubuntu.com/questions/201303/what-is-a-defunct-process-and-why-doesnt-it-get-killed

@subramen subramen closed this as completed Mar 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants