New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits #17422
Conversation
I think this might help with #16350 but I only tested minimally with cgroups2 |
Hmm, I'd do this differently. Maybe introduce Calling that function should then be safe and robust from pretty much any suitable place. |
d0f4108
to
141fb00
Compare
Updated with a wrapper. But I think it's only applicable in |
…t process exits If processes remain in the unit's cgroup after the final SIGKILL is sent and the unit has exceeded stop timeout, don't release the unit's cgroup information. Pid1 will have failed to `rmdir` the cgroup path due to processes remaining in the cgroup and releasing would leave the cgroup path on the file system with no tracking for pid1 to clean it up. Instead, keep the information around until the last process exits and pid1 sends the cgroup empty notification. The service/scope can then prune the cgroup if the unit is inactive/failed.
141fb00
to
fe46a98
Compare
Updated according to comments |
LGTM. |
lgtm, too |
Summary: - Backport PR systemd/systemd#17495 to fix BPF program lifecycle - Backport PR systemd/systemd#17422 to clean up cgroups more reliably after exit - Backport PR systemd/systemd#17497 to add FixedRandomDelay= support systemd/systemd#17495 is not fully upstream yet, but this is enough to fix our existing issues around BPF program lifecycles for device isolation on cgroup v2. Reviewed By: davide125, nikitakoshikov Differential Revision: D25090420 fbshipit-source-id: 37aef957e68fee17c4250ba7f49937d9e12de816
So you have a service/scope and one of its processes is stuck in D state.
The unit is sent a
systemctl stop
and the process is sent the SIGTERMthen SIGKILL. It doesn't respond and after
TimeoutStopSec
the unit ismarked as failed because it times out while processes are still around.
The unit's cgroup remains in the file system (for example,
/system.slice/test.service
).When the process comes out of D state it processes the signals and exits.
Pid1 gets the SIGCHLD and tries to figure out what unit this process belongs
to based on what cgroup it's in, but since the unit's cgroup path information
was released earlier, pid1 thinks the process belongs to the unit's parent
(from the previous example, pid1 thinks the process belongs to system.slice).
This parent cgroup probably has many subgroups and processes under it
and cannot be removed, leaving the initial
/system.slice/test.service
there until something is started and stopped under it again or
system.slice
is stopped.D state is one example but this happens to any process that happens to
stick around after the final kill signal state.
The fix I'm proposing is to keep the unit's cgroup path information
around if pid1 fails to
rmdir
the cgroup. When the last process exitsand sends the cgroup empty notification, the service/scope can try to
prune the cgroup if the unit is marked as inactive/failed.