Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits #17422

Merged
merged 1 commit into from Oct 27, 2020

Conversation

anitazha
Copy link
Member

@anitazha anitazha commented Oct 23, 2020

So you have a service/scope and one of its processes is stuck in D state.
The unit is sent a systemctl stop and the process is sent the SIGTERM
then SIGKILL. It doesn't respond and after TimeoutStopSec the unit is
marked as failed because it times out while processes are still around.
The unit's cgroup remains in the file system (for example,
/system.slice/test.service).

When the process comes out of D state it processes the signals and exits.
Pid1 gets the SIGCHLD and tries to figure out what unit this process belongs
to based on what cgroup it's in, but since the unit's cgroup path information
was released earlier, pid1 thinks the process belongs to the unit's parent
(from the previous example, pid1 thinks the process belongs to system.slice).
This parent cgroup probably has many subgroups and processes under it
and cannot be removed, leaving the initial /system.slice/test.service
there until something is started and stopped under it again or
system.slice is stopped.

D state is one example but this happens to any process that happens to
stick around after the final kill signal state.

The fix I'm proposing is to keep the unit's cgroup path information
around if pid1 fails to rmdir the cgroup. When the last process exits
and sends the cgroup empty notification, the service/scope can try to
prune the cgroup if the unit is marked as inactive/failed.

@anitazha anitazha added the pid1 label Oct 23, 2020
@anitazha
Copy link
Member Author

I think this might help with #16350 but I only tested minimally with cgroups2

@poettering
Copy link
Member

Hmm, I'd do this differently. Maybe introduce unit_maybe_release_cgroup(). which is a wrapper around unit_release_cgroup() and checks if the cgroup is actually empty first. If it's not it returns immediately, otherwise it calls the real function.

Calling that function should then be safe and robust from pretty much any suitable place.

@anitazha anitazha changed the title RFC: clean up failed {service|scope}'s cgroups when the last process exits RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits Oct 24, 2020
@anitazha
Copy link
Member Author

Updated with a wrapper. But I think it's only applicable in unit_prune_cgroup where my original check was; the other 2 uses don't use the wrapper because the cgroup should be released regardless of whether it is empty or not (they're unit_free and unit_set_cgroup_path).

src/core/cgroup.c Show resolved Hide resolved
src/core/cgroup.c Outdated Show resolved Hide resolved
src/core/cgroup.c Outdated Show resolved Hide resolved
src/core/service.c Show resolved Hide resolved
@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 26, 2020
…t process exits

If processes remain in the unit's cgroup after the final SIGKILL is
sent and the unit has exceeded stop timeout, don't release the unit's
cgroup information. Pid1 will have failed to `rmdir` the cgroup path due
to processes remaining in the cgroup and releasing would leave the cgroup
path on the file system with no tracking for pid1 to clean it up.

Instead, keep the information around until the last process exits and pid1
sends the cgroup empty notification. The service/scope can then prune
the cgroup if the unit is inactive/failed.
@anitazha
Copy link
Member Author

Updated according to comments

@anitazha anitazha removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 27, 2020
@keszybz
Copy link
Member

keszybz commented Oct 27, 2020

LGTM.

@keszybz keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 27, 2020
@poettering
Copy link
Member

lgtm, too

@keszybz keszybz merged commit e08dabf into systemd:master Oct 27, 2020
@keszybz keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 27, 2020
@anitazha anitazha deleted the clean_stray_cgroups branch November 18, 2020 22:50
facebook-github-bot pushed a commit to facebookarchive/rpm-backports that referenced this pull request Nov 19, 2020
Summary:
- Backport PR systemd/systemd#17495 to fix BPF program lifecycle
- Backport PR systemd/systemd#17422 to clean up cgroups more reliably after exit
- Backport PR systemd/systemd#17497 to add FixedRandomDelay= support

systemd/systemd#17495 is not fully upstream yet, but this is enough to fix our existing issues around BPF program lifecycles for device isolation on cgroup v2.

Reviewed By: davide125, nikitakoshikov

Differential Revision: D25090420

fbshipit-source-id: 37aef957e68fee17c4250ba7f49937d9e12de816
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

3 participants