RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits #17422

anitazha · 2020-10-23T06:18:45Z

So you have a service/scope and one of its processes is stuck in D state.
The unit is sent a systemctl stop and the process is sent the SIGTERM
then SIGKILL. It doesn't respond and after TimeoutStopSec the unit is
marked as failed because it times out while processes are still around.
The unit's cgroup remains in the file system (for example,
/system.slice/test.service).

When the process comes out of D state it processes the signals and exits.
Pid1 gets the SIGCHLD and tries to figure out what unit this process belongs
to based on what cgroup it's in, but since the unit's cgroup path information
was released earlier, pid1 thinks the process belongs to the unit's parent
(from the previous example, pid1 thinks the process belongs to system.slice).
This parent cgroup probably has many subgroups and processes under it
and cannot be removed, leaving the initial /system.slice/test.service
there until something is started and stopped under it again or
system.slice is stopped.

D state is one example but this happens to any process that happens to
stick around after the final kill signal state.

The fix I'm proposing is to keep the unit's cgroup path information
around if pid1 fails to rmdir the cgroup. When the last process exits
and sends the cgroup empty notification, the service/scope can try to
prune the cgroup if the unit is marked as inactive/failed.

anitazha · 2020-10-23T06:21:11Z

I think this might help with #16350 but I only tested minimally with cgroups2

poettering · 2020-10-23T13:40:00Z

Hmm, I'd do this differently. Maybe introduce unit_maybe_release_cgroup(). which is a wrapper around unit_release_cgroup() and checks if the cgroup is actually empty first. If it's not it returns immediately, otherwise it calls the real function.

Calling that function should then be safe and robust from pretty much any suitable place.

anitazha · 2020-10-24T01:14:46Z

Updated with a wrapper. But I think it's only applicable in unit_prune_cgroup where my original check was; the other 2 uses don't use the wrapper because the cgroup should be released regardless of whether it is empty or not (they're unit_free and unit_set_cgroup_path).

src/core/cgroup.c

src/core/service.c

…t process exits If processes remain in the unit's cgroup after the final SIGKILL is sent and the unit has exceeded stop timeout, don't release the unit's cgroup information. Pid1 will have failed to `rmdir` the cgroup path due to processes remaining in the cgroup and releasing would leave the cgroup path on the file system with no tracking for pid1 to clean it up. Instead, keep the information around until the last process exits and pid1 sends the cgroup empty notification. The service/scope can then prune the cgroup if the unit is inactive/failed.

anitazha · 2020-10-27T00:58:07Z

Updated according to comments

keszybz · 2020-10-27T07:40:29Z

LGTM.

poettering · 2020-10-27T08:02:55Z

lgtm, too

Summary: - Backport PR systemd/systemd#17495 to fix BPF program lifecycle - Backport PR systemd/systemd#17422 to clean up cgroups more reliably after exit - Backport PR systemd/systemd#17497 to add FixedRandomDelay= support systemd/systemd#17495 is not fully upstream yet, but this is enough to fix our existing issues around BPF program lifecycles for device isolation on cgroup v2. Reviewed By: davide125, nikitakoshikov Differential Revision: D25090420 fbshipit-source-id: 37aef957e68fee17c4250ba7f49937d9e12de816

anitazha added the pid1 label Oct 23, 2020

anitazha force-pushed the clean_stray_cgroups branch from d0f4108 to 141fb00 Compare October 24, 2020 01:07

anitazha changed the title ~~RFC: clean up failed {service|scope}'s cgroups when the last process exits~~ RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits Oct 24, 2020

poettering requested changes Oct 26, 2020

View reviewed changes

src/core/cgroup.c Show resolved Hide resolved

src/core/cgroup.c Outdated Show resolved Hide resolved

src/core/cgroup.c Outdated Show resolved Hide resolved

src/core/service.c Show resolved Hide resolved

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 26, 2020

anitazha force-pushed the clean_stray_cgroups branch from 141fb00 to fe46a98 Compare October 27, 2020 00:57

anitazha removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 27, 2020

keszybz added the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 27, 2020

keszybz merged commit e08dabf into systemd:master Oct 27, 2020

keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Oct 27, 2020

anitazha deleted the clean_stray_cgroups branch November 18, 2020 22:50

huww98 mentioned this pull request Jun 13, 2021

cephadm: workaround unit replace failure ceph/ceph#41829

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits #17422

RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits #17422

anitazha commented Oct 23, 2020 •

edited

anitazha commented Oct 23, 2020

poettering commented Oct 23, 2020

anitazha commented Oct 24, 2020

anitazha commented Oct 27, 2020

keszybz commented Oct 27, 2020

poettering commented Oct 27, 2020

RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits #17422

RFC: clean up inactive/failed {service|scope}'s cgroups when the last process exits #17422

Conversation

anitazha commented Oct 23, 2020 • edited

anitazha commented Oct 23, 2020

poettering commented Oct 23, 2020

anitazha commented Oct 24, 2020

anitazha commented Oct 27, 2020

keszybz commented Oct 27, 2020

poettering commented Oct 27, 2020

anitazha commented Oct 23, 2020 •

edited