New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the unified hierarchy for /sys/fs/cgroup/systemd when legacy hierarchies are being used #3965
Conversation
A following patch will update cgroup handling so that the systemd controller (/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel resource controllers are on the legacy hierarchies. This would require distinguishing whether all controllers are on cgroup v2 or only the systemd controller is. In preparation, this patch renames cg_unified() to cg_all_unified(). This patch doesn't cause any functional changes.
if (r < 0) | ||
return r; | ||
|
||
if (controller && streq(controller, SYSTEMD_CGROUP_CONTROLLER)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
streq_ptr()
seems a good fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, indeed. I was actually reading streq() implementation wondering whether it'd first do the pointer test. Will switch to streq_ptr().
I didn't go through all the details, but the general idea is sound and the patch lgtm. |
f885b7d
to
0b96748
Compare
lol forgot to do "make install" before testing the updates. There was a silly bug in detect_unified_cgroup_hierarchy(). |
0b96748
to
f7724eb
Compare
The error reported by the CentOS CI seems legit, btw: |
Hmm... nspawn worked fine on all three cgroup modes worked here. No idea what the difference is from the first revision. Digging into it. |
@htejun , diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 0c1c21d..f6af953 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -326,7 +326,9 @@ static int detect_unified_cgroup_hierarchy(void) {
r = parse_boolean(e);
if (r < 0)
return log_error_errno(r, "Failed to parse $UNIFIED_CGROUP_HIERARCHY.");
- if (r > 0)
+ else if (r == 0)
+ arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_NONE;
+ else
arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_ALL;
return 0;
} |
I see, the same bug on env path. I was scratching my head viciously wondering why I haven't been able to reproduce it. Yeap, UNIFIED_CGROUP_HIERARCHY=0 reproduces it. Fixing it. Thanks! |
…rarchy Currently, systemd uses either the legacy hierarchies or the unified hierarchy. When the legacy hierarchies are used, systemd uses a named legacy hierarchy mounted on /sys/fs/cgroup/systemd without any kernel controllers for process management. Due to the shortcomings in the legacy hierarchy, this involves a lot of workarounds and complexities. Because the unified hierarchy can be mounted and used in parallel to legacy hierarchies, there's no reason for systemd to use a legacy hierarchy for management even if the kernel resource controllers need to be mounted on legacy hierarchies. It can simply mount the unified hierarchy under /sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies. This disables a significant amount of fragile workaround logics and would allow using features which depend on the unified hierarchy membership such bpf cgroup v2 membership test. In time, this would also allow deleting the said complexities. This patch updates systemd so that it prefers the unified hierarchy for the systemd cgroup controller hierarchy when legacy hierarchies are used for kernel resource controllers. * cg_unified(@controller) is introduced which tests whether the specific controller in on unified hierarchy and used to choose the unified hierarchy code path for process and service management when available. Kernel controller specific operations remain gated by cg_all_unified(). * "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to force the use of legacy hierarchy for systemd cgroup controller. * nspawn: By default nspawn uses the same hierarchies as the host. If UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all. If 0, legacy for all. * nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of three options - legacy, only systemd controller on unified, and unified. The value is passed into mount setup functions and controls cgroup configuration. * nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount option is moved to mount_legacy_cgroup_hierarchy() so that it can take an appropriate action depending on the configuration of the host. v2: - CGroupUnified enum replaces open coded integer values to indicate the cgroup operation mode. - Various style updates. v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2. v4: Restored legacy container on unified host support and fixed another bug in detect_unified_cgroup_hierarchy().
f7724eb
to
5da38d0
Compare
I'm trying to pass -bash-4.3# cat /proc/cmdline
root=/dev/sda1 raid=noautodetect loglevel=2 init=/usr/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.unified_cgroup_hierarchy=no systemd.unit=multi-user.target
-bash-4.3# grep cgroup /proc/self/mountinfo
24 18 0:20 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
25 24 0:21 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
27 24 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,cpu,cpuacct
28 24 0:24 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,perf_event
29 24 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,net_cls,net_prio
30 24 0:26 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,pids
31 24 0:27 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,devices
32 24 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,blkio
33 24 0:29 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,cpuset
34 24 0:30 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,freezer
35 24 0:31 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,hugetlb
36 24 0:32 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memory I think we shouldn't mount |
So, the intention was systemd.unified_cgroup_hierarchy indicating whether controllers are on unified or not and systemd.legacy_systemd_cgroup_controller indicating whether systemd controller is on legacy or not. There are three modes to select from after all. The other thing is that whether systemd controller is on unified or legacy doesn't matter all that much to users so I thought it'd make sense to put the selection behind another flag. If you have better ideas, please let me know. |
Well, I think this matters. See #3388
I think I should reread your commit message :) |
r = get_proc_cmdline_key("systemd.legacy_systemd_cgroup_controller", NULL); | ||
if (r > 0) { | ||
wanted = false; | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: no {} for single-line if blocks
Oh, wow, interesting idea, never thought about that. Looks OK though, even though I don't like the complexity this adds to alread complex code... But it is definitely an OK way to deal with the political issues... Would have merged, but needs a rebase first, github doesn't let me merge it... |
Seems to works nicely here. Merged. |
It works for all hierarchies. Again, the only restriction is to which hierarchy a controller is attached in the host and whether that coincides with what the container expects. Other than that, you can do whatever you want to do (for example, detach one controller from v2 hierarchy, mount it on a separate v1 hierarchy and expose that to a container); however, please note that controllers may be pinned down by internal refs and thus may not be able to detach from the current hierarchy. You gotta figure out what controllers should go where on boot. |
Quoting Tejun Heo (notifications@github.com):
Right, that's not going to be enough. The controller would need to be in both Thanks for the information. |
@htejun, can named controllers like |
@brauner, a hierarchy can be mounted multiple times but that would just make the same hierarchy show up in multiple places and I don't think what's implemented in systemd-nspawn now is correct with multiple containers. When the container cgroup type is different from the host, the host side should set the other type for the container. I think what we should do for the hybrid mode is simply creating v1 name=systemd hierarchy in parallel which systemd itself won't use but can help other tools expecting that hierarchy. |
@htejun, basically we run something like this: our-v2-cgroup=$(get-our-v2-cgroup)
mkdir -p /tmp/v1
mount -t cgroup -o none,name=systemd,xattr cgroup /tmp/v1
mkdir -p /tmp/v1/$(our-v2-cgroup)
echo "PID1-of-the-container" >/tmp/v1/$(our-v2-cgroup)/cgroup.procs This works because we spawn every container inside its own cgroup |
@htejun How does mounting a As an aside, in |
@cyphar , |
@htejun , oh, I got it :-) i.e.
or so. |
@evverx That's what my question was -- I didn't see how us creating a |
@cyphar, sorry, I misunderstood. I mean that |
The question is "will systemd do unholy things to our processes if we don't touch the v2 hierarchy?". Because if systemd's v1 interface is only going to be skin-deep (just faking it for container runtimes), then we still have the same problem as if we just mounted Note: "unholy things" involves systemd reorganising the cgroup associations of processes that I created and am attempting to control. This happened quite a lot in older releases (and because of our issues with the |
@cyphar , I need some context. I'll read moby/moby#23374, moby/moby#17704 |
@evverx You can also look at opencontainers/runc#325 too (and things that link to it). Unfortunately most of the actual bug reports I've seen are on an internal bug tracker. |
@cyphar , thanks for the link! I'll take a look. I "fixed" the "no subsystem for mount"-error: --- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -149,7 +149,7 @@ func getCgroupMountsHelper(ss map[string]bool, mi io.Reader, all bool) ([]Mount,
if sepIdx == -1 {
return nil, fmt.Errorf("invalid mountinfo format")
}
- if txt[sepIdx+3:sepIdx+9] != "cgroup" {
+ if txt[sepIdx+3:sepIdx+10] == "cgroup2" || txt[sepIdx+3:sepIdx+9] != "cgroup" {
continue
}
fields := strings.Split(txt, " ") So, I can run -bash-4.3# grep cgroup2 /proc/self/mountinfo
25 24 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
-bash-4.3# runc run hola
/ # cat /proc/self/cgroup
10:freezer:/hola
9:pids:/user.slice/user-0.slice/session-1.scope/hola
8:net_cls,net_prio:/hola
7:memory:/user.slice/user-0.slice/session-1.scope/hola
6:cpuset:/hola
5:hugetlb:/hola
4:perf_event:/hola
3:devices:/user.slice/user-0.slice/session-1.scope/hola
2:cpu,cpuacct:/user.slice/user-0.slice/session-1.scope/hola
1:blkio:/user.slice/user-0.slice/session-1.scope/hola
0::/user.slice/user-0.slice/session-1.scope
/ # ls /sys/fs/cgroup/systemd
ls: /sys/fs/cgroup/systemd: No such file or directory I'm trying to understand
Sadly, I'm not sure that the parallel |
@evverx, ah yes, you're right. I missed mkdir_parents(). What systemd-nspawn does is correct and I think is a robust way to deal with the situation. |
@brauner, the term named controller is a bit misleading. The name= option specifies the identifier for the hierarchy as without the actual controllers there's no other way of specifying that hierarchy, so name= is only allowed on hierarchies which do not have any controllers attached to them. As such, there can be only one hierarchy with a given name; however, the hierarchy can be mounted multiple times just like any other hierarchies or filesystems can be. It'll just show the same hierarchy on different mount points. As long as the container preparation sets up its cgroup as systemd-nspawn does, everything should be fine. |
@evvrx, and yes again, just setting up the name=systemd hierarchy in parallel to cgroup v2 hierarchy in both hybrid and v2 mode for backward compatibility for backward compatibility without actually using it for anything. @cyphar, can't tell much without specifics but if you aren't using "Delegate=", problems are expected with controllers. I can't see how cooperation between two managing agents can be achieved without one telling the other that it's gonna take over some part of the hierarchy. I don't understand why specifying "Delegate=" is a huge problem but one way or the other you'll have to do it. However, that doesn't have anything to do with "name=systemd" hierarchy as that hierarchy will always be fully instantiated and thus processes won't be relocated once set up. |
@htejun, yes, but the |
Nevermind, I got mixed-up here. |
It's still going to be a problem though. We're taking care of placing user into writeable |
@brauner, I'm working on a change to make v1 name=systemd hierarchy always available in hybrid mode. That should work for your case, right? |
That depends on what that means. Does it mean in addition to |
The cgroup2 one would have a different name, obviously. |
@htejun , Currently -bash-4.3# systemd-run --setenv UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn -D /var/lib/machines/nspawn-root/ -b systemd.unit=multi-user.target
-bash-4.3# systemd-cgls
...
-.slice
├─machine.slice
│ └─machine-nspawn\x2droot.scope
│ ├─356 /usr/lib/systemd/systemd systemd.unit=multi-user.target
│ ├─380 /usr/lib/systemd/systemd-journald
│ ├─387 /usr/lib/systemd/systemd-logind
│ ├─389 /usr/lib/systemd/systemd-resolved
│ ├─391 /sbin/agetty --noclear --keep-baud console 115200,38400,9600 vt220
│ └─392 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfi...
The "real" layout is
|
Also, Should all other tools manually clean-up their |
Just posted #4670 which puts cgroup v2 systemd hierarchy on /sys/fs/cgroup/systemd-cgroup2 (any idea for a better name?) while maintaining the "name=systemd" hierarchy on /sys/fs/cgroup/systemd in parallel. This should avoid issues with most tools. For the ones which fail to parse if there's an entry for the v2 hierarchy in /proc/$PID/cgroup, I have no idea yet. @evverx, yeah, systemd-cgls would need to select mode per machine. I think we have the same problem without the hybrid mode tho. It shouldn't be too difficult to teach systemd-cgls to switch modes per-machine, right? |
@evverx, as for cleaning up afterwards, cg_trim() taking care of the v1 hierarchy should be enough, right? |
@htejun good idea to make the |
These two patches make systemd prefer the unified hierarchy for /sys/fs/cgroup/systemd/ when the kernel controllers are on legacy hierarchies. While this adds yet another cgroup setup variant, this allows configurations which require cgroup v1 for, e.g., cpu controller or backward compatibility avoid the nasty complications of using legacy hierarchy for process management. This also opens the door for eventually removing legacy hierarchy based process management as the kernel has been shipping with cgroup v2 support for quite a while now, which should get rid of a lot of workarounds for cgroup v1's shortcomings while not impacting compatibility in any noticeable way.