Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Using the unified hierarchy for /sys/fs/cgroup/systemd when legacy hierarchies are being used #3965
Conversation
zonque
reviewed
Aug 16, 2016
| + if (r < 0) | ||
| + return r; | ||
| + | ||
| + if (controller && streq(controller, SYSTEMD_CGROUP_CONTROLLER)) |
htejun
Aug 16, 2016
Contributor
Ah, indeed. I was actually reading streq() implementation wondering whether it'd first do the pointer test. Will switch to streq_ptr().
keszybz
reviewed
Aug 16, 2016
| +/* -1: unknown | ||
| + * 0: both systemd and controller hierarchies on legacy | ||
| + * 1: only systemd hierarchy on unified | ||
| + * 2: both systemd and controller hierarchies on unified */ |
keszybz
Aug 16, 2016
Owner
This should be an anonymous enum:
enum {
CGROUP_UNIFIED_NONE = 0,
CGROUP_UNIFIED_SYSTEMD = 1,
CGROUP_UNIFIED_ALL = 2
};
keszybz
reviewed
Aug 16, 2016
| static thread_local int unified_cache = -1; | ||
| -int cg_all_unified(void) { | ||
| +static int cg_update_unified(void) | ||
| +{ |
keszybz
reviewed
Aug 16, 2016
| + if (r < 0) | ||
| + return true; | ||
| + if (r == 0) | ||
| + return (wanted = true); |
keszybz
Aug 16, 2016
Owner
Nah, please don't do inline assignment like that unless absolutely necessary.
keszybz
Aug 16, 2016
Owner
The style changed a bit over the years. Some of the older code doesn't match, but we usually only update it when changing for other reasons.
keszybz
reviewed
Aug 16, 2016
| + | ||
| + if (all_unified < 0 || systemd_unified < 0) | ||
| + return log_error_errno(all_unified < 0 ? all_unified : systemd_unified, | ||
| + "Couldn't determine if we are running in the unified hierarchy: %m"); |
keszybz
Aug 16, 2016
Owner
Wouldn't it be better to assume some default value in this case and continue? Failing fatally in the set up code is something best avoided...
htejun
Aug 16, 2016
Contributor
We have multiple places where we fail fatally if cg_[all_]unified() fail. It indicates that the basic mount setup went horribly wrong and once the results are established they never fail. I can't think of a good recovery action here as it'd mean that process management won't work.
keszybz
reviewed
Aug 16, 2016
| - else | ||
| - log_debug("Using cgroup controller " SYSTEMD_CGROUP_CONTROLLER ". File system hierarchy is at %s.", path); | ||
| + else { | ||
| + if (systemd_unified > 0) |
|
I didn't go through all the details, but the general idea is sound and the patch lgtm. |
|
lol forgot to do "make install" before testing the updates. There was a silly bug in detect_unified_cgroup_hierarchy(). |
|
The error reported by the CentOS CI seems legit, btw: |
|
Hmm... nspawn worked fine on all three cgroup modes worked here. No idea what the difference is from the first revision. Digging into it. |
|
@htejun , diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 0c1c21d..f6af953 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -326,7 +326,9 @@ static int detect_unified_cgroup_hierarchy(void) {
r = parse_boolean(e);
if (r < 0)
return log_error_errno(r, "Failed to parse $UNIFIED_CGROUP_HIERARCHY.");
- if (r > 0)
+ else if (r == 0)
+ arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_NONE;
+ else
arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_ALL;
return 0;
} |
|
I see, the same bug on env path. I was scratching my head viciously wondering why I haven't been able to reproduce it. Yeap, UNIFIED_CGROUP_HIERARCHY=0 reproduces it. Fixing it. Thanks! |
|
I'm trying to pass -bash-4.3# cat /proc/cmdline
root=/dev/sda1 raid=noautodetect loglevel=2 init=/usr/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.unified_cgroup_hierarchy=no systemd.unit=multi-user.target
-bash-4.3# grep cgroup /proc/self/mountinfo
24 18 0:20 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
25 24 0:21 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
27 24 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,cpu,cpuacct
28 24 0:24 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,perf_event
29 24 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,net_cls,net_prio
30 24 0:26 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,pids
31 24 0:27 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,devices
32 24 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,blkio
33 24 0:29 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,cpuset
34 24 0:30 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,freezer
35 24 0:31 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,hugetlb
36 24 0:32 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memoryI think we shouldn't mount |
|
So, the intention was systemd.unified_cgroup_hierarchy indicating whether controllers are on unified or not and systemd.legacy_systemd_cgroup_controller indicating whether systemd controller is on legacy or not. There are three modes to select from after all. The other thing is that whether systemd controller is on unified or legacy doesn't matter all that much to users so I thought it'd make sense to put the selection behind another flag. If you have better ideas, please let me know. |
Well, I think this matters. See #3388
I think I should reread your commit message :) |
keszybz
added
nspawn
cgroups
pid1
labels
Aug 19, 2016
poettering
reviewed
Aug 19, 2016
| + r = get_proc_cmdline_key("systemd.legacy_systemd_cgroup_controller", NULL); | ||
| + if (r > 0) { | ||
| + wanted = false; | ||
| + } else { |
|
Oh, wow, interesting idea, never thought about that. Looks OK though, even though I don't like the complexity this adds to alread complex code... But it is definitely an OK way to deal with the political issues... Would have merged, but needs a rebase first, github doesn't let me merge it... |
poettering
added
the
looks-good/needs-rebase
label
Aug 19, 2016
keszybz
removed
the
looks-good/needs-rebase
label
Aug 19, 2016
keszybz
merged commit 5da38d0
into
systemd:master
Aug 20, 2016
added a commit
that referenced
this pull request
Aug 20, 2016
|
Seems to works nicely here. Merged. |
This was referenced Aug 24, 2016
phedders
commented
Nov 8, 2016
•
|
Is this the change that snuck into 232 that has broken LXC, Docker and RKT? |
phedders
commented
Nov 8, 2016
|
Seems a bit unfortunate to push that out before the dependents were ready for it. Caused a lot of pain for many people I suspect - if my experience is anything to go by. |
|
Oops, sorry about that. Didn't foresee failures with other tools. In principle, as long as "Delegate=" is used and the tools don't mangle with the systemd hierarchy, it should all work but yeap reality is always harder. Should the hybrid mode default to off at least for now? |
|
Do we know which other tools are affected by this? @evverx mentioned runc, rkt and lxc. What about docker? Do bug reports exist to notify them about this issue? |
Docker uses
(@brauner , @stgraber , why not to reopen lxc/lxc#1280?)
@htejun , yes, I think the hybrid mode should be disabled by default @poettering, @keszybz ? |
|
Welp ;) Seems we have no choice. |
|
Yes, we can certainly re-open lxc/lxc#1280 to track this. It would be good, if the hybrid mode were to be disabled by default for now. @htejun as we've discussed with @serge and @stgraber during Plumbers, when a |
|
I sent PR #4628 to invert the default for now, but still keep an opt-in for LXC/docker/rkt/etc. developers. I will apply this downstream at least, but now that we have issues filed for affected software I think we should revert this upstream for now too (and maybe push out 233 relatively quickly). |
|
@martinpitt If a v233 is made quickly, it should also address #4575 I think |
|
@brauner, I'm still not quite understanding why this is a fundamental problem. If the host has its systemd hierarchy on v2 and other controllers on v1, lxc should still be able to namespace the v1 controller hierarchies and create a new named systemd mount. From inside the namespace, this shouldn't make any difference. The only thing which is affected by systemd's use of v1 or v2 is what controllers are available on what hierarchies. Outside of that, lxc (or whatever) should be free to set up whatever cgroup hierarchy it wants for the container. What am I missing here? |
|
So, for example, systemd-nspawn can follow or do either v1 or v2 regardless of the host mode. While not implemented for simplicity, we can make it choose any of three modes regardless of the host mode. What hierarchies a namespace can use is not restricted by what the host is doing at all. The only thing which gets affected is what controllers are available on which hierarchies. |
hallyn
commented
Nov 9, 2016
|
Nested legacy containers. Can a newer host which has name=systemd mounted on v2 start a container with an older distribution release which is using sytemd which is using v1? It's possible that in fact systemd doesn't use any v1-only features (i.e. tasks file or tasks in a non-leaf node) and that this would in fact just work, I've not tested. @brauner, could you test say a centos or ubuntu 16.04 container under lxc on a host with systemd-on-v2? |
|
@hallyn, if you're talking about putting a v1 distro on top of v2 hierarchy directly, it's highly unlikely to work but why would you do that in the first place when you can simply create a new v1 hierarchy? |
hallyn
commented
Nov 9, 2016
|
@htejun oh that would require updates to all the container drivers, but would be neat and solve this (temporarily) - you're saying the container could mount a v1 version of name=systemd? |
|
@hallyn, yeah. That's what systemd-nspawn does and it works fine. |
hallyn
commented
Nov 9, 2016
|
Cool, thanks. I didn't think that was possible. So lxc should implement that asap. I assume that only works for named controllers? |
|
It works for all hierarchies. Again, the only restriction is to which hierarchy a controller is attached in the host and whether that coincides with what the container expects. Other than that, you can do whatever you want to do (for example, detach one controller from v2 hierarchy, mount it on a separate v1 hierarchy and expose that to a container); however, please note that controllers may be pinned down by internal refs and thus may not be able to detach from the current hierarchy. You gotta figure out what controllers should go where on boot. |
hallyn
commented
Nov 9, 2016
|
Quoting Tejun Heo (notifications@github.com):
Right, that's not going to be enough. The controller would need to be in both Thanks for the information. |
|
@htejun, can named controllers like |
|
@brauner, a hierarchy can be mounted multiple times but that would just make the same hierarchy show up in multiple places and I don't think what's implemented in systemd-nspawn now is correct with multiple containers. When the container cgroup type is different from the host, the host side should set the other type for the container. I think what we should do for the hybrid mode is simply creating v1 name=systemd hierarchy in parallel which systemd itself won't use but can help other tools expecting that hierarchy. |
|
@htejun, basically we run something like this: our-v2-cgroup=$(get-our-v2-cgroup)
mkdir -p /tmp/v1
mount -t cgroup -o none,name=systemd,xattr cgroup /tmp/v1
mkdir -p /tmp/v1/$(our-v2-cgroup)
echo "PID1-of-the-container" >/tmp/v1/$(our-v2-cgroup)/cgroup.procsThis works because we spawn every container inside its own cgroup |
cyphar
commented
Nov 10, 2016
•
|
@htejun How does mounting a As an aside, in |
@cyphar , |
@htejun , oh, I got it :-) i.e.
or so. |
cyphar
commented
Nov 10, 2016
|
@evverx That's what my question was -- I didn't see how us creating a |
|
@cyphar, sorry, I misunderstood. I mean that |
cyphar
commented
Nov 10, 2016
|
The question is "will systemd do unholy things to our processes if we don't touch the v2 hierarchy?". Because if systemd's v1 interface is only going to be skin-deep (just faking it for container runtimes), then we still have the same problem as if we just mounted Note: "unholy things" involves systemd reorganising the cgroup associations of processes that I created and am attempting to control. This happened quite a lot in older releases (and because of our issues with the |
@cyphar , I need some context. I'll read docker/docker#23374, docker/docker#17704 |
cyphar
commented
Nov 10, 2016
|
@evverx You can also look at opencontainers/runc#325 too (and things that link to it). Unfortunately most of the actual bug reports I've seen are on an internal bug tracker. |
|
@cyphar , thanks for the link! I'll take a look. I "fixed" the "no subsystem for mount"-error: --- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -149,7 +149,7 @@ func getCgroupMountsHelper(ss map[string]bool, mi io.Reader, all bool) ([]Mount,
if sepIdx == -1 {
return nil, fmt.Errorf("invalid mountinfo format")
}
- if txt[sepIdx+3:sepIdx+9] != "cgroup" {
+ if txt[sepIdx+3:sepIdx+10] == "cgroup2" || txt[sepIdx+3:sepIdx+9] != "cgroup" {
continue
}
fields := strings.Split(txt, " ")So, I can run -bash-4.3# grep cgroup2 /proc/self/mountinfo
25 24 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
-bash-4.3# runc run hola
/ # cat /proc/self/cgroup
10:freezer:/hola
9:pids:/user.slice/user-0.slice/session-1.scope/hola
8:net_cls,net_prio:/hola
7:memory:/user.slice/user-0.slice/session-1.scope/hola
6:cpuset:/hola
5:hugetlb:/hola
4:perf_event:/hola
3:devices:/user.slice/user-0.slice/session-1.scope/hola
2:cpu,cpuacct:/user.slice/user-0.slice/session-1.scope/hola
1:blkio:/user.slice/user-0.slice/session-1.scope/hola
0::/user.slice/user-0.slice/session-1.scope
/ # ls /sys/fs/cgroup/systemd
ls: /sys/fs/cgroup/systemd: No such file or directoryI'm trying to understand
Sadly, I'm not sure that the parallel |
|
@evverx, ah yes, you're right. I missed mkdir_parents(). What systemd-nspawn does is correct and I think is a robust way to deal with the situation. |
|
@brauner, the term named controller is a bit misleading. The name= option specifies the identifier for the hierarchy as without the actual controllers there's no other way of specifying that hierarchy, so name= is only allowed on hierarchies which do not have any controllers attached to them. As such, there can be only one hierarchy with a given name; however, the hierarchy can be mounted multiple times just like any other hierarchies or filesystems can be. It'll just show the same hierarchy on different mount points. As long as the container preparation sets up its cgroup as systemd-nspawn does, everything should be fine. |
|
@evvrx, and yes again, just setting up the name=systemd hierarchy in parallel to cgroup v2 hierarchy in both hybrid and v2 mode for backward compatibility for backward compatibility without actually using it for anything. @cyphar, can't tell much without specifics but if you aren't using "Delegate=", problems are expected with controllers. I can't see how cooperation between two managing agents can be achieved without one telling the other that it's gonna take over some part of the hierarchy. I don't understand why specifying "Delegate=" is a huge problem but one way or the other you'll have to do it. However, that doesn't have anything to do with "name=systemd" hierarchy as that hierarchy will always be fully instantiated and thus processes won't be relocated once set up. |
|
@htejun, yes, but the |
|
Nevermind, I got mixed-up here. |
|
It's still going to be a problem though. We're taking care of placing user into writeable |
|
@brauner, I'm working on a change to make v1 name=systemd hierarchy always available in hybrid mode. That should work for your case, right? |
|
That depends on what that means. Does it mean in addition to |
|
The cgroup2 one would have a different name, obviously. |
|
@htejun , Currently -bash-4.3# systemd-run --setenv UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn -D /var/lib/machines/nspawn-root/ -b systemd.unit=multi-user.target
-bash-4.3# systemd-cgls
...
-.slice
├─machine.slice
│ └─machine-nspawn\x2droot.scope
│ ├─356 /usr/lib/systemd/systemd systemd.unit=multi-user.target
│ ├─380 /usr/lib/systemd/systemd-journald
│ ├─387 /usr/lib/systemd/systemd-logind
│ ├─389 /usr/lib/systemd/systemd-resolved
│ ├─391 /sbin/agetty --noclear --keep-baud console 115200,38400,9600 vt220
│ └─392 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfi...
The "real" layout is
|
|
Also, Should all other tools manually clean-up their |
|
Just posted #4670 which puts cgroup v2 systemd hierarchy on /sys/fs/cgroup/systemd-cgroup2 (any idea for a better name?) while maintaining the "name=systemd" hierarchy on /sys/fs/cgroup/systemd in parallel. This should avoid issues with most tools. For the ones which fail to parse if there's an entry for the v2 hierarchy in /proc/$PID/cgroup, I have no idea yet. @evverx, yeah, systemd-cgls would need to select mode per machine. I think we have the same problem without the hybrid mode tho. It shouldn't be too difficult to teach systemd-cgls to switch modes per-machine, right? |
|
@evverx, as for cleaning up afterwards, cg_trim() taking care of the v1 hierarchy should be enough, right? |
|
@htejun good idea to make the |
htejun commentedAug 15, 2016
These two patches make systemd prefer the unified hierarchy for /sys/fs/cgroup/systemd/ when the kernel controllers are on legacy hierarchies. While this adds yet another cgroup setup variant, this allows configurations which require cgroup v1 for, e.g., cpu controller or backward compatibility avoid the nasty complications of using legacy hierarchy for process management. This also opens the door for eventually removing legacy hierarchy based process management as the kernel has been shipping with cgroup v2 support for quite a while now, which should get rid of a lot of workarounds for cgroup v1's shortcomings while not impacting compatibility in any noticeable way.