Using the unified hierarchy for /sys/fs/cgroup/systemd when legacy hierarchies are being used #3965

Merged
merged 2 commits into from Aug 20, 2016

Conversation

Projects
None yet
Contributor

htejun commented Aug 15, 2016

These two patches make systemd prefer the unified hierarchy for /sys/fs/cgroup/systemd/ when the kernel controllers are on legacy hierarchies. While this adds yet another cgroup setup variant, this allows configurations which require cgroup v1 for, e.g., cpu controller or backward compatibility avoid the nasty complications of using legacy hierarchy for process management. This also opens the door for eventually removing legacy hierarchy based process management as the kernel has been shipping with cgroup v2 support for quite a while now, which should get rid of a lot of workarounds for cgroup v1's shortcomings while not impacting compatibility in any noticeable way.

core: rename cg_unified() to cg_all_unified()
A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies.  This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is.  In preparation, this patch renames cg_unified() to
cg_all_unified().

This patch doesn't cause any functional changes.
src/basic/cgroup-util.c
+ if (r < 0)
+ return r;
+
+ if (controller && streq(controller, SYSTEMD_CGROUP_CONTROLLER))
@zonque

zonque Aug 16, 2016

Owner

streq_ptr() seems a good fit.

@htejun

htejun Aug 16, 2016

Contributor

Ah, indeed. I was actually reading streq() implementation wondering whether it'd first do the pointer test. Will switch to streq_ptr().

src/basic/cgroup-util.c
+/* -1: unknown
+ * 0: both systemd and controller hierarchies on legacy
+ * 1: only systemd hierarchy on unified
+ * 2: both systemd and controller hierarchies on unified */
@keszybz

keszybz Aug 16, 2016

Owner

This should be an anonymous enum:

enum {
   CGROUP_UNIFIED_NONE = 0,
   CGROUP_UNIFIED_SYSTEMD = 1,
   CGROUP_UNIFIED_ALL = 2
};
@htejun

htejun Aug 16, 2016

Contributor

Will update.

src/basic/cgroup-util.c
static thread_local int unified_cache = -1;
-int cg_all_unified(void) {
+static int cg_update_unified(void)
+{
@keszybz

keszybz Aug 16, 2016

Owner

That's kernel style. We don't use separate line for the brace.

@htejun

htejun Aug 16, 2016

Contributor

Right, will update.

src/basic/cgroup-util.c
+ if (r < 0)
+ return true;
+ if (r == 0)
+ return (wanted = true);
@keszybz

keszybz Aug 16, 2016

Owner

Nah, please don't do inline assignment like that unless absolutely necessary.

@htejun

htejun Aug 16, 2016

Contributor

Was following the existing style. Will update.

@keszybz

keszybz Aug 16, 2016

Owner

The style changed a bit over the years. Some of the older code doesn't match, but we usually only update it when changing for other reasons.

+
+ if (all_unified < 0 || systemd_unified < 0)
+ return log_error_errno(all_unified < 0 ? all_unified : systemd_unified,
+ "Couldn't determine if we are running in the unified hierarchy: %m");
@keszybz

keszybz Aug 16, 2016

Owner

Wouldn't it be better to assume some default value in this case and continue? Failing fatally in the set up code is something best avoided...

@htejun

htejun Aug 16, 2016

Contributor

We have multiple places where we fail fatally if cg_[all_]unified() fail. It indicates that the basic mount setup went horribly wrong and once the results are established they never fail. I can't think of a good recovery action here as it'd mean that process management won't work.

src/core/cgroup.c
- else
- log_debug("Using cgroup controller " SYSTEMD_CGROUP_CONTROLLER ". File system hierarchy is at %s.", path);
+ else {
+ if (systemd_unified > 0)
@htejun

htejun Aug 16, 2016

Contributor

Will update.

Owner

keszybz commented Aug 16, 2016

I didn't go through all the details, but the general idea is sound and the patch lgtm.

Contributor

htejun commented Aug 16, 2016

lol forgot to do "make install" before testing the updates. There was a silly bug in detect_unified_cgroup_hierarchy().

Owner

zonque commented Aug 17, 2016

The error reported by the CentOS CI seems legit, btw: Failed to mount unified hierarchy: No such device

Contributor

htejun commented Aug 17, 2016

Hmm... nspawn worked fine on all three cgroup modes worked here. No idea what the difference is from the first revision. Digging into it.

Member

evverx commented Aug 17, 2016

@htejun ,

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 0c1c21d..f6af953 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -326,7 +326,9 @@ static int detect_unified_cgroup_hierarchy(void) {
                 r = parse_boolean(e);
                 if (r < 0)
                         return log_error_errno(r, "Failed to parse $UNIFIED_CGROUP_HIERARCHY.");
-                if (r > 0)
+                else if (r == 0)
+                        arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_NONE;
+                else
                         arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_ALL;
                 return 0;
         }
Contributor

htejun commented Aug 17, 2016

I see, the same bug on env path. I was scratching my head viciously wondering why I haven't been able to reproduce it. Yeap, UNIFIED_CGROUP_HIERARCHY=0 reproduces it. Fixing it. Thanks!

core: use the unified hierarchy for the systemd cgroup controller hie…
…rarchy

Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management.  Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.

Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies.  It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test.  In time, this would also allow deleting the said
complexities.

This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.

* cg_unified(@controller) is introduced which tests whether the specific
  controller in on unified hierarchy and used to choose the unified hierarchy
  code path for process and service management when available.  Kernel
  controller specific operations remain gated by cg_all_unified().

* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
  force the use of legacy hierarchy for systemd cgroup controller.

* nspawn: By default nspawn uses the same hierarchies as the host.  If
  UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all.  If
  0, legacy for all.

* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
  three options - legacy, only systemd controller on unified, and unified.  The
  value is passed into mount setup functions and controls cgroup configuration.

* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
  option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
  appropriate action depending on the configuration of the host.

v2: - CGroupUnified enum replaces open coded integer values to indicate the
      cgroup operation mode.
    - Various style updates.

v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.

v4: Restored legacy container on unified host support and fixed another bug in
    detect_unified_cgroup_hierarchy().
Member

evverx commented Aug 18, 2016

I'm trying to pass systemd.unified_cgroup_hierarchy=no

-bash-4.3# cat /proc/cmdline
root=/dev/sda1 raid=noautodetect loglevel=2 init=/usr/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.unified_cgroup_hierarchy=no systemd.unit=multi-user.target

-bash-4.3# grep cgroup /proc/self/mountinfo
24 18 0:20 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
25 24 0:21 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
27 24 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,cpu,cpuacct
28 24 0:24 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,perf_event
29 24 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,net_cls,net_prio
30 24 0:26 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,pids
31 24 0:27 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,devices
32 24 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,blkio
33 24 0:29 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,cpuset
34 24 0:30 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,freezer
35 24 0:31 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,hugetlb
36 24 0:32 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memory

I think we shouldn't mount cgroup2 in this case.

Contributor

htejun commented Aug 18, 2016

So, the intention was systemd.unified_cgroup_hierarchy indicating whether controllers are on unified or not and systemd.legacy_systemd_cgroup_controller indicating whether systemd controller is on legacy or not. There are three modes to select from after all. The other thing is that whether systemd controller is on unified or legacy doesn't matter all that much to users so I thought it'd make sense to put the selection behind another flag.

If you have better ideas, please let me know.

Member

evverx commented Aug 18, 2016

The other thing is that whether systemd controller is on unified or legacy doesn't matter all that much to users

Well, I think this matters. See #3388

-bash-4.3$ grep cgroup /proc/self/mountinfo
24 18 0:20 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
25 24 0:21 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
27 24 0:23 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,blkio
28 24 0:24 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,perf_event
29 24 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,net_cls,net_prio
30 24 0:26 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,freezer
31 24 0:27 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,devices
32 24 0:28 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,cpu,cpuacct
33 24 0:29 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,memory
34 24 0:30 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,pids
35 24 0:31 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,hugetlb
36 24 0:32 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,cpuset

-bash-4.3$ systemd-run --user --scope bash -c 'sleep 10'
Job for run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope failed.
See "systemctl status run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope" and "journalctl -xe" for details.

-bash-4.3$ journalctl | grep run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal', 'wheel' can see all messages.
      Pass -q to turn off this notice.
Aug 16 01:30:49 systemd-testsuite systemd[194]: run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope: Failed to add PIDs to scope's control group: Permission denied
Aug 16 01:30:49 systemd-testsuite systemd[194]: run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope: Unit entered failed state.

If you have better ideas, please let me know.

I think I should reread your commit message :)
Actually, I missed systemd.legacy_systemd_cgroup_controller

+ r = get_proc_cmdline_key("systemd.legacy_systemd_cgroup_controller", NULL);
+ if (r > 0) {
+ wanted = false;
+ } else {
@poettering

poettering Aug 19, 2016

Owner

nitpick: no {} for single-line if blocks

Owner

poettering commented Aug 19, 2016

Oh, wow, interesting idea, never thought about that. Looks OK though, even though I don't like the complexity this adds to alread complex code... But it is definitely an OK way to deal with the political issues...

Would have merged, but needs a rebase first, github doesn't let me merge it...

@keszybz keszybz merged commit 5da38d0 into systemd:master Aug 20, 2016

5 checks passed

default Build finished.
Details
semaphoreci The build passed on Semaphore.
Details
ubuntu-amd64 autopkgtest finished (success)
Details
ubuntu-i386 autopkgtest finished (success)
Details
ubuntu-s390x autopkgtest finished (success)
Details
Owner

keszybz commented Aug 20, 2016

Seems to works nicely here. Merged.

phedders commented Nov 8, 2016

Is this the change that snuck into 232 that has broken LXC, Docker and RKT?

Member

evverx commented Nov 8, 2016

Indeed. runc, rkt and lxc,2 can't handle 0::/some/cgroup/path

BTW: rkt supports two modes: legacy and "full" unified (but not mixed)

And, yes, systemd.legacy_systemd_cgroup_controller=yes "fixes" the issue

phedders commented Nov 8, 2016

Seems a bit unfortunate to push that out before the dependents were ready for it. Caused a lot of pain for many people I suspect - if my experience is anything to go by.

Contributor

htejun commented Nov 8, 2016

Oops, sorry about that. Didn't foresee failures with other tools. In principle, as long as "Delegate=" is used and the tools don't mangle with the systemd hierarchy, it should all work but yeap reality is always harder. Should the hybrid mode default to off at least for now?

Contributor

mbiebl commented Nov 8, 2016

Do we know which other tools are affected by this? @evverx mentioned runc, rkt and lxc. What about docker? Do bug reports exist to notify them about this issue?

Member

evverx commented Nov 8, 2016

@mbiebl

What about docker?

Docker uses runc. So, this affects Docker too.

Do bug reports exist to notify them about this issue?

There is nothing LXC can currently do about this. So closing this.
You should make sure that systemd is mounted into a cgroup v1 hierarchy.

(@brauner , @stgraber , why not to reopen lxc/lxc#1280?)

Should the hybrid mode default to off at least for now?

@htejun , yes, I think the hybrid mode should be disabled by default @poettering, @keszybz ?

Owner

keszybz commented Nov 9, 2016

Welp ;) Seems we have no choice.

Contributor

brauner commented Nov 9, 2016

Yes, we can certainly re-open lxc/lxc#1280 to track this. It would be good, if the hybrid mode were to be disabled by default for now. @htejun as we've discussed with @serge and @stgraber during Plumbers, when a v2 hierarchy is mounted we run into trouble as soon as we run a distro in a container that only knows how to deal with a v1 hierarchy. Having a functional userspace experience wrt to different hierarchies on the host and inside the container is what we care about. Functional limits are not even important at this point.

Contributor

martinpitt commented Nov 9, 2016

I sent PR #4628 to invert the default for now, but still keep an opt-in for LXC/docker/rkt/etc. developers. I will apply this downstream at least, but now that we have issues filed for affected software I think we should revert this upstream for now too (and maybe push out 233 relatively quickly).

Contributor

mbiebl commented Nov 9, 2016

@martinpitt If a v233 is made quickly, it should also address #4575 I think

Contributor

htejun commented Nov 9, 2016

@brauner, I'm still not quite understanding why this is a fundamental problem. If the host has its systemd hierarchy on v2 and other controllers on v1, lxc should still be able to namespace the v1 controller hierarchies and create a new named systemd mount. From inside the namespace, this shouldn't make any difference. The only thing which is affected by systemd's use of v1 or v2 is what controllers are available on what hierarchies. Outside of that, lxc (or whatever) should be free to set up whatever cgroup hierarchy it wants for the container. What am I missing here?

Contributor

htejun commented Nov 9, 2016

So, for example, systemd-nspawn can follow or do either v1 or v2 regardless of the host mode. While not implemented for simplicity, we can make it choose any of three modes regardless of the host mode. What hierarchies a namespace can use is not restricted by what the host is doing at all. The only thing which gets affected is what controllers are available on which hierarchies.

hallyn commented Nov 9, 2016

Nested legacy containers. Can a newer host which has name=systemd mounted on v2 start a container with an older distribution release which is using sytemd which is using v1? It's possible that in fact systemd doesn't use any v1-only features (i.e. tasks file or tasks in a non-leaf node) and that this would in fact just work, I've not tested. @brauner, could you test say a centos or ubuntu 16.04 container under lxc on a host with systemd-on-v2?

Contributor

htejun commented Nov 9, 2016

@hallyn, if you're talking about putting a v1 distro on top of v2 hierarchy directly, it's highly unlikely to work but why would you do that in the first place when you can simply create a new v1 hierarchy?

hallyn commented Nov 9, 2016

@htejun oh that would require updates to all the container drivers, but would be neat and solve this (temporarily) - you're saying the container could mount a v1 version of name=systemd?

Contributor

htejun commented Nov 9, 2016

@hallyn, yeah. That's what systemd-nspawn does and it works fine.

hallyn commented Nov 9, 2016

Cool, thanks. I didn't think that was possible. So lxc should implement that asap.

I assume that only works for named controllers?

Contributor

htejun commented Nov 9, 2016

It works for all hierarchies. Again, the only restriction is to which hierarchy a controller is attached in the host and whether that coincides with what the container expects. Other than that, you can do whatever you want to do (for example, detach one controller from v2 hierarchy, mount it on a separate v1 hierarchy and expose that to a container); however, please note that controllers may be pinned down by internal refs and thus may not be able to detach from the current hierarchy. You gotta figure out what controllers should go where on boot.

hallyn commented Nov 9, 2016

Quoting Tejun Heo (notifications@github.com):

It works for all hierarchies. Again, the only restriction is to which hierarchy a controller is attached in the host and whether that coincides with what the container expects. Other than that, you can do whatever you want to do (for example, detach one controller from v2 hierarchy, mount it on a separate v1 hierarchy and expose that to a container); however, please note that controllers may be pinned down by internal refs and thus may not be able to detach from the current hierarchy. You gotta figure out what controllers should go where on boot.

Right, that's not going to be enough. The controller would need to be in both
v1 and v2 at the same time, for the general case. For specific little applications,
detaching will be an option, but not in general.

Thanks for the information.

Contributor

brauner commented Nov 9, 2016

@htejun, can named controllers like name=systemd be attached to two hierarchies at the same time? It seems like this is what nspawn is doing or am I missing something obvious?

Contributor

htejun commented Nov 9, 2016

@brauner, a hierarchy can be mounted multiple times but that would just make the same hierarchy show up in multiple places and I don't think what's implemented in systemd-nspawn now is correct with multiple containers. When the container cgroup type is different from the host, the host side should set the other type for the container. I think what we should do for the hybrid mode is simply creating v1 name=systemd hierarchy in parallel which systemd itself won't use but can help other tools expecting that hierarchy.

Member

evverx commented Nov 10, 2016

@htejun, basically we run something like this:

our-v2-cgroup=$(get-our-v2-cgroup)
mkdir -p /tmp/v1
mount -t cgroup -o none,name=systemd,xattr cgroup /tmp/v1
mkdir -p /tmp/v1/$(our-v2-cgroup)
echo "PID1-of-the-container" >/tmp/v1/$(our-v2-cgroup)/cgroup.procs

This works because we spawn every container inside its own cgroup

cyphar commented Nov 10, 2016

@htejun How does mounting a v1 name=system not break the whole point of the named systemd cgroup? Isn't the whole idea for that cgroup so that systemd can track services even if the service creates new cgroups?

As an aside, in runc we've had our fair share of issues with systemd's transient service APIs (so much so that we don't use them by default -- you have to specify a flag to use them, we use the filesystem API directly by default). That means that we can't set Delegate=true, because we don't create a service. I'm assuming that means that runc is screwed because of all of these issues within systemd that keep cropping up?

Member

evverx commented Nov 10, 2016

How does mounting a v1 name=system not break the whole point of the named systemd cgroup?

@cyphar , systemd doesn't use the named systemd cgroup(v1) in "mixed/full-unified"-mode.

Member

evverx commented Nov 10, 2016

I think what we should do for the hybrid mode is simply creating v1 name=systemd hierarchy in parallel which systemd itself won't use but can help other tools expecting that hierarchy.

@htejun , oh, I got it :-) i.e. systemd should create (and maintain) the parallel hierarchy. For every process:

$ cat /proc/self/cgroup
11:name=systemd:/user.slice/user-1001.slice/session-1.scope
10:pids:/user.slice/user-1001.slice/session-1.scope
...
0::/user.slice/user-1001.slice/session-1.scope

or so.
And this doesn't break the /proc/self/cgroup-parsers.
Right?

cyphar commented Nov 10, 2016

@evverx That's what my question was -- I didn't see how us creating a v1 name=systemd cgroup will help because systemd won't be aware of it (so it will mess around with our processes just like it normally does).

Member

evverx commented Nov 10, 2016

@cyphar, sorry, I misunderstood.

I mean that systemd tracks all processes via the v2-hierarchy (and the named v1-hierarchy doesn't really matter). So, the question is "how does it work"? Right?

cyphar commented Nov 10, 2016

The question is "will systemd do unholy things to our processes if we don't touch the v2 hierarchy?". Because if systemd's v1 interface is only going to be skin-deep (just faking it for container runtimes), then we still have the same problem as if we just mounted tmpfs at /sys/fs/cgroup/systemd. That is, container runtimes will still be broken (though not as badly as they are now) unless they support cgroupv2 because otherwise they won't be able to coax systemd into not terrorising our processes -- which is the exact problem we have right now (except right now the error is very loud and clear, not silent).

Note: "unholy things" involves systemd reorganising the cgroup associations of processes that I created and am attempting to control. This happened quite a lot in older releases (and because of our issues with the TransientUnit API we can't use Delegate all the time).

Member

evverx commented Nov 10, 2016

will systemd do unholy things to our processes

@cyphar , I need some context. I'll read docker/docker#23374, docker/docker#17704

cyphar commented Nov 10, 2016

@evverx You can also look at opencontainers/runc#325 too (and things that link to it). Unfortunately most of the actual bug reports I've seen are on an internal bug tracker.

Member

evverx commented Nov 10, 2016

@cyphar , thanks for the link! I'll take a look.

I "fixed" the "no subsystem for mount"-error:

--- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -149,7 +149,7 @@ func getCgroupMountsHelper(ss map[string]bool, mi io.Reader, all bool) ([]Mount,
                if sepIdx == -1 {
                        return nil, fmt.Errorf("invalid mountinfo format")
                }
-               if txt[sepIdx+3:sepIdx+9] != "cgroup" {
+               if txt[sepIdx+3:sepIdx+10] == "cgroup2" || txt[sepIdx+3:sepIdx+9] != "cgroup" {
                        continue
                }
                fields := strings.Split(txt, " ")

So, I can run runc:

-bash-4.3# grep cgroup2 /proc/self/mountinfo
25 24 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw

-bash-4.3# runc run hola
/ # cat /proc/self/cgroup
10:freezer:/hola
9:pids:/user.slice/user-0.slice/session-1.scope/hola
8:net_cls,net_prio:/hola
7:memory:/user.slice/user-0.slice/session-1.scope/hola
6:cpuset:/hola
5:hugetlb:/hola
4:perf_event:/hola
3:devices:/user.slice/user-0.slice/session-1.scope/hola
2:cpu,cpuacct:/user.slice/user-0.slice/session-1.scope/hola
1:blkio:/user.slice/user-0.slice/session-1.scope/hola
0::/user.slice/user-0.slice/session-1.scope

/ # ls /sys/fs/cgroup/systemd
ls: /sys/fs/cgroup/systemd: No such file or directory

I'm trying to understand

  • what to do about /sys/fs/cgroup/systemd
  • what to do about --systemd-cgroup
  • how to fix something like #4008
  • does the Delegate=-setting really work ?

Sadly, I'm not sure that the parallel v1-hierarchy will help

Contributor

htejun commented Nov 10, 2016

@evverx, ah yes, you're right. I missed mkdir_parents(). What systemd-nspawn does is correct and I think is a robust way to deal with the situation.

Contributor

htejun commented Nov 10, 2016

@brauner, the term named controller is a bit misleading. The name= option specifies the identifier for the hierarchy as without the actual controllers there's no other way of specifying that hierarchy, so name= is only allowed on hierarchies which do not have any controllers attached to them.

As such, there can be only one hierarchy with a given name; however, the hierarchy can be mounted multiple times just like any other hierarchies or filesystems can be. It'll just show the same hierarchy on different mount points. As long as the container preparation sets up its cgroup as systemd-nspawn does, everything should be fine.

Contributor

htejun commented Nov 10, 2016

@evvrx, and yes again, just setting up the name=systemd hierarchy in parallel to cgroup v2 hierarchy in both hybrid and v2 mode for backward compatibility for backward compatibility without actually using it for anything.

@cyphar, can't tell much without specifics but if you aren't using "Delegate=", problems are expected with controllers. I can't see how cooperation between two managing agents can be achieved without one telling the other that it's gonna take over some part of the hierarchy. I don't understand why specifying "Delegate=" is a huge problem but one way or the other you'll have to do it. However, that doesn't have anything to do with "name=systemd" hierarchy as that hierarchy will always be fully instantiated and thus processes won't be relocated once set up.

Contributor

brauner commented Nov 10, 2016

@htejun, yes, but the nspawn trick of temporarily mounting the systemd controller somewhere only works for privileged containers. If you're an unprivileged user and want to start a container that trick won't work because you're not allowed to mount cgroupfs even if you're root in a new CLONE_NEWUSER | CLONE_NEWNS.

Contributor

brauner commented Nov 10, 2016

Nevermind, I got mixed-up here.

Contributor

brauner commented Nov 10, 2016

It's still going to be a problem though. We're taking care of placing user into writeable cgroups for some crucial subsystems but if the v1 systemd controller is not mounted than we can't do that. So, if you haven't already been placed into a writeable cgroup as an unprivileged user for the systemd v1 controller, you can sure mount it by using a trick like CLONE_NEWUSER | CLONE_NEWNS, map root uid, setuid() and then mount(), but you won't be able to create the necessary cgroup by copying over the path from the v2 mount under /sys/fs/cgroup/systemd because you lack necessary privileges.

Contributor

htejun commented Nov 10, 2016

@brauner, I'm working on a change to make v1 name=systemd hierarchy always available in hybrid mode. That should work for your case, right?

Contributor

brauner commented Nov 10, 2016

That depends on what that means. Does it mean in addition to /sys/fs/cgroup/systemd being mounted as an empty v2 hierarchy a named name=systemd controller is mounted as well? Where would that additional v1 systemd controller be mounted? :)

Contributor

htejun commented Nov 10, 2016

The cgroup2 one would have a different name, obviously.

Member

evverx commented Nov 11, 2016

@htejun , systemd-cgls (and other tools) will print the v2-layout. Is this correct from the application's point of view?

Currently

-bash-4.3# systemd-run --setenv UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn -D /var/lib/machines/nspawn-root/ -b systemd.unit=multi-user.target

-bash-4.3# systemd-cgls
...
-.slice
├─machine.slice
│ └─machine-nspawn\x2droot.scope
│   ├─356 /usr/lib/systemd/systemd systemd.unit=multi-user.target
│   ├─380 /usr/lib/systemd/systemd-journald
│   ├─387 /usr/lib/systemd/systemd-logind
│   ├─389 /usr/lib/systemd/systemd-resolved
│   ├─391 /sbin/agetty --noclear --keep-baud console 115200,38400,9600 vt220
│   └─392 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfi...

The "real" layout is

├─machine.slice
│ └─machine-nspawn\x2droot.scope
│   ├─init.scope
│   │ └─356 /usr/lib/systemd/systemd systemd.unit=multi-user.target
│   └─system.slice
│     ├─dbus.service
│     │ └─392 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nop...
│     ├─systemd-logind.service
│     │ └─387 /usr/lib/systemd/systemd-logind
│     ├─console-getty.service
│     │ └─391 (agetty)
│     ├─systemd-resolved.service
│     │ └─389 /usr/lib/systemd/systemd-resolved
│     └─systemd-journald.service
│       └─380 /usr/lib/systemd/systemd-journald
Member

evverx commented Nov 11, 2016

Also, systemd-nspawn manually clean-ups the v1-hierarchy (see #4223 (comment))

Should all other tools manually clean-up their v1-subhierarchies?

Contributor

htejun commented Nov 14, 2016

Just posted #4670 which puts cgroup v2 systemd hierarchy on /sys/fs/cgroup/systemd-cgroup2 (any idea for a better name?) while maintaining the "name=systemd" hierarchy on /sys/fs/cgroup/systemd in parallel. This should avoid issues with most tools. For the ones which fail to parse if there's an entry for the v2 hierarchy in /proc/$PID/cgroup, I have no idea yet.

@evverx, yeah, systemd-cgls would need to select mode per machine. I think we have the same problem without the hybrid mode tho. It shouldn't be too difficult to teach systemd-cgls to switch modes per-machine, right?

Contributor

htejun commented Nov 14, 2016

@evverx, as for cleaning up afterwards, cg_trim() taking care of the v1 hierarchy should be enough, right?

Contributor

brauner commented Nov 14, 2016

@htejun good idea to make the v1 systemd controller available per default at /sys/fs/cgroup/systemd. Hm, maybe we can just name the cgroupfs v2 mountpoint /sys/fs/cgroup/unified that would also make the intentions quite clear. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment