Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the unified hierarchy for /sys/fs/cgroup/systemd when legacy hierarchies are being used #3965

Merged
merged 2 commits into from Aug 20, 2016

Conversation

@htejun
Copy link
Contributor

@htejun htejun commented Aug 15, 2016

These two patches make systemd prefer the unified hierarchy for /sys/fs/cgroup/systemd/ when the kernel controllers are on legacy hierarchies. While this adds yet another cgroup setup variant, this allows configurations which require cgroup v1 for, e.g., cpu controller or backward compatibility avoid the nasty complications of using legacy hierarchy for process management. This also opens the door for eventually removing legacy hierarchy based process management as the kernel has been shipping with cgroup v2 support for quite a while now, which should get rid of a lot of workarounds for cgroup v1's shortcomings while not impacting compatibility in any noticeable way.

A following patch will update cgroup handling so that the systemd controller
(/sys/fs/cgroup/systemd) can use the unified hierarchy even if the kernel
resource controllers are on the legacy hierarchies.  This would require
distinguishing whether all controllers are on cgroup v2 or only the systemd
controller is.  In preparation, this patch renames cg_unified() to
cg_all_unified().

This patch doesn't cause any functional changes.
if (r < 0)
return r;

if (controller && streq(controller, SYSTEMD_CGROUP_CONTROLLER))

This comment has been minimized.

@zonque

zonque Aug 16, 2016
Member

streq_ptr() seems a good fit.

This comment has been minimized.

@htejun

htejun Aug 16, 2016
Author Contributor

Ah, indeed. I was actually reading streq() implementation wondering whether it'd first do the pointer test. Will switch to streq_ptr().

/* -1: unknown
* 0: both systemd and controller hierarchies on legacy
* 1: only systemd hierarchy on unified
* 2: both systemd and controller hierarchies on unified */

This comment has been minimized.

@keszybz

keszybz Aug 16, 2016
Member

This should be an anonymous enum:

enum {
   CGROUP_UNIFIED_NONE = 0,
   CGROUP_UNIFIED_SYSTEMD = 1,
   CGROUP_UNIFIED_ALL = 2
};

This comment has been minimized.

@htejun

htejun Aug 16, 2016
Author Contributor

Will update.

static thread_local int unified_cache = -1;

int cg_all_unified(void) {
static int cg_update_unified(void)
{

This comment has been minimized.

@keszybz

keszybz Aug 16, 2016
Member

That's kernel style. We don't use separate line for the brace.

This comment has been minimized.

@htejun

htejun Aug 16, 2016
Author Contributor

Right, will update.

if (r < 0)
return true;
if (r == 0)
return (wanted = true);

This comment has been minimized.

@keszybz

keszybz Aug 16, 2016
Member

Nah, please don't do inline assignment like that unless absolutely necessary.

This comment has been minimized.

@htejun

htejun Aug 16, 2016
Author Contributor

Was following the existing style. Will update.

This comment has been minimized.

@keszybz

keszybz Aug 16, 2016
Member

The style changed a bit over the years. Some of the older code doesn't match, but we usually only update it when changing for other reasons.


if (all_unified < 0 || systemd_unified < 0)
return log_error_errno(all_unified < 0 ? all_unified : systemd_unified,
"Couldn't determine if we are running in the unified hierarchy: %m");

This comment has been minimized.

@keszybz

keszybz Aug 16, 2016
Member

Wouldn't it be better to assume some default value in this case and continue? Failing fatally in the set up code is something best avoided...

This comment has been minimized.

@htejun

htejun Aug 16, 2016
Author Contributor

We have multiple places where we fail fatally if cg_[all_]unified() fail. It indicates that the basic mount setup went horribly wrong and once the results are established they never fail. I can't think of a good recovery action here as it'd mean that process management won't work.

This comment has been minimized.

else
log_debug("Using cgroup controller " SYSTEMD_CGROUP_CONTROLLER ". File system hierarchy is at %s.", path);
else {
if (systemd_unified > 0)

This comment has been minimized.

@keszybz

keszybz Aug 16, 2016
Member

else if

This comment has been minimized.

@htejun

htejun Aug 16, 2016
Author Contributor

Will update.

@keszybz
Copy link
Member

@keszybz keszybz commented Aug 16, 2016

I didn't go through all the details, but the general idea is sound and the patch lgtm.

@htejun htejun force-pushed the htejun:systemd-controller-on-unified branch from f885b7d to 0b96748 Aug 16, 2016
@htejun
Copy link
Contributor Author

@htejun htejun commented Aug 16, 2016

lol forgot to do "make install" before testing the updates. There was a silly bug in detect_unified_cgroup_hierarchy().

@htejun htejun force-pushed the htejun:systemd-controller-on-unified branch from 0b96748 to f7724eb Aug 16, 2016
@zonque
Copy link
Member

@zonque zonque commented Aug 17, 2016

The error reported by the CentOS CI seems legit, btw: Failed to mount unified hierarchy: No such device

@htejun
Copy link
Contributor Author

@htejun htejun commented Aug 17, 2016

Hmm... nspawn worked fine on all three cgroup modes worked here. No idea what the difference is from the first revision. Digging into it.

@evverx
Copy link
Member

@evverx evverx commented Aug 17, 2016

@htejun ,

diff --git a/src/nspawn/nspawn.c b/src/nspawn/nspawn.c
index 0c1c21d..f6af953 100644
--- a/src/nspawn/nspawn.c
+++ b/src/nspawn/nspawn.c
@@ -326,7 +326,9 @@ static int detect_unified_cgroup_hierarchy(void) {
                 r = parse_boolean(e);
                 if (r < 0)
                         return log_error_errno(r, "Failed to parse $UNIFIED_CGROUP_HIERARCHY.");
-                if (r > 0)
+                else if (r == 0)
+                        arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_NONE;
+                else
                         arg_unified_cgroup_hierarchy = CGROUP_UNIFIED_ALL;
                 return 0;
         }
@htejun
Copy link
Contributor Author

@htejun htejun commented Aug 17, 2016

I see, the same bug on env path. I was scratching my head viciously wondering why I haven't been able to reproduce it. Yeap, UNIFIED_CGROUP_HIERARCHY=0 reproduces it. Fixing it. Thanks!

…rarchy

Currently, systemd uses either the legacy hierarchies or the unified hierarchy.
When the legacy hierarchies are used, systemd uses a named legacy hierarchy
mounted on /sys/fs/cgroup/systemd without any kernel controllers for process
management.  Due to the shortcomings in the legacy hierarchy, this involves a
lot of workarounds and complexities.

Because the unified hierarchy can be mounted and used in parallel to legacy
hierarchies, there's no reason for systemd to use a legacy hierarchy for
management even if the kernel resource controllers need to be mounted on legacy
hierarchies.  It can simply mount the unified hierarchy under
/sys/fs/cgroup/systemd and use it without affecting other legacy hierarchies.
This disables a significant amount of fragile workaround logics and would allow
using features which depend on the unified hierarchy membership such bpf cgroup
v2 membership test.  In time, this would also allow deleting the said
complexities.

This patch updates systemd so that it prefers the unified hierarchy for the
systemd cgroup controller hierarchy when legacy hierarchies are used for kernel
resource controllers.

* cg_unified(@controller) is introduced which tests whether the specific
  controller in on unified hierarchy and used to choose the unified hierarchy
  code path for process and service management when available.  Kernel
  controller specific operations remain gated by cg_all_unified().

* "systemd.legacy_systemd_cgroup_controller" kernel argument can be used to
  force the use of legacy hierarchy for systemd cgroup controller.

* nspawn: By default nspawn uses the same hierarchies as the host.  If
  UNIFIED_CGROUP_HIERARCHY is set to 1, unified hierarchy is used for all.  If
  0, legacy for all.

* nspawn: arg_unified_cgroup_hierarchy is made an enum and now encodes one of
  three options - legacy, only systemd controller on unified, and unified.  The
  value is passed into mount setup functions and controls cgroup configuration.

* nspawn: Interpretation of SYSTEMD_CGROUP_CONTROLLER to the actual mount
  option is moved to mount_legacy_cgroup_hierarchy() so that it can take an
  appropriate action depending on the configuration of the host.

v2: - CGroupUnified enum replaces open coded integer values to indicate the
      cgroup operation mode.
    - Various style updates.

v3: Fixed a bug in detect_unified_cgroup_hierarchy() introduced during v2.

v4: Restored legacy container on unified host support and fixed another bug in
    detect_unified_cgroup_hierarchy().
@htejun htejun force-pushed the htejun:systemd-controller-on-unified branch from f7724eb to 5da38d0 Aug 17, 2016
@evverx
Copy link
Member

@evverx evverx commented Aug 18, 2016

I'm trying to pass systemd.unified_cgroup_hierarchy=no

-bash-4.3# cat /proc/cmdline
root=/dev/sda1 raid=noautodetect loglevel=2 init=/usr/lib/systemd/systemd ro console=ttyS0 selinux=0 systemd.unified_cgroup_hierarchy=no systemd.unit=multi-user.target

-bash-4.3# grep cgroup /proc/self/mountinfo
24 18 0:20 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
25 24 0:21 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
27 24 0:23 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,cpu,cpuacct
28 24 0:24 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,perf_event
29 24 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,net_cls,net_prio
30 24 0:26 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,pids
31 24 0:27 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,devices
32 24 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,blkio
33 24 0:29 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,cpuset
34 24 0:30 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,freezer
35 24 0:31 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,hugetlb
36 24 0:32 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,memory

I think we shouldn't mount cgroup2 in this case.

@htejun
Copy link
Contributor Author

@htejun htejun commented Aug 18, 2016

So, the intention was systemd.unified_cgroup_hierarchy indicating whether controllers are on unified or not and systemd.legacy_systemd_cgroup_controller indicating whether systemd controller is on legacy or not. There are three modes to select from after all. The other thing is that whether systemd controller is on unified or legacy doesn't matter all that much to users so I thought it'd make sense to put the selection behind another flag.

If you have better ideas, please let me know.

@evverx
Copy link
Member

@evverx evverx commented Aug 18, 2016

The other thing is that whether systemd controller is on unified or legacy doesn't matter all that much to users

Well, I think this matters. See #3388

-bash-4.3$ grep cgroup /proc/self/mountinfo
24 18 0:20 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:7 - tmpfs tmpfs ro,mode=755
25 24 0:21 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw
27 24 0:23 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,blkio
28 24 0:24 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,perf_event
29 24 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,net_cls,net_prio
30 24 0:26 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,freezer
31 24 0:27 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,devices
32 24 0:28 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,cpu,cpuacct
33 24 0:29 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,memory
34 24 0:30 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,pids
35 24 0:31 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:20 - cgroup cgroup rw,hugetlb
36 24 0:32 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,cpuset

-bash-4.3$ systemd-run --user --scope bash -c 'sleep 10'
Job for run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope failed.
See "systemctl status run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope" and "journalctl -xe" for details.

-bash-4.3$ journalctl | grep run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal', 'wheel' can see all messages.
      Pass -q to turn off this notice.
Aug 16 01:30:49 systemd-testsuite systemd[194]: run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope: Failed to add PIDs to scope's control group: Permission denied
Aug 16 01:30:49 systemd-testsuite systemd[194]: run-rcb2d58d5d78d40c2bc88e62f4c7e2b58.scope: Unit entered failed state.

If you have better ideas, please let me know.

I think I should reread your commit message :)
Actually, I missed systemd.legacy_systemd_cgroup_controller

r = get_proc_cmdline_key("systemd.legacy_systemd_cgroup_controller", NULL);
if (r > 0) {
wanted = false;
} else {

This comment has been minimized.

@poettering

poettering Aug 19, 2016
Member

nitpick: no {} for single-line if blocks

@poettering
Copy link
Member

@poettering poettering commented Aug 19, 2016

Oh, wow, interesting idea, never thought about that. Looks OK though, even though I don't like the complexity this adds to alread complex code... But it is definitely an OK way to deal with the political issues...

Would have merged, but needs a rebase first, github doesn't let me merge it...

@keszybz keszybz removed the needs-rebase label Aug 19, 2016
@keszybz keszybz merged commit 5da38d0 into systemd:master Aug 20, 2016
5 checks passed
5 checks passed
default Build finished.
Details
semaphoreci The build passed on Semaphore.
Details
ubuntu-amd64 autopkgtest finished (success)
Details
ubuntu-i386 autopkgtest finished (success)
Details
ubuntu-s390x autopkgtest finished (success)
Details
keszybz added a commit that referenced this pull request Aug 20, 2016
@keszybz
Copy link
Member

@keszybz keszybz commented Aug 20, 2016

Seems to works nicely here. Merged.

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 9, 2016

It works for all hierarchies. Again, the only restriction is to which hierarchy a controller is attached in the host and whether that coincides with what the container expects. Other than that, you can do whatever you want to do (for example, detach one controller from v2 hierarchy, mount it on a separate v1 hierarchy and expose that to a container); however, please note that controllers may be pinned down by internal refs and thus may not be able to detach from the current hierarchy. You gotta figure out what controllers should go where on boot.

@hallyn
Copy link

@hallyn hallyn commented Nov 9, 2016

Quoting Tejun Heo (notifications@github.com):

It works for all hierarchies. Again, the only restriction is to which hierarchy a controller is attached in the host and whether that coincides with what the container expects. Other than that, you can do whatever you want to do (for example, detach one controller from v2 hierarchy, mount it on a separate v1 hierarchy and expose that to a container); however, please note that controllers may be pinned down by internal refs and thus may not be able to detach from the current hierarchy. You gotta figure out what controllers should go where on boot.

Right, that's not going to be enough. The controller would need to be in both
v1 and v2 at the same time, for the general case. For specific little applications,
detaching will be an option, but not in general.

Thanks for the information.

@brauner
Copy link
Contributor

@brauner brauner commented Nov 9, 2016

@htejun, can named controllers like name=systemd be attached to two hierarchies at the same time? It seems like this is what nspawn is doing or am I missing something obvious?

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 9, 2016

@brauner, a hierarchy can be mounted multiple times but that would just make the same hierarchy show up in multiple places and I don't think what's implemented in systemd-nspawn now is correct with multiple containers. When the container cgroup type is different from the host, the host side should set the other type for the container. I think what we should do for the hybrid mode is simply creating v1 name=systemd hierarchy in parallel which systemd itself won't use but can help other tools expecting that hierarchy.

@evverx
Copy link
Member

@evverx evverx commented Nov 10, 2016

@htejun, basically we run something like this:

our-v2-cgroup=$(get-our-v2-cgroup)
mkdir -p /tmp/v1
mount -t cgroup -o none,name=systemd,xattr cgroup /tmp/v1
mkdir -p /tmp/v1/$(our-v2-cgroup)
echo "PID1-of-the-container" >/tmp/v1/$(our-v2-cgroup)/cgroup.procs

This works because we spawn every container inside its own cgroup

@cyphar
Copy link

@cyphar cyphar commented Nov 10, 2016

@htejun How does mounting a v1 name=system not break the whole point of the named systemd cgroup? Isn't the whole idea for that cgroup so that systemd can track services even if the service creates new cgroups?

As an aside, in runc we've had our fair share of issues with systemd's transient service APIs (so much so that we don't use them by default -- you have to specify a flag to use them, we use the filesystem API directly by default). That means that we can't set Delegate=true, because we don't create a service. I'm assuming that means that runc is screwed because of all of these issues within systemd that keep cropping up?

@evverx
Copy link
Member

@evverx evverx commented Nov 10, 2016

How does mounting a v1 name=system not break the whole point of the named systemd cgroup?

@cyphar , systemd doesn't use the named systemd cgroup(v1) in "mixed/full-unified"-mode.

@evverx
Copy link
Member

@evverx evverx commented Nov 10, 2016

I think what we should do for the hybrid mode is simply creating v1 name=systemd hierarchy in parallel which systemd itself won't use but can help other tools expecting that hierarchy.

@htejun , oh, I got it :-) i.e. systemd should create (and maintain) the parallel hierarchy. For every process:

$ cat /proc/self/cgroup
11:name=systemd:/user.slice/user-1001.slice/session-1.scope
10:pids:/user.slice/user-1001.slice/session-1.scope
...
0::/user.slice/user-1001.slice/session-1.scope

or so.
And this doesn't break the /proc/self/cgroup-parsers.
Right?

@cyphar
Copy link

@cyphar cyphar commented Nov 10, 2016

@evverx That's what my question was -- I didn't see how us creating a v1 name=systemd cgroup will help because systemd won't be aware of it (so it will mess around with our processes just like it normally does).

@evverx
Copy link
Member

@evverx evverx commented Nov 10, 2016

@cyphar, sorry, I misunderstood.

I mean that systemd tracks all processes via the v2-hierarchy (and the named v1-hierarchy doesn't really matter). So, the question is "how does it work"? Right?

@cyphar
Copy link

@cyphar cyphar commented Nov 10, 2016

The question is "will systemd do unholy things to our processes if we don't touch the v2 hierarchy?". Because if systemd's v1 interface is only going to be skin-deep (just faking it for container runtimes), then we still have the same problem as if we just mounted tmpfs at /sys/fs/cgroup/systemd. That is, container runtimes will still be broken (though not as badly as they are now) unless they support cgroupv2 because otherwise they won't be able to coax systemd into not terrorising our processes -- which is the exact problem we have right now (except right now the error is very loud and clear, not silent).

Note: "unholy things" involves systemd reorganising the cgroup associations of processes that I created and am attempting to control. This happened quite a lot in older releases (and because of our issues with the TransientUnit API we can't use Delegate all the time).

@evverx
Copy link
Member

@evverx evverx commented Nov 10, 2016

will systemd do unholy things to our processes

@cyphar , I need some context. I'll read moby/moby#23374, moby/moby#17704

@cyphar
Copy link

@cyphar cyphar commented Nov 10, 2016

@evverx You can also look at opencontainers/runc#325 too (and things that link to it). Unfortunately most of the actual bug reports I've seen are on an internal bug tracker.

@evverx
Copy link
Member

@evverx evverx commented Nov 10, 2016

@cyphar , thanks for the link! I'll take a look.

I "fixed" the "no subsystem for mount"-error:

--- a/libcontainer/cgroups/utils.go
+++ b/libcontainer/cgroups/utils.go
@@ -149,7 +149,7 @@ func getCgroupMountsHelper(ss map[string]bool, mi io.Reader, all bool) ([]Mount,
                if sepIdx == -1 {
                        return nil, fmt.Errorf("invalid mountinfo format")
                }
-               if txt[sepIdx+3:sepIdx+9] != "cgroup" {
+               if txt[sepIdx+3:sepIdx+10] == "cgroup2" || txt[sepIdx+3:sepIdx+9] != "cgroup" {
                        continue
                }
                fields := strings.Split(txt, " ")

So, I can run runc:

-bash-4.3# grep cgroup2 /proc/self/mountinfo
25 24 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:8 - cgroup2 cgroup rw

-bash-4.3# runc run hola
/ # cat /proc/self/cgroup
10:freezer:/hola
9:pids:/user.slice/user-0.slice/session-1.scope/hola
8:net_cls,net_prio:/hola
7:memory:/user.slice/user-0.slice/session-1.scope/hola
6:cpuset:/hola
5:hugetlb:/hola
4:perf_event:/hola
3:devices:/user.slice/user-0.slice/session-1.scope/hola
2:cpu,cpuacct:/user.slice/user-0.slice/session-1.scope/hola
1:blkio:/user.slice/user-0.slice/session-1.scope/hola
0::/user.slice/user-0.slice/session-1.scope

/ # ls /sys/fs/cgroup/systemd
ls: /sys/fs/cgroup/systemd: No such file or directory

I'm trying to understand

  • what to do about /sys/fs/cgroup/systemd
  • what to do about --systemd-cgroup
  • how to fix something like #4008
  • does the Delegate=-setting really work ?

Sadly, I'm not sure that the parallel v1-hierarchy will help

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 10, 2016

@evverx, ah yes, you're right. I missed mkdir_parents(). What systemd-nspawn does is correct and I think is a robust way to deal with the situation.

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 10, 2016

@brauner, the term named controller is a bit misleading. The name= option specifies the identifier for the hierarchy as without the actual controllers there's no other way of specifying that hierarchy, so name= is only allowed on hierarchies which do not have any controllers attached to them.

As such, there can be only one hierarchy with a given name; however, the hierarchy can be mounted multiple times just like any other hierarchies or filesystems can be. It'll just show the same hierarchy on different mount points. As long as the container preparation sets up its cgroup as systemd-nspawn does, everything should be fine.

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 10, 2016

@evvrx, and yes again, just setting up the name=systemd hierarchy in parallel to cgroup v2 hierarchy in both hybrid and v2 mode for backward compatibility for backward compatibility without actually using it for anything.

@cyphar, can't tell much without specifics but if you aren't using "Delegate=", problems are expected with controllers. I can't see how cooperation between two managing agents can be achieved without one telling the other that it's gonna take over some part of the hierarchy. I don't understand why specifying "Delegate=" is a huge problem but one way or the other you'll have to do it. However, that doesn't have anything to do with "name=systemd" hierarchy as that hierarchy will always be fully instantiated and thus processes won't be relocated once set up.

@brauner
Copy link
Contributor

@brauner brauner commented Nov 10, 2016

@htejun, yes, but the nspawn trick of temporarily mounting the systemd controller somewhere only works for privileged containers. If you're an unprivileged user and want to start a container that trick won't work because you're not allowed to mount cgroupfs even if you're root in a new CLONE_NEWUSER | CLONE_NEWNS.

@brauner
Copy link
Contributor

@brauner brauner commented Nov 10, 2016

Nevermind, I got mixed-up here.

@brauner
Copy link
Contributor

@brauner brauner commented Nov 10, 2016

It's still going to be a problem though. We're taking care of placing user into writeable cgroups for some crucial subsystems but if the v1 systemd controller is not mounted than we can't do that. So, if you haven't already been placed into a writeable cgroup as an unprivileged user for the systemd v1 controller, you can sure mount it by using a trick like CLONE_NEWUSER | CLONE_NEWNS, map root uid, setuid() and then mount(), but you won't be able to create the necessary cgroup by copying over the path from the v2 mount under /sys/fs/cgroup/systemd because you lack necessary privileges.

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 10, 2016

@brauner, I'm working on a change to make v1 name=systemd hierarchy always available in hybrid mode. That should work for your case, right?

@brauner
Copy link
Contributor

@brauner brauner commented Nov 10, 2016

That depends on what that means. Does it mean in addition to /sys/fs/cgroup/systemd being mounted as an empty v2 hierarchy a named name=systemd controller is mounted as well? Where would that additional v1 systemd controller be mounted? :)

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 10, 2016

The cgroup2 one would have a different name, obviously.

@evverx
Copy link
Member

@evverx evverx commented Nov 11, 2016

@htejun , systemd-cgls (and other tools) will print the v2-layout. Is this correct from the application's point of view?

Currently

-bash-4.3# systemd-run --setenv UNIFIED_CGROUP_HIERARCHY=no systemd-nspawn -D /var/lib/machines/nspawn-root/ -b systemd.unit=multi-user.target

-bash-4.3# systemd-cgls
...
-.slice
├─machine.slice
│ └─machine-nspawn\x2droot.scope
│   ├─356 /usr/lib/systemd/systemd systemd.unit=multi-user.target
│   ├─380 /usr/lib/systemd/systemd-journald
│   ├─387 /usr/lib/systemd/systemd-logind
│   ├─389 /usr/lib/systemd/systemd-resolved
│   ├─391 /sbin/agetty --noclear --keep-baud console 115200,38400,9600 vt220
│   └─392 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfi...

The "real" layout is

├─machine.slice
│ └─machine-nspawn\x2droot.scope
│   ├─init.scope
│   │ └─356 /usr/lib/systemd/systemd systemd.unit=multi-user.target
│   └─system.slice
│     ├─dbus.service
│     │ └─392 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nop...
│     ├─systemd-logind.service
│     │ └─387 /usr/lib/systemd/systemd-logind
│     ├─console-getty.service
│     │ └─391 (agetty)
│     ├─systemd-resolved.service
│     │ └─389 /usr/lib/systemd/systemd-resolved
│     └─systemd-journald.service
│       └─380 /usr/lib/systemd/systemd-journald
@evverx
Copy link
Member

@evverx evverx commented Nov 11, 2016

Also, systemd-nspawn manually clean-ups the v1-hierarchy (see #4223 (comment))

Should all other tools manually clean-up their v1-subhierarchies?

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 14, 2016

Just posted #4670 which puts cgroup v2 systemd hierarchy on /sys/fs/cgroup/systemd-cgroup2 (any idea for a better name?) while maintaining the "name=systemd" hierarchy on /sys/fs/cgroup/systemd in parallel. This should avoid issues with most tools. For the ones which fail to parse if there's an entry for the v2 hierarchy in /proc/$PID/cgroup, I have no idea yet.

@evverx, yeah, systemd-cgls would need to select mode per machine. I think we have the same problem without the hybrid mode tho. It shouldn't be too difficult to teach systemd-cgls to switch modes per-machine, right?

@htejun
Copy link
Contributor Author

@htejun htejun commented Nov 14, 2016

@evverx, as for cleaning up afterwards, cg_trim() taking care of the v1 hierarchy should be enough, right?

@brauner
Copy link
Contributor

@brauner brauner commented Nov 14, 2016

@htejun good idea to make the v1 systemd controller available per default at /sys/fs/cgroup/systemd. Hm, maybe we can just name the cgroupfs v2 mountpoint /sys/fs/cgroup/unified that would also make the intentions quite clear. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.