cgroup: try creating a temporary directory after mounting `/sys/fs/cgroup/unified` #7401

evverx · 2017-11-21T00:27:05Z

It's possible for systemd inside an unprivileged user namespace container
to be able to mount cgroup2 on /sys/fs/cgroup/unified without being able
to create directories there. When this happens, systemd fails to boot, making
it impossible to reexecute itself without restarting the container runtime.

In this patch the issue is avoided by trying creating a temporary directory
after mounting cgroup2 and falling back to v1 if mkdir fails.

Closes #6408 and lxc/lxc#1678.

…roup/unified` It's possible for `systemd` inside an unprivileged user namespace container to be able to mount `cgroup2` on `/sys/fs/cgroup/unified` without being able to create directories there. When this happens, `systemd` fails to boot, making it impossible to reexecute itself without restarting the container runtime. In this patch the issue is avoided by trying creating a temporary directory after mounting `cgroup2` and falling back to `v1` if `mkdir` fails. Closes systemd#6408 and lxc/lxc#1678.

brauner · 2017-11-21T11:10:00Z

Thanks man! :)

poettering · 2017-11-21T14:08:29Z

Hmpf, I really don't like that this actually creates something in the cgroupfs... Are you sure that access(W_OK) wouldn't detect this case too, without actually changing things on disk?

How exactly does the non-writable cgroupfs look like, when this fails? who owns the dirs in there? If the access(W_OK) doesn't work, maybe we can filter this out if the owner of the dir is "nobody" or so?

evverx · 2017-11-21T15:31:20Z

access(W_OK) won't work if something like AppArmor is used to block access to /sys/fs/cgroup/unified, that is it is quite easy to get the following:

access("/sys/fs/cgroup/unified", W_OK)            = 0
mkdir("/sys/fs/cgroup/unified/init.scope", 0777)         = -1 EACCES (Permission denied)

/sys/fs/cgroup/unified is likely to be owned by nobody:nogroup, but it really depends on a few different factors.

I'm not sure what exactly is wrong with mkdir, because I think that trying to create something is the only reliable way to check whether something can be created.

brauner · 2017-11-21T16:22:22Z

How exactly does the non-writable cgroupfs look like, when this fails? who owns the dirs in there? If the access(W_OK) doesn't work, maybe we can filter this out if the owner of the dir is "nobody" or so?

So for most controllers liblxc will opportunistically use them if they are writable. It only has very few requirements on what cgroups should be writable. I think the only one right now is freezer for legacy reasons (Might rework that one soon.). So a cgroup tree for an unprivileged container might look like this:

root@a1:/# ls -al /sys/fs/cgroup/
total 0
drwxr-xr-x 14 root   root    360 Nov 21 16:17 .
drwxr-xr-x 10 nobody nogroup   0 Nov 14 17:10 ..
drwxr-xr-x  2 nobody nogroup   0 Nov 21 16:15 blkio
lrwxrwxrwx  1 root   root     11 Nov 21 16:17 cpu -> cpu,cpuacct
drwxr-xr-x  2 nobody nogroup   0 Nov 21 16:15 cpu,cpuacct
lrwxrwxrwx  1 root   root     11 Nov 21 16:17 cpuacct -> cpu,cpuacct
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 cpuset
drwxr-xr-x  2 nobody nogroup   0 Nov 10 15:18 devices
drwxrwxr-x  2 nobody root      0 Nov 21 16:17 freezer
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 hugetlb
drwxrwxr-x  2 nobody root      0 Nov 21 16:17 memory
lrwxrwxrwx  1 root   root     16 Nov 21 16:17 net_cls -> net_cls,net_prio
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 net_cls,net_prio
lrwxrwxrwx  1 root   root     16 Nov 21 16:17 net_prio -> net_cls,net_prio
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 perf_event
drwxr-xr-x  2 nobody nogroup   0 Nov 21 16:15 pids
drwxrwxr-x  5 nobody root      0 Nov 21 16:17 systemd
drwxr-xr-x  2 root   root     40 Nov 21 16:17 unified

Writable cgroups will have their gid chow()ned to the container's root user. Non-writable cgroups will belong to nobody:nogroup which is the overflow {g,u}id. Iirc the overflow id can be grabbed from /proc/sys/kernel/overflow{g,u}id.

evverx · 2017-11-21T16:42:32Z

@brauner, /sys/fs/cgroup/unified owned by root:root suggests that you have already updated and restarted the container runtime. This patch is supposed to prevent you from doing this :-)

brauner · 2017-11-21T16:44:27Z

@evverx, I've delegated the unified hierarchy to my unprivileged user before so systemd inside the container is free to create and mount. :)

chb@conventiont|~
> cat /proc/self/cgroup
11:hugetlb:/
10:cpuset:/
9:perf_event:/
8:blkio:/user.slice
7:freezer:/user/chb/0
6:pids:/user.slice/user-1000.slice/session-1.scope
5:net_cls,net_prio:/
4:devices:/user.slice
3:cpu,cpuacct:/user.slice
2:memory:/user/chb/0
1:name=systemd:/user.slice/user-1000.slice/session-1.scope
0::/user.slice/user-1000.slice/session-1.scope

chb@conventiont|~
> ls -al /sys/fs/cgroup/unified/user.slice/user-1000.slice/session-1.scope/
total 0
drwxr-xr-x  4 chb  chb  0 Nov  3 13:32 .
drwxr-xr-x  4 root root 0 Nov  2 22:14 ..
-r--r--r--  1 root root 0 Nov  2 22:14 cgroup.controllers
-r--r--r--  1 root root 0 Nov  2 22:14 cgroup.events
-rw-r--r--  1 root root 0 Nov  2 22:14 cgroup.max.depth
-rw-r--r--  1 root root 0 Nov  2 22:14 cgroup.max.descendants
-rw-r--r--  1 chb  chb  0 Nov 21 17:17 cgroup.procs
-r--r--r--  1 root root 0 Nov  2 22:14 cgroup.stat
-rw-r--r--  1 chb  chb  0 Nov  2 22:14 cgroup.subtree_control
-rw-r--r--  1 chb  chb  0 Nov  2 22:14 cgroup.threads
-rw-r--r--  1 root root 0 Nov  2 22:14 cgroup.type
drwxr-xr-x 20 chb  chb  0 Nov 21 17:35 lxc
drwxr-xr-x  3 root root 0 Nov  4 07:50 user

brauner · 2017-11-21T16:45:21Z

I just wanted to illustrate how you can recognize unwritable cgroups in our case if that helps you. :)

poettering · 2017-11-22T13:46:13Z

access(W_OK) won't work if something like AppArmor is used to block access to /sys/fs/cgroup/unified,

Well, but I'd call that an AA misconfiguration. I mean, if AA is misconfigured, and doesn't allow access to something that is there, then that's something to fix in the AA policy. I mean, we really don't need to support all possible security policies in the world. It's fine to support kernel concepts such as unpriv userns (even though I think userns is a deeply flawed concept) but I am not convinced we should tape over all kinds of bad MAC setups really...

poettering · 2017-11-22T13:47:13Z

/sys/fs/cgroup/unified is likely to be owned by nobody:nogroup, but it really depends on a few different factors.

If so, then that's what I'd prefer. (But W_OK would be much better even)

poettering · 2017-11-22T13:49:48Z

I'm not sure what exactly is wrong with mkdir, because I think that trying to create something is the only reliable way to check whether something can be created.

Well, creating something means inotify is generated, mtimes are bumped yadda yada, and means we might leave useless stuff around when we hit some issue. And cgroupsv2 kinda pushes people to use inotify on cgroupfs (for getting events), hence I'd very much prefer we'd avoid creating any objects we aren't really intending to keep.

poettering · 2017-11-22T13:51:44Z

Non-writable cgroups will belong to nobody:nogroup which is the overflow {g,u}id. Iirc the overflow id can be grabbed from /proc/sys/kernel/overflow{g,u}id

(Side note: in systemd, we do not support using overflow UID/GID that is not 65534. If you define it to anything else, you are on your own... Quite frankly I find it really strange that is is configurable in the kernel in the first place. It's like making the root user's UID configurable, or the PID of the init system...)

brauner · 2017-11-22T13:55:22Z

On Wed, Nov 22, 2017 at 01:51:52PM +0000, Lennart Poettering wrote: > Non-writable cgroups will belong to nobody:nogroup which is the overflow {g,u}id. Iirc the overflow id can be grabbed from /proc/sys/kernel/overflow{g,u}id (Side note: in systemd, we do not support using overflow UID/GID that is not 65534. If you define it to anything else, you are on your own... Quite frankly I find it really strange that is is configurable in the kernel in the first place. It's like making the root user's UID configurable, or the PID of the init system...)

I don't know what you mean by "support". liblxc will not interact with these files at all. But system administrators will be free to change this value behind systemd's and liblxc's back nonetheless.

…

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #7401 (comment)

brauner · 2017-11-22T14:00:09Z

On Wed, Nov 22, 2017 at 01:46:20PM +0000, Lennart Poettering wrote: > access(W_OK) won't work if something like AppArmor is used to block access to /sys/fs/cgroup/unified, Well, but I'd call that an AA misconfiguration. I mean, if AA is misconfigured, and doesn't allow access to something that is there, then that's something to fix in the AA policy. I mean, we really don't need to support all possible security policies in the world. It's fine to support kernel concepts such as unpriv userns (even though I think userns is a deeply flawed concept) but I am not convinced we should tape over all kinds of bad MAC setups really...

I'm currently not sure how AppArmor and access() interact. In any case, iirc AppArmor didn't know about cgroup2 being a valid filesystem in the first place and so denied the mount by default. I've filed a bug and the AppArmor guys are fixing this.

…

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #7401 (comment)

evverx · 2017-11-22T14:01:21Z

@poettering, I agree with you to some extent, but what you are suggesting doesn't work for me, so I'm going to use this patch locally as I've been doing for a couple of months.

@brauner, would you mind sending another PR where access(W_OK) would be used?

brauner · 2017-11-22T14:06:13Z

@evverx, have you ever tried running the container without AppArmor because I can reproduce this behavior without AppArmor. So yeah, having an access(W_OK) check in there might actually help.

evverx · 2017-11-22T14:17:30Z

@brauner, I used AppArmor as an example of why access(W_OK) is not as useful as it might seem to be, but actually I don't use AppArmor at all.

AppArmor didn't know about cgroup2 being a valid filesystem in the first place
and so denied the mount by default. I've filed a bug and the AppArmor guys are
fixing this.

If I remember correctly, only this prevents systemd from mass freezing on Ubuntu.

brauner · 2017-11-22T14:30:28Z

On Wed, Nov 22, 2017 at 02:17:38PM +0000, Evgeny Vereshchagin wrote: @brauner, I used `AppArmor` as an example of why `access(W_OK)` is not as useful as it might seem to be, but actually I don't use `AppArmor` at all.

Oh, that's weird though... I just wrote a little stupid test program #include <stdio.h> #include <unistd.h> int main(int argc, char *argv[]) { printf("Is \"\/sys/fs/cgroup/unified\" writable: %d\n", access("/sys/fs/cgroup/unified", W_OK); return 0; } and ran it in a AppArmor confined and in an unconfined container: 1. confined: root@a1:/# cat /proc/self/attr/current lxc-container-default-cgns (enforce) root@a1:/# ./test Is "/sys/fs/cgroup/unified" writable: 1 2. unconfined: root@a1:/# cat /proc/self/attr/current unconfined root@a1:/# ./test Is "/sys/fs/cgroup/unified" writable: 0

evverx · 2017-11-22T14:56:51Z

It seems that writable: 1 should be writable: -1 when lxc is confined. Could a minus sign have been lost while being pasted?

brauner · 2017-11-22T14:59:27Z

On Wed, Nov 22, 2017 at 02:57:01PM +0000, Evgeny Vereshchagin wrote: It seems that `writable: 1` should be `writable: -1` when `lxc` is confined. Could a minus sign have lost while pasting?

The test program checks whether access(path, W_OK) == 0

…

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #7401 (comment)

evverx · 2017-11-22T15:16:31Z

I'm not sure what is going on. Could it be that /sys/fs/cgroup/unified is set up differently depending on whether lxc.aa_profile=unconfined is used or not?

brauner · 2017-11-22T15:34:40Z

On Wed, Nov 22, 2017 at 03:16:37PM +0000, Evgeny Vereshchagin wrote: I'm not sure what is going on. Could it be that `/sys/fs/cgroup/unified` is set up differently depending on whether `lxc.aa_profile=unconfined` is used or not?

So what I understand so far - and I haven't been able to dig into this issue deep enough - is: 1. confined container: AppArmor will refuse the mount for a cgroup2 filesystem on /sys/fs/cgroup because our current AppArmor rules disallow it. In which case systemd will fail to mount the unified cgroup hierarchy and happily move on. 2. unconfined container: Here AppArmor will not refuse the mount and systemd will be able to mount the unified cgroup hierachy. But it also expects it to be writable which it isn't and so the boot freezes. I've tested a systemd patch with access(W_OK) which I'm going to send soon and with this patch applied systemd boots just fine even if the unified cgroup hierarchy is mountable but not writable. Christian

evverx · 2017-11-22T15:36:20Z

@brauner, could you check what will happen if you mount cgroup2 on /sys/fs/cgroup/unified before calling access when lxc is unconfined?

evverx · 2017-11-22T15:37:50Z

@brauner, sorry, I left my comment before yours appeared.

brauner · 2017-11-22T16:02:34Z

On Wed, Nov 22, 2017 at 03:37:56PM +0000, Evgeny Vereshchagin wrote: @brauner, sorry, I left my comment before yours appeared.

No problem at all. What I'm currently thinking about is whether we should still mount the cgroup even though it's not writable. I'm going to send a first version of the patch that umounts() the unified hierarchy again when it detects that it isn't writable. I think this behavior is correct.

evverx · 2017-11-22T16:06:45Z

I don't think that cgroup2 should be mounted if it is not writable, because systemd currently expects that if something is mounted it can be used and fails to boot with the same Failed to create /init.scope control group when it cannot create init.scope.

evverx · 2017-11-22T16:11:47Z

It seems that I misunderstood what you had written. I think that you are right that cgroup should be unmounted if it is not writable.

evverx · 2017-11-22T16:16:44Z

In fact, if cgroup2 will not be umounted, systemd assumes that the hybrid hierarchy is used during the second attempt to mount_one(/sys/fs/cgroup/unified) and fails to boot.

evverx added cgroups pid1 labels Nov 21, 2017

evverx mentioned this pull request Nov 21, 2017

lxc: Failed to activate service 'org.freedesktop.systemd1': timed out after update to 233.75-3 from 232-8 #6408

Closed

evverx force-pushed the try-creating-tmp-dir branch from 7503e86 to ce7b4d9 Compare November 21, 2017 00:59

evverx force-pushed the try-creating-tmp-dir branch from ce7b4d9 to 461ef01 Compare November 21, 2017 11:51

poettering added the needs-discussion 🤔 label Nov 21, 2017

evverx closed this Nov 22, 2017

evverx deleted the try-creating-tmp-dir branch November 22, 2017 16:23

brauner mentioned this pull request Nov 22, 2017

cgroup: skip unwritable cgroups #7420

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup: try creating a temporary directory after mounting `/sys/fs/cgroup/unified` #7401

cgroup: try creating a temporary directory after mounting `/sys/fs/cgroup/unified` #7401

evverx commented Nov 21, 2017

brauner commented Nov 21, 2017

poettering commented Nov 21, 2017

evverx commented Nov 21, 2017 •

edited

Loading

brauner commented Nov 21, 2017 •

edited

Loading

evverx commented Nov 21, 2017

brauner commented Nov 21, 2017 •

edited

Loading

brauner commented Nov 21, 2017

poettering commented Nov 22, 2017

poettering commented Nov 22, 2017

poettering commented Nov 22, 2017 •

edited

Loading

poettering commented Nov 22, 2017

brauner commented Nov 22, 2017 via email

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017 via email •

edited

Loading

evverx commented Nov 22, 2017 •

edited

Loading

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017 •

edited

Loading

evverx commented Nov 22, 2017 •

edited

Loading

evverx commented Nov 22, 2017 •

edited

Loading

cgroup: try creating a temporary directory after mounting /sys/fs/cgroup/unified #7401

cgroup: try creating a temporary directory after mounting /sys/fs/cgroup/unified #7401

Conversation

evverx commented Nov 21, 2017

brauner commented Nov 21, 2017

poettering commented Nov 21, 2017

evverx commented Nov 21, 2017 • edited Loading

brauner commented Nov 21, 2017 • edited Loading

evverx commented Nov 21, 2017

brauner commented Nov 21, 2017 • edited Loading

brauner commented Nov 21, 2017

poettering commented Nov 22, 2017

poettering commented Nov 22, 2017

poettering commented Nov 22, 2017 • edited Loading

poettering commented Nov 22, 2017

brauner commented Nov 22, 2017 via email

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017 via email • edited Loading

evverx commented Nov 22, 2017 • edited Loading

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017

evverx commented Nov 22, 2017

brauner commented Nov 22, 2017 via email

evverx commented Nov 22, 2017 • edited Loading

evverx commented Nov 22, 2017 • edited Loading

evverx commented Nov 22, 2017 • edited Loading

cgroup: try creating a temporary directory after mounting `/sys/fs/cgroup/unified` #7401

cgroup: try creating a temporary directory after mounting `/sys/fs/cgroup/unified` #7401

evverx commented Nov 21, 2017 •

edited

Loading

brauner commented Nov 21, 2017 •

edited

Loading

brauner commented Nov 21, 2017 •

edited

Loading

poettering commented Nov 22, 2017 •

edited

Loading

brauner commented Nov 22, 2017 via email •

edited

Loading

evverx commented Nov 22, 2017 •

edited

Loading

evverx commented Nov 22, 2017 •

edited

Loading

evverx commented Nov 22, 2017 •

edited

Loading

evverx commented Nov 22, 2017 •

edited

Loading