Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup: try creating a temporary directory after mounting /sys/fs/cgroup/unified #7401

Closed
wants to merge 1 commit into from

Conversation

evverx
Copy link
Member

@evverx evverx commented Nov 21, 2017

It's possible for systemd inside an unprivileged user namespace container
to be able to mount cgroup2 on /sys/fs/cgroup/unified without being able
to create directories there. When this happens, systemd fails to boot, making
it impossible to reexecute itself without restarting the container runtime.

In this patch the issue is avoided by trying creating a temporary directory
after mounting cgroup2 and falling back to v1 if mkdir fails.

Closes #6408 and lxc/lxc#1678.

…roup/unified`

It's possible for `systemd` inside an unprivileged user namespace container
to be able to mount `cgroup2` on `/sys/fs/cgroup/unified` without being able
to create directories there.  When this happens, `systemd` fails to boot, making
it impossible to reexecute itself without restarting the container runtime.

In this patch the issue is avoided by trying creating a temporary directory
after mounting `cgroup2` and falling back to `v1` if `mkdir` fails.

Closes systemd#6408 and lxc/lxc#1678.
@brauner
Copy link
Contributor

brauner commented Nov 21, 2017

Thanks man! :)

@poettering
Copy link
Member

Hmpf, I really don't like that this actually creates something in the cgroupfs... Are you sure that access(W_OK) wouldn't detect this case too, without actually changing things on disk?

How exactly does the non-writable cgroupfs look like, when this fails? who owns the dirs in there? If the access(W_OK) doesn't work, maybe we can filter this out if the owner of the dir is "nobody" or so?

@evverx
Copy link
Member Author

evverx commented Nov 21, 2017

access(W_OK) won't work if something like AppArmor is used to block access to /sys/fs/cgroup/unified, that is it is quite easy to get the following:

access("/sys/fs/cgroup/unified", W_OK)            = 0
mkdir("/sys/fs/cgroup/unified/init.scope", 0777)         = -1 EACCES (Permission denied)

/sys/fs/cgroup/unified is likely to be owned by nobody:nogroup, but it really depends on a few different factors.

I'm not sure what exactly is wrong with mkdir, because I think that trying to create something is the only reliable way to check whether something can be created.

@brauner
Copy link
Contributor

brauner commented Nov 21, 2017

How exactly does the non-writable cgroupfs look like, when this fails? who owns the dirs in there? If the access(W_OK) doesn't work, maybe we can filter this out if the owner of the dir is "nobody" or so?

So for most controllers liblxc will opportunistically use them if they are writable. It only has very few requirements on what cgroups should be writable. I think the only one right now is freezer for legacy reasons (Might rework that one soon.). So a cgroup tree for an unprivileged container might look like this:

root@a1:/# ls -al /sys/fs/cgroup/
total 0
drwxr-xr-x 14 root   root    360 Nov 21 16:17 .
drwxr-xr-x 10 nobody nogroup   0 Nov 14 17:10 ..
drwxr-xr-x  2 nobody nogroup   0 Nov 21 16:15 blkio
lrwxrwxrwx  1 root   root     11 Nov 21 16:17 cpu -> cpu,cpuacct
drwxr-xr-x  2 nobody nogroup   0 Nov 21 16:15 cpu,cpuacct
lrwxrwxrwx  1 root   root     11 Nov 21 16:17 cpuacct -> cpu,cpuacct
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 cpuset
drwxr-xr-x  2 nobody nogroup   0 Nov 10 15:18 devices
drwxrwxr-x  2 nobody root      0 Nov 21 16:17 freezer
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 hugetlb
drwxrwxr-x  2 nobody root      0 Nov 21 16:17 memory
lrwxrwxrwx  1 root   root     16 Nov 21 16:17 net_cls -> net_cls,net_prio
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 net_cls,net_prio
lrwxrwxrwx  1 root   root     16 Nov 21 16:17 net_prio -> net_cls,net_prio
dr-xr-xr-x  3 nobody nogroup   0 Nov  4 07:27 perf_event
drwxr-xr-x  2 nobody nogroup   0 Nov 21 16:15 pids
drwxrwxr-x  5 nobody root      0 Nov 21 16:17 systemd
drwxr-xr-x  2 root   root     40 Nov 21 16:17 unified

Writable cgroups will have their gid chow()ned to the container's root user. Non-writable cgroups will belong to nobody:nogroup which is the overflow {g,u}id. Iirc the overflow id can be grabbed from /proc/sys/kernel/overflow{g,u}id.

@evverx
Copy link
Member Author

evverx commented Nov 21, 2017

@brauner, /sys/fs/cgroup/unified owned by root:root suggests that you have already updated and restarted the container runtime. This patch is supposed to prevent you from doing this :-)

@brauner
Copy link
Contributor

brauner commented Nov 21, 2017

@evverx, I've delegated the unified hierarchy to my unprivileged user before so systemd inside the container is free to create and mount. :)

chb@conventiont|~
> cat /proc/self/cgroup
11:hugetlb:/
10:cpuset:/
9:perf_event:/
8:blkio:/user.slice
7:freezer:/user/chb/0
6:pids:/user.slice/user-1000.slice/session-1.scope
5:net_cls,net_prio:/
4:devices:/user.slice
3:cpu,cpuacct:/user.slice
2:memory:/user/chb/0
1:name=systemd:/user.slice/user-1000.slice/session-1.scope
0::/user.slice/user-1000.slice/session-1.scope

chb@conventiont|~
> ls -al /sys/fs/cgroup/unified/user.slice/user-1000.slice/session-1.scope/
total 0
drwxr-xr-x  4 chb  chb  0 Nov  3 13:32 .
drwxr-xr-x  4 root root 0 Nov  2 22:14 ..
-r--r--r--  1 root root 0 Nov  2 22:14 cgroup.controllers
-r--r--r--  1 root root 0 Nov  2 22:14 cgroup.events
-rw-r--r--  1 root root 0 Nov  2 22:14 cgroup.max.depth
-rw-r--r--  1 root root 0 Nov  2 22:14 cgroup.max.descendants
-rw-r--r--  1 chb  chb  0 Nov 21 17:17 cgroup.procs
-r--r--r--  1 root root 0 Nov  2 22:14 cgroup.stat
-rw-r--r--  1 chb  chb  0 Nov  2 22:14 cgroup.subtree_control
-rw-r--r--  1 chb  chb  0 Nov  2 22:14 cgroup.threads
-rw-r--r--  1 root root 0 Nov  2 22:14 cgroup.type
drwxr-xr-x 20 chb  chb  0 Nov 21 17:35 lxc
drwxr-xr-x  3 root root 0 Nov  4 07:50 user

@brauner
Copy link
Contributor

brauner commented Nov 21, 2017

I just wanted to illustrate how you can recognize unwritable cgroups in our case if that helps you. :)

@poettering
Copy link
Member

access(W_OK) won't work if something like AppArmor is used to block access to /sys/fs/cgroup/unified,

Well, but I'd call that an AA misconfiguration. I mean, if AA is misconfigured, and doesn't allow access to something that is there, then that's something to fix in the AA policy. I mean, we really don't need to support all possible security policies in the world. It's fine to support kernel concepts such as unpriv userns (even though I think userns is a deeply flawed concept) but I am not convinced we should tape over all kinds of bad MAC setups really...

@poettering
Copy link
Member

/sys/fs/cgroup/unified is likely to be owned by nobody:nogroup, but it really depends on a few different factors.

If so, then that's what I'd prefer. (But W_OK would be much better even)

@poettering
Copy link
Member

poettering commented Nov 22, 2017

I'm not sure what exactly is wrong with mkdir, because I think that trying to create something is the only reliable way to check whether something can be created.

Well, creating something means inotify is generated, mtimes are bumped yadda yada, and means we might leave useless stuff around when we hit some issue. And cgroupsv2 kinda pushes people to use inotify on cgroupfs (for getting events), hence I'd very much prefer we'd avoid creating any objects we aren't really intending to keep.

@poettering
Copy link
Member

Non-writable cgroups will belong to nobody:nogroup which is the overflow {g,u}id. Iirc the overflow id can be grabbed from /proc/sys/kernel/overflow{g,u}id

(Side note: in systemd, we do not support using overflow UID/GID that is not 65534. If you define it to anything else, you are on your own... Quite frankly I find it really strange that is is configurable in the kernel in the first place. It's like making the root user's UID configurable, or the PID of the init system...)

@brauner
Copy link
Contributor

brauner commented Nov 22, 2017 via email

@brauner
Copy link
Contributor

brauner commented Nov 22, 2017 via email

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

@poettering, I agree with you to some extent, but what you are suggesting doesn't work for me, so I'm going to use this patch locally as I've been doing for a couple of months.

@brauner, would you mind sending another PR where access(W_OK) would be used?

@evverx evverx closed this Nov 22, 2017
@brauner
Copy link
Contributor

brauner commented Nov 22, 2017

@evverx, have you ever tried running the container without AppArmor because I can reproduce this behavior without AppArmor. So yeah, having an access(W_OK) check in there might actually help.

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

@brauner, I used AppArmor as an example of why access(W_OK) is not as useful as it might seem to be, but actually I don't use AppArmor at all.

AppArmor didn't know about cgroup2 being a valid filesystem in the first place
and so denied the mount by default. I've filed a bug and the AppArmor guys are
fixing this.

If I remember correctly, only this prevents systemd from mass freezing on Ubuntu.

@brauner
Copy link
Contributor

brauner commented Nov 22, 2017 via email

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

It seems that writable: 1 should be writable: -1 when lxc is confined. Could a minus sign have been lost while being pasted?

@brauner
Copy link
Contributor

brauner commented Nov 22, 2017 via email

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

I'm not sure what is going on. Could it be that /sys/fs/cgroup/unified is set up differently depending on whether lxc.aa_profile=unconfined is used or not?

@brauner
Copy link
Contributor

brauner commented Nov 22, 2017 via email

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

@brauner, could you check what will happen if you mount cgroup2 on /sys/fs/cgroup/unified before calling access when lxc is unconfined?

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

@brauner, sorry, I left my comment before yours appeared.

@brauner
Copy link
Contributor

brauner commented Nov 22, 2017 via email

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

I don't think that cgroup2 should be mounted if it is not writable, because systemd currently expects that if something is mounted it can be used and fails to boot with the same Failed to create /init.scope control group when it cannot create init.scope.

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

It seems that I misunderstood what you had written. I think that you are right that cgroup should be unmounted if it is not writable.

@evverx
Copy link
Member Author

evverx commented Nov 22, 2017

In fact, if cgroup2 will not be umounted, systemd assumes that the hybrid hierarchy is used during the second attempt to mount_one(/sys/fs/cgroup/unified) and fails to boot.

@evverx evverx deleted the try-creating-tmp-dir branch November 22, 2017 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

3 participants