-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup: try creating a temporary directory after mounting /sys/fs/cgroup/unified
#7401
Conversation
7503e86
to
ce7b4d9
Compare
…roup/unified` It's possible for `systemd` inside an unprivileged user namespace container to be able to mount `cgroup2` on `/sys/fs/cgroup/unified` without being able to create directories there. When this happens, `systemd` fails to boot, making it impossible to reexecute itself without restarting the container runtime. In this patch the issue is avoided by trying creating a temporary directory after mounting `cgroup2` and falling back to `v1` if `mkdir` fails. Closes systemd#6408 and lxc/lxc#1678.
Thanks man! :) |
ce7b4d9
to
461ef01
Compare
Hmpf, I really don't like that this actually creates something in the cgroupfs... Are you sure that access(W_OK) wouldn't detect this case too, without actually changing things on disk? How exactly does the non-writable cgroupfs look like, when this fails? who owns the dirs in there? If the access(W_OK) doesn't work, maybe we can filter this out if the owner of the dir is "nobody" or so? |
access("/sys/fs/cgroup/unified", W_OK) = 0
mkdir("/sys/fs/cgroup/unified/init.scope", 0777) = -1 EACCES (Permission denied)
I'm not sure what exactly is wrong with |
So for most controllers
Writable cgroups will have their gid chow()ned to the container's root user. Non-writable cgroups will belong to |
@brauner, |
@evverx, I've delegated the unified hierarchy to my unprivileged user before so systemd inside the container is free to create and mount. :)
|
I just wanted to illustrate how you can recognize unwritable cgroups in our case if that helps you. :) |
Well, but I'd call that an AA misconfiguration. I mean, if AA is misconfigured, and doesn't allow access to something that is there, then that's something to fix in the AA policy. I mean, we really don't need to support all possible security policies in the world. It's fine to support kernel concepts such as unpriv userns (even though I think userns is a deeply flawed concept) but I am not convinced we should tape over all kinds of bad MAC setups really... |
If so, then that's what I'd prefer. (But W_OK would be much better even) |
Well, creating something means inotify is generated, mtimes are bumped yadda yada, and means we might leave useless stuff around when we hit some issue. And cgroupsv2 kinda pushes people to use inotify on cgroupfs (for getting events), hence I'd very much prefer we'd avoid creating any objects we aren't really intending to keep. |
(Side note: in systemd, we do not support using overflow UID/GID that is not 65534. If you define it to anything else, you are on your own... Quite frankly I find it really strange that is is configurable in the kernel in the first place. It's like making the root user's UID configurable, or the PID of the init system...) |
On Wed, Nov 22, 2017 at 01:51:52PM +0000, Lennart Poettering wrote:
> Non-writable cgroups will belong to nobody:nogroup which is the overflow {g,u}id. Iirc the overflow id can be grabbed from /proc/sys/kernel/overflow{g,u}id
(Side note: in systemd, we do not support using overflow UID/GID that is not 65534. If you define it to anything else, you are on your own... Quite frankly I find it really strange that is is configurable in the kernel in the first place. It's like making the root user's UID configurable, or the PID of the init system...)
I don't know what you mean by "support". liblxc will not interact with these
files at all. But system administrators will be free to change this value behind
systemd's and liblxc's back nonetheless.
…
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#7401 (comment)
|
On Wed, Nov 22, 2017 at 01:46:20PM +0000, Lennart Poettering wrote:
> access(W_OK) won't work if something like AppArmor is used to block access to /sys/fs/cgroup/unified,
Well, but I'd call that an AA misconfiguration. I mean, if AA is misconfigured, and doesn't allow access to something that is there, then that's something to fix in the AA policy. I mean, we really don't need to support all possible security policies in the world. It's fine to support kernel concepts such as unpriv userns (even though I think userns is a deeply flawed concept) but I am not convinced we should tape over all kinds of bad MAC setups really...
I'm currently not sure how AppArmor and access() interact. In any case, iirc
AppArmor didn't know about cgroup2 being a valid filesystem in the first place
and so denied the mount by default. I've filed a bug and the AppArmor guys are
fixing this.
…
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#7401 (comment)
|
@poettering, I agree with you to some extent, but what you are suggesting doesn't work for me, so I'm going to use this patch locally as I've been doing for a couple of months. @brauner, would you mind sending another PR where |
@evverx, have you ever tried running the container without AppArmor because I can reproduce this behavior without AppArmor. So yeah, having an |
@brauner, I used
If I remember correctly, only this prevents |
On Wed, Nov 22, 2017 at 02:17:38PM +0000, Evgeny Vereshchagin wrote:
@brauner, I used `AppArmor` as an example of why `access(W_OK)` is not as useful as it might seem to be, but actually I don't use `AppArmor` at all.
Oh, that's weird though... I just wrote a little stupid test program
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
printf("Is \"\/sys/fs/cgroup/unified\" writable: %d\n", access("/sys/fs/cgroup/unified", W_OK);
return 0;
}
and ran it
in a AppArmor confined and in an unconfined container:
1. confined:
root@a1:/# cat /proc/self/attr/current
lxc-container-default-cgns (enforce)
root@a1:/# ./test
Is "/sys/fs/cgroup/unified" writable: 1
2. unconfined:
root@a1:/# cat /proc/self/attr/current
unconfined
root@a1:/# ./test
Is "/sys/fs/cgroup/unified" writable: 0
|
It seems that |
On Wed, Nov 22, 2017 at 02:57:01PM +0000, Evgeny Vereshchagin wrote:
It seems that `writable: 1` should be `writable: -1` when `lxc` is confined. Could a minus sign have lost while pasting?
The test program checks whether
access(path, W_OK) == 0
…
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#7401 (comment)
|
I'm not sure what is going on. Could it be that |
On Wed, Nov 22, 2017 at 03:16:37PM +0000, Evgeny Vereshchagin wrote:
I'm not sure what is going on. Could it be that `/sys/fs/cgroup/unified` is set up differently depending on whether `lxc.aa_profile=unconfined` is used or not?
So what I understand so far - and I haven't been able to dig into this issue
deep enough - is:
1. confined container:
AppArmor will refuse the mount for a cgroup2 filesystem on /sys/fs/cgroup
because our current AppArmor rules disallow it. In which case systemd will fail
to mount the unified cgroup hierarchy and happily move on.
2. unconfined container:
Here AppArmor will not refuse the mount and systemd will be able to mount the
unified cgroup hierachy. But it also expects it to be writable which it isn't
and so the boot freezes.
I've tested a systemd patch with access(W_OK) which I'm going to send soon and
with this patch applied systemd boots just fine even if the unified cgroup
hierarchy is mountable but not writable.
Christian
|
@brauner, could you check what will happen if you mount |
@brauner, sorry, I left my comment before yours appeared. |
On Wed, Nov 22, 2017 at 03:37:56PM +0000, Evgeny Vereshchagin wrote:
@brauner, sorry, I left my comment before yours appeared.
No problem at all. What I'm currently thinking about is whether we should still
mount the cgroup even though it's not writable. I'm going to send a first
version of the patch that umounts() the unified hierarchy again when it detects
that it isn't writable. I think this behavior is correct.
|
I don't think that |
It seems that I misunderstood what you had written. I think that you are right that |
In fact, if |
It's possible for
systemd
inside an unprivileged user namespace containerto be able to mount
cgroup2
on/sys/fs/cgroup/unified
without being ableto create directories there. When this happens,
systemd
fails to boot, makingit impossible to reexecute itself without restarting the container runtime.
In this patch the issue is avoided by trying creating a temporary directory
after mounting
cgroup2
and falling back tov1
ifmkdir
fails.Closes #6408 and lxc/lxc#1678.