-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels #9483
[RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels #9483
Conversation
On the way to unprivileged filesystem mounts starting with commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.") we've enable mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since static struct super_block *alloc_super(struct file_system_type *type, int flags, struct user_namespace *user_ns) { /* <snip> */ if (s->s_user_ns != &init_user_ns) s->s_iflags |= SB_I_NODEV; /* <snip> */ } will set the SB_I_NODEV flag on the filesystem. When a device node created in non-init userns is open()ed the call chain will hit: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } which will cause an EPERM because the device node is located on an fs owned by non-init-userns and thus doesn't grant access to device nodes due to SB_I_NODEV. It seems that to gracefully handle this case we should bind-mount device nodes unless we're real root.
//Cc @poettering, this would by may suggested fix. |
Hmm, but as I understand this is a major compatibility breakage of the kernel, no? Code that worked previously OK will break on current kernels if userns is used, no? I mean, we can patch around in current system as much as we want, this wouldn't suddenly fix things on already released systemd version would it? Don't the Linux kernel folks keep repeating that mantra of not breaking userspace? This appears to be quite some breakage no? Or what am I missing? |
(specific to the patch: In general, we try to to avoid generic checks like what you are proposing as much as we can, and instead try to test for actual behaviour. We use "am i running in a container" checks, and "am i running under userns" only as very last resort. Are we sure this is one of those case? To me this appears like something that should be reverted in the kernel, and they should find a different approach. For example, it would already be sufficient if the API device nodes, such as /dev/null, /dev/zero or /dev/urandom would just be whitelisted in the kernel to not onyl allow to be mknod()ed, but also the open()ed if they are.) |
So for The whitelist approach is something I planned on discussing anyway. But I don't think we want to revert enabling |
hmm, interesting, in really old versions of this there actually was some gracefulness in that code7f112f50fea585411ea2d493b3582bea77eb4d6e), but it was dropped eventually, probably by accident. Either way, I know some distros have stabilized on 238, i.e. the kernel API compat breakage will hit people. So far the general assumption was that mknod() was "more privileged" than open(), and thus if I can mknod() the thing I can totally also open it. This is in line with UNIX tradition (as any file node I create I also own, hence I should be able to open it. the concept that something i own cannot opened by me appears quite surprising and simple wrong to me). But this concept is not just regular UNIX, it also leaked into various Linux concepts. For example, there's a reason why there is CAP_MKNOD, but no CAP_DEVICE_OPEN... I mean, I am not against making mknod work in userns, but if so you should really make sure that you don't blanket allow mknod() if you then prohibit opening it. That's just conceptually very wrong. Either do it properly (i.e. allow only mknod( ) on nodes you can also open()) or not at all, but the kernel's behaviour of this right now is just bogus: it allows you to create unusable objects (around which userspace then shall work around), and it's already clear that this is only stopgap, and you are working on fixing this properly. Also, if there needs to be a workaround, I can see at least three ways how the container manager could work around this more safely and in a way that's compatible with 238 too:
Hence, what's the rationale for working around this in the payload, when it could easily be worked around in the container manager instead? I mean, it's a lot easier to fix a few container managers than to fix all images already created out there... That said, the best appears would be to fix the kernel instead. They should really be held to their mantra of not breaking userspace... |
On Mon, Jul 02, 2018 at 04:42:51AM -0700, Lennart Poettering wrote:
hmm, interesting, in really old versions of this there actually was some gracefulness in that code7f112f50fea585411ea2d493b3582bea77eb4d6e), but it was dropped eventually, probably by accident.
Either way, I know some distros have stabilized on 238, i.e. the kernel API compat breakage will hit people.
So far the general assumption was that mknod() was "more privileged" than open(), and thus if I can mknod() the thing I can totally also open it. This is in line with UNIX tradition (as any file node I create I also own, hence I should be able to open it. the concept that something i own cannot opened by me appears quite surprising and simple wrong to me). But this concept is not just regular UNIX, it also leaked into various Linux concepts. For example, there's a reason why there is CAP_MKNOD, but no CAP_DEVICE_OPEN...
I mean, I am not against making mknod work in userns, but if so you should really make sure that you don't blanket allow mknod() if you then prohibit opening it. That's just conceptually very wrong. Either do it properly (i.e. allow only mknod( ) on nodes you can also open()) or not at all, but the kernel's behaviour of this right is just bogus: it allows you to create unusable objects (around which userspace then shall work around), and it's already clear that this is only stopgap, and you are working on fixing this properly.
Actually you can already end up in similar situation quite easily even
before that change. Mount options such as MS_NOEXEC and MS_NODEV on any
filesystem will allow you to still set the exec bit or create device
nodes but not execute the files in question or open them. So there's
precedence. The counter argument - with which I sympathize - is that
there's a difference between explicitly mounting an fs with MS_NODEV and
having the kernel implicitly mark the filesystem as SB_I_NODEV
internally. In the former case the open() refusal is somewhat
transparent in that you can discover whether the fs was mounted with
MS_NODEV whereas in the latter case it is opaque to userspace.
Another argument I see is that this logic is broken for MS_NOEXEC and
MS_NODEV. I'd rather have them both refuse the corresponding operations.
In any case I'll be debating this with a bunch of people.
…
Also, if there needs to be a workaround, I can see at least three ways how the container manager could work around this more safely and in a way that's compatible with 238 too:
1. on cgroupsv1 the container manager could use the "devices" cgroup controller to only whitelist mknod() for device nodes that can actually work
2. the container manager could simply drop CAP_SYS_MKNOD in the container, so that mknod() fails
3. the container manager could install a seccomp filter to filter out mknod() for all devices that cannot be used anyway
Hence, what's the rationale for working around this in the payload, when it could easily be worked around in the container manager instead? I mean, it's a lot easier to fix a few container managers than to fix all images already created out there...
That said, the best appears would be to fix the kernel instead. They should really be held to their mantra of not breaking userspace...
--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
#9483 (comment)
|
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com>
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com>
What can be done to get this to move forward? Now that it's been almost half a year since 239, I'd really like if this comes with 240 (and I don't have to use an old kernel). |
Well, the proposed fix turns off all device node creation in userns. But it was my understanding that the kernel folks want to open up /dev/null sooner or later for that fully, and hence I am not convinced we should merge this, given that it really is not right in the long run, and just is glue that works around temporary kernel compat breakage, that will bite is in the long run. It's kinda disappointing that some kernel devs knowingly break userspace with this. After all @brauner did mention the problem, but he was ignored. Quite frankly, I think the right approach is probably to make noise about this, so that the kernel devs revert this. They can't claim their motto was never to break userspace if they galantly break userspace like this and don't care at all about the effects, even when brought to their attention. That said, I am am not sure I care enough about this I must say to be willing to wrestle with the userns maintainer about this myself. Quite frankly, they should either go all the way (which means, allow /dev/null and friends to be created and opened in userns), or not do this at all. The mess the are doing now in the kernel is just broken. From my perspective, I'd probably just add a NEWS entry telling people that due to a kernel compat breakage they need to update their userns using container managers to install seccomp filters on mknod (or take away CAP_SYS_MKNOD in it) if they want to run systemd inside of the containers if they want to use systemd inside of it. |
well, but this specific code is written in the knowledge that the file system mknod() is invoked on was just mounted a few syscalls earlier as tmpfs with mount options the code selected itself. So I think we can reasonably assume that on a file system we create and then do mknod() on things are actually usable. |
I can tell you right away that this won't be reverted upstream and will be status quo going forward. I have had no luck in making a case for reverting it when I sent the revert since no one cared enough to point out in the thread that it actually breaks them: basically, my revert was rejected with "this only breaks container runtimes and the logic they follow is buggy". Sorry that I can't be more helpful as a kernel dev here. |
Note that the tmpfs is mounted in a private fs namespace, right after creating it. It's not easy to get access to that mount point in the short time window, and given the hoops you have to jump through to get there, in order to remount it. This is certainly not going to be remounted "by accident", and does require privileges. |
See https://lkml.org/lkml/2018/12/22/221 , I've added Linus to CC on that problem also. |
I responded pointing out that I sent a revert for this in July: https://lists.linuxfoundation.org/pipermail/containers/2018-July/039182.html |
I never hit this issue and didn't saw the revert you send for that on LKML. However on things breaking userpace I always CC Linus.. BR |
Ignoring the slightly patronizing tone, I have cc'ed and notified systemd people while sending the revert and someone should've said "this is an issue for us". Part why Eric opposed me was that he said he doesn't see the broken users. Now, that's not good argument but an easy one to refute if people would've opened their mouths. But we settled it so we can move on, hopefully. :)
|
commit 94f8200 upstream. This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
I think we can close this bug now that my revert has been merged upstream. |
commit 94f8200 upstream. This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 94f8200 upstream. This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> https://jira.sw.ru/browse/PSBM-100581 Warning: next patch will allow mknod in CT init userns. The explanation above is good, but we always had such a behavior (mknod succeeds but later open() can fail) and never had a problem because of that, so let it be the same until we face the problem. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> https://jira.sw.ru/browse/PSBM-100581 Warning: next patch will allow mknod in CT init userns. The explanation above is good, but we always had such a behavior (mknod succeeds but later open() can fail) and never had a problem because of that, so let it be the same until we face the problem. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> https://jira.sw.ru/browse/PSBM-100581 Warning: next patch will allow mknod in CT init userns. The explanation above is good, but we always had such a behavior (mknod succeeds but later open() can fail) and never had a problem because of that, so let it be the same until we face the problem. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> https://jira.sw.ru/browse/PSBM-100581 Warning: next patch will allow mknod in CT init userns. The explanation above is good, but we always had such a behavior (mknod succeeds but later open() can fail) and never had a problem because of that, so let it be the same until we face the problem. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> https://jira.sw.ru/browse/PSBM-100581 Warning: next patch will allow mknod in CT init userns. The explanation above is good, but we always had such a behavior (mknod succeeds but later open() can fail) and never had a problem because of that, so let it be the same until we face the problem. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> https://jira.sw.ru/browse/PSBM-100581 Warning: next patch will allow mknod in CT init userns. The explanation above is good, but we always had such a behavior (mknod succeeds but later open() can fail) and never had a problem because of that, so let it be the same until we face the problem. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
On the way to unprivileged filesystem mounts starting with
commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
we've enable mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since
static struct super_block *alloc_super(struct file_system_type *type, int flags,
struct user_namespace user_ns)
{
/ */
}
will set the SB_I_NODEV flag on the filesystem. When a device node created in
non-init userns is open()ed the call chain will hit:
bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
which will cause an EPERM because the device node is located on an fs owned by
non-init-userns and thus doesn't grant access to device nodes due to
SB_I_NODEV.
It seems that to gracefully handle this case we should bind-mount device nodes
unless we're real root.