[RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels #9483

brauner · 2018-07-02T10:57:49Z

On the way to unprivileged filesystem mounts starting with
commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
we've enable mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since

static struct super_block *alloc_super(struct file_system_type *type, int flags,
struct user_namespace user_ns)
{
/ */

    if (s->s_user_ns != &init_user_ns)
            s->s_iflags |= SB_I_NODEV;

    /* <snip> */

}

will set the SB_I_NODEV flag on the filesystem. When a device node created in
non-init userns is open()ed the call chain will hit:

bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

which will cause an EPERM because the device node is located on an fs owned by
non-init-userns and thus doesn't grant access to device nodes due to
SB_I_NODEV.

It seems that to gracefully handle this case we should bind-mount device nodes
unless we're real root.

On the way to unprivileged filesystem mounts starting with commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.") we've enable mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since static struct super_block *alloc_super(struct file_system_type *type, int flags, struct user_namespace *user_ns) { /* <snip> */ if (s->s_user_ns != &init_user_ns) s->s_iflags |= SB_I_NODEV; /* <snip> */ } will set the SB_I_NODEV flag on the filesystem. When a device node created in non-init userns is open()ed the call chain will hit: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } which will cause an EPERM because the device node is located on an fs owned by non-init-userns and thus doesn't grant access to device nodes due to SB_I_NODEV. It seems that to gracefully handle this case we should bind-mount device nodes unless we're real root.

brauner · 2018-07-02T10:58:31Z

//Cc @poettering, this would by may suggested fix.

poettering · 2018-07-02T11:00:39Z

Hmm, but as I understand this is a major compatibility breakage of the kernel, no? Code that worked previously OK will break on current kernels if userns is used, no?

I mean, we can patch around in current system as much as we want, this wouldn't suddenly fix things on already released systemd version would it?

Don't the Linux kernel folks keep repeating that mantra of not breaking userspace? This appears to be quite some breakage no? Or what am I missing?

poettering · 2018-07-02T11:04:02Z

(specific to the patch: In general, we try to to avoid generic checks like what you are proposing as much as we can, and instead try to test for actual behaviour. We use "am i running in a container" checks, and "am i running under userns" only as very last resort. Are we sure this is one of those case?

To me this appears like something that should be reverted in the kernel, and they should find a different approach. For example, it would already be sufficient if the API device nodes, such as /dev/null, /dev/zero or /dev/urandom would just be whitelisted in the kernel to not onyl allow to be mknod()ed, but also the open()ed if they are.)

brauner · 2018-07-02T11:07:17Z

So for PrivateDevices this is actually not true I think. We had various reports of users before I added the fallback code in v238 that PrivateDevices would fail in user namespaces and looking at v237 there wasn't fallback code. When creating actual device nodes failed the service would fail to start.

The whitelist approach is something I planned on discussing anyway. But I don't think we want to revert enabling mknod() we always expected mknod() to work in user namespaces at some point.

poettering · 2018-07-02T11:42:37Z

hmm, interesting, in really old versions of this there actually was some gracefulness in that code7f112f50fea585411ea2d493b3582bea77eb4d6e), but it was dropped eventually, probably by accident.

Either way, I know some distros have stabilized on 238, i.e. the kernel API compat breakage will hit people.

So far the general assumption was that mknod() was "more privileged" than open(), and thus if I can mknod() the thing I can totally also open it. This is in line with UNIX tradition (as any file node I create I also own, hence I should be able to open it. the concept that something i own cannot opened by me appears quite surprising and simple wrong to me). But this concept is not just regular UNIX, it also leaked into various Linux concepts. For example, there's a reason why there is CAP_MKNOD, but no CAP_DEVICE_OPEN...

I mean, I am not against making mknod work in userns, but if so you should really make sure that you don't blanket allow mknod() if you then prohibit opening it. That's just conceptually very wrong. Either do it properly (i.e. allow only mknod( ) on nodes you can also open()) or not at all, but the kernel's behaviour of this right now is just bogus: it allows you to create unusable objects (around which userspace then shall work around), and it's already clear that this is only stopgap, and you are working on fixing this properly.

Also, if there needs to be a workaround, I can see at least three ways how the container manager could work around this more safely and in a way that's compatible with 238 too:

on cgroupsv1 the container manager could use the "devices" cgroup controller to only whitelist mknod() for device nodes that can actually work
the container manager could simply drop CAP_SYS_MKNOD in the container, so that mknod() fails
the container manager could install a seccomp filter to filter out mknod() for all devices that cannot be used anyway

Hence, what's the rationale for working around this in the payload, when it could easily be worked around in the container manager instead? I mean, it's a lot easier to fix a few container managers than to fix all images already created out there...

That said, the best appears would be to fix the kernel instead. They should really be held to their mantra of not breaking userspace...

brauner · 2018-07-02T13:43:39Z

On Mon, Jul 02, 2018 at 04:42:51AM -0700, Lennart Poettering wrote: hmm, interesting, in really old versions of this there actually was some gracefulness in that code7f112f50fea585411ea2d493b3582bea77eb4d6e), but it was dropped eventually, probably by accident. Either way, I know some distros have stabilized on 238, i.e. the kernel API compat breakage will hit people. So far the general assumption was that mknod() was "more privileged" than open(), and thus if I can mknod() the thing I can totally also open it. This is in line with UNIX tradition (as any file node I create I also own, hence I should be able to open it. the concept that something i own cannot opened by me appears quite surprising and simple wrong to me). But this concept is not just regular UNIX, it also leaked into various Linux concepts. For example, there's a reason why there is CAP_MKNOD, but no CAP_DEVICE_OPEN... I mean, I am not against making mknod work in userns, but if so you should really make sure that you don't blanket allow mknod() if you then prohibit opening it. That's just conceptually very wrong. Either do it properly (i.e. allow only mknod( ) on nodes you can also open()) or not at all, but the kernel's behaviour of this right is just bogus: it allows you to create unusable objects (around which userspace then shall work around), and it's already clear that this is only stopgap, and you are working on fixing this properly.

Actually you can already end up in similar situation quite easily even before that change. Mount options such as MS_NOEXEC and MS_NODEV on any filesystem will allow you to still set the exec bit or create device nodes but not execute the files in question or open them. So there's precedence. The counter argument - with which I sympathize - is that there's a difference between explicitly mounting an fs with MS_NODEV and having the kernel implicitly mark the filesystem as SB_I_NODEV internally. In the former case the open() refusal is somewhat transparent in that you can discover whether the fs was mounted with MS_NODEV whereas in the latter case it is opaque to userspace. Another argument I see is that this logic is broken for MS_NOEXEC and MS_NODEV. I'd rather have them both refuse the corresponding operations. In any case I'll be debating this with a bunch of people.

…

Also, if there needs to be a workaround, I can see at least three ways how the container manager could work around this more safely and in a way that's compatible with 238 too: 1. on cgroupsv1 the container manager could use the "devices" cgroup controller to only whitelist mknod() for device nodes that can actually work 2. the container manager could simply drop CAP_SYS_MKNOD in the container, so that mknod() fails 3. the container manager could install a seccomp filter to filter out mknod() for all devices that cannot be used anyway Hence, what's the rationale for working around this in the payload, when it could easily be worked around in the container manager instead? I mean, it's a lot easier to fix a few container managers than to fix all images already created out there... That said, the best appears would be to fix the kernel instead. They should really be held to their mantra of not breaking userspace... -- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: #9483 (comment)

This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com>

bl33pbl0p · 2018-12-09T22:57:01Z

What can be done to get this to move forward? Now that it's been almost half a year since 239, I'd really like if this comes with 240 (and I don't have to use an old kernel).

poettering · 2018-12-10T12:44:15Z

What can be done to get this to move forward? Now that it's been almost half a year since 239, I'd really like if this comes with 240 (and I don't have to use an old kernel).

Well, the proposed fix turns off all device node creation in userns. But it was my understanding that the kernel folks want to open up /dev/null sooner or later for that fully, and hence I am not convinced we should merge this, given that it really is not right in the long run, and just is glue that works around temporary kernel compat breakage, that will bite is in the long run.

It's kinda disappointing that some kernel devs knowingly break userspace with this. After all @brauner did mention the problem, but he was ignored.

Quite frankly, I think the right approach is probably to make noise about this, so that the kernel devs revert this. They can't claim their motto was never to break userspace if they galantly break userspace like this and don't care at all about the effects, even when brought to their attention. That said, I am am not sure I care enough about this I must say to be willing to wrestle with the userns maintainer about this myself.

Quite frankly, they should either go all the way (which means, allow /dev/null and friends to be created and opened in userns), or not do this at all. The mess the are doing now in the kernel is just broken.

From my perspective, I'd probably just add a NEWS entry telling people that due to a kernel compat breakage they need to update their userns using container managers to install seccomp filters on mknod (or take away CAP_SYS_MKNOD in it) if they want to run systemd inside of the containers if they want to use systemd inside of it.

poettering · 2018-12-10T12:46:54Z

Actually you can already end up in similar situation quite easily even before that change. Mount options such as MS_NOEXEC and MS_NODEV on any filesystem will allow you to still set the exec bit or create device nodes but not execute the files in question or open them. So there's precedence.

well, but this specific code is written in the knowledge that the file system mknod() is invoked on was just mounted a few syscalls earlier as tmpfs with mount options the code selected itself. So I think we can reasonably assume that on a file system we create and then do mknod() on things are actually usable.

brauner · 2018-12-11T08:51:48Z

Actually you can already end up in similar situation quite easily even before that change. Mount options such as MS_NOEXEC and MS_NODEV on any filesystem will allow you to still set the exec bit or create device nodes but not execute the files in question or open them. So there's precedence.

well, but this specific code is written in the knowledge that the file system mknod() is invoked on was just mounted a few syscalls earlier as tmpfs with mount options the code selected itself. So I think we can reasonably assume that on a file system we create and then do mknod() on things are actually usable.

I can tell you right away that this won't be reverted upstream and will be status quo going forward. I have had no luck in making a case for reverting it when I sent the revert since no one cared enough to point out in the thread that it actually breaks them: basically, my revert was rejected with "this only breaks container runtimes and the logic they follow is buggy". Sorry that I can't be more helpful as a kernel dev here.

poettering · 2018-12-11T09:32:30Z

On second thought, that is very very unlikely (and also a bit incorrect). Sorry for the noise.

Note that the tmpfs is mounted in a private fs namespace, right after creating it. It's not easy to get access to that mount point in the short time window, and given the hoops you have to jump through to get there, in order to remount it. This is certainly not going to be remounted "by accident", and does require privileges.

abucodonosor · 2018-12-22T22:47:18Z

@brauner @poettering

See https://lkml.org/lkml/2018/12/22/221 , I've added Linus to CC on that problem also.

brauner · 2018-12-22T22:50:35Z

@brauner @poettering

See https://lkml.org/lkml/2018/12/22/221 , I've added Linus to CC on that problem also.

I responded pointing out that I sent a revert for this in July:

https://lists.linuxfoundation.org/pipermail/containers/2018-July/039182.html
https://lists.linuxfoundation.org/pipermail/containers/2018-July/039183.html

abucodonosor · 2018-12-22T22:58:56Z

@brauner

I never hit this issue and didn't saw the revert you send for that on LKML.

However on things breaking userpace I always CC Linus..
Next time you should do that also. ( NO matter what the patch author telling you).

BR

brauner · 2018-12-23T10:52:47Z

@brauner

I never hit this issue and didn't saw the revert you send for that on LKML.

However on things breaking userpace I always CC Linus..
Next time you should do that also. ( NO matter what the patch author telling you).

Ignoring the slightly patronizing tone, I have cc'ed and notified systemd people while sending the revert and someone should've said "this is an issue for us". Part why Eric opposed me was that he said he doesn't see the broken users. Now, that's not good argument but an easy one to refute if people would've opened their mouths. But we settled it so we can move on, hopefully. :)

BR

commit 94f8200 upstream. This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

brauner · 2018-12-29T23:16:06Z

I think we can close this bug now that my revert has been merged upstream.

commit 94f8200 upstream. This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

This reverts commit 55956b5. commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.") enabled mknod() in user namespaces for userns root if CAP_MKNOD is available. However, these device nodes are useless since any filesystem mounted from a non-initial user namespace will set the SB_I_NODEV flag on the filesystem. Now, when a device node s created in a non-initial user namespace a call to open() on said device node will fail due to: bool may_open_dev(const struct path *path) { return !(path->mnt->mnt_flags & MNT_NODEV) && !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV); } The problem with this is that as of the aforementioned commit mknod() creates partially functional device nodes in non-initial user namespaces. In particular, it has the consequence that as of the aforementioned commit open() will be more privileged with respect to device nodes than mknod(). Before it was the other way around. Specifically, if mknod() succeeded then it was transparent for any userspace application that a fatal error must have occured when open() failed. All of this breaks multiple userspace workloads and a widespread assumption about how to handle mknod(). Basically, all container runtimes and systemd live by the slogan "ask for forgiveness not permission" when running user namespace workloads. For mknod() the assumption is that if the syscall succeeds the device nodes are useable irrespective of whether it succeeds in a non-initial user namespace or not. This logic was chosen explicitly to allow for the glorious day when mknod() will actually be able to create fully functional device nodes in user namespaces. A specific problem people are already running into when running 4.18 rc kernels are failing systemd services. For any distro that is run in a container systemd services started with the PrivateDevices= property set will fail to start since the device nodes in question cannot be opened (cf. the arguments in [1]). Full disclosure, Seth made the very sound argument that it is already possible to end up with partially functional device nodes. Any filesystem mounted with MS_NODEV set will allow mknod() to succeed but will not allow open() to succeed. The difference to the case here is that the MS_NODEV case is transparent to userspace since it is an explicitly set mount option while the SB_I_NODEV case is an implicit property enforced by the kernel and hence opaque to userspace. [1]: systemd/systemd#9483 Signed-off-by: Christian Brauner <christian@brauner.io> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Serge Hallyn <serge@hallyn.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> https://jira.sw.ru/browse/PSBM-100581 Warning: next patch will allow mknod in CT init userns. The explanation above is good, but we always had such a behavior (mknod succeeds but later open() can fail) and never had a problem because of that, so let it be the same until we face the problem. Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>

brauner changed the title ~~core: adapt PrivateDevices= to changed behavior on 4.18 kernels~~ [RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels Jul 2, 2018

poettering added needs-discussion 🤔 kernel-api-breakage pid1 labels Jul 4, 2018

brauner mentioned this pull request Aug 19, 2018

[Arch|AUR] After upgrading to 3.4 nginx won't start on container because it can't open /dev/null canonical/lxd#4950

Closed

brauner closed this Dec 29, 2018

This was referenced Jun 26, 2022

RFE: nspawn: add native options to create extra devices #23831

Closed

nspawn: --bind for device node does not work #21987

Open

Uh oh!

[RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels #9483

[RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels #9483

Uh oh!

Conversation

brauner commented Jul 2, 2018

Uh oh!

brauner commented Jul 2, 2018

Uh oh!

poettering commented Jul 2, 2018

Uh oh!

poettering commented Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brauner commented Jul 2, 2018

Uh oh!

poettering commented Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brauner commented Jul 2, 2018 via email

Uh oh!

bl33pbl0p commented Dec 9, 2018

Uh oh!

poettering commented Dec 10, 2018

Uh oh!

poettering commented Dec 10, 2018

Uh oh!

brauner commented Dec 11, 2018

Uh oh!

poettering commented Dec 11, 2018

Uh oh!

abucodonosor commented Dec 22, 2018

Uh oh!

brauner commented Dec 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abucodonosor commented Dec 22, 2018

Uh oh!

brauner commented Dec 23, 2018

Uh oh!

brauner commented Dec 29, 2018

Uh oh!

Uh oh!

poettering commented Jul 2, 2018 •

edited

Loading

poettering commented Jul 2, 2018 •

edited

Loading

brauner commented Dec 22, 2018 •

edited

Loading