New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels #9483

Open
wants to merge 1 commit into
base: master
from

Conversation

3 participants
@brauner
Contributor

brauner commented Jul 2, 2018

On the way to unprivileged filesystem mounts starting with
commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
we've enable mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since

static struct super_block *alloc_super(struct file_system_type *type, int flags,
struct user_namespace user_ns)
{
/
*/

    if (s->s_user_ns != &init_user_ns)
            s->s_iflags |= SB_I_NODEV;

    /* <snip> */

}

will set the SB_I_NODEV flag on the filesystem. When a device node created in
non-init userns is open()ed the call chain will hit:

bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

which will cause an EPERM because the device node is located on an fs owned by
non-init-userns and thus doesn't grant access to device nodes due to
SB_I_NODEV.

It seems that to gracefully handle this case we should bind-mount device nodes
unless we're real root.

core: adapt PrivateDevices= to changed behavior on 4.18 kernels
On the way to unprivileged filesystem mounts starting with
commit 55956b59df33 ("vfs: Allow userns root to call mknod on owned filesystems.")
we've enable mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since

static struct super_block *alloc_super(struct file_system_type *type, int flags,
                                       struct user_namespace *user_ns)
{
        /* <snip> */

        if (s->s_user_ns != &init_user_ns)
                s->s_iflags |= SB_I_NODEV;

        /* <snip> */
}

will set the SB_I_NODEV flag on the filesystem. When a device node created in
non-init userns is open()ed the call chain will hit:

bool may_open_dev(const struct path *path)
{
        return !(path->mnt->mnt_flags & MNT_NODEV) &&
                !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

which will cause an EPERM because the device node is located on an fs owned by
non-init-userns and thus doesn't grant access to device nodes due to
SB_I_NODEV.

It seems that to gracefully handle this case we should bind-mount device nodes
unless we're real root.

@brauner brauner changed the title from core: adapt PrivateDevices= to changed behavior on 4.18 kernels to [RFC] core: adapt PrivateDevices= to changed behavior on 4.18 kernels Jul 2, 2018

@brauner

This comment has been minimized.

Contributor

brauner commented Jul 2, 2018

//Cc @poettering, this would by may suggested fix.

@poettering

This comment has been minimized.

Member

poettering commented Jul 2, 2018

Hmm, but as I understand this is a major compatibility breakage of the kernel, no? Code that worked previously OK will break on current kernels if userns is used, no?

I mean, we can patch around in current system as much as we want, this wouldn't suddenly fix things on already released systemd version would it?

Don't the Linux kernel folks keep repeating that mantra of not breaking userspace? This appears to be quite some breakage no? Or what am I missing?

@poettering

This comment has been minimized.

Member

poettering commented Jul 2, 2018

(specific to the patch: In general, we try to to avoid generic checks like what you are proposing as much as we can, and instead try to test for actual behaviour. We use "am i running in a container" checks, and "am i running under userns" only as very last resort. Are we sure this is one of those case?

To me this appears like something that should be reverted in the kernel, and they should find a different approach. For example, it would already be sufficient if the API device nodes, such as /dev/null, /dev/zero or /dev/urandom would just be whitelisted in the kernel to not onyl allow to be mknod()ed, but also the open()ed if they are.)

@brauner

This comment has been minimized.

Contributor

brauner commented Jul 2, 2018

So for PrivateDevices this is actually not true I think. We had various reports of users before I added the fallback code in v238 that PrivateDevices would fail in user namespaces and looking at v237 there wasn't fallback code. When creating actual device nodes failed the service would fail to start.

The whitelist approach is something I planned on discussing anyway. But I don't think we want to revert enabling mknod() we always expected mknod() to work in user namespaces at some point.

@poettering

This comment has been minimized.

Member

poettering commented Jul 2, 2018

hmm, interesting, in really old versions of this there actually was some gracefulness in that code7f112f50fea585411ea2d493b3582bea77eb4d6e), but it was dropped eventually, probably by accident.

Either way, I know some distros have stabilized on 238, i.e. the kernel API compat breakage will hit people.

So far the general assumption was that mknod() was "more privileged" than open(), and thus if I can mknod() the thing I can totally also open it. This is in line with UNIX tradition (as any file node I create I also own, hence I should be able to open it. the concept that something i own cannot opened by me appears quite surprising and simple wrong to me). But this concept is not just regular UNIX, it also leaked into various Linux concepts. For example, there's a reason why there is CAP_MKNOD, but no CAP_DEVICE_OPEN...

I mean, I am not against making mknod work in userns, but if so you should really make sure that you don't blanket allow mknod() if you then prohibit opening it. That's just conceptually very wrong. Either do it properly (i.e. allow only mknod( ) on nodes you can also open()) or not at all, but the kernel's behaviour of this right now is just bogus: it allows you to create unusable objects (around which userspace then shall work around), and it's already clear that this is only stopgap, and you are working on fixing this properly.

Also, if there needs to be a workaround, I can see at least three ways how the container manager could work around this more safely and in a way that's compatible with 238 too:

  1. on cgroupsv1 the container manager could use the "devices" cgroup controller to only whitelist mknod() for device nodes that can actually work
  2. the container manager could simply drop CAP_SYS_MKNOD in the container, so that mknod() fails
  3. the container manager could install a seccomp filter to filter out mknod() for all devices that cannot be used anyway

Hence, what's the rationale for working around this in the payload, when it could easily be worked around in the container manager instead? I mean, it's a lot easier to fix a few container managers than to fix all images already created out there...

That said, the best appears would be to fix the kernel instead. They should really be held to their mantra of not breaking userspace...

@brauner

This comment has been minimized.

Contributor

brauner commented Jul 2, 2018

brauner added a commit to brauner/linux that referenced this pull request Jul 5, 2018

Revert "vfs: Allow userns root to call mknod on owned filesystems."
This reverts commit 55956b5.

commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:

bool may_open_dev(const struct path *path)
{
        return !(path->mnt->mnt_flags & MNT_NODEV) &&
                !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.

All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).

Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.

[1]: systemd/systemd#9483

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>

fengguang pushed a commit to 0day-ci/linux that referenced this pull request Jul 5, 2018

Revert "vfs: Allow userns root to call mknod on owned filesystems."
This reverts commit 55956b5.

commit 55956b5 ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:

bool may_open_dev(const struct path *path)
{
        return !(path->mnt->mnt_flags & MNT_NODEV) &&
                !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}

The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.

All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).

Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.

[1]: systemd/systemd#9483

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
@bl33pbl0p

This comment has been minimized.

Contributor

bl33pbl0p commented Dec 9, 2018

What can be done to get this to move forward? Now that it's been almost half a year since 239, I'd really like if this comes with 240 (and I don't have to use an old kernel).

@poettering

This comment has been minimized.

Member

poettering commented Dec 10, 2018

What can be done to get this to move forward? Now that it's been almost half a year since 239, I'd really like if this comes with 240 (and I don't have to use an old kernel).

Well, the proposed fix turns off all device node creation in userns. But it was my understanding that the kernel folks want to open up /dev/null sooner or later for that fully, and hence I am not convinced we should merge this, given that it really is not right in the long run, and just is glue that works around temporary kernel compat breakage, that will bite is in the long run.

It's kinda disappointing that some kernel devs knowingly break userspace with this. After all @brauner did mention the problem, but he was ignored.

Quite frankly, I think the right approach is probably to make noise about this, so that the kernel devs revert this. They can't claim their motto was never to break userspace if they galantly break userspace like this and don't care at all about the effects, even when brought to their attention. That said, I am am not sure I care enough about this I must say to be willing to wrestle with the userns maintainer about this myself.

Quite frankly, they should either go all the way (which means, allow /dev/null and friends to be created and opened in userns), or not do this at all. The mess the are doing now in the kernel is just broken.

From my perspective, I'd probably just add a NEWS entry telling people that due to a kernel compat breakage they need to update their userns using container managers to install seccomp filters on mknod (or take away CAP_SYS_MKNOD in it) if they want to run systemd inside of the containers if they want to use systemd inside of it.

@poettering

This comment has been minimized.

Member

poettering commented Dec 10, 2018

Actually you can already end up in similar situation quite easily even before that change. Mount options such as MS_NOEXEC and MS_NODEV on any filesystem will allow you to still set the exec bit or create device nodes but not execute the files in question or open them. So there's precedence.

well, but this specific code is written in the knowledge that the file system mknod() is invoked on was just mounted a few syscalls earlier as tmpfs with mount options the code selected itself. So I think we can reasonably assume that on a file system we create and then do mknod() on things are actually usable.

@bl33pbl0p

This comment has been minimized.

Contributor

bl33pbl0p commented Dec 10, 2018

well, but this specific code is written in the knowledge that the file system mknod() is invoked on was just mounted a few syscalls earlier as tmpfs with mount options the code selected itself. So I think we can reasonably assume that on a file system we create and then do mknod() on things are actually usable.

What when somebody in the race between mounting and doing mknod sets nodev and remounts the file system? (I know this is a bit hyperbole, but I'm wanting this to be merged, because it is unlikely the kernel will change as that, from my assessment, makes things unsafe/not usable anyway).

If that is something that should be handled gracefully, then the patch should go in.

On second thought, that is very very unlikely (and also a bit incorrect). Sorry for the noise.

@brauner

This comment has been minimized.

Contributor

brauner commented Dec 11, 2018

Actually you can already end up in similar situation quite easily even before that change. Mount options such as MS_NOEXEC and MS_NODEV on any filesystem will allow you to still set the exec bit or create device nodes but not execute the files in question or open them. So there's precedence.

well, but this specific code is written in the knowledge that the file system mknod() is invoked on was just mounted a few syscalls earlier as tmpfs with mount options the code selected itself. So I think we can reasonably assume that on a file system we create and then do mknod() on things are actually usable.

I can tell you right away that this won't be reverted upstream and will be status quo going forward. I have had no luck in making a case for reverting it when I sent the revert since no one cared enough to point out in the thread that it actually breaks them: basically, my revert was rejected with "this only breaks container runtimes and the logic they follow is buggy". Sorry that I can't be more helpful as a kernel dev here.

@poettering

This comment has been minimized.

Member

poettering commented Dec 11, 2018

On second thought, that is very very unlikely (and also a bit incorrect). Sorry for the noise.

Note that the tmpfs is mounted in a private fs namespace, right after creating it. It's not easy to get access to that mount point in the short time window, and given the hoops you have to jump through to get there, in order to remount it. This is certainly not going to be remounted "by accident", and does require privileges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment