Rework mount namespace support #168

Merged
merged 23 commits into from Oct 14, 2016

Conversation

Projects
None yet
4 participants
Collaborator

zyga commented Oct 7, 2016

This patch changes the contents of the initialized mount namespace. The
main motivator is to allow /media to be shared amongst the initial mount
namespace and the derivative namespaces created by snap-confine.

The code is unified and simplified around the new boostrap function. It
now takes a few simple arguments through a configuration structure. The
main argument is the location of the starting point of the desired root
filesystem. On classic this is the core snap (or ubuntu-core if core is
not available). On an all-snap system this is the root filesystem.

The bootstrap function puts the desired root filesystem into a new
temporary directory (using recursive bind mount) paying special
attention to avoid creating bind mount loops. This directory is now
called the constructed root filesystem and eventually, this is where we
pivot_root into.

Since we still have access to the initial root filesystem under the
initial sharing parameters as well as to the new constructed root
filesystem, we can choose to bind mount arbitrary directories over.

The code now supports a list of directories and a flag indicating if
the mount event propagation should be bidirectional or not.

In this patch /media is on the bidirectional list. On classic a large
list of directories is on the unidirectional list. On all snap the
unidirectional list is empty because we don't need anything as we're
sharing the root filesystem which is already correctly populated by the
all-snap initrd.

The existing code for special treatment of /snap and /etc/alternatives
was retained as is (semantically) but has been folded
into the bootstrap function for simplicity. This was not yet done for /tmp
and /dev/pts.

The code for nvidia support is now also handled by the bootstrap
function but this should be moved outside later. This can be done only
after nvidia code is adjusted to assume it executes after pivot root
(this will simplify it and allow us to do something sensible in all-snap
system later). I chose not to do this to simplify the change and review
process. To repeat, nvidia support is exactly as it was before.

The apparmor profile was adjusted to take account of all the new
(numerous) operations.

Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com

src/mount-support.c
+ die("cannot perform operation: mount --make-rslave %s",
+ SC_HOSTFS_DIR);
+ }
+#if 0
@zyga

zyga Oct 7, 2016

Collaborator

FYI, I was split on this (advice from stgraber earlier) but it seems to actually do the right thing (to be clear, the code as-is, with #if 0). I didn't see any adverse effects either.

@Conan-Kudo

Conan-Kudo Oct 10, 2016

Please provide a comment on why #if 0

@zyga

zyga Oct 11, 2016

Collaborator

I dropped that part entirely now

zyga added some commits Oct 11, 2016

Add kernel patch for pivot_root debugging
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Rework mount namespace support
This patch changes the contents of the initialized mount namespace. The
main motivator is to allow /media to be shared amongst the initial mount
namespace and the derivative namespaces created by snap-confine.

The code is unified and simplified around the new boostrap function. It
now takes a few simple arguments through a configuration structure. The
main argument is the location of the starting point of the desired root
filesystem. On classic this is the core snap (or ubuntu-core if core is
not available). On an all-snap system this is the root filesystem.

The bootstrap function puts the desired root filesystem into a new
temporary directory (using recursive bind mount) paying special
attention to avoid creating bind mount loops. This directory is now
called the constructed root filesystem and eventually, this is where we
pivot_root into.

Since we still have access to the initial root filesystem under the
initial sharing parameters as well as to the new constructed root
filesystem, we can choose to bind mount arbitrary directories over.

The code now supports two lists of directories, one that is bind-mounted
rslave (called unidirectional because events propagate only in one
direction in the peer group) and one that is mounted rshared (called
bidirectional for the obvious reason).

In this patch /media is on the bidirectional list. On classic a large
list of directories is on the unidirectional list. On all snap the
unidirectional list is empty because we don't need anything as we're
sharing the root filesystem which is already correctly populated by the
all-snap initrd.

The existing code for special treatment of /snap, /etc/alternatives,
/tmp and /dev/pts is retained as is (semantically) but has been folded
into the bootstrap function for simplicity.

The code for nvidia support is now also handled by the bootstrap
function but this should be moved outside later. This can be done only
after nvidia code is adjusted to assume it executes after pivot root
(this will simplify it and allow us to do something sensible in all-snap
system later). I chose not to do this to simplify the change and review
process. To repeat, nvidia support is exactly as it was before.

The apparmor profile was adjusted to take account of all the new
(numerous) operations.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Ignore core snap in layout test
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Drop mount_src from layout test (too unpredictable)
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Update expected data and processing script
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

This is a very complex change that is difficult to review in a short amount of time. However, the tests that you created are a big help in trusting the changes.

Most of my comments are suggestions and are not blockers. The only two blockers are the /run unidirectional mount happening after, and breaking, the /run/media unidirectional mount and the missing AppArmor profile changes.

src/mount-support.c
+ * (end of quote).
+ *
+ * The main idea is to setup a mount namespace that has a root filesystem with
+ * vfsmounts and peer groups that, depending on the location, either isolated
@tyhicks

tyhicks Oct 11, 2016

Collaborator

s/isolated/isolate/

src/mount-support.c
+ * Selected directories (today just /media) can be shared in both directions.
+ * This allows snaps with sufficient privileges to create additional mount
+ * points that are visible by the rest of the system (both the main mount
+ * namespace and namespaces of individual snaps).
@tyhicks

tyhicks Oct 11, 2016

Collaborator

Please also point out that it allows snaps with sufficient privs to remove mount points that were visible to the rest of the system.

src/mount-support.c
+ **/
+static void sc_bootstrap_mount_namespace(const struct sc_mount_config *config)
+{
+ char scratch_dir[PATH_MAX] = "/tmp/snap.rootfs_XXXXXX";
@tyhicks

tyhicks Oct 11, 2016

Collaborator

Optional suggestion: save a little stack space and declare scratch_dir like so:

char scratch_dir[] = "/tmp/snap.rootfs_XXXXXX";

src/mount-support.c
+ die("cannot perform operation: mount --make-unbindable %s",
+ scratch_dir);
+ }
+ // Recursively bind mount desired root filesystem directory over of the
@tyhicks

tyhicks Oct 11, 2016

Collaborator

s/over of the/over the/

src/mount-support.c
+ // scratch directory. This puts the initial content into the scratch space
+ // and serves as a foundation for all subsequent operations below.
+ //
+ // The mount is recursive because it can either be applied the root
@tyhicks

tyhicks Oct 11, 2016

Collaborator

s/applied the root/applied to the root/

src/mount-support.c
+ die("cannot perform operation: mount --make-rslave %s",
+ dst);
+ }
+ }
@tyhicks

tyhicks Oct 11, 2016

Collaborator

I'm a little worried that bidirectional_mounts and unidirectional_mounts are two different lists and that bidirectional is also processed before unidirectional. The concern comes from the possible need to set up a unidirectional mount before a bidirectional. This could be solved by combining them into one list of something like:

struct sc_mounts {
    const char *path;
    bool is_bidirectional;
};

This is only an observation and not a blocker for this PR.

@zyga

zyga Oct 13, 2016

Collaborator

I'll do this, good idea.

@zyga

zyga Oct 13, 2016

Collaborator

Done, , please have a look

src/mount-support.c
+ etc_alternatives);
+ // NOTE: MS_SLAVE so that the started process cannot maliciously mount
+ // anything into those places and affect the system on the outside.
+ debug("performing operation: mount --bind -o slave %s %s", src,
@tyhicks

tyhicks Oct 11, 2016

Collaborator

I think "-o slave" works but we should change "-o slave" to "--make-slave" to match the rest of the debugging output.

@zyga

zyga Oct 13, 2016

Collaborator

This is subtly different. When you call mount --make-slave you get different arguments to the system call:

mount("none", "/some/path", NULL, MS_SLAVE, NULL)

src/mount-support.c
+ debug("performing operation: mount --bind -o slave %s %s", src,
+ dst);
+ if (mount(src, dst, NULL, MS_BIND | MS_SLAVE, NULL) != 0) {
+ die("cannot perform operation: mount --bind -o slave %s %s", src, dst);
@tyhicks

tyhicks Oct 11, 2016

Collaborator

Here, too.

+ // directory is always /snap. On the host it is a build-time configuration
+ // option stored in SNAP_MOUNT_DIR.
+ must_snprintf(dst, sizeof dst, "%s/snap", scratch_dir);
+ debug("performing operation: mount --rbind %s %s", SNAP_MOUNT_DIR, dst);
@tyhicks

tyhicks Oct 11, 2016

Collaborator

Add --make-rslave.

@zyga

zyga Oct 13, 2016

Collaborator

Ah, I see. Will do

@zyga

zyga Oct 13, 2016

Collaborator

I take that back, --make-rslave is directly below this code

+ debug("performing operation: mount --rbind %s %s", SNAP_MOUNT_DIR, dst);
+ if (mount(SNAP_MOUNT_DIR, dst, NULL, MS_BIND | MS_REC | MS_SLAVE, NULL)
+ < 0) {
+ die("cannot perform operation: mount --rbind -o slave %s %s",
@tyhicks

tyhicks Oct 11, 2016

Collaborator

s/-o slave/--make-rslave/

src/mount-support.c
+ die("cannot perform operation: mount --rbind -o slave %s %s",
+ SNAP_MOUNT_DIR, dst);
+ }
+ debug("performing operation: mount --make-rslave slave %s", dst);
@tyhicks

tyhicks Oct 11, 2016

Collaborator

There's an extra "slave" in there.

src/mount-support.c
+ "/var/snap", // to get access to global snap data
+ "/var/lib/snapd", // to get access to snapd state and seccomp profiles
+ "/var/tmp", // to get access to the other temporary directory
+ "/run", // to get /run with sockets and what not
@tyhicks

tyhicks Oct 11, 2016

Collaborator

Huh, here's what I was talking about earlier in the review where I was worried about bidirectional_mounts being processed before unidirectional_mounts.

If the MERGED_USR macro is set, then the /run/media bidirectional mount is performed. The problem is that the /run unidirectional mount will then clobber it and break the sharing of /run/media across snaps and the host.

@zyga

zyga Oct 11, 2016

Collaborator

Ah, correct. I was not testing this on fedora (no fedora CI yet) and that's where we use merged user. Nice catch!

@zyga

zyga Oct 13, 2016

Collaborator

Fixed.

src/snap-confine.apparmor.in
+ umount /var/lib/snapd/hostfs/tmp/snap.rootfs_*/,
+ mount options=(rw rslave) -> /var/lib/snapd/hostfs/,
+ mount options=(rw rprivate) -> /var/lib/snapd/hostfs/,
+
@tyhicks

tyhicks Oct 11, 2016

Collaborator

The AppArmor profile changes look incomplete. These are mostly all mounts that were already being performed and would therefore need AppArmor rules allowing them. It looks like this PR should either be reusing existing rules or removing the old rules that no longer suffice.

I'm holding off on reviewing the profile changes for now.

@zyga

zyga Oct 13, 2016

Collaborator

I've simplified the AA profile considerably. Please have a look

zyga added some commits Oct 13, 2016

Update expected layouts for core and ubuntu-core
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Fix typo
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Clarify that /media also propagates unmount events
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Save some stack space
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Fix typo
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Improve grammar
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Remove uneeded word
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Unify bidirectional and unidirectional mount directives
This fixes a bug found by Tyler Hicks where if MERGED_USR configuration
option was enabled then /run/ would clobber the correctly set up
/run/media. Now processing of all mount directives is unified and the
both directives are set up to be in the right order (/run/media after
/run).

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Clean up apparmor profile
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Collaborator

zyga commented Oct 13, 2016

I ran this branch against tests in snapd and ... crashed the kernel:

linode:ubuntu-16.04-64 .../tests/main/snap-set# journalctl -xe
Oct 13 10:59:18 ubuntu kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
Oct 13 10:59:18 ubuntu kernel: task: ffff880037b4d280 ti: ffff88007c344000 task.ti: ffff88007c344000
Oct 13 10:59:18 ubuntu kernel: RIP: 0010:[<ffffffff8123cc9e>]  [<ffffffff8123cc9e>] propagate_one+0xbe/0x1c0
Oct 13 10:59:18 ubuntu kernel: RSP: 0018:ffff88007c347d68  EFLAGS: 00010297
Oct 13 10:59:18 ubuntu kernel: RAX: ffff880037623a80 RBX: ffff88004f8d3480 RCX: ffff88004f8d3180
Oct 13 10:59:18 ubuntu kernel: RDX: 0000000000000000 RSI: 0000000000000076 RDI: 0000000000000000
Oct 13 10:59:18 ubuntu kernel: RBP: ffff88007c347d78 R08: ffff88007b297000 R09: ffffffff813eadcc
Oct 13 10:59:18 ubuntu kernel: R10: ffffea0001ec0c00 R11: 0000000000003f91 R12: ffff8800375f4900
Oct 13 10:59:18 ubuntu kernel: R13: ffff88007c347dc0 R14: ffff88004f8d3480 R15: 0000000000000000
Oct 13 10:59:18 ubuntu kernel: FS:  00007f595e1ca840(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
Oct 13 10:59:18 ubuntu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 13 10:59:18 ubuntu kernel: CR2: 0000000000000010 CR3: 0000000057974000 CR4: 00000000001406f0
Oct 13 10:59:18 ubuntu kernel: Stack:
Oct 13 10:59:18 ubuntu kernel:  ffff88004f8d3480 ffff8800375f4900 ffff88007c347db0 ffffffff8123d1c0
Oct 13 10:59:18 ubuntu kernel:  ffff880037623a80 ffff8800375f4900 ffff88003776e380 0000000000000000
Oct 13 10:59:18 ubuntu kernel:  ffff88007c347e98 ffff88007c347df8 ffffffff8122def7 ffff88007b0bea80
Oct 13 10:59:18 ubuntu kernel: Call Trace:
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8123d1c0>] propagate_mnt+0x120/0x150
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8122def7>] attach_recursive_mnt+0x147/0x230
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8122e038>] graft_tree+0x58/0x90
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8122e0fe>] do_add_mount+0x8e/0xd0
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8122eed0>] do_mount+0x2c0/0xe00
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8122d064>] ? mntput+0x24/0x40
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff811eebf4>] ? __kmalloc_track_caller+0x1b4/0x250
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8120e5c0>] ? __fput+0x190/0x220
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff811abda2>] ? memdup_user+0x42/0x70
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff8122fd3f>] SyS_mount+0x9f/0x100
Oct 13 10:59:18 ubuntu kernel:  [<ffffffff818244f2>] entry_SYSCALL_64_fastpath+0x16/0x71
Oct 13 10:59:18 ubuntu kernel: Code: 39 90 d8 00 00 00 75 ec 8b b0 10 01 00 00 48 89 3d 20 e0 f8 00 48 89 05 21 e0 f8 00 39 b1 10 01 00 00 74 19 48 8b bf d8 00 00 00 <48> 8b 47 10
Oct 13 10:59:18 ubuntu kernel: RIP  [<ffffffff8123cc9e>] propagate_one+0xbe/0x1c0
Oct 13 10:59:18 ubuntu kernel:  RSP <ffff88007c347d68>
Oct 13 10:59:18 ubuntu kernel: CR2: 0000000000000010
Oct 13 10:59:18 ubuntu kernel: ---[ end trace f43e7d84ab4ddab3 ]---
Oct 13 10:59:18 ubuntu systemd[1]: snap-failing\x2dconfig\x2dhooks-x1.mount: Mount process exited, code=killed status=9
Oct 13 10:59:18 ubuntu systemd[1]: Failed to mount Mount unit for failing-config-hooks.

Looks like the problem is an outdated kernel. This was fixed in May in the Ubuntu kernels, the original fix can be found here: https://marc.info/?l=linux-fsdevel&m=146246187014403 and the LP bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1572316

This gets my ack. Thanks for the fixups!

I do ask for you to clarify and fix, if needed, the profile question that I had around the potentially duplicate /snap/ and @SNAP_MOUNT_DIR@/ rules.

Thanks @zyga!

src/snap-confine.apparmor.in
+ # all of the constructed rootfs is a rslave
+ mount options=(rw rslave) -> /tmp/snap.rootfs_*/,
+ # bidirectional mounts
+ mount options=(rw rbind) /media/ -> /tmp/snap.rootfs_*/media/,
@tyhicks

tyhicks Oct 13, 2016

Collaborator

There's another bidirectional mount when MERGED_USR is set:

mount options=(rw rbind) /run/media/ -> /tmp/snap.rootfs_*/run/media/,

However, I think you said that MERGED_USR is only used on Fedora and we know that SELinux is used there instead of AppArmor so, in practice, this isn't an issue.

@zyga

zyga Oct 13, 2016

Collaborator

Exactly :)

+ /snap/ r,
+ /snap/** r,
+ @SNAP_MOUNT_DIR@/ r,
+ @SNAP_MOUNT_DIR@/** r,
@tyhicks

tyhicks Oct 13, 2016

Collaborator

Isn't @SNAP_MOUNT_DIR@ set to /snap in most situations? I don't think we need both sets of rules here.

@zyga

zyga Oct 13, 2016

Collaborator

We need both for before pivot @SNAP_MOUNT_DIR@/ and after pivot /snap/

@tyhicks

tyhicks Oct 13, 2016

Collaborator

Thanks for the clarification.

Collaborator

zyga commented Oct 13, 2016

This now passed snapd tests in classic using zyga/snapd@06e5942#diff-556bb7431481e375713ea3e0883a771aR82 (just modified the branch name to test against this branch). Once I can repeat that with core (aka all-snap) I'll press the big green merge button.

Ensure that /etc/alternatives is not shared
This is just making sure that we don't ever propagate events from
/etc/alternatives outside our namespace.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Collaborator

tyhicks commented on src/mount-support.c in a6bedac Oct 13, 2016

That should be --make-slave not rslave

Collaborator

tyhicks commented Oct 13, 2016

Thanks for fixing the typo. Looks good to me.

zyga added some commits Oct 13, 2016

Fix typo in debug message and add one more debug message
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Fix apparmor profile for /etc/alternatives on core
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Indicate that expected responses are for classic
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Add expected file from core system (not tested yet)
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Document merged /usr and /media handling in apparmor
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

@zyga zyga merged commit dbbf80d into master Oct 14, 2016

1 check was pending

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details

@zyga zyga deleted the media-sharing branch Oct 14, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment