Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Rework mount namespace support #168
Conversation
| + die("cannot perform operation: mount --make-rslave %s", | ||
| + SC_HOSTFS_DIR); | ||
| + } | ||
| +#if 0 |
zyga
Oct 7, 2016
Collaborator
FYI, I was split on this (advice from stgraber earlier) but it seems to actually do the right thing (to be clear, the code as-is, with #if 0). I didn't see any adverse effects either.
zyga
added some commits
Oct 11, 2016
tyhicks
requested changes
Oct 11, 2016
This is a very complex change that is difficult to review in a short amount of time. However, the tests that you created are a big help in trusting the changes.
Most of my comments are suggestions and are not blockers. The only two blockers are the /run unidirectional mount happening after, and breaking, the /run/media unidirectional mount and the missing AppArmor profile changes.
| + * (end of quote). | ||
| + * | ||
| + * The main idea is to setup a mount namespace that has a root filesystem with | ||
| + * vfsmounts and peer groups that, depending on the location, either isolated |
| + * Selected directories (today just /media) can be shared in both directions. | ||
| + * This allows snaps with sufficient privileges to create additional mount | ||
| + * points that are visible by the rest of the system (both the main mount | ||
| + * namespace and namespaces of individual snaps). |
tyhicks
Oct 11, 2016
Collaborator
Please also point out that it allows snaps with sufficient privs to remove mount points that were visible to the rest of the system.
| + **/ | ||
| +static void sc_bootstrap_mount_namespace(const struct sc_mount_config *config) | ||
| +{ | ||
| + char scratch_dir[PATH_MAX] = "/tmp/snap.rootfs_XXXXXX"; |
tyhicks
Oct 11, 2016
Collaborator
Optional suggestion: save a little stack space and declare scratch_dir like so:
char scratch_dir[] = "/tmp/snap.rootfs_XXXXXX";
| + die("cannot perform operation: mount --make-unbindable %s", | ||
| + scratch_dir); | ||
| + } | ||
| + // Recursively bind mount desired root filesystem directory over of the |
| + // scratch directory. This puts the initial content into the scratch space | ||
| + // and serves as a foundation for all subsequent operations below. | ||
| + // | ||
| + // The mount is recursive because it can either be applied the root |
| + die("cannot perform operation: mount --make-rslave %s", | ||
| + dst); | ||
| + } | ||
| + } |
tyhicks
Oct 11, 2016
Collaborator
I'm a little worried that bidirectional_mounts and unidirectional_mounts are two different lists and that bidirectional is also processed before unidirectional. The concern comes from the possible need to set up a unidirectional mount before a bidirectional. This could be solved by combining them into one list of something like:
struct sc_mounts {
const char *path;
bool is_bidirectional;
};
This is only an observation and not a blocker for this PR.
| + etc_alternatives); | ||
| + // NOTE: MS_SLAVE so that the started process cannot maliciously mount | ||
| + // anything into those places and affect the system on the outside. | ||
| + debug("performing operation: mount --bind -o slave %s %s", src, |
tyhicks
Oct 11, 2016
Collaborator
I think "-o slave" works but we should change "-o slave" to "--make-slave" to match the rest of the debugging output.
zyga
Oct 13, 2016
Collaborator
This is subtly different. When you call mount --make-slave you get different arguments to the system call:
mount("none", "/some/path", NULL, MS_SLAVE, NULL)
| + debug("performing operation: mount --bind -o slave %s %s", src, | ||
| + dst); | ||
| + if (mount(src, dst, NULL, MS_BIND | MS_SLAVE, NULL) != 0) { | ||
| + die("cannot perform operation: mount --bind -o slave %s %s", src, dst); |
| + // directory is always /snap. On the host it is a build-time configuration | ||
| + // option stored in SNAP_MOUNT_DIR. | ||
| + must_snprintf(dst, sizeof dst, "%s/snap", scratch_dir); | ||
| + debug("performing operation: mount --rbind %s %s", SNAP_MOUNT_DIR, dst); |
| + debug("performing operation: mount --rbind %s %s", SNAP_MOUNT_DIR, dst); | ||
| + if (mount(SNAP_MOUNT_DIR, dst, NULL, MS_BIND | MS_REC | MS_SLAVE, NULL) | ||
| + < 0) { | ||
| + die("cannot perform operation: mount --rbind -o slave %s %s", |
| + die("cannot perform operation: mount --rbind -o slave %s %s", | ||
| + SNAP_MOUNT_DIR, dst); | ||
| + } | ||
| + debug("performing operation: mount --make-rslave slave %s", dst); |
| + "/var/snap", // to get access to global snap data | ||
| + "/var/lib/snapd", // to get access to snapd state and seccomp profiles | ||
| + "/var/tmp", // to get access to the other temporary directory | ||
| + "/run", // to get /run with sockets and what not |
tyhicks
Oct 11, 2016
•
Collaborator
Huh, here's what I was talking about earlier in the review where I was worried about bidirectional_mounts being processed before unidirectional_mounts.
If the MERGED_USR macro is set, then the /run/media bidirectional mount is performed. The problem is that the /run unidirectional mount will then clobber it and break the sharing of /run/media across snaps and the host.
zyga
Oct 11, 2016
Collaborator
Ah, correct. I was not testing this on fedora (no fedora CI yet) and that's where we use merged user. Nice catch!
| + umount /var/lib/snapd/hostfs/tmp/snap.rootfs_*/, | ||
| + mount options=(rw rslave) -> /var/lib/snapd/hostfs/, | ||
| + mount options=(rw rprivate) -> /var/lib/snapd/hostfs/, | ||
| + |
tyhicks
Oct 11, 2016
Collaborator
The AppArmor profile changes look incomplete. These are mostly all mounts that were already being performed and would therefore need AppArmor rules allowing them. It looks like this PR should either be reusing existing rules or removing the old rules that no longer suffice.
I'm holding off on reviewing the profile changes for now.
zyga
added some commits
Oct 13, 2016
|
I ran this branch against tests in snapd and ... crashed the kernel:
|
jamiedbennett
commented
Oct 13, 2016
|
Looks like the problem is an outdated kernel. This was fixed in May in the Ubuntu kernels, the original fix can be found here: https://marc.info/?l=linux-fsdevel&m=146246187014403 and the LP bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1572316 |
tyhicks
approved these changes
Oct 13, 2016
This gets my ack. Thanks for the fixups!
I do ask for you to clarify and fix, if needed, the profile question that I had around the potentially duplicate /snap/ and @SNAP_MOUNT_DIR@/ rules.
Thanks @zyga!
| + # all of the constructed rootfs is a rslave | ||
| + mount options=(rw rslave) -> /tmp/snap.rootfs_*/, | ||
| + # bidirectional mounts | ||
| + mount options=(rw rbind) /media/ -> /tmp/snap.rootfs_*/media/, |
tyhicks
Oct 13, 2016
Collaborator
There's another bidirectional mount when MERGED_USR is set:
mount options=(rw rbind) /run/media/ -> /tmp/snap.rootfs_*/run/media/,
However, I think you said that MERGED_USR is only used on Fedora and we know that SELinux is used there instead of AppArmor so, in practice, this isn't an issue.
| + /snap/ r, | ||
| + /snap/** r, | ||
| + @SNAP_MOUNT_DIR@/ r, | ||
| + @SNAP_MOUNT_DIR@/** r, |
tyhicks
Oct 13, 2016
Collaborator
Isn't @SNAP_MOUNT_DIR@ set to /snap in most situations? I don't think we need both sets of rules here.
zyga
Oct 13, 2016
Collaborator
We need both for before pivot @SNAP_MOUNT_DIR@/ and after pivot /snap/
|
This now passed snapd tests in classic using zyga/snapd@06e5942#diff-556bb7431481e375713ea3e0883a771aR82 (just modified the branch name to test against this branch). Once I can repeat that with core (aka all-snap) I'll press the big green merge button. |
|
That should be --make-slave not rslave |
|
Thanks for fixing the typo. Looks good to me. |
zyga commentedOct 7, 2016
•
Edited 1 time
-
zyga
Oct 14, 2016
This patch changes the contents of the initialized mount namespace. The
main motivator is to allow /media to be shared amongst the initial mount
namespace and the derivative namespaces created by snap-confine.
The code is unified and simplified around the new boostrap function. It
now takes a few simple arguments through a configuration structure. The
main argument is the location of the starting point of the desired root
filesystem. On classic this is the core snap (or ubuntu-core if core is
not available). On an all-snap system this is the root filesystem.
The bootstrap function puts the desired root filesystem into a new
temporary directory (using recursive bind mount) paying special
attention to avoid creating bind mount loops. This directory is now
called the constructed root filesystem and eventually, this is where we
pivot_root into.
Since we still have access to the initial root filesystem under the
initial sharing parameters as well as to the new constructed root
filesystem, we can choose to bind mount arbitrary directories over.
The code now supports a list of directories and a flag indicating if
the mount event propagation should be bidirectional or not.
In this patch /media is on the bidirectional list. On classic a large
list of directories is on the unidirectional list. On all snap the
unidirectional list is empty because we don't need anything as we're
sharing the root filesystem which is already correctly populated by the
all-snap initrd.
The existing code for special treatment of /snap and /etc/alternatives
was retained as is (semantically) but has been folded
into the bootstrap function for simplicity. This was not yet done for /tmp
and /dev/pts.
The code for nvidia support is now also handled by the bootstrap
function but this should be moved outside later. This can be done only
after nvidia code is adjusted to assume it executes after pivot root
(this will simplify it and allow us to do something sensible in all-snap
system later). I chose not to do this to simplify the change and review
process. To repeat, nvidia support is exactly as it was before.
The apparmor profile was adjusted to take account of all the new
(numerous) operations.
Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com