Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Add support module for namespace sharing #126
Conversation
|
I spent some time looking at this today and will comment inline. I can say that I may have further comments once the PR for using these functions is in place. There is a lot of 'you should only do this if that' language in the header that I think makes this module somewhat error prone to use. The header does do a good job of explaining each function though. Perhaps it should at the top also give steps on the ordering of how the functions are intended to be used? |
jdstrand
reviewed
Sep 6, 2016
| + | ||
| +As a part of application startup `snap-confine` will move the process to a new | ||
| +mount namespace. Since version 1.0.41 all the applications belonging to a given | ||
| +snap will share the mount namespace amongst them. |
jdstrand
Sep 6, 2016
Contributor
Perhaps: "As of version 1.0.41 all the applications from the same snap will share the same mount namespace. Applications from different snaps continue to use separate mount namespaces."
jdstrand
reviewed
Sep 6, 2016
| @@ -126,6 +133,18 @@ FILES | ||
| Description of the seccomp profile. | ||
| +`/run/snapd/ns/.lock`: | ||
| + | ||
| + Global lock acquired to initialize namespace sharing feature |
jdstrand
Sep 6, 2016
Contributor
FYI, I found the .lock file under-documented. I'm not sure where it would be best to document it, but it wasn't part of the google doc spec and I had to put it together myself why it was needed.
jdstrand
reviewed
Sep 6, 2016
| + g_test_queue_destroy((GDestroyNotify) sc_set_ns_dir, SC_NS_DIR); | ||
| + g_test_queue_free(ns_dir); | ||
| + sc_set_ns_dir(ns_dir); | ||
| + // TODO: queue something that rm -rf's the directory tree |
zyga
Sep 6, 2016
•
Collaborator
Yes, I just found glib lacking in performing this thing in one, nice call.
jdstrand
reviewed
Sep 6, 2016
| + g_assert_cmpint(group->child, ==, 0); | ||
| + g_assert_cmpint(group->should_populate, ==, false); | ||
| + g_assert_null(group->name); | ||
| +} |
jdstrand
reviewed
Sep 6, 2016
| + sc_unlock_ns_mutex(group); | ||
| + // Re-attempt the locking operation. This time it shoud succeed. | ||
| + err = flock(lock_fd, LOCK_EX | LOCK_NB); | ||
| + g_assert_cmpint(err, ==, 0); |
zyga
Sep 6, 2016
Collaborator
Closing the file descriptor unlocks the lock. We could do it but at some point there's no need to do it.
jdstrand
reviewed
Sep 6, 2016
| +#include <sys/prctl.h> | ||
| +#include <sys/stat.h> | ||
| +#include <sys/types.h> | ||
| +#include <sys/types.h> |
jdstrand
reviewed
Sep 6, 2016
| +/** | ||
| + * Effective value of SC_NS_DIR. | ||
| + * | ||
| + * This is only altered for testing. |
jdstrand
Sep 6, 2016
Contributor
Perhaps change this to "We use 'const char *' so we can update sc_ns_dir in the testsuite"
jdstrand
reviewed
Sep 6, 2016
| +// | ||
| +// That is, it cannot be shared with any other peer as defined by kernel | ||
| +// documentation listed here: | ||
| +// https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt |
jdstrand
Sep 6, 2016
Contributor
I think this could be worded a bit better. Eg:
// Read /proc/self/mountinfo and check if /run/snapd/ns is a private bind mount.
//
// We do this because /run/snapd/ns cannot be shared with any other peers as per:
// https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt
Also, why // here and /** up above?
zyga
Sep 6, 2016
Collaborator
I moved from vm to vm and I used misconfigure vim. I'll update the docs to match.
jdstrand
reviewed
Sep 6, 2016
| + mkpath(sc_ns_dir); | ||
| + debug("opening namespace group directory %s", sc_ns_dir); | ||
| + dir_fd = open(sc_ns_dir, O_DIRECTORY | O_PATH | O_CLOEXEC | O_NOFOLLOW); | ||
| + if (dir_fd == -1) { |
jdstrand
Sep 6, 2016
Contributor
I tend to prefer < 0 checks instead of == -1. I won't block if this stays as is. Note, you use == -1 in quite a few places.
zyga
Sep 6, 2016
Collaborator
I realised I made this mistake just before pushing this. I was on edge between < 0 and -1 and I think I'll go back to < 0 as they cannot be mistyped to = -1 and make sense in any way.
jdstrand
reviewed
Sep 6, 2016
| + | ||
| +void sc_initialize_ns_groups() | ||
| +{ | ||
| + int dir_fd __attribute__ ((cleanup(sc_cleanup_close))) = -1; |
jdstrand
Sep 6, 2016
Contributor
Moving this down above dir_fd = open(... would be slightly more clear.
jdstrand
reviewed
Sep 6, 2016
| + } | ||
| +} | ||
| + | ||
| +static bool sc_is_ns_group_dir_private() |
jdstrand
Sep 6, 2016
Contributor
I would probably prefer this to be declared above sc_initialize_ns_groups(), but I won't block on it.
jdstrand
reviewed
Sep 6, 2016
| + debug("unlocking the namespace group directory"); | ||
| + if (flock(lock_fd, LOCK_UN) < 0) { | ||
| + die("cannot release lock for namespace control directory"); | ||
| + } |
jdstrand
Sep 6, 2016
Contributor
Perhaps a comment stating that the caller must call sc_close_ns_group() to close dir_fd and lock_fd would be clearer.
zyga
Sep 8, 2016
Collaborator
Hmm? Wait, this is not the same lock file. This is the global lock file that only this function ever touches.
jdstrand
reviewed
Sep 6, 2016
| +// https://www.kernel.org/doc/Documentation/filesystems/sharedsubtree.txt | ||
| +static bool sc_is_ns_group_dir_private(); | ||
| + | ||
| +void sc_initialize_ns_groups() |
jdstrand
Sep 6, 2016
•
Contributor
I'm not sure there is anything to be done about it, but is seems that sc_initialize_ns_groups() is going to be called on every app invocation, which means every invocation will try to create sc_ns_dir, create the lock file and parse the mountinfo table. Since /run is tmpfs we don't have to worry about media wear and performance should be ok, plus this only has to be done when the app is launched, not for every fork/exec the app does, but thought I'd at least point it out explicitly.
zyga
Sep 6, 2016
Collaborator
I'm aware of this and I was wondering if there's any way that we can avoid this without introducing some kind of raciness. I guess we could do it in a job that runs before we mount, say, the core snap, but at some point this becomes a bit distant and magic and I was worried that the essential property for being able to preserve the mount namespace would get lost.
jdstrand
reviewed
Sep 6, 2016
| + if (strcmp(mount_dir, sc_ns_dir) == 0 | ||
| + && strcmp(optional_fields, "") == 0) { | ||
| + // If the /run/snapd/ns directory is mounted and has no optional | ||
| + // fields (i.e. the shared:xxx is not there) then we are all set. |
jdstrand
Sep 6, 2016
Contributor
Can have more than just 'shared:xxx'. Perhaps say: // If /run/snapd/ns has no optional fields, we know it is mounted private and there is nothing else to do.
jdstrand
reviewed
Sep 6, 2016
| + const char *optional_fields = | ||
| + mountinfo_entry_optional_fields(entry); | ||
| + if (strcmp(mount_dir, sc_ns_dir) == 0 | ||
| + && strcmp(optional_fields, "") == 0) { |
jdstrand
Sep 6, 2016
Contributor
This strcmp() is perfectly correct for today's parse_mountinfo_entry(), but if parse_mountinfo_entry() ever changes, we'll have a bug here. Can you add a comment to parse_mountinfo_entry() in mountinfo.c circa line 238 that we 'strcpy(entry->optional_fields, "");' to guarantee consumers have a non-null string?
jdstrand
reviewed
Sep 6, 2016
| + int lock_fd; | ||
| + // Descriptor to an eventfd that is used to notify the child that it can | ||
| + // now complete its job and exit. | ||
| + int event_fd; |
jdstrand
Sep 6, 2016
Contributor
I was surprised to see event_fd based on the spec, but I don't think it is a problem.
zyga
Sep 6, 2016
Collaborator
It is easier than a pipe and more reliable (race free) than a signal handler.
jdstrand
reviewed
Sep 6, 2016
| + int event_fd; | ||
| + // Identifier of the child process that is used during the one-time (per | ||
| + // group) initialization and capture process. | ||
| + pid_t child; |
jdstrand
Sep 6, 2016
Contributor
I know that you calloc'd this and so child is already zero, but an explicit assignment with a comment that you check for non-zero in other functions to verify the mount namespace capture process' status would make this a bit clearer.
tyhicks
reviewed
Sep 6, 2016
| + int lock_fd __attribute__ ((cleanup(sc_cleanup_close))) = -1; | ||
| + lock_fd = openat(dir_fd, | ||
| + SC_NS_LOCK_FILE, | ||
| + O_CREAT | O_RDWR | O_CLOEXEC | O_NOFOLLOW, 0600); |
tyhicks
Sep 6, 2016
Collaborator
On 09/06/2016 04:17 PM, jdstrand wrote:
In src/ns-support.c
#126 (comment):
+void sc_initialize_ns_groups()
+{
- int dir_fd attribute ((cleanup(sc_cleanup_close))) = -1;
- debug("creating namespace group directory %s", sc_ns_dir);
- mkpath(sc_ns_dir);
- debug("opening namespace group directory %s", sc_ns_dir);
- dir_fd = open(sc_ns_dir, O_DIRECTORY | O_PATH | O_CLOEXEC | O_NOFOLLOW);
- if (dir_fd == -1) {
die("cannot open namespace group directory");- }
- debug("opening lock file for group directory");
- int lock_fd attribute ((cleanup(sc_cleanup_close))) = -1;
- lock_fd = openat(dir_fd,
SC_NS_LOCK_FILE,O_CREAT | O_RDWR | O_CLOEXEC | O_NOFOLLOW, 0600);Please also add O_EXCL to this openat() call (at which point, you don't
need to specify O_NOFOLLOW).
I don't think that a lockfile should be opened with O_EXCL. Requiring
O_EXCL defeats the purpose of the lockfile since the exclusivity
provided by the lockfile is enforced by flock() once you've opened the file.
(Note that I have not reviewed the code in this PR at all but saw this
review comment fly by and thought I'd comment on it)
jdstrand
Sep 6, 2016
•
Contributor
Heh, of course you are right there. :) I've deleted those two comments.
jdstrand
reviewed
Sep 6, 2016
| + if (group->dir_fd == -1) { | ||
| + die("cannot open directory for namespace group %s", group_name); | ||
| + } | ||
| + char lock_fname[PATH_MAX]; |
jdstrand
Sep 6, 2016
Contributor
I fear we may be in an 'undefined' area here. The actual full path is sc_ns_dir + '/' + lock_fname so I'm not sure what side-effects there might be if lock_name actually was PATH_MAX - 1. Really this need only be the maximum of SNAP_NAME + .lock + '\0', but I realize we don't want to be that tightly coupled with snapd's definition of how long SNAP_NAME can be. Perhaps define this length as PATH_MAX - length of sc_ns_dir + '/' + '\0'?
jdstrand
reviewed
Sep 6, 2016
| + group->name); | ||
| + close(group->dir_fd); | ||
| + close(group->lock_fd); | ||
| + close(group->event_fd); |
jdstrand
Sep 6, 2016
•
Contributor
Might be nice to check if dir_fd, lock_fd and event_fd are all '>= 0' here. I like to also check the return code of close(), but the dir_fd and the lock_fd aren't being written to and event_fd is only written to trigger an event, so not checking should be safe.
zyga
Sep 8, 2016
Collaborator
They don't have to be >= though, e.g event_fd is only used if required.
jdstrand
reviewed
Sep 6, 2016
| +} | ||
| + | ||
| +void sc_lock_ns_mutex(struct sc_ns_group *group) | ||
| +{ |
jdstrand
Sep 6, 2016
Contributor
It would be nice if sc_lock_ns_mutex() and sc_unlock_ns_mutex() both checked if lock_fd was '>=0'.
jdstrand
reviewed
Sep 6, 2016
| + int mnt_fd __attribute__ ((cleanup(sc_cleanup_close))) = -1; | ||
| + mnt_fd = | ||
| + openat(group->dir_fd, mnt_fname, | ||
| + O_CREAT | O_RDONLY | O_CLOEXEC | O_NOFOLLOW, 0600); |
jdstrand
Sep 6, 2016
Contributor
Can you add a comment here on why you are not using O_EXCL? AIUI, you want to either open the nsfs mounted file or open the newly created file. It would be good to document that for auditability.
zyga
Sep 6, 2016
Collaborator
Yes, you are right both in the deduction of the intent and the suggestion to document this better.
jdstrand
reviewed
Sep 6, 2016
| + // killed before it managed to complete the process. | ||
| + if (errno != EINVAL) { | ||
| + die("cannot re-associate the mount namespace with namespace group %s", group->name); | ||
| + } |
jdstrand
Sep 6, 2016
Contributor
Per the setns() man page, EINVAL might be returned for a bunch of reasons. The comment should be updated and the code reviewed to make sure the assumptions regarding EINVAL are still valid.
zyga
Sep 8, 2016
Collaborator
You are right. I plan to change this a little to call fstatfs() to determine if the opened file descriptor belongs to nsfs or not. This will let us predictably die() when anything goes wrong.
jdstrand
reviewed
Sep 6, 2016
| + if (pid == 0) { | ||
| + // This is the child process which will capture the mount namespace. | ||
| + // | ||
| + // It will do so by bind-mounting it the SC_NS_MNT_FILE after |
jdstrand
reviewed
Sep 6, 2016
| + ("changing apparmor hat of the support process for mount namespace capture"); | ||
| + if (aa_change_hat("mount-namespace-capture-helper", 0) < 0) { | ||
| + die("cannot change apparmor hat of the support process for mount namespace capture"); | ||
| + } |
jdstrand
Sep 6, 2016
Contributor
Why aa_change_hat() instead of aa_change_profile()? This seems like a one way operation for the child so aa_change_profile() may be better. Also, I think you can drop -helper since it implies a helper binary (at least to me anyway).
zyga
Sep 8, 2016
Collaborator
No particular reason, the profile name looked nice.
What should be the profile name here? snap-confine//mount-namespace-capture or just mount-namespace-capture?
jdstrand
Sep 8, 2016
•
Contributor
Depends on the policy. I suggest you use a child profile so it is clear this is for snap-confine. Eg:
@LIBEXECDIR@/snap-confine (attach_disconnected) {
...
profile mount-namespace-capture (attach_disconnected) {
...
}
}
With this you would use @LIBEXECDIR@/snap-confine//mount-namespace-capture.
jdstrand
reviewed
Sep 6, 2016
| + } | ||
| + debug("forking support process for mount namespace capture"); | ||
| + pid_t pid = fork(); | ||
| + debug("forked support process has pid %d", (int)pid); |
jdstrand
Sep 6, 2016
Contributor
A comment on how casting pid to int is ok with glibc would be good here.
jdstrand
reviewed
Sep 6, 2016
| + if (prctl(PR_SET_PDEATHSIG, SIGINT, 0, 0, 0) < 0) { | ||
| + die("cannot set parent process death notification signal to SIGINT"); | ||
| + } | ||
| + if (fchdir(group->dir_fd) < -1) { |
jdstrand
reviewed
Sep 6, 2016
| + } | ||
| + } | ||
| + // Get back to the original directory | ||
| + if (fchdir(old_dir_fd) < -1) { |
jdstrand
reviewed
Sep 6, 2016
| + die("cannot open current directory"); | ||
| + } | ||
| + // Move to the mount namespace directory (/run/snapd/ns) | ||
| + if (fchdir(group->dir_fd) < -1) { |
jdstrand
reviewed
Sep 6, 2016
| + * used for storing preserved namespaces as bind-mounted files from the nsfs | ||
| + * filesystem (namespace filesystem). | ||
| + * | ||
| + * This function acquires a floc(2)-based lock to ensure that no other instance |
jdstrand
reviewed
Sep 6, 2016
| + * | ||
| + * This will open and keep file descriptors for /run/snapd/ns/ as well as for | ||
| + * /run/snapd/ns/${group_name}.lock. The lock file is created if necessary but | ||
| + * is not locked in any way. |
jdstrand
Sep 6, 2016
Contributor
'is not locked in any way' is confusing since this is a lock file. Perhaps say, 'is not locked until sc_lock_ns_mutex() is called'. Also, it seems a bit odd that the lock file is created here but not locked until another function. Perhaps document why that is?
jdstrand
reviewed
Sep 6, 2016
| + * | ||
| + * If the return value is true then at this stage the namespace is already | ||
| + * unshared. The caller should perform any mount operations that are desired | ||
| + * and then proceed to call sc_preserve_ns_group(). |
jdstrand
Sep 6, 2016
Contributor
sc_preserve_ns_group() is not a thing, I think you mean sc_preserve_populated_ns_group()
|
For completeness, please also add spread tests once there are consumers for |
jdstrand
reviewed
Sep 6, 2016
| + * can be done by calling sc_preserve_populated_ns_group()). In the meantime | ||
| + * the parent process unshares the mount namespace() and sets a flag so that | ||
| + * sc_should_populate_ns_group() returns true. | ||
| + * |
|
Thank you for the review. It's a bit too late to do this responsibly but I will go over all the feedback and ensure it is addressed. Depending on how you feel I can propose a sister branch that uses this module to implement fully-working mount namespace sharing, along with a few spread tests for the basic property. |
jdstrand
reviewed
Sep 6, 2016
| + // task or wait for anything. | ||
| + if (prctl(PR_SET_PDEATHSIG, SIGINT, 0, 0, 0) < 0) { | ||
| + die("cannot set parent process death notification signal to SIGINT"); | ||
| + } |
jdstrand
Sep 6, 2016
Contributor
This may make sense to move above the aa_change_* call, but I'm curious as to the affect on policy it will have on the parent and child profiles. I'm not terribly worried about the timing of the call-- in the unlikely event the parent dies before this call (which is not attacker controlled), the child will become a zombie and at some point init should reap it so even if the parent is killed before prctl, we should be ok.
zyga
Sep 8, 2016
Collaborator
Let's postpone changing this until the next pull request where we actually start to use this (and we can see the apparmor profile).
zyga
added some commits
Sep 7, 2016
jdstrand
reviewed
Sep 8, 2016
| + // print pid's portably so this is the best we can do. | ||
| + pid_t pid = fork(); | ||
| + debug("forked support process has pid %d", (int)pid); | ||
| + if (pid == -1) { |
jdstrand
reviewed
Sep 8, 2016
| @@ -112,7 +112,7 @@ void sc_create_or_join_ns_group(struct sc_ns_group *group); | ||
| * | ||
| * If the return value is true then at this stage the namespace is already | ||
| * unshared. The caller should perform any mount operations that are desired | ||
| - * and then proceed to call sc_preserve_ns_group(). | ||
| + * and then proceed to call sc_preserve_populated_ns_group(). |
jdstrand
reviewed
Sep 8, 2016
| +static void rm_rf(const char *dir) | ||
| +{ | ||
| + gint exit_status = 0; | ||
| + gchar *cmd = g_strdup_printf("rm -rf %s", dir); |
jdstrand
Sep 8, 2016
•
Contributor
Oh, scary. I hope there are no bugs in the test code otherwise the developer might have severe data loss....
jdstrand
Sep 8, 2016
Contributor
Can we do some checks on dir to make sure it is inside the test dir?
jdstrand
Sep 8, 2016
Contributor
Alternatively remove the files we created and them rmdir the directory. If we make rmdir() fail the test, then we can catch stray files added via code changes easily. I'd prefer this approach.
zyga
Sep 8, 2016
Collaborator
I tried to use the function that doesn't do shell expansion but glib is just annoying here. I'll try again.
zyga
Sep 12, 2016
Collaborator
I've implemented this a little bit better now. I think it good to merge now.
jdstrand
reviewed
Sep 8, 2016
| @@ -235,6 +235,7 @@ static struct mountinfo_entry *parse_mountinfo_entry(const char *line) | ||
| if ((entry->mount_opts = parse_next_string_field()) == NULL) | ||
| goto fail; | ||
| entry->optional_fields = &entry->line_buf[0] + total_used++; | ||
| + // NOTE: This ensure s that optional_fields is never NULL. |
jdstrand
Sep 8, 2016
Contributor
There is a typo here. Perhaps be a little more explicit though. Eg: // NOTE: This ensures that optional_fields is never NULL. If this changes, must adjust all callers of parse_mountinfo_entry() accordingly.
|
Thanks for all the changes! I think this is ok to merge provided:
Don't feel like you need to implement the apparmor policy changes just yet as you mentioned, but let's not cut a new version of snap-confine until the other bits are in place. |
zyga
added some commits
Sep 8, 2016
jdstrand
reviewed
Sep 8, 2016
| - debug | ||
| - ("cannot re-associate the mount namespace with namespace group %s, falling back to initialization", | ||
| - group->name); | ||
| + debug("initializing new namespace group %s", group->name); |
jdstrand
Sep 8, 2016
Contributor
This is soo much nicer, thanks! :)
It's interesting that NSFS_MAGIC is not included in man fstatfs. Note that this was added only in http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/include/uapi/linux/magic.h?id=e149ed2b805fefdccf7ccdfc19eca22fdd4514ac so this will likely need to be added to the backports list if it isn't already. It does seem present in the 14.04 Ubuntu kernel.
zyga
Sep 12, 2016
Collaborator
I've informed tvoss about this. With the sanity check we'll know if tests are all green. Thanks for making me aware of this :)
jdstrand
reviewed
Sep 8, 2016
| + int err = statfs("/proc/self/ns/mnt", &buf); | ||
| + g_assert_cmpint(err, ==, 0); | ||
| + g_assert_cmpint(buf.f_type, ==, NSFS_MAGIC); | ||
| +} |
zyga commentedSep 6, 2016
•
Edited 1 time
-
zyga
Sep 6, 2016
This patch adds a module that implements mount namespace sharing. It is
not yet used by snap-confine. The module implements a set of library
functions that can perform the necessary magic.
As the topic is a bit unexpected a brief description follows.
Linux namespaces can be created with the unshare(2) call. They are
represented as files in the nsfs filesystem under /proc/$pid/ns/. The
mount namespace is represented as /proc/$pid/ns/mnt. An open file
descriptor to a file like that can be used along with setns(2) to move a
thread to a different namespace.
While namespaces are normally bound to the lifetime of a given process
they can be preserved by bind-mounting the appropriate namespace file to
another location. This feature serves as the basis for the namespace
sharing feature.
The mount namespace is a little bit special as it requires some
additional things to be in place before the bind mount can happen. All
violations of the rules listed below results in mount failing with
EINVAL.
In order to preserve the mount namespace the calling process must be in
a different mount namespace (snap-confine just uses the original namespace in
which it executes). This can be achieved by forking a child process that
can see its parent mount namespace file in /proc/$ppid/ns/mnt and having
the parent process move to a new namespace by calling unshare(). The
destination of the bind mount must be on a filesystem that is mounted
without any peers (in the sense of shared subtrees). This can be checked
by inspecting and parsing the /proc/self/mountinfo file. The new module
includes support function that bind mounts the target directory over
itself and the converts it to a private mount.
Actual mount namespaces are kept in /run/snapd/ns (this is also the
directory that gets converted to a private mount). The actual namespaces
are put in "groups" with names identical to the snap name. In practice
the preserved namespaces are in /run/snapd/ns/$SNAP_NAME.mnt
In order to make everything race-free the library uses locking based on
flock(2). There are two lock files. One global, required to make
/run/snapd/ns a private mount, and one local, required to manipulate a
particular mount namespace. The locks are /run/snapd/ns/.lock and
/run/snaps/ns/$SNAP_NAME.lock respectively.
Everything is coded defensively, terminating the process in case
something bad happens. The child process of snap-confine is using
prctl() to ensure it gets killed when the parent process dies for any
reason.
One notable thing that is not present in this patch is the adjusted
apparmor profile for snap-confine and for its child process (it switches
hats to a new profile for added paranoia and security). It will be
presented along with a, hopefully small, patch that enables namespace
sharing in practice.
Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com