Add support module for namespace sharing #126
Conversation
This patch adds a module that implements mount namespace sharing. It is not yet used by snap-confine. The module implements a set of library functions that can perform the necessary magic. As the topic is a bit unexpected a brief description follows. Linux namespaces can be created with the unshare(2) call. They are represented as files in the nsfs filesystem under /proc/$pid/ns/. The mount namespace is represented as /proc/$pid/ns/mnt. An open file descriptor to a file like that can be used along with setns(2) to move a thread to a different namespace. While namespaces are normally bound to the lifetime of a given process they can be preserved by bind-mounting the appropriate namespace file to another location. This feature serves as the basis for the namespace sharing feature. The mount namespace is a little bit special as it requires some additional things to be in place before the bind mount can happen. All violations of the rules listed below results in mount failing with EINVAL. In order to preserve the mount namespace the calling process must be in a different namespace (snap-confine just uses the original namespace in which it executes). This can be achieved by forking a child process that can see its parent mount namespace file in /proc/$ppid/ns/mnt and having the parent process moves to a new namespace by calling unshare(). The destination of the bind mount must be on a filesystem that is mounted without any peers (in the sense of shared subtrees). This can be checked by inspecting and parsing the /proc/self/mountinfo file. The new module includes support function that bind mounts the target directory over itself and the converts it to a private mount. Actual mount namespaces are kept in /run/snapd/ns (this is also the directory that gets converted to a private mount). The actual namespaces are put in "groups" with names identical to the snap name. In practice the preserved namespaces are in /run/snapd/ns/$SNAP_NAME.mnt In order to make everything race-free the library uses locking based on flock(2). There are two lock files. One global, required to make /run/snapd/ns a private mount, and one local, required to manipulate a particular mount namespace. The locks are /run/snapd/ns/.lock and /run/snaps/ns/$SNAP_NAME.lock respectively. Everything is coded defensively, terminating the process in case something bad happens. The child process of snap-confine is using prctl() to ensure it gets killed when the parent process dies for any reason. One notable thing that is *not* present in this patch is the adjusted apparmor profile for snap-confine and for its child process (it switches hats to a new profile for added paranoia and security). It will be presented along with a, hopefully small, patch that enables namespace sharing in practice. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
I spent some time looking at this today and will comment inline. I can say that I may have further comments once the PR for using these functions is in place. There is a lot of 'you should only do this if that' language in the header that I think makes this module somewhat error prone to use. The header does do a good job of explaining each function though. Perhaps it should at the top also give steps on the ordering of how the functions are intended to be used? |
|
||
As a part of application startup `snap-confine` will move the process to a new | ||
mount namespace. Since version 1.0.41 all the applications belonging to a given | ||
snap will share the mount namespace amongst them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps: "As of version 1.0.41 all the applications from the same snap will share the same mount namespace. Applications from different snaps continue to use separate mount namespaces."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
// print pid's portably so this is the best we can do. | ||
pid_t pid = fork(); | ||
debug("forked support process has pid %d", (int)pid); | ||
if (pid == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pid < 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Thanks for all the changes! I think this is ok to merge provided:
Don't feel like you need to implement the apparmor policy changes just yet as you mentioned, but let's not cut a new version of snap-confine until the other bits are in place. |
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
This patch fixes a potentially iffy code that used to open the mount namespace file /run/snapd/ns/$SNAP_NAME.mnt and then blindly call sents() on the file descriptor, if the open call succeeded. Instead, we now fstatfs() the file descriptor and check the resulting f_type field. If it is NSFS_MAGIC then we know that we should use setns(). If not we fall back to regular initialization. All errors from setns() now result in snap-confine death. This should be more reliable against unexpected bugs. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
debug | ||
("cannot re-associate the mount namespace with namespace group %s, falling back to initialization", | ||
group->name); | ||
debug("initializing new namespace group %s", group->name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is soo much nicer, thanks! :)
It's interesting that NSFS_MAGIC is not included in man fstatfs
. Note that this was added only in http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/include/uapi/linux/magic.h?id=e149ed2b805fefdccf7ccdfc19eca22fdd4514ac so this will likely need to be added to the backports list if it isn't already. It does seem present in the 14.04 Ubuntu kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've informed tvoss about this. With the sanity check we'll know if tests are all green. Thanks for making me aware of this :)
This patch make the cleanup function a little bit more paranoid by not expanding shell arguments and by ensuring that what is removed is in /tmp/. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
This patch adds a module that implements mount namespace sharing. It is
not yet used by snap-confine. The module implements a set of library
functions that can perform the necessary magic.
As the topic is a bit unexpected a brief description follows.
Linux namespaces can be created with the unshare(2) call. They are
represented as files in the nsfs filesystem under /proc/$pid/ns/. The
mount namespace is represented as /proc/$pid/ns/mnt. An open file
descriptor to a file like that can be used along with setns(2) to move a
thread to a different namespace.
While namespaces are normally bound to the lifetime of a given process
they can be preserved by bind-mounting the appropriate namespace file to
another location. This feature serves as the basis for the namespace
sharing feature.
The mount namespace is a little bit special as it requires some
additional things to be in place before the bind mount can happen. All
violations of the rules listed below results in mount failing with
EINVAL.
In order to preserve the mount namespace the calling process must be in
a different mount namespace (snap-confine just uses the original namespace in
which it executes). This can be achieved by forking a child process that
can see its parent mount namespace file in /proc/$ppid/ns/mnt and having
the parent process move to a new namespace by calling unshare(). The
destination of the bind mount must be on a filesystem that is mounted
without any peers (in the sense of shared subtrees). This can be checked
by inspecting and parsing the /proc/self/mountinfo file. The new module
includes support function that bind mounts the target directory over
itself and the converts it to a private mount.
Actual mount namespaces are kept in /run/snapd/ns (this is also the
directory that gets converted to a private mount). The actual namespaces
are put in "groups" with names identical to the snap name. In practice
the preserved namespaces are in /run/snapd/ns/$SNAP_NAME.mnt
In order to make everything race-free the library uses locking based on
flock(2). There are two lock files. One global, required to make
/run/snapd/ns a private mount, and one local, required to manipulate a
particular mount namespace. The locks are /run/snapd/ns/.lock and
/run/snaps/ns/$SNAP_NAME.lock respectively.
Everything is coded defensively, terminating the process in case
something bad happens. The child process of snap-confine is using
prctl() to ensure it gets killed when the parent process dies for any
reason.
One notable thing that is not present in this patch is the adjusted
apparmor profile for snap-confine and for its child process (it switches
hats to a new profile for added paranoia and security). It will be
presented along with a, hopefully small, patch that enables namespace
sharing in practice.
Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com