Add support module for namespace sharing #126

zyga · 2016-09-06T12:34:52Z

This patch adds a module that implements mount namespace sharing. It is
not yet used by snap-confine. The module implements a set of library
functions that can perform the necessary magic.

As the topic is a bit unexpected a brief description follows.

Linux namespaces can be created with the unshare(2) call. They are
represented as files in the nsfs filesystem under /proc/$pid/ns/. The
mount namespace is represented as /proc/$pid/ns/mnt. An open file
descriptor to a file like that can be used along with setns(2) to move a
thread to a different namespace.

While namespaces are normally bound to the lifetime of a given process
they can be preserved by bind-mounting the appropriate namespace file to
another location. This feature serves as the basis for the namespace
sharing feature.

The mount namespace is a little bit special as it requires some
additional things to be in place before the bind mount can happen. All
violations of the rules listed below results in mount failing with
EINVAL.

In order to preserve the mount namespace the calling process must be in
a different mount namespace (snap-confine just uses the original namespace in
which it executes). This can be achieved by forking a child process that
can see its parent mount namespace file in /proc/$ppid/ns/mnt and having
the parent process move to a new namespace by calling unshare(). The
destination of the bind mount must be on a filesystem that is mounted
without any peers (in the sense of shared subtrees). This can be checked
by inspecting and parsing the /proc/self/mountinfo file. The new module
includes support function that bind mounts the target directory over
itself and the converts it to a private mount.

Actual mount namespaces are kept in /run/snapd/ns (this is also the
directory that gets converted to a private mount). The actual namespaces
are put in "groups" with names identical to the snap name. In practice
the preserved namespaces are in /run/snapd/ns/$SNAP_NAME.mnt

In order to make everything race-free the library uses locking based on
flock(2). There are two lock files. One global, required to make
/run/snapd/ns a private mount, and one local, required to manipulate a
particular mount namespace. The locks are /run/snapd/ns/.lock and
/run/snaps/ns/$SNAP_NAME.lock respectively.

Everything is coded defensively, terminating the process in case
something bad happens. The child process of snap-confine is using
prctl() to ensure it gets killed when the parent process dies for any
reason.

One notable thing that is not present in this patch is the adjusted
apparmor profile for snap-confine and for its child process (it switches
hats to a new profile for added paranoia and security). It will be
presented along with a, hopefully small, patch that enables namespace
sharing in practice.

Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com

This patch adds a module that implements mount namespace sharing. It is not yet used by snap-confine. The module implements a set of library functions that can perform the necessary magic. As the topic is a bit unexpected a brief description follows. Linux namespaces can be created with the unshare(2) call. They are represented as files in the nsfs filesystem under /proc/$pid/ns/. The mount namespace is represented as /proc/$pid/ns/mnt. An open file descriptor to a file like that can be used along with setns(2) to move a thread to a different namespace. While namespaces are normally bound to the lifetime of a given process they can be preserved by bind-mounting the appropriate namespace file to another location. This feature serves as the basis for the namespace sharing feature. The mount namespace is a little bit special as it requires some additional things to be in place before the bind mount can happen. All violations of the rules listed below results in mount failing with EINVAL. In order to preserve the mount namespace the calling process must be in a different namespace (snap-confine just uses the original namespace in which it executes). This can be achieved by forking a child process that can see its parent mount namespace file in /proc/$ppid/ns/mnt and having the parent process moves to a new namespace by calling unshare(). The destination of the bind mount must be on a filesystem that is mounted without any peers (in the sense of shared subtrees). This can be checked by inspecting and parsing the /proc/self/mountinfo file. The new module includes support function that bind mounts the target directory over itself and the converts it to a private mount. Actual mount namespaces are kept in /run/snapd/ns (this is also the directory that gets converted to a private mount). The actual namespaces are put in "groups" with names identical to the snap name. In practice the preserved namespaces are in /run/snapd/ns/$SNAP_NAME.mnt In order to make everything race-free the library uses locking based on flock(2). There are two lock files. One global, required to make /run/snapd/ns a private mount, and one local, required to manipulate a particular mount namespace. The locks are /run/snapd/ns/.lock and /run/snaps/ns/$SNAP_NAME.lock respectively. Everything is coded defensively, terminating the process in case something bad happens. The child process of snap-confine is using prctl() to ensure it gets killed when the parent process dies for any reason. One notable thing that is *not* present in this patch is the adjusted apparmor profile for snap-confine and for its child process (it switches hats to a new profile for added paranoia and security). It will be presented along with a, hopefully small, patch that enables namespace sharing in practice. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

jdstrand · 2016-09-06T20:48:42Z

I spent some time looking at this today and will comment inline. I can say that I may have further comments once the PR for using these functions is in place. There is a lot of 'you should only do this if that' language in the header that I think makes this module somewhat error prone to use. The header does do a good job of explaining each function though. Perhaps it should at the top also give steps on the ordering of how the functions are intended to be used?

jdstrand · 2016-09-06T20:51:23Z

docs/snap-confine.rst

+
+As a part of application startup `snap-confine` will move the process to a new
+mount namespace. Since version 1.0.41 all the applications belonging to a given
+snap will share the mount namespace amongst them.


Perhaps: "As of version 1.0.41 all the applications from the same snap will share the same mount namespace. Applications from different snaps continue to use separate mount namespaces."

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

jdstrand · 2016-09-08T16:38:36Z

src/ns-support.c

+	// print pid's portably so this is the best we can do.
+	pid_t pid = fork();
+	debug("forked support process has pid %d", (int)pid);
+	if (pid == -1) {


jdstrand · 2016-09-08T17:05:44Z

Thanks for all the changes! I think this is ok to merge provided:

my feedback from today is addressed (ie, pid < 0, rm_rf, comment changes)
your proposed fstatfs() is implemented

Don't feel like you need to implement the apparmor policy changes just yet as you mentioned, but let's not cut a new version of snap-confine until the other bits are in place.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

This patch fixes a potentially iffy code that used to open the mount namespace file /run/snapd/ns/$SNAP_NAME.mnt and then blindly call sents() on the file descriptor, if the open call succeeded. Instead, we now fstatfs() the file descriptor and check the resulting f_type field. If it is NSFS_MAGIC then we know that we should use setns(). If not we fall back to regular initialization. All errors from setns() now result in snap-confine death. This should be more reliable against unexpected bugs. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

jdstrand · 2016-09-08T19:53:33Z

src/ns-support.c

-	debug
-	    ("cannot re-associate the mount namespace with namespace group %s, falling back to initialization",
-	     group->name);
+	debug("initializing new namespace group %s", group->name);


This is soo much nicer, thanks! :)

It's interesting that NSFS_MAGIC is not included in man fstatfs. Note that this was added only in http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/include/uapi/linux/magic.h?id=e149ed2b805fefdccf7ccdfc19eca22fdd4514ac so this will likely need to be added to the backports list if it isn't already. It does seem present in the 14.04 Ubuntu kernel.

I've informed tvoss about this. With the sanity check we'll know if tests are all green. Thanks for making me aware of this :)

This patch make the cleanup function a little bit more paranoid by not expanding shell arguments and by ensuring that what is removed is in /tmp/. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

jdstrand reviewed Sep 6, 2016
View reviewed changes

zyga force-pushed the ns-sharing branch from 3bf9141 to e6ce2f9 Compare September 8, 2016 11:03

Comment that optional_fields is never NULL

d20bc45

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

zyga force-pushed the ns-sharing branch from 8fae44a to d20bc45 Compare September 8, 2016 11:05

zyga added 7 commits September 8, 2016 13:06

Tweak comment

a890dbe

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Move function for better readability

524ce1a

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Move variable declaration closer to use

598b31d

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Tweak comment and unify comment style

0f51648

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Tweak comment

aa074ed

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Remove duplicate include

ab3c245

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Document why casting pid_t to int is safe

50ebef9

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

jdstrand reviewed Sep 8, 2016
View reviewed changes

zyga added 7 commits September 8, 2016 19:29

Fix function name

a4cc855

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Fix typo

b2fc050

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Reword comment

9123679

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Fix typo

6afb53b

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Test for '< 0' instead of '== -1'

afc1cc9

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

Add sanity check for nsfs

db1a076

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

jdstrand reviewed Sep 8, 2016
View reviewed changes

tianon mentioned this pull request Sep 8, 2016

Replace pivot_root implementation by LXC's #122

Closed

zyga added the Reviewed label Sep 12, 2016

Add some safeguards to rm_rf_tmp

d9adc64

This patch make the cleanup function a little bit more paranoid by not expanding shell arguments and by ensuring that what is removed is in /tmp/. Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

zyga force-pushed the ns-sharing branch from d3b04f5 to d9adc64 Compare September 12, 2016 11:06

zyga merged commit 7f07af2 into master Sep 12, 2016

zyga deleted the ns-sharing branch September 12, 2016 11:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support module for namespace sharing #126

Add support module for namespace sharing #126

zyga commented Sep 6, 2016 •

edited

jdstrand commented Sep 6, 2016

jdstrand Sep 6, 2016

zyga Sep 8, 2016

jdstrand Sep 8, 2016

zyga Sep 8, 2016

zyga Sep 8, 2016

jdstrand commented Sep 8, 2016

jdstrand Sep 8, 2016

zyga Sep 12, 2016

Add support module for namespace sharing #126

Add support module for namespace sharing #126

Conversation

zyga commented Sep 6, 2016 • edited

jdstrand commented Sep 6, 2016

jdstrand Sep 6, 2016

Choose a reason for hiding this comment

zyga Sep 8, 2016

Choose a reason for hiding this comment

jdstrand Sep 8, 2016

Choose a reason for hiding this comment

zyga Sep 8, 2016

Choose a reason for hiding this comment

zyga Sep 8, 2016

Choose a reason for hiding this comment

jdstrand commented Sep 8, 2016

jdstrand Sep 8, 2016

Choose a reason for hiding this comment

zyga Sep 12, 2016

Choose a reason for hiding this comment

zyga commented Sep 6, 2016 •

edited