Skip to content
This repository has been archived by the owner on Oct 4, 2023. It is now read-only.

Add support module for namespace sharing #126

Merged
merged 35 commits into from Sep 12, 2016
Merged

Add support module for namespace sharing #126

merged 35 commits into from Sep 12, 2016

Conversation

zyga
Copy link
Contributor

@zyga zyga commented Sep 6, 2016

This patch adds a module that implements mount namespace sharing. It is
not yet used by snap-confine. The module implements a set of library
functions that can perform the necessary magic.

As the topic is a bit unexpected a brief description follows.

Linux namespaces can be created with the unshare(2) call. They are
represented as files in the nsfs filesystem under /proc/$pid/ns/. The
mount namespace is represented as /proc/$pid/ns/mnt. An open file
descriptor to a file like that can be used along with setns(2) to move a
thread to a different namespace.

While namespaces are normally bound to the lifetime of a given process
they can be preserved by bind-mounting the appropriate namespace file to
another location. This feature serves as the basis for the namespace
sharing feature.

The mount namespace is a little bit special as it requires some
additional things to be in place before the bind mount can happen. All
violations of the rules listed below results in mount failing with
EINVAL.

In order to preserve the mount namespace the calling process must be in
a different mount namespace (snap-confine just uses the original namespace in
which it executes). This can be achieved by forking a child process that
can see its parent mount namespace file in /proc/$ppid/ns/mnt and having
the parent process move to a new namespace by calling unshare(). The
destination of the bind mount must be on a filesystem that is mounted
without any peers (in the sense of shared subtrees). This can be checked
by inspecting and parsing the /proc/self/mountinfo file. The new module
includes support function that bind mounts the target directory over
itself and the converts it to a private mount.

Actual mount namespaces are kept in /run/snapd/ns (this is also the
directory that gets converted to a private mount). The actual namespaces
are put in "groups" with names identical to the snap name. In practice
the preserved namespaces are in /run/snapd/ns/$SNAP_NAME.mnt

In order to make everything race-free the library uses locking based on
flock(2). There are two lock files. One global, required to make
/run/snapd/ns a private mount, and one local, required to manipulate a
particular mount namespace. The locks are /run/snapd/ns/.lock and
/run/snaps/ns/$SNAP_NAME.lock respectively.

Everything is coded defensively, terminating the process in case
something bad happens. The child process of snap-confine is using
prctl() to ensure it gets killed when the parent process dies for any
reason.

One notable thing that is not present in this patch is the adjusted
apparmor profile for snap-confine and for its child process (it switches
hats to a new profile for added paranoia and security). It will be
presented along with a, hopefully small, patch that enables namespace
sharing in practice.

Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com

This patch adds a module that implements mount namespace sharing. It is
not yet used by snap-confine. The module implements a set of library
functions that can perform the necessary magic.

As the topic is a bit unexpected a brief description follows.

Linux namespaces can be created with the unshare(2) call. They are
represented as files in the nsfs filesystem under /proc/$pid/ns/. The
mount namespace is represented as /proc/$pid/ns/mnt. An open file
descriptor to a file like that can be used along with setns(2) to move a
thread to a different namespace.

While namespaces are normally bound to the lifetime of a given process
they can be preserved by bind-mounting the appropriate namespace file to
another location. This feature serves as the basis for the namespace
sharing feature.

The mount namespace is a little bit special as it requires some
additional things to be in place before the bind mount can happen. All
violations of the rules listed below results in mount failing with
EINVAL.

In order to preserve the mount namespace the calling process must be in
a different namespace (snap-confine just uses the original namespace in
which it executes). This can be achieved by forking a child process that
can see its parent mount namespace file in /proc/$ppid/ns/mnt and having
the parent process moves to a new namespace by calling unshare(). The
destination of the bind mount must be on a filesystem that is mounted
without any peers (in the sense of shared subtrees). This can be checked
by inspecting and parsing the /proc/self/mountinfo file. The new module
includes support function that bind mounts the target directory over
itself and the converts it to a private mount.

Actual mount namespaces are kept in /run/snapd/ns (this is also the
directory that gets converted to a private mount). The actual namespaces
are put in "groups" with names identical to the snap name. In practice
the preserved namespaces are in /run/snapd/ns/$SNAP_NAME.mnt

In order to make everything race-free the library uses locking based on
flock(2). There are two lock files. One global, required to make
/run/snapd/ns a private mount, and one local, required to manipulate a
particular mount namespace. The locks are /run/snapd/ns/.lock and
/run/snaps/ns/$SNAP_NAME.lock respectively.

Everything is coded defensively, terminating the process in case
something bad happens. The child process of snap-confine is using
prctl() to ensure it gets killed when the parent process dies for any
reason.

One notable thing that is *not* present in this patch is the adjusted
apparmor profile for snap-confine and for its child process (it switches
hats to a new profile for added paranoia and security). It will be
presented along with a, hopefully small, patch that enables namespace
sharing in practice.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
@jdstrand
Copy link
Collaborator

jdstrand commented Sep 6, 2016

I spent some time looking at this today and will comment inline. I can say that I may have further comments once the PR for using these functions is in place. There is a lot of 'you should only do this if that' language in the header that I think makes this module somewhat error prone to use. The header does do a good job of explaining each function though. Perhaps it should at the top also give steps on the ordering of how the functions are intended to be used?


As a part of application startup `snap-confine` will move the process to a new
mount namespace. Since version 1.0.41 all the applications belonging to a given
snap will share the mount namespace amongst them.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps: "As of version 1.0.41 all the applications from the same snap will share the same mount namespace. Applications from different snaps continue to use separate mount namespaces."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
// print pid's portably so this is the best we can do.
pid_t pid = fork();
debug("forked support process has pid %d", (int)pid);
if (pid == -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pid < 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@jdstrand
Copy link
Collaborator

jdstrand commented Sep 8, 2016

Thanks for all the changes! I think this is ok to merge provided:

  • my feedback from today is addressed (ie, pid < 0, rm_rf, comment changes)
  • your proposed fstatfs() is implemented

Don't feel like you need to implement the apparmor policy changes just yet as you mentioned, but let's not cut a new version of snap-confine until the other bits are in place.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
This patch fixes a potentially iffy code that used to open the mount
namespace file /run/snapd/ns/$SNAP_NAME.mnt and then blindly call
sents() on the file descriptor, if the open call succeeded.

Instead, we now fstatfs() the file descriptor and check the resulting
f_type field. If it is NSFS_MAGIC then we know that we should use
setns(). If not we fall back to regular initialization.

All errors from setns() now result in snap-confine death. This should be
more reliable against unexpected bugs.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
debug
("cannot re-associate the mount namespace with namespace group %s, falling back to initialization",
group->name);
debug("initializing new namespace group %s", group->name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is soo much nicer, thanks! :)

It's interesting that NSFS_MAGIC is not included in man fstatfs. Note that this was added only in http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/include/uapi/linux/magic.h?id=e149ed2b805fefdccf7ccdfc19eca22fdd4514ac so this will likely need to be added to the backports list if it isn't already. It does seem present in the 14.04 Ubuntu kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've informed tvoss about this. With the sanity check we'll know if tests are all green. Thanks for making me aware of this :)

This patch make the cleanup function a little bit more paranoid by not
expanding shell arguments and by ensuring that what is removed is in
/tmp/.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
@zyga zyga merged commit 7f07af2 into master Sep 12, 2016
@zyga zyga deleted the ns-sharing branch September 12, 2016 11:18
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants