cmd/snap-confine: discard stale mount namespaces (v2) #4329

Open
wants to merge 7 commits into
from

Conversation

Projects
None yet
4 participants
Contributor

zyga commented Nov 30, 2017

This is the second iteration of this patch.

This patch enables snap-confine to discard stale mount namespaces. The
code already contained to logic to detect a stale namespace. The patch
introduces an additional check. Once we know of a stale namespace we
check if it would be safe to discard it by looking at the processes that
inhabit it. This can be done reliably by enumerating the freezer group.

If we find any process we consider it unsafe for the mount namespace to
be discarded but we log a diagnostic message that system administrators
can see (it can be an important security fact that a particular snap is
using an older revision of the base snap).

The code is made a little bit more generic so that we can also filter by
user identifier. This is likely to be used by the upcoming per-user
mount namespace feature.

The apparmor profile is extended slightly to be able to read the
cgroup.procs file and to unmount existing namespaces.

The code is structured specially so that we call setns only once
so that we stay within the limits of what the kernel apparmor
implementation currently supports. See patch description for details.

Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com

zyga added some commits Nov 29, 2017

cmd/snap-confine: discard stale mount namespaces
This patch enables snap-confine to discard stale mount namespaces. The
code already contained to logic to detect a stale namespace. The patch
introduces an additional check. Once we know of a stale namespace we
check if it would be safe to discard it by looking at the processes that
inhabit it. This can be done reliably by enumerating the freezer group.

If we find any process we consider it unsafe for the mount namespace to
be discarded but we log a diagnostic message that system administrators
can see (it can be an important security fact that a particular snap is
using an older revision of the base snap).

The code is made a little bit more generic so that we can also filter by
user identifier. This is likely to be used by the upcoming per-user
mount namespace feature.

The apparmor profile is extended slightly to be able to read the
cgroup.procs file and to unmount existing namespaces. That last thing is
a bit of a magic / broken rule as apparmor has some issues that we don't
fully understand whenever setns is called.

< jjohansen> uh, setns is a known problem, double setns doesn't
change that
< jjohansen> hey zyga-ubuntu, setns won't necessarily break
everything dependent on several things. Partly to due with how mounts
are shared or whether they are cloned ..

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
cmd/snap-confine: use helper process, setns at most once
This patch is a follow-up from the discussion between me, jdstrand and
jjohansen. Due to some constraints we must call setns at most once. This
means that we cannot jump in and out of a mount namespace. This patch
adjust ts the logic to have a helper process jump in, look around and
report back.

EDIT: Even with one process the bug looks the same and the same
workaround is kept as a remedy.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>

codecov-io commented Nov 30, 2017

Codecov Report

Merging #4329 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4329      +/-   ##
==========================================
- Coverage   78.06%   78.05%   -0.01%     
==========================================
  Files         449      449              
  Lines       30951    30972      +21     
==========================================
+ Hits        24161    24176      +15     
- Misses       4777     4781       +4     
- Partials     2013     2015       +2
Impacted Files Coverage Δ
cmd/snap/cmd_info.go 50.95% <0%> (+0.44%) ⬆️
cmd/snap/cmd_run.go 68.32% <0%> (+1.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5e7630...580cc3a. Read the comment docs.

+ # Nov 30 10:26:00 autopkgtest audit[16159]: AVC apparmor="DENIED"
+ # operation="umount" profile="/snap/core/x1/usr/lib/snapd/snap-confine"
+ #
+ umount /,
@jdstrand

jdstrand Nov 30, 2017

Contributor

I haven't performed a code review for this PR yet, but did look at the approach and discussed with @tyhicks and @jrjohansen (please correct as needed). Here are some important things to note about how snapd is managing namespaces:

  • namespaces are not hiearchical and are all single level and live in parallel. snap-confine starts in the default namespace and ultimately uses setns() into the per-snap mount namespace
  • using setns() to jump in and out of namespaces (the prior approach) is not how namespaces are meant to be used. setns() is really meant to be one direction
  • manipulating mount namespaces can be a very dangerous operation if you can't trust the filesystem inside the mount namespace. Eg, there have been LXC/LXD bugs surrounding this
  • manipulating mount namespaces can be tricky if there are processes running inside the mount namespace
  • there are currently no LSM hooks for setns()
  • the process manipulating mount namespaces (ie, the setuid snap-confine) is by definition very privileged and so the security policy confining it can only be so strong (eg, all the apparmor mount rules that are currently allowed on '/' (and others), there is no setns() mediation, lack of seccomp filter, ...)

In short, the general approach that this PR (and the subsequent PRs leading up to it) has been discussed with the security team and works within the current limitations of the kernel and keeps the above in mind. Specifically:

  • strict confinement snaps running in the namespace are not allowed to use mount or access sensitive files in /proc (eg, nsfs). Importantly, devmode, classic snaps and snaps running on distros without full confinemnet have full access to everything, but they have full root anyway and must be trusted, so no privilege escalation
  • nothing is done on the namespace if processes are running within it
  • snap-confine has a restrictive AppArmor profile to reduce the attack surface as much as possible. An attacker with total control of snap-confine would be able to exercise significant privilege even under confinement, but the confinement makes attacks and gaining control more difficult
  • snap-confine is defensively coded, undergoes in depth code reviews and is coded to fail closed
  • snap-confine doesn't use setns() to jump back and forth within the same process (devmode notwithstanding, but we can look at fixing this separately-- let's not be distracted by that here). Instead it:
    1. starts in the default namespace
    2. forks() a child which does setns() to the snap-specific namespace. The child fork() without exec() is considered sufficient for this specific use case (but exec() is an important security barrier for many other use cases).
    3. the child interrogates the mount namespace, communicates if changes are needed to the parent and exits
    4. the parent checks if the namespace is in use (via the per-snap freezer cgroup where all the snap's processes are). If occupied, does nothing, otherwise discards (tears down) the namespace and reconstructs it

So long as snap-confine continues to setns() in one direction, we severely limit what confined snaps are allowed to do (ie with mount, CLONE_NEWNS, /proc, etc), we correctly detect if processes are running with the mount namespace in a race-free manner, then this is the best we can do.

Contributor

zyga commented Nov 30, 2017

Thank you for the very detailed write-up @jdstrand. I wanted to clarify two points if they happen to be important.

  • When we look at the inhabitants of the mount namespace we take advantage of the freezer cgroup we create and maintain, for each snap separately, in a similar non-hierarchical, parallel manner, we are not freezing that cgroup. If this is unreliable and we have to freeze the cgroup that is something we should do before landing this patch.
  • When we construct a fresh mount namespace we also fork for a second time, before we unshare the mount namespace. This is done to perform the bind mount from /proc/$ppid/mnt/ns to /run/snapd/ns/$SNAP_NAME.mnt. The directory /run/snapd/ns has private mount event propagation (i.e. none).

This should help other reviewers make decisions about the validity of the approach.

Contributor

jdstrand commented Nov 30, 2017

@zyga - we shouldn't have to freeze the cgroup if our check/logic is race-free since nothing will be in there to freeze. This was one of the things I wanted to look at in the code review. I also updated the previous comment to clarify the freezer cgroup point.

Contributor

zyga commented Dec 4, 2017

This PR is consistently failing on core systems. I had a look and the reason is that we are always in a situation where the root filesystem is stale and thus snap-confine will always try to "rebuild" the namespace, whenever possible.

On core systems, unless we a using a custom base snap, we are not using pivot_root and instead rely on a simplified logic for building the mount namespace - taking advantage of the fact that ther roof filesystem is already the "right one".

This doesn't work in our tests though because the root filesystem is mounted from a snap that is now gone. As a proof this is the output of losetup --all shortly after one of the affected tests has failed:

linode:ubuntu-core-16-64 .../tests/main/snap-update-ns# losetup --all
/dev/loop0: [2051]:20 (/writable/system-data/var/lib/snapd/snaps/core_x1.snap (deleted))
/dev/loop1: [2051]:21 (/writable/system-data/var/lib/snapd/snaps/pc-kernel_93.snap (deleted))
/dev/loop2: [2051]:20 (/var/lib/snapd/snaps/core_x1.snap (deleted))
/dev/loop3: [2051]:21 (/var/lib/snapd/snaps/pc-kernel_93.snap (deleted))
/dev/loop4: [2051]:33 (/var/lib/snapd/snaps/pc_22.snap (deleted))
/dev/loop5: [2051]:114388 (/var/lib/snapd/snaps/test-snapd-content-slot_x1.snap)
/dev/loop6: [2051]:114397 (/var/lib/snapd/snaps/test-snapd-content-plug_x1.snap)

As you can see /dev/loop0 which is revision x1 of the core snap has been deleted. The file is still mounted though as we can see in /proc/1/mountinfo

linode:ubuntu-core-16-64 .../tests/main/snap-update-ns# grep 7:0 /proc/1/mountinfo 
25 0 7:0 / / ro,relatime shared:1 - squashfs /dev/loop0 ro

Inside each per-snap mount namespace we see that the root filesystem is indeed the same / as on the outside.

linode:ubuntu-core-16-64 .../tests/main/snap-update-ns# nsenter -m/run/snapd/ns/test-snapd-content-plug.mnt
linode:ubuntu-core-16-64 /# grep 7:0 /proc/self/mountinfo
483 740 7:0 / /var/lib/snapd/hostfs ro,relatime master:1 - squashfs /dev/loop0 ro
618 482 7:0 / / ro,relatime master:1 - squashfs /dev/loop0 ro

However /snap/core/x1 is actually another thing, as we can see

linode:ubuntu-core-16-64 .../tests/main/snap-update-ns# grep 7:2 /proc/self/mountinfo 
448 105 7:2 / /snap/core/x1 ro,nodev,relatime shared:400 - squashfs /dev/loop2 ro
Contributor

zyga commented Dec 4, 2017

Ok, I think I found the smoking gun now:

As you can see here https://github.com/snapcore/snapd/blob/master/tests/lib/reset.sh#L95 the prepare-each for the main suite is removing all the snaps and the unpacks the special tarball we made soon after preparing the project. Now that we understand it we need to figure out what to do. I'm checking if we could just treat core systems the same as classic where we would be able to reconstruct the namespace correctly.

Contributor

jdstrand commented Dec 4, 2017

@zyga - your comment suggests that you want to change snap-confine to accommodate the testsuite, but shouldn't the testsuite be adjusted to accommodate snapd? (Note, I haven't looked at this at all, just pointing out this struck me as the wrong way around).

Contributor

zyga commented Dec 4, 2017

@jdstrand I think snap-confine is correct but could be "more correct". The problem with the test suite is that this is hard to change (immediately). I'm looking at this and still haven't made up my mind.

Note that if core was treated the same as classic we'd get one clear benefit: various hooks could run with the new snap as soon as it is available, before the system reboots. Removing this distinction would also considerably simplify the mount logic in snap-confine.

cmd/snap-confine: reference bug on apparmor and nsfs
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Contributor

zyga commented Dec 5, 2017

14:52 < zyga-ubuntu> ogra_: do you have a moment:
14:52 < ogra_> zyga-ubuntu, sure
14:53 < zyga-ubuntu> ogra_: can you grab any core device you have around you and run losetup --all
14:56 < zyga-ubuntu> ogra_: how many times is core attached?
14:56 < ogra_> zyga-ubuntu, http://paste.ubuntu.com/26118700/
14:57 < ogra_> zyga-ubuntu, on some i had run snap abort for a hanging core already (pi3 and joule)
14:57 < zyga-ubuntu> /dev/loop0: []: (/writable/system-data/var/lib/snapd/snaps/core_3587.snap)
14:57 < zyga-ubuntu> /dev/loop3: []: (/var/lib/snapd/snaps/core_3587.snap)
14:57 < zyga-ubuntu> why is this happening?
14:58 < zyga-ubuntu> ogra_: something is not detaching those
14:58 < ogra_> zyga-ubuntu, you tell me :)
14:59 < ogra_> zyga-ubuntu, they are double mounted ... thats the design of assembling the rootfs i think
14:59 < zyga-ubuntu> ogra_: they are not only double mounted, the same core is loop-attached twice to separate loop devices
14:59 < ogra_> zyga-ubuntu, i.e. we mount /writable from initrd, then mount the snap ... then do all the bind mounting back into /writable for rw stuff and then use run-init to switch the rootfs
15:00 < ogra_> the original snap mount will stay
15:00 < zyga-ubuntu> ogra_: mmmm
15:00 < ogra_> zyga-ubuntu, and then i guess systemd mounts them again due to having a .mount unit
15:00 < zyga-ubuntu> ogra_: aha
15:00 < zyga-ubuntu> ogra_: I see
15:00 < zyga-ubuntu> ogra_: thanks!
15:01  * zyga-ubuntu knows more but needs to see why his test is still failing
15:01 < ogra_> we could perhaps make these mount usingts skipped ... based on /proc/cmdline or so
15:01 < ogra_> since we dont really need them mounted twice
15:01 < ogra_> (but can not "not mount" them when assembling the rootfs)
15:02 < zyga-ubuntu> ogra_: right
15:02 < ogra_> i'm not sure if "mount --move" would work here to move the existing mount to a new place ...
15:02 < zyga-ubuntu> ogra_: ideally we'd mount them but reuse the loopback device
15:02 < ogra_> sounds risky in any case though
15:02 < ogra_> yeah or that
15:03 < ogra_> will need a *lot* of testing though ... after all you suffle stuff around unterneath your rootfs /
15:03 < ogra_> *shuffle
15:05  * zyga-ubuntu analyzes http://pastebin.ubuntu.com/26118742/
15:16 < zyga-ubuntu> ha
15:16 < zyga-ubuntu> ok
15:17 < zyga-ubuntu> niemeyer: that issue I talked about in the call, it's actually trickier
15:17 < zyga-ubuntu> niemeyer: but I now have a full(er) picture
15:17 < zyga-ubuntu> niemeyer: it relates to those loop devices I spoke with ogra above
15:17 < zyga-ubuntu> niemeyer: I think I just need to adjust logic to check that the / inside matches / outside (when snap name == "core")
15:18 < zyga-ubuntu> niemeyer: because on core / is mounted in initrd and it is a spearate mount of the same snap (but has different loopback device)
15:18 < zyga-ubuntu> niemeyer: alternatively we could not perform this detection when running on core and when the base snap is "core"
15:18 < zyga-ubuntu> niemeyer: as any change would warrant a reboot 

EDIT: http://pastebin.ubuntu.com/26118789/ has more details, including the major:minor numbers

+ }
+ // Open the hierarchy directory for the given snap.
+ int hierarchy_fd SC_CLEANUP(sc_cleanup_close) = -1;
+ hierarchy_fd = openat(cgroup_fd, buf,
@bboozzoo

bboozzoo Dec 7, 2017

Contributor

Why not just open /sys/fs/cgroup/freeezer/snap.%s/cgroup.procs instead of opening freezer, snap.%s and then cgroup.procs ?

@zyga

zyga Dec 7, 2017

Contributor

A bit of a paranoia wrt symlinks. This ensures that, from to a certain point, we are immune to symlink attacks.

cmd/snap-confine/ns-support.c
+
+ // Send this back to the parent: 2 - discard, 1 - keep.
+ // Note that we cannot just use 0 and 1 because of the semantics of eventfd(2).
+ debug
@bboozzoo

bboozzoo Dec 7, 2017

Contributor

This is some weird indentation employed by indent

@zyga

zyga Dec 7, 2017

Contributor

Yes, indent is now disabled (as an enforced check) and I will migrate the code away to something saner.

cmd/snap-confine/ns-support.c
+ // Note that we cannot just use 0 and 1 because of the semantics of eventfd(2).
+ debug
+ ("sending information about the state of the mount namespace");
+ if (eventfd_write(event_fd, should_discard ? 2 : 1) < 0) {
@bboozzoo

bboozzoo Dec 7, 2017

Contributor

I think I would prefer an enum here, so that it's easier to jump around using tags, eg:

typedef enum {
    SC_NS_INVALID = 0,
    SC_NS_KEEP,
    SC_NS_DISCARD, 
} sc_ns_discard_t
@zyga

zyga Dec 7, 2017

Contributor

I did something like that now.

zyga added some commits Dec 7, 2017

cmd/snap-confine: use enum instead of magic values
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
cmd/snap-confine: limit stale namespace detection to classic
Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Contributor

zyga commented Dec 7, 2017

For the sake of landing this I constrained the change to classic systems.

zyga added some commits Jan 2, 2018

cmd/snap-confine: fix detection of stale mount ns
This patch fixes incorrect check for classic vs core that happened
inside the mount namespace, not outside of it.

Signed-off-by: Zygmunt Krynicki <zygmunt.krynicki@canonical.com>
Contributor

zyga commented Jan 3, 2018

I just pushed a patch that fixes it. I mistakenly tested "is-classic" inside the mount namespace which was never true. This will now work :-)

@zyga zyga added this to the 2.31 milestone Jan 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment