Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
cmd/snap-confine: discard stale mount namespaces (v2) #4329
Conversation
zyga
added some commits
Nov 29, 2017
codecov-io
commented
Nov 30, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #4329 +/- ##
==========================================
- Coverage 78.06% 78.05% -0.01%
==========================================
Files 449 449
Lines 30951 30972 +21
==========================================
+ Hits 24161 24176 +15
- Misses 4777 4781 +4
- Partials 2013 2015 +2
Continue to review full report at Codecov.
|
| + # Nov 30 10:26:00 autopkgtest audit[16159]: AVC apparmor="DENIED" | ||
| + # operation="umount" profile="/snap/core/x1/usr/lib/snapd/snap-confine" | ||
| + # | ||
| + umount /, |
jdstrand
Nov 30, 2017
•
Contributor
I haven't performed a code review for this PR yet, but did look at the approach and discussed with @tyhicks and @jrjohansen (please correct as needed). Here are some important things to note about how snapd is managing namespaces:
- namespaces are not hiearchical and are all single level and live in parallel. snap-confine starts in the default namespace and ultimately uses setns() into the per-snap mount namespace
- using setns() to jump in and out of namespaces (the prior approach) is not how namespaces are meant to be used. setns() is really meant to be one direction
- manipulating mount namespaces can be a very dangerous operation if you can't trust the filesystem inside the mount namespace. Eg, there have been LXC/LXD bugs surrounding this
- manipulating mount namespaces can be tricky if there are processes running inside the mount namespace
- there are currently no LSM hooks for setns()
- the process manipulating mount namespaces (ie, the setuid snap-confine) is by definition very privileged and so the security policy confining it can only be so strong (eg, all the apparmor mount rules that are currently allowed on '/' (and others), there is no setns() mediation, lack of seccomp filter, ...)
In short, the general approach that this PR (and the subsequent PRs leading up to it) has been discussed with the security team and works within the current limitations of the kernel and keeps the above in mind. Specifically:
- strict confinement snaps running in the namespace are not allowed to use mount or access sensitive files in /proc (eg, nsfs). Importantly, devmode, classic snaps and snaps running on distros without full confinemnet have full access to everything, but they have full root anyway and must be trusted, so no privilege escalation
- nothing is done on the namespace if processes are running within it
- snap-confine has a restrictive AppArmor profile to reduce the attack surface as much as possible. An attacker with total control of snap-confine would be able to exercise significant privilege even under confinement, but the confinement makes attacks and gaining control more difficult
- snap-confine is defensively coded, undergoes in depth code reviews and is coded to fail closed
- snap-confine doesn't use setns() to jump back and forth within the same process (devmode notwithstanding, but we can look at fixing this separately-- let's not be distracted by that here). Instead it:
- starts in the default namespace
- forks() a child which does setns() to the snap-specific namespace. The child fork() without exec() is considered sufficient for this specific use case (but exec() is an important security barrier for many other use cases).
- the child interrogates the mount namespace, communicates if changes are needed to the parent and exits
- the parent checks if the namespace is in use (via the per-snap freezer cgroup where all the snap's processes are). If occupied, does nothing, otherwise discards (tears down) the namespace and reconstructs it
So long as snap-confine continues to setns() in one direction, we severely limit what confined snaps are allowed to do (ie with mount, CLONE_NEWNS, /proc, etc), we correctly detect if processes are running with the mount namespace in a race-free manner, then this is the best we can do.
|
Thank you for the very detailed write-up @jdstrand. I wanted to clarify two points if they happen to be important.
This should help other reviewers make decisions about the validity of the approach. |
|
@zyga - we shouldn't have to freeze the cgroup if our check/logic is race-free since nothing will be in there to freeze. This was one of the things I wanted to look at in the code review. I also updated the previous comment to clarify the freezer cgroup point. |
|
This PR is consistently failing on core systems. I had a look and the reason is that we are always in a situation where the root filesystem is stale and thus snap-confine will always try to "rebuild" the namespace, whenever possible. On core systems, unless we a using a custom base snap, we are not using pivot_root and instead rely on a simplified logic for building the mount namespace - taking advantage of the fact that ther roof filesystem is already the "right one". This doesn't work in our tests though because the root filesystem is mounted from a snap that is now gone. As a proof this is the output of
As you can see
Inside each per-snap mount namespace we see that the root filesystem is indeed the same / as on the outside.
However
|
|
Ok, I think I found the smoking gun now: As you can see here https://github.com/snapcore/snapd/blob/master/tests/lib/reset.sh#L95 the prepare-each for the main suite is removing all the snaps and the unpacks the special tarball we made soon after preparing the project. Now that we understand it we need to figure out what to do. I'm checking if we could just treat core systems the same as classic where we would be able to reconstruct the namespace correctly. |
|
@zyga - your comment suggests that you want to change snap-confine to accommodate the testsuite, but shouldn't the testsuite be adjusted to accommodate snapd? (Note, I haven't looked at this at all, just pointing out this struck me as the wrong way around). |
|
@jdstrand I think snap-confine is correct but could be "more correct". The problem with the test suite is that this is hard to change (immediately). I'm looking at this and still haven't made up my mind. Note that if core was treated the same as classic we'd get one clear benefit: various hooks could run with the new snap as soon as it is available, before the system reboots. Removing this distinction would also considerably simplify the mount logic in snap-confine. |
EDIT: http://pastebin.ubuntu.com/26118789/ has more details, including the major:minor numbers |
| + } | ||
| + // Open the hierarchy directory for the given snap. | ||
| + int hierarchy_fd SC_CLEANUP(sc_cleanup_close) = -1; | ||
| + hierarchy_fd = openat(cgroup_fd, buf, |
bboozzoo
Dec 7, 2017
Contributor
Why not just open /sys/fs/cgroup/freeezer/snap.%s/cgroup.procs instead of opening freezer, snap.%s and then cgroup.procs ?
zyga
Dec 7, 2017
Contributor
A bit of a paranoia wrt symlinks. This ensures that, from to a certain point, we are immune to symlink attacks.
| + | ||
| + // Send this back to the parent: 2 - discard, 1 - keep. | ||
| + // Note that we cannot just use 0 and 1 because of the semantics of eventfd(2). | ||
| + debug |
zyga
Dec 7, 2017
Contributor
Yes, indent is now disabled (as an enforced check) and I will migrate the code away to something saner.
| + // Note that we cannot just use 0 and 1 because of the semantics of eventfd(2). | ||
| + debug | ||
| + ("sending information about the state of the mount namespace"); | ||
| + if (eventfd_write(event_fd, should_discard ? 2 : 1) < 0) { |
bboozzoo
Dec 7, 2017
Contributor
I think I would prefer an enum here, so that it's easier to jump around using tags, eg:
typedef enum {
SC_NS_INVALID = 0,
SC_NS_KEEP,
SC_NS_DISCARD,
} sc_ns_discard_t
zyga
added some commits
Dec 7, 2017
|
For the sake of landing this I constrained the change to classic systems. |
zyga
added some commits
Jan 2, 2018
|
I just pushed a patch that fixes it. I mistakenly tested "is-classic" inside the mount namespace which was never true. This will now work :-) |
zyga commentedNov 30, 2017
This is the second iteration of this patch.
This patch enables snap-confine to discard stale mount namespaces. The
code already contained to logic to detect a stale namespace. The patch
introduces an additional check. Once we know of a stale namespace we
check if it would be safe to discard it by looking at the processes that
inhabit it. This can be done reliably by enumerating the freezer group.
If we find any process we consider it unsafe for the mount namespace to
be discarded but we log a diagnostic message that system administrators
can see (it can be an important security fact that a particular snap is
using an older revision of the base snap).
The code is made a little bit more generic so that we can also filter by
user identifier. This is likely to be used by the upcoming per-user
mount namespace feature.
The apparmor profile is extended slightly to be able to read the
cgroup.procs file and to unmount existing namespaces.
The code is structured specially so that we call setns only once
so that we stay within the limits of what the kernel apparmor
implementation currently supports. See patch description for details.
Signed-off-by: Zygmunt Krynicki zygmunt.krynicki@canonical.com