interfaces: simplify snap-confine by just loading pre-generated bpf code #3431

Merged
merged 95 commits into from Jun 22, 2017

Conversation

Projects
None yet
4 participants
Collaborator

mvo5 commented Jun 2, 2017

We currently rely on snap-confine to generate the seccomp filter based on a textual filter. This approach has some drawbacks:

  • the parser is written in a suid root C program
  • tricky to add new symbols and new syntax, i.e. we always need to ensure that the snap-confine that runs can parse the input
  • the parser is written in C

This branch changes this so that all the parsing is done in go. This will generate a binary bpf program. This program is loaded via snap-confine into the kernel.

Note that the tests will not work just yet, the testing is done by compiling the bpf program and running it in a bpf VM with the inputs that the kernel would feed into the bpf program. However there is a bug/misfeature in x/net/bpf that hardcodes the byte order to big-endian. Which will fail on x86. I reported this (with a patch) upstream at golang/go#20556

@mvo5 mvo5 requested review from zyga and jdstrand Jun 2, 2017

mvo5 added some commits Jun 2, 2017

I just started looking at this and I wanted to quickly mention before any other review comments that I noticed that the tests in cmd/snap-confine/tests/* have not been ported. These are run as part of 'make check' and many verify syntax parsing and some make sure the sandbox's profile is loaded and in effect. I suppose the sandbox tests could be moved out to spread but all the parsing tests should be reimplemented in main_test.go.

I've performed a first pass at the snap-confine C changes. I'll take a look at snap-seccomp next.

cmd/snap-confine/seccomp-support.c
@@ -20,6 +20,7 @@
#include <asm/ioctls.h>
#include <ctype.h>
#include <errno.h>
+#include <fcntl.h>
#include <linux/can.h> // needed for search mappings
#include <linux/netlink.h>
#include <sched.h>
@jdstrand

jdstrand Jun 12, 2017

Contributor

You can (and should) remove a lot of includes that were only used for search mappings.

@mvo5

mvo5 Jun 13, 2017

Collaborator

Thanks, this is done now.

cmd/snap-confine/seccomp-support.c
+ die("seteuid failed");
+ if (geteuid() != 0)
+ die("raising privs before seccomp_load did not work");
+ }
@jdstrand

jdstrand Jun 12, 2017

Contributor

Instead of raising here, can you raise just before the prctl call instead? I don't see a reason to perform the parsing and loading into memory as root.

@mvo5

mvo5 Jun 13, 2017

Collaborator

Thanks, indeed! This makes a lot of sense.

cmd/snap-confine/seccomp-support.c
- filter_profile);
+ char profile_path[512]; // arbitrary path name limit
+ sc_must_snprintf(profile_path, sizeof(profile_path), "%s/%s.bpf",
+ filter_profile_dir, filter_profile);
@jdstrand

jdstrand Jun 12, 2017

Contributor

In the current non-bpf caching implementation, filter_profile_dir can be set to SNAPPY_LAUNCHER_SECCOMP_PROFILE_DIR. Was this functionality intentionally dropped from this PR?

@mvo5

mvo5 Jun 13, 2017

Collaborator

I will have a look at this, we may not need it anymore with real spread tests and with the bpf VM unit tests. But if its easier with it I will bring it back.

cmd/snap-confine/seccomp-support.c
- // initialize hsearch map
- sc_map_init();
+ // load bpf
+ char bpf[32 * 1024];
@jdstrand

jdstrand Jun 12, 2017

Contributor

Where did you get these magic values? I looked around and I found:

Looking at that, it seems that the size you picked is rather arbitrary. Is this true? If so, please comment this is an arbitrary upper bound.

@mvo5

mvo5 Jun 13, 2017

Collaborator

Sorry, my bad. This is indeed an arbitrary limit in my code. I think we want some limit but the question of course is how big it should be. Maybe something like 640kb? or multiple megabyte? Suggestions welcome.

@jdstrand

jdstrand Jun 14, 2017

Contributor

I mentioned elsewhere, I think 16kb is fine.

cmd/snap-confine/seccomp-support.c
- if (ctx == NULL) {
- errno = ENOMEM;
- die("seccomp_init() failed");
+ ssize_t num_read = read(fd, bpf, sizeof bpf);
@jdstrand

jdstrand Jun 12, 2017

Contributor

Note that read() might be interrupted for a number of reasons which could potentially result in a partial bpf cache file read that could be continued later. It's probably fine to just error out here like you are and let systemd/the user retry, but thought I'd mention it. See 'man 2 read' for details.

@mvo5

mvo5 Jun 13, 2017

Collaborator

Indeed, I reworked this code a little bit and error for now, I think I need to get back to this though, I think we can't land it without a proper read() implementation that deals with interruptions :/

cmd/snap-confine/seccomp-support.c
+ if (num_read < 0) {
+ die("cannot read bpf %s", profile_path);
+ } else if (num_read == sizeof bpf) {
+ die("cannot fit bpf %s into buffer", profile_path);
@jdstrand

jdstrand Jun 12, 2017

Contributor

This isn't quite right-- a file of exactly size '32 * 1024' fits into bpf. Since you are reading binary data, you don't need the '\0' at the end. Your code for the size check is fine and safe, it is just a bit confusing. I also don't know if '32 * 1024' is a magic value. Perhaps you want '(32 * 1024) + 1'?

@mvo5

mvo5 Jun 13, 2017

Collaborator

Indeed, thank you! I changed the code slightly now to check the limit directly via fstat instead via this (indirect and confusing) loading.

cmd/snap-confine/seccomp-support.c
@@ -775,126 +107,12 @@ scmp_filter_ctx sc_prepare_seccomp_context(const char *filter_profile)
// - capability sys_admin in AppArmor
// Note that with NO_NEW_PRIVS disabled, CAP_SYS_ADMIN is required to
// change the seccomp sandbox.
@jdstrand

jdstrand Jun 12, 2017

Contributor

Please adjust this comment to now be:

// Load filter into the kernel. Importantly we are intentionally *not* setting
// NO_NEW_PRIVS because it interferes with exec transitions in AppArmor with
// certain snappy interfaces. Not setting NO_NEW_PRIVS does mean that
// applications can adjust their sandbox if they have CAP_SYS_ADMIN or, if
// running on < 4.8 kernels, break out of the seccomp via ptrace. Both
// CAP_SYS_ADMIN and 'ptrace (trace)' are blocked by AppArmor with typical
// snappy interfaces.
@mvo5

mvo5 Jun 13, 2017

Collaborator

done

+ .len = num_read / sizeof(struct sock_filter),
+ .filter = (struct sock_filter *)bpf,
+ };
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
@jdstrand

jdstrand Jun 12, 2017

Contributor

I'm not terribly excited that we are just dumping into the kernel whatever is on disk since that could expose kernel implementation bugs but some light fuzzing does show that the kernel is verifying it is a valid bpf, but snap-confine seems like it would happily load an arbitrary non-seccomp bpf. snap-confine should be modified to check that the '/' all the way down to the bpf cache file are root:root owned and all writable only by the root user, with no writes for group or other. In this manner, only root-owned processes (eg, snapd) can write the bpf cache files, thus mitigating issues with loading arbitrary content into the kernel. This would guard against situations where the permissions of these files somehow become unsafe.

The most correct implementation would also have snap-confine perform (as non-root) validation on the input to make sure it is structured correctly.

@mvo5

mvo5 Jun 19, 2017

Collaborator

As discussed I added opcode based validation and removed it again because the consensus is that this kind of validation is too brittle. The check for /var/lib/snapd/seccomp/bpf all the way down to "/" is implemented (and uncovered a bug in the current snap-confine code).

cmd/snap-confine/seccomp-support.c
+ .filter = (struct sock_filter *)bpf,
+ };
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ die("prctl(SECCOMP)");
@jdstrand

jdstrand Jun 12, 2017

Contributor

We had better error reporting before this PR. Ie, we had:

rc = seccomp_load(ctx);
if (rc != 0) {
        fprintf(stderr, "seccomp_load failed with %i\n", rc);
        die("aborting");
}

Can you do something similar, eg:

if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
        perror("prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...) failed");
        die("aborting");
}
cmd/snap-seccomp/main.go
+ cmd := os.Args[1]
+ switch cmd {
+ case "compile":
+ content, err := ioutil.ReadFile(os.Args[2])
@jdstrand

jdstrand Jun 12, 2017

Contributor

Your use of := here declares err local to this statement such that err is always nil outside of this statement which means that compile(content, os.Args[3]) errors end up being ignored. Eg:

$ mkdir /tmp/dir
$ snap-seccomp compile /var/lib/snapd/seccomp/profiles/snap.hello-world.sh /tmp/dir && echo yes
yes
$

That should have errored out. Please correct this to use = instead of := up above and add a test for os.Args[3] pointing at an existing directory and that it returns an error.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Thank you, very good catch. This is fixed now.

mvo5 added some commits Jun 13, 2017

Per the discussion today in the call, it seems pretty strange that we're getting out of parsing seccomp syntax after suffering with incompatibilities, and then we implement a parser for BPF code? Why would we do that? The tooling that builds those files is trusted, and if there's any good place to do validation of this content, that place is the kernel. We shouldn't be in the business of doing that I think.

Commenting only on the read() part in the interest of time.

cmd/snap-confine/seccomp-support.c
+ die("cannot stat %s", profile_path);
+ if (stat_buf.st_size > MAX_BPF_SIZE)
+ die("profile %s is too big %lu", profile_path,
+ stat_buf.st_size);
@jdstrand

jdstrand Jun 13, 2017

Contributor

Note, stat() is a little weird in that while you pass it an fd, the size is checked at the time of the call and does not return the size from when the file was open()ed. Therefore, this is technically racy with the open but that's ok with how you are using it (ie, as a quick check for convenient error reporting). You are correctly giving 'read()' the maximum size which is MAX_BPF_SIZE so there is no need to check num_read is too big (though note that between the stat and the read the file could've been made bigger and therefore the read truncated).

In thinking about this and the issues with read() I suggest:

FILE *fp;
unsigned char bpf[MAX_BPF_SIZE + 1]; // acount for EOF

fp = fopen(profile_path, "rb");
if (fp == NULL) {
	die("cannot read %s", profile_path);

// set 'size' to 1 to get bytes transferred
size_t num_read = fread(bpf, 1, sizeof(bpf), fp);

if (ferror(fp) != 0)
	die("cannot fread() %s", profile_name);
else if (feof(fp) == 0)
	die("profile %s exceeds %zu bytes", profile_path, sizeof(bpf));

fclose(fp);
debug("read %zu bytes from %s", num_read, profile_name);
@mvo5

mvo5 Jun 19, 2017

Collaborator

Thank you, the code now uses your suggestion.

mvo5 added some commits Jun 13, 2017

Contributor

jdstrand commented Jun 13, 2017

@niemeyer - regarding parsing the content of the bpf, I agree we shouldn't introduce brittle bpf parsing checks.

My comment regarding 'most correct' was coming from the POV that snap-confine is setuid and is therefore able to load bpf programs into the kernel, so we need to understand that coding errors could result in non-root being able to load arbitrary data into the kernel via the bpf interface. The most correct implementation from a security perspective is to sanitize that data, and if there are non-brittle checks, that would be fine, but looking at the opcode checks, I worry they might break or be different between different kernels.

For snappy, we can mitigate that concern and avoid the brittle checks if we make sure snap-confine only reads from protected directories that snapd manages (ie, you have to be root to modify the bpf programs, so there is no privilege escalation in exploiting a bug in snap-confine for loading bpf programs).

cmd/snap-confine/seccomp-support.c
-{
- int rc;
+ // validate profile-path and all parents is all root owned
+ // (TOCTTOU sadly).
@jdstrand

jdstrand Jun 13, 2017

Contributor

As implemented, this is a TOCTOU because you are starting at the filename and going to '/'. If you go the other way (I recommended going 'down'), ie from '/' down to the filename, there is no exploitable TOCTOU because only root would be able to change the paths after they've been checked and if you are root, you can just change the bpf anyway. I suggest something along the lines of:

static void validate_path_has_strict_perms(const char *path)
{
	struct stat stat_buf;
	if (stat(path, &stat_buf) < 0)
		die("cannot stat %s", path);

	if (stat_buf.st_uid != 0 || stat_buf.st_gid != 0)
		die("%s not root-owned", path);

	if (stat_buf.st_mode & S_IWOTH)
		die("%s has 'other' write", path);
}


static void validate_bpfpath_is_safe(const char *path)
{
	// strtok_r() modifies its first argument, so work on a copy
	char *tokenized = strdup(path);

	// allocate a string large enough to hold path, and initialize it to
	// '/'
	size_t checked_path_size = sizeof (char) * strlen(path) + 1;
	char *checked_path = malloc(checked_path_size);
	if (checked_path == NULL)
		die("Out of memory creating checked_path");

	checked_path[0] = '/';
	checked_path[1] = '\0';

	// validate '/'
	validate_path_has_strict_perms(checked_path);

	// strtok_r needs a pointer to keep track of where it is in the
	// string.
	char *buf_saveptr;

	// reconstruct the path from '/' down to profile_name
	char *buf_token = strtok_r(tokenized, "/", &buf_saveptr);
	while (buf_token != NULL) {
		char *prev = strdup(checked_path); // needed by vsnprintf in sc_must_snprintf
		// append '<buf_token>' if checked_path is '/', otherwise '/<buf_token>'
		if (strlen(checked_path) == 1)
			sc_must_snprintf(checked_path, checked_path_size, "%s%s", prev, buf_token);
		else
			sc_must_snprintf(checked_path, checked_path_size, "%s%s", prev, buf_token);

		free(prev);
		validate_path_has_strict_perms(checked_path);

		buf_token = strtok_r(NULL, "/", &buf_saveptr);
	}

	free(tokenized);
	free(checked_path);
}

int sc_apply_seccomp_bpf(const char *filter_profile)
{
	...
	// validate '/' down to profile_path are root-owned and not 'other' writable
	// to avoid possibility of privilege escalation via bpf program load when
	// paths are incorrectly set on the system.
	validate_bpfpath_is_safe(profile_name);
	...
@jdstrand

jdstrand Jun 13, 2017

Contributor

Note, that validate_bpfpath_is_safe() assumes an absolute path, which is guaranteed in sc_apply_seccomp_bpf() but a quick check for:

if (path == NULL || strlen(path) == 0 || path[0] != '/')
    die(...)

would be good in the name of defensive coding and future proofing.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Thanks, both suggestions are in the code now.

tests/main/snap-confine/task.yaml
+ echo "snap-seccomp should have failed to compile this"
+ exit 1
+ fi
+ done
@jdstrand

jdstrand Jun 13, 2017

Contributor

This is a pure syntax check and therefore should be covered by the unit tests. All pure syntax checks should be ported to cmd/snap-seccomp/main_test.go while ones testing if confinement is in place should be in spread tests. To make this easier, here are the syntax tests to be ported to main_export.go:

  • test_bad_seccomp_filter_*
  • test_restrictions_working_args_clone
  • test_restrictions_working_args_mknod
  • test_restrictions_working_args_prctl
  • test_restrictions_working_args_prio
  • test_restrictions_working_args_quotactl
  • test_restrictions_working_args_socket
  • test_restrictions_working_args_termios

and here are the ones to be ported to spread:

  • test_complain
  • test_complain_missed
  • test_unrestricted
  • test_unrestricted_missed
  • test_noprofile
  • test_whitelist (this may already be sufficiently covered by other spread tests)
  • test_restrictions (this may already be sufficiently covered by other spread tests)
  • test_restrictions_working (this may already be sufficiently covered by other spread tests)
  • test_restrictions_working_args

All that said, having one spread test with busted syntax is valuable to make sure things are functionally sane.

Contributor

niemeyer commented Jun 13, 2017

The most correct implementation from a security perspective is to sanitize that data, and if there are non-brittle checks, that would be fine, but looking at the opcode checks, I worry they might break or be different between different kernels.

I still don't understand the perspective, Jamie. Who produced the data in the first place? If we don't trust the data we're loading into the kernel, we have much larger issues.

For snappy, we can mitigate that concern and avoid the brittle checks if we make sure snap-confine only reads from protected directories that snapd manages (ie, you have to be root to modify the bpf programs, so there is no privilege escalation in exploiting a bug in snap-confine for loading bpf programs).

Yes, exactly. If snapd or its data directories can be arbitrarily modified, we're in deep trouble already.

Contributor

jdstrand commented Jun 13, 2017

@niemeyer - the POV is me with my security hat on and it's just about security in depth and hardening. If through user, admin or packaging error a directory or file perms are wrong somewhere, then it is good to put roadblocks in place to block privilege escalation exploiting that.

With my snappy hat on, to be super clear, I am not advocating expensive or brittle checks and would prefer that the opcode checks be removed cause they look brittle. If there are bpf checks that are non-brittle and cheap, perhaps we could consider them, but I don't think they're necessary for this PR if we're happy with the file perms.

Contributor

niemeyer commented Jun 13, 2017

@jdstrand Thanks, I think we're aligned then. As a suggestion, if there are good checks for BPF codes, they ideally would be in the kernel so random data isn't being loaded in the first place.

+ "AF_IB": C.AF_IB, // 27
+ "PF_IB": C.PF_IB,
+ "AF_MPLS": C.AF_MPLS, // 28
+ "PF_MPLS": C.PF_MPLS,
@jdstrand

jdstrand Jun 13, 2017

Contributor

This doesn't seem to represent what the C implementation is doing. In the C implementation, we had:

#ifndef AF_IB
#define AF_IB 27
#define PF_IB AF_IB
#endif				// AF_IB
	sc_map_add(AF_IB);
	sc_map_add(PF_IB);

but the go implementation is relying on these being available in C, which iirc they weren't available on 14.04.

@jdstrand

jdstrand Jun 14, 2017

Contributor

Oh, I missed that this and the ones I commented on later are handled by the preamble before import "C". I'll leave this comment for posterity, but delete the others for clarity.

@mvo5

mvo5 Jun 19, 2017

Collaborator

The code now follows the original C code closer and builds fine on 14.04 now.

+ "NETLINK_RDMA": C.NETLINK_RDMA,
+ "NETLINK_CRYPTO": C.NETLINK_CRYPTO,
+ "NETLINK_INET_DIAG": C.NETLINK_INET_DIAG, // synonymous with NETLINK_SOCK_DIAG
+}
@jdstrand

jdstrand Jun 13, 2017

Contributor

Comments regrading ifndefs notwithstanding, the seccompResolver map looks good. When the tests are ported, they will verify all of these are correct.

UPDATE: ignore the "Comments regrading ifndefs notwithstanding" part of the above.

cmd/snap-seccomp/main.go
+ return value, nil
+ }
+
+ return strconv.ParseUint(token, 10, 64)
@jdstrand

jdstrand Jun 13, 2017

Contributor

Can you add a comment here:

// Negative numbers are not supported yet, but when they are, adjust this
// accordingly
+ }
+
+ // regular line
+ tokens := strings.Fields(line)
@jdstrand

jdstrand Jun 13, 2017

Contributor

In the C implementation we are checking for SC_ARGS_MAXLENGTH elsewhere, but it would be more efficient to check this here:

if len(tokens[1:] > ScArgsMaxlength) {
	return fmt.Errorf("Too many arguments specified for syscall '%s'", tokens[0])
}
cmd/snap-seccomp/main.go
+ // fish out syscall
+ secSyscall, err := seccomp.GetSyscallFromName(tokens[0])
+ if err != nil {
+ // FIXME: add structed error to libseccomp-golang
@jdstrand

jdstrand Jun 13, 2017

Contributor

Can you adjust this to be:

// FIXME: add structed error to libseccomp-golang. For now, ignore
// unknown syscalls
cmd/snap-seccomp/main.go
+ if err.Error() == "could not resolve name to syscall" {
+ return nil
+ }
+ return fmt.Errorf("cannot resolve name: %s", err)
@jdstrand

jdstrand Jun 13, 2017

Contributor

This isn't exactly the same as the C implementation as I read it. In the C implementation if seccomp_syscall_resolve_name() gives back __NR_SCMP_ERROR then we ignore it. I don't know what seccomp.GetSyscallFromName can return as error, but depending on what it does we might have errors returns in snap-seccomp when we didn't in the C implementation.

It also seems unsafe to check for string equality on err.Error(). How about:

if err != nil {
	// FIXME: add structed error to libseccomp-golang. For now, ignore
	// unknown syscalls
	return nil
}
@mvo5

mvo5 Jun 19, 2017

Collaborator

Thanks, this is implemented now as suggested.

cmd/snap-seccomp/main.go
+ var conds []seccomp.ScmpCondition
+ if err != nil {
+ return fmt.Errorf("cannot create new filter: %s", err)
+ }
@jdstrand

jdstrand Jun 13, 2017

Contributor

I'm not clear on this err check-- we already checked it above....

@mvo5

mvo5 Jun 14, 2017

Collaborator

Indeed, sorry, that is a leftover, I removed it.

cmd/snap-seccomp/main.go
+
+ if pos >= ScArgsMaxlength {
+ return fmt.Errorf("too many tokens (%d) in line %q", pos, line)
+ }
@jdstrand

jdstrand Jun 13, 2017

Contributor

We can check for this more efficiently up above, so if you implement that, feel free to remove this.

cmd/snap-seccomp/main.go
+ return fmt.Errorf("too many tokens (%d) in line %q", pos, line)
+ }
+
+ if arg == "-" {
@jdstrand

jdstrand Jun 13, 2017

Contributor

Can you add this for readability:

if arg == "-" {  // skip arg
cmd/snap-seccomp/main.go
+ conds = append(conds, scmpCond)
+ }
+
+ // FIXME: why do we need this fallback?
@jdstrand

jdstrand Jun 13, 2017

Contributor

seccomp_rule_add_exact_array is more precise than seccomp_rule_add_array (see man seccomp_rule_add). I suggest adding a comment:

// Default to adding a precise match if possible. Otherwise let seccomp figure
// out the architecture specifics.
cmd/snap-seccomp/main.go
+// the kernel arch to support the kernel's arch (eg, 64bit kernels with
+// 32bit userspace).
+func addSecondaryArches(secFilter *seccomp.ScmpFilter) error {
+ switch arch.UbuntuArchitecture() {
@jdstrand

jdstrand Jun 13, 2017

Contributor

Why is this named UbuntuArchitecture?

@mvo5

mvo5 Jun 14, 2017

Collaborator

This is a idiosyncrasy of our "arch" code. It would probably better be called "DpkgArchitecture()" and it maps from the go-architecture back to the dpkg architecture.

cmd/snap-seccomp/main.go
+ case "ppc64el":
+ return secFilter.AddArch(seccomp.ArchPPC)
+ }
+ return nil
@jdstrand

jdstrand Jun 13, 2017

Contributor

This code is different than the C implementation and doesn't appear to account for the difference between kernel arch and userspace arch. What you have is only valid if the kernel and the userspace has the same arch (eg, kernel and userspace are both amd64). If they are different, you need to add the kernel arch (eg, userspace is i386 (presumably what arch.UbuntuArchitecture() returns) but kernel is amd64), but this code doesn't seem to do that. Please see sc_add_seccomp_archs() in the C implementation.

@mvo5

mvo5 Jun 14, 2017

Collaborator

I'm curious to learn more about this. If the userspace is i386 but the kernel is amd64, why do we need add the amd64 syscalls in the seccomp then? Is that for the case that someone does apt install snapd:i386 on an amd64 system? But even with that we would only be able to install i386 snaps. I'm sure I'm missing something and I'm curious to learn what.

@jdstrand

jdstrand Jun 14, 2017

Contributor

64 bit kernels support filtering both their native syscalls and compat syscalls. Consider the following seccomp denial:

kernel: [1288522.491300] audit: type=1326 audit(1497441564.893:222342): auid=1000 uid=0 gid=0 ses=3 pid=30492 comm="curl" exe="/usr/bin/curl" sig=31 arch=c000003e syscall=41 compat=0 ip=0x7f2f1a8ec387 code=0x0

The compat=0 indicates that the syscall is using the kernel's native syscall. If you run an 32 bit x86 binary on amd64, this would be compat=1. To support syscall filtering for compat architectures, we need to load the compat architecture for that kernel. libseccomp will by default only load the architecture that matches its architecture (eg, if libseccomp is 32 bit, it loads that, if 64, it loads that).

Therefore to fully support compat syscalls, we need to handle:

  1. kernel and userspace are both 32 bit (no compat)
  2. kernel and userspace are both 64 bit (need to load 32 bit arch for 32 bit compat)
  3. kernel is 64 bit and userspace is 32 bit (need to load 64 bit arch for native syscalls)

32 bit kernel and 64 bit userspace is of course not supported. The C implementation handles all 3 of the above cases, this PR currently only handles the first two. I'm told the 3rd case is interesting in embedded where it is not uncommon to run a 64 bit kernel but otherwise use a 32 bit userspace (it isn't a typical general Linux distro combination, but snapd is going everywhere, so we need to support it).

@jdstrand

jdstrand Jun 14, 2017

Contributor

To give a practical example, most of the time snaps specify the architecture for the system they are destined for and the normal case is kernel, userspace and snap all match (ie, i386 or amd64) and there are no compat syscalls. Indeed, this was all snappy supported at first.

Some 64 bit systems support running 32 bit code on 64 bit systems, eg, a wine snap might ship in its amd64 snap both 64 bit and 32 bit binaries so that users can run old 32 bit Windows software, and the 32 bit binaries will use compat syscalls. This is case '2'.

A brand may decide to create a model with a 64 bit kernel and a 32 bit userspace. While most snaps would simply use the 32bit architecture (eg, i386), a specialized (perhaps brand) snap on this system might contain some code that is 64 bit to take advantage of the kernel capabilities and this code would need to use the syscalls for the kernel architecture. This is case '3'.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Thanks for this feedback. I implemented support for (3) now as well.

cmd/snap-seccomp/main.go
+ // FIXME: right now complain mode is the equivalent to unrestricted.
+ // We'll want to change this once we seccomp logging is in order.
+
+ if bytes.Contains(content, []byte("@unrestricted")) || bytes.Contains(content, []byte("@complain")) {
@jdstrand

jdstrand Jun 13, 2017

Contributor

This will catch these even if they are in comments, which I don't think we want.

@mvo5

mvo5 Jun 19, 2017

Collaborator

This is fixed now, only complete non-comment lines are considered now.

interfaces/seccomp/backend.go
+
+ for baseName := range content {
+ in := filepath.Join(dirs.SnapSeccompDir, baseName)
+ out := filepath.Join(dirs.SnapSeccompDir, baseName[:len(baseName)-3]+".bpf")
@jdstrand

jdstrand Jun 13, 2017

Contributor

I thought that snapd was particular about what was in dirs.SnapSeccompDir and it would remove things it didn't know about whenever profile regeneration happened. Has this changed?

@mvo5

mvo5 Jun 16, 2017

Collaborator

Thanks for bringing this up, this is still the case and a really important point. In order to provide reverts that work we will need to change the dirs.SnapSeccompDir in this PR. Otherwise the old profiles will get removed and we are back to the race where snapd is not quick enough to restore the profiles on system-boot. I'm open for suggestions, instead of '/var/lib/snapd/profileswe could use/var/lib/snapd/seccomp/bpfor just/var/lib/snapd/seccompor/var/lib/snapd/security/seccomp` ?

@jdstrand

jdstrand Jun 16, 2017

Contributor

I don't have a strong opinion except to say that I think it should be consistent with what we have. That suggests to me one of the following (not in order of preference):

  • /var/cache/snapd/seccomp (similar to /var/cache/apparmor for snappy apparmor cache files)
  • /var/lib/snapd/seccomp/cache (ie, put it alongside /var/lib/snapd/seccomp/profiles; if you prefer /var/lib/snapd/seccomp/bpf instead, that's fine too)
@mvo5

mvo5 Jun 19, 2017

Collaborator

I picked /var/lib/snapd/seccomp/bpf for now, happy to change if someone else has a strong preference.

interfaces/seccomp/backend_test.go
+ // and got compiled
+ c.Check(s.snapSeccomp.Calls(), DeepEquals, [][]string{
+ {"snap-seccomp", "compile", profile + ".in", profile + ".bpf"},
+ })
@jdstrand

jdstrand Jun 13, 2017

Contributor

The use of ".in" here doesn't seem to match up with backend.go which doesn't reference ".in" that I can see. What's happening here?

@mvo5

mvo5 Jun 19, 2017

Collaborator

In backend.go the code is using: path := fmt.Sprintf("%s.in", securityTag) - now that we use the directoy I can change things back to not use .in if you prefer that btw.

Contributor

jdstrand commented Jun 13, 2017

@niemeyer - re bpf checks in kernel: yeah, I mentioned elsewhere that the kernel appears to be doing some of that already, but I have a TODO to talk to cking about fuzzing the interface and seeing if improvements could be made there.

cmd/snap-confine/seccomp-support.c
- if (node->e == NULL)
- die("Out of memory creating ENTRY");
+// MAX_BPF_SIZE is an arbitrary limit.
+const int MAX_BPF_SIZE = 640 * 1024;
@jdstrand

jdstrand Jun 13, 2017

Contributor

I gave this some thought and I created filters with all the syscalls we have with a bunch of arg filtering calls and was able to create a profile that was about 8.5k. We can always increase this, but I think 16k minimum is reasonable. I suggest:

// Put an upper bound on how big the seccomp BPF can be
#define MAX_BPF_SIZE 16 * 1024
@jdstrand

jdstrand Jun 13, 2017

Contributor

I'd rather keep in on the smallishly reasonable side cause if things all of a sudden get big, that would indicate something we'd want to look into. Note, I only tested this on amd64.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Ok, sounds good. How about 32k just to be on the (very) safe side?

mvo5 added some commits Jun 14, 2017

disable check for stat_buf.st_mode & S_IWOTH in validate_path_has_str…
…ict_perms() for now, the bind mount magic gives us a 1777 dir apparently there :(
cmd/snap-confine/seccomp-support.c
+
+ // paranoid, but only helps a bit
+ if (stat_buf.st_uid != 0 || stat_buf.st_gid != 0)
+ die("cannot use %s: must be root:root owned", profile_path);
@jdstrand

jdstrand Jun 14, 2017

Contributor

This is already handled by validate_bpfpath_is_safe(profile_path), above.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Thanks, indeed. I removed it now.

mvo5 added some commits Jun 19, 2017

reshuffle and create TestRestrictionsWorkingArgsClone,TestRestriction…
…sWorkingArgsMknod,TestRestrictionsWorkingArgsPrio,TestRestrictionsWorkingArgsTermios
+ "aarch64": "arm64",
+ "ppc64le": "ppc64el",
+ "s390x": "s390x",
+ "ppc": "powerpc",
@jdstrand

jdstrand Jun 19, 2017

Contributor

Note that the C code also has ppc64 that in seccomp maps to SCMP_ARCH_PPC64. I realize that ppc64 is not a thing in Ubuntu (so ubuntuArchFromKernelArch understandably doesn't have it), but I wonder about other distros like Fedora, CentOS, Suse, etc. Is ppc64 supported anywhere in cross-distro? ppc64 is a thing in Debian: https://wiki.debian.org/PPC64

@mvo5

mvo5 Jun 19, 2017

Collaborator

Thanks, nice catch. I added it for completeness (with a comment).

cmd/snap-seccomp/main.go
+ // case snapd would also only request i386 snaps. So
+ // the use-case is even stranger, a i386 snap would
+ // have to ship some 64bit code that would have to
+ // detect at runtime if it can be used or not.
@jdstrand

jdstrand Jun 19, 2017

Contributor

This comment may not be totally accurate. At one point in time we had the concept of fat packages so I don't think it is all that strange for a snap to have code for multiple architectures. For example, an arch all snap that has binaries for everything or a snap that has mostly 32 bit intended for i386 systems with some dashes of 64 bit code that is conditionally used if running on a 64 bit kernel. I'm told this is not rare with certain classes of embedded devices.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Thanks, I realize the comment is a bit snarky, I changed it a bit to reflect that apparently embedded devices use this. I still find it strange but maybe I just need to get used to it :)

cmd/snap-seccomp/main.go
+ compatArch = "armhf"
+ case "ppc64el":
+ compatArch = "powerpc"
+ }
@jdstrand

jdstrand Jun 19, 2017

Contributor

powerpc is only valid with ppc64 because they are both big endian, not ppc64el which is little endian, but see my comments above where you omit ppc64 as a valid arch.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Nice catch, thank you! I fixed this now as well.

+ {"setpriority 0-10", `cannot parse line: cannot parse token "0-10" .*`},
+ {"setpriority 0,1", `cannot parse line: cannot parse token "0,1" .*`},
+ {"setpriority 0x0", `cannot parse line: cannot parse token "0x0" .*`},
+ {"setpriority a1", `cannot parse line: cannot parse token "a1" .*`},
@jdstrand

jdstrand Jun 19, 2017

Contributor

You skipped 1a. It was there since our syntax parser looks at the first character and the first character of 1a is 1 which is valid, but 1a should fail to parse.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Sorry for that, I readded the test for 1a.

cmd/snap-seccomp/main_test.go
+ {"setpriority 0x0", `cannot parse line: cannot parse token "0x0" .*`},
+ {"setpriority a1", `cannot parse line: cannot parse token "a1" .*`},
+ {"setpriority 1-", `cannot parse line: cannot parse token "1-" .*`},
+ {"setpriority 1\\ 2", `cannot parse line: cannot parse token "1\\\\" .*`},
@jdstrand

jdstrand Jun 19, 2017

Contributor

In the shell tests, this was meant to test an embedded newline (this tests the backslash, which I'm fine with you keeping, but a test with an embedded newline would be good). Ie, 1\n2.

UPDATE: the shell scripts were testing for 1\ 2 in the profile and 1\n2 in the profile.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Hm, I think setpriorty 1\n2 is currently legal, it would get parsed as two rules, the first setpriority with first argument equals 1. The second syscall number 2 with any arguments. Am I missing something?

@jdstrand

jdstrand Jun 19, 2017

Contributor

setpriority 1\n2 would parse as two different lines:

  • setpriority 1 (legal)
  • 2 (invalid syscall, but skipped)

I used the wrong term. I said 'embedded newline' when I actually meant the character sequence '\n' is in the policy (sorry for not saying that right).

I if this is literally in /var/lib/snapd/seccomp/profiles/...:

setpriority 1\ 2
setpriority 1\n2

then both should fail to parse. The 1\\ 2 and 1\\n2 in the shell test script was just to make sure we put the right thing in the test profile. Sorry for the confusion.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Thanks for helping me here. I got it and updated the test.

tests/main/snap-seccomp/task.yaml
+ $SNAP_SECCOMP compile ${PROFILE}.in ${PROFILE}.bpf
+ echo "Ensure the code cannot not run due to impossible filtering"
+ if test-snapd-tools.echo hello; then
+ echo "filtering broken: program should have failed to run: program should have failed to run"
@jdstrand

jdstrand Jun 19, 2017

Contributor

You have an extra ': program should have failed to run' here.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Ups, sorry for that. Thanks you!

+ exit 1
+ fi
+ echo $output | MATCH "profile .* exceeds .* bytes"
+
@jdstrand

jdstrand Jun 19, 2017

Contributor

Can you add additional tests for:

  • profile is missing, you run snap-seccomp and therefore snap-seccomp fails
  • profile is empty, you run snap-seccomp and therefore the snap fails to start (default closed/restrictive policy)
@mvo5

mvo5 Jun 19, 2017

Collaborator

I added those two tests now.

cmd/snap-confine/snap-confine.rst
@@ -129,7 +129,7 @@ FILES
Description of the mount profile.
-`/var/lib/snapd/seccomp/profiles/*`:
+`/var/lib/snapd/seccomp/bpf/*`:
@jdstrand

jdstrand Jun 19, 2017

Contributor

We need to list both /var/lib/snapd/seccomp/profiles/* and /var/lib/snapd/seccomp/bpf/* as well as the snap-seccomp tool. /var/lib/snapd/seccomp/profiles/* is all users will every look at and /var/lib/snapd/seccomp/bpf/* is all snap-confine will now ever look at. A quick note that the files in /var/lib/snapd/seccomp/bpf/* are generated by snapd and that modifications to /var/lib/snapd/seccomp/profiles/* need to then be compiled to /var/lib/snapd/seccomp/bpf/*.

@mvo5

mvo5 Jun 19, 2017

Collaborator

I can add this, strictly speaking snap-confine does not care for the .in file though. What do you think about:

`/var/lib/snapd/seccomp/bpf/*.bpf`:

	Compiled seccomp profile program.

`/var/lib/snapd/seccomp/bpf/*.in`:

	Input for the /usr/lib/snapd/snap-seccomp profile compiler.

Note that we currently have both the .bpf and the .in in the same directory. The reason is mostly that we need a different place than /var/lib/snapd/profiles. This directory must be kept untouched or will will loose the ability to rollback to older versions. The code in EnsureDirState will clean all snap.$snap-name files. So in order to be able to rollback to version prior to this change we need to be able to revert to this previous place. If there is a strong desire to keep the .in and the .bpf in separate dirs, we could use something like /var/lib/snapd/seccomp/{input,bpf} ?

@jdstrand

jdstrand Jun 19, 2017

Contributor

I do want something about snap-seccomp because while snap-confine may not care about the input files, people developing and auditing policy definitely do and they need to know that they have to compile policy for snap-confine to pick it up.

As for the separate directory, this is unfortunate and I was hoping the bpf would obviate the need for a separate directory for input files; and it does, but I see now only for rollbacks to versions of snapd with bpf.

As discussed on IRC, let's rename things again (sorry) so they are comfortable for users familiar with the existing system and somewhat discoverable. Ie:

snap-seccomp /var/lib/snapd/seccomp/profiles.bpf/snap.foo.bar /var/lib/snapd/seccomp/profiles.bpf/snap.foo.bar.bpf

In other words, rename the dir to /var/lib/snapd/seccomp/profiles.bpf/, don't use the .in extension for input files, but do use .bpf extension for output.

mvo5 added some commits Jun 19, 2017

cmd/snap-seccomp/main.go
+ return secFilter.AddArch(seccomp.ArchARM)
+ case "powerpc":
+ return secFilter.AddArch(seccomp.ArchPPC)
+ }
@jdstrand

jdstrand Jun 19, 2017

Contributor

In the C implementation we set initialized compat to 0 and then set it to SCMP_ARCH_... in the if/else up above and then check for if compat > 0. It would be easier to read if in the if/else here you did something similar and use var compatArch ScmpArch and set it to seccomp.ArchWHATEVER in the if/else. I think this would make it easier to read but it also means that this section of code is easier to maintain (just update the if/else as opposed to both the if/else and the switch). Indeed, you forgot to add ppc64 to the switch even though it can be set in the if/else.

@mvo5

mvo5 Jun 19, 2017

Collaborator

Sure, I updated the code now, I hope its doing what you suggested now.

cmd/snap-confine/snap-confine.rst
+
+`/var/lib/snapd/seccomp/bpf/*.in`:
+
+ Input for the /usr/lib/snapd/snap-seccomp profile compiler.
@jdstrand

jdstrand Jun 19, 2017

Contributor

You asked this elsewhere, but I'll answer here since it is clearer what is happening. Today we have text profiles in /var/lib/snapd/seccomp/profiles. This PR creates compiled cache files in /var/lib/snapd/seccomp/bpf/*.bpf. Why are we using .in files at all and not just using what is in /var/lib/snapd/seccomp/profiles as the input to snap-seccomp with the output in /var/lib/snapd/seccomp/bpf/*.bpf? Ie, snap-seccomp /var/lib/snapd/seccomp/profiles/snap.foo.bar /var/lib/snapd/seccomp/bpf/snap.foo.bar.bpf (and why is the .bpf extension even needed at all if in another directory?)?

@mvo5

mvo5 Jun 19, 2017

Collaborator

In a better world we would do exactly what you describe. Just use /var/lib/snapd/seccomp/profiles/snap.$security_tag` as the input and compile to the bpf in a cache dir. However we also need to deal with https://forum.snapcraft.io/t/snapd-2-25-blocked-because-of-revert-race-condition/ - so we cannot write an incompatible new snap.$security_tag because in the revert case we will fail in the same way as outlined in the forum post. The only way around this is a new directory to ensure we keep a compatible seccomp profile for old snapd around.

cmd/snap-seccomp/main.go
+ case "powerpc":
+ compatArch = seccomp.ArchPPC
+
+ }
@jdstrand

jdstrand Jun 19, 2017

Contributor

Thanks for updating this. Not saying this is needed for this PR, but refactoring this bit into another function would make this clearer. Eg, the whole switch here goes away with:

compatArch = something.GetSeccompArch(arch.UbuntuKernelArchitecture())
@jdstrand

jdstrand Jun 19, 2017

Contributor

Ah yes, similar to what you do in goArchToScmpArch() in main_test.go. Perhaps since it is listed in two different places that is reason enough to factor this out now. Note that main_test.go is missing ppc64.

@mvo5

mvo5 Jun 20, 2017

Collaborator

Indeed, I added code for this now and also added tests for the compat architecture handling (unfortunately the tests are a bit tricky because of idiosyncrasies of libseccomp, see seccomp/libseccomp#86. This is also further complicated by the fact that libseccomp-golang has two architecture ints: one is a iota that is only used in libseccomp-golang. The other is the native C/kernel architecture (AUDIT_ARCH_*). There is no public API to map those. We now have 4 different arch mapping tables in the code :(

mvo5 added some commits Jun 20, 2017

cmd/snap-seccomp/main_test.go
+ case "s390x":
+ return main.ScmpArchS390X
+ case "ppc":
+ return main.ScmpArchPPC
@jdstrand

jdstrand Jun 20, 2017

Contributor

You're missing ppc64.

cmd/snap-seccomp/main_test.go
+ }
+}
+
+func simulateBpf(c *C, seccompWhitelist, bpfInput string, expected int) {
@jdstrand

jdstrand Jun 20, 2017

Contributor

Perhaps add a comment here to help explain things. Eg:

// simulateBpf:
//  1. runs main.Compile() which will catch syntax errors and output to a file
//  2. takes the output file from main.Compile and loads it via decodeBpfFromFIle
//  3. parses the decoded bpf
//  4. runs the parsed bpf through a bpf VM
//
// In this manner, in addition to verifying policy syntax we are able to
// unit test the resulting bpf in several rudimentary ways (rudimentary
// because this parser is (intentionally) not a complete seccomp parser.
//
// Full testing of applied policy is done elsewhere via spread tests.
@jdstrand

jdstrand Jun 20, 2017

Contributor

In writing the above comment it occurred to me that it would be nice if we could load this policy into the kernel via prctl(). This should be possible with a small loader program that you exec(). Eg, here is a small C program to do that:

/*
 * gcc -Wall ./load-bpf-with-nnp.c -o load-bpf-with-nnp
 * /tmp/snap-seccomp compile ./test-nonet.filter ./test-nonet.filter.bpf
 *
 * Assuing test-nonet.filter doesn't allow 'socket', this should load the
 * bpf into the kernel and run ping under it, which should fail:
 * ./load-bpf-with-nnp ./test-nonet.filter.bpf /bin/ping -c 1 www.ubuntu.com
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdint.h>
#include <string.h>
#include <stdint.h>
#include <inttypes.h>
#include <fcntl.h>
#include <sys/prctl.h>

#include <linux/filter.h>
#include <linux/seccomp.h>

#define MAX_BPF_SIZE 32 * 1024

int sc_apply_seccomp_bpf(const char *profile_path)
{
	unsigned char bpf[MAX_BPF_SIZE + 1]; // account for EOF
	FILE *fp;
	fp = fopen(profile_path, "rb");
	if (fp == NULL) {
		fprintf(stderr, "cannot read %s\n", profile_path);
		exit(1);
	}

	// set 'size' to '1; to get bytes transferred
	size_t num_read = fread(bpf, 1, sizeof(bpf), fp);

	if (ferror(fp) != 0) {
		perror("fread()");
		exit(1);
	} else if (feof(fp) == 0) {
		fprintf(stderr, "file too big\n");
		exit(1);
	}
	fclose(fp);

	struct sock_fprog prog = {
		.len = num_read / sizeof(struct sock_filter),
		.filter = (struct sock_filter *)bpf,
	};

	// Set NNP to allow loading seccomp policy into the kernel without
	// root
	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
		perror("prctl(PR_NO_NEW_PRIVS, 1, 0, 0, 0)");
		exit(1);
	}

	if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
		perror("prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...) failed");
		exit(1);
	}
	return 0;
}

int main(int argc, char *argv[])
{
	int rc = 0;
	if (argc < 2) {
		fprintf(stderr, "Usage: %s <bpf file> [prog ...]\n", argv[0]);
		return 1;
	}

	rc = sc_apply_seccomp_bpf(argv[1]);
	if (rc || argc == 2)
		return rc;

	execv(argv[2], (char *const *)&argv[2]);
	perror("execv failed");
	return 1;
}

Now, you don't have to use C, the important parts are:

	// Set NNP to allow loading seccomp policy into the kernel without
	// root
	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
		perror("prctl(PR_NO_NEW_PRIVS, 1, 0, 0, 0)");
		exit(1);
	}

	if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
		perror("prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, ...) failed");
		exit(1);
	}

so, call prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0) then you can load the policy into the kernel with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) without being root. If that doesn't exit with error then that is an even better test than the bpf VM.

The shell scripts for the C code would take this a small step further and typically simply run something basic like /bin/true with the default policy and any tested rules to see that after the policy was loaded, it was usable to some degree (ie, enough to run /bin/true; spread tests continue to be the best way to test specific rule functionality of course).

@mvo5

mvo5 Jun 20, 2017

Collaborator

Thanks a lot! I spend a bit of time tying to do this natively in go. Loading is fine, however simulating a seccomp killed app is tricky with go because of its runtime. I end up with hanging code and zombies when trying to do this with cgo from inside go. It seems to do these kinds of tests we will need a C helper (as the one you outlined above).

@mvo5

mvo5 Jun 20, 2017

Collaborator

I played with this a bit and got to http://paste.ubuntu.com/24909258/ - it does feel a bit ugly and takes a long time (45sec on my relatively fast workstation) because of intensive tests like snapSeccompSuite.TestRestrictionsWorkingArgsSocket - I can explore further tomorrow if we want to keep this direction.

@mvo5

mvo5 Jun 20, 2017

Collaborator

This is a fully working version: http://paste.ubuntu.com/24909351/ using the C helper/in-kernel-execute approach.

@mvo5

mvo5 Jun 20, 2017

Collaborator

The helper is only compiled once, but I compile custom test-binaries:

content := fmt.Sprintf(`
#define _GNU_SOURCE
#include<unistd.h>
#include<sys/syscall.h>
int main(int argc, char **argv) {
syscall(SYS_%v, %v, %v, %v, %v, %v, %v);
}
`, syscallname, syscallArgs[0], syscallArgs[1], syscallArgs[2], syscallArgs[3], syscallArgs[4], syscallArgs[5])

like this to test the filtering with both Allow and Kill.

@jdstrand

jdstrand Jun 20, 2017

Contributor

That's pretty cool actually and a lot more than what the old shell tests were doing and way more comprehensive. To me, the 45 seconds would be worth it. :)

That said, I think that test binary could be made more generic so that it only would need to compile it once. You shouldn't have to use syscall(SYS_%v, ...); you can just feed ints into syscall() for arg1 and all the other args (man syscall). So compile once, then the tests call your test-binary with different arguments. This should get you the speed you want and give us very comprehensive testing.

@mvo5

mvo5 Jun 20, 2017

Collaborator

I got this down to ~30s on my system and also to a much narrower set of syscalls: http://paste.ubuntu.com/24911184/ - it still feels strange to have gcc in there but this is really as close as we can get to when it comes to testing.

@jdstrand

jdstrand Jun 20, 2017

Contributor

Personally, I think 30 seconds is acceptable, but note my last comment-- I think we can make it faster.

What is neat about your approach is 'yes' it is very close and we're doing it all in unit tests rather than relying on this only in the integration spread tests. Yay! :)

@mvo5

mvo5 Jun 21, 2017

Collaborator

@jdstrand I pushed #3502 with the suggested single syscall-runner binary. I did the original per-test compile mostly because it allowed me to have a super narrow set of syscalls by avoiding to link against most of libc. Once I get argument parsing I need to accept the libc startupfiles which means more syscalls (like brk, arch_prctl, access, sysinfo). But given the win in test runtime I think its better to have the compile only once (even if it means some more allowed syscalls by default).

mvo5 added some commits Jun 20, 2017

cmd/snap-seccomp/main.go
+ panic(fmt.Sprintf("cannot map ubuntu arch %q to a seccomp arch", ubuntuArch))
+}
+
+// ScmpArchToSeccompAnativeArch takes a seccomp.ScmpArch and converts
@jdstrand

jdstrand Jun 20, 2017

Contributor

s/ScmpArchToSeccompAnativeArch/ScmpArchToSeccompNativeArch/

I went through the PR top to bottom and believe all the most important parts of my comments have been addressed (excepting the pending loading the bpf into the kernel patches). What remains is all comment changes, nit-picky stuff, a small apparmor profile change, some observations and a few easily added extra tests.

@@ -62,11 +63,58 @@ func ubuntuArchFromGoArch(goarch string) string {
"ppc64le": "ppc64el",
"s390x": "s390x",
"ppc": "powerpc",
+ // available in debian and other distros
+ "ppc64": "ppc64",
@jdstrand

jdstrand Jun 20, 2017

Contributor

I just noticed that arch_test.go is missing s390x and ppc64.

@mvo5

mvo5 Jun 21, 2017

Collaborator

Thanks, added.

+ "s390x": "s390x",
+ "ppc": "powerpc",
+ // available in debian and other distros
+ "ppc64": "ppc64",
@jdstrand

jdstrand Jun 20, 2017

Contributor

I think you need to go fmt for the spacing here.

@mvo5

mvo5 Jun 21, 2017

Collaborator

The comment is the reason why the spacing here is different, go fmt was applied :)

@jdstrand

jdstrand Jun 21, 2017

Contributor

Ah :)

cmd/snap-confine/quirks.c
@@ -116,6 +116,18 @@ static void sc_quirk_create_writable_mimic(const char *mimic_dir,
debug("creating writable mimic directory %s based on %s", mimic_dir,
ref_dir);
sc_quirk_setup_tmpfs(mimic_dir);
+
+ struct stat stat_buf;
@jdstrand

jdstrand Jun 20, 2017

Contributor

Not strictly required cause the code is obvious, but adding a comment to the effect of this might be nice:

// Now copy the ownership and permissions of the mimicked directory
@@ -89,11 +89,14 @@
# change_profile unsafe /** -> **,
# reading seccomp filters
- /{tmp/snap.rootfs_*/,}var/lib/snapd/seccomp/profiles/* r,
+ /{tmp/snap.rootfs_*/,}var/lib/snapd/seccomp/profiles.bpf/* r,
@jdstrand

jdstrand Jun 20, 2017

Contributor

Let's add the .bpf extension here so if for some reason snap-confine ends up trying to read the wrong thing, we have a denial to indicate that. Eg, instead of what you have, use:

/{tmp/snap.rootfs_*/,}var/lib/snapd/seccomp/profiles.bpf/*.bpf r,
+// sc_maybe_fixup_permissions fixes incorrect permissions
+// inside the mount namespace for /var/lib. Before 1ccce4
+// this directory was created with permissions 1777.
+void sc_maybe_fixup_permissions()
@jdstrand

jdstrand Jun 20, 2017

Contributor

I'm 'ok' with this function here since it is a fixup function, but I suspect there might be a better location. @zyga, do you have an opinion?

cmd/snap-confine/snap-confine.rst
@@ -47,7 +47,7 @@ extensive dbus mediation. Refer to apparmor documentation for more details.
Seccomp profiles
----------------
-`snap-confine` looks for the `/var/lib/snapd/seccomp/profiles/$SECURITY_TAG`
+`snap-confine` looks for the `/var/lib/snapd/seccomp/profiles.bpf/$SECURITY_TAG`
@jdstrand

jdstrand Jun 20, 2017

Contributor

This should be:

`snap-confine` looks for the `/var/lib/snapd/seccomp/profiles.bpf/$SECURITY_TAG.bpf`
cmd/snap-confine/snap-confine.rst
- Description of the seccomp profile.
+ Input for the /usr/lib/snapd/snap-seccomp profile compiler.
+
+ Compiled seccomp profile program have the extension .bpf.
@jdstrand

jdstrand Jun 20, 2017

Contributor

Compiled... should start with a tab, no?

cmd/snap-seccomp/main.go
+ case "powerpc":
+ return seccomp.ArchPPC
+ }
+ panic(fmt.Sprintf("cannot map ubuntu arch %q to a seccomp arch", ubuntuArch))
@jdstrand

jdstrand Jun 20, 2017

Contributor

Nit-pick: having these alphabetized would be nice.

cmd/snap-seccomp/main.go
+ return C.SCMP_ARCH_S390X
+ case seccomp.ArchPPC:
+ return C.SCMP_ARCH_PPC
+ }
@jdstrand

jdstrand Jun 20, 2017

Contributor

Nit-pick: having these alphabetized too would be nice.

cmd/snap-seccomp/main.go
+ // perspective), a i386 snap would have to ship some
+ // 64bit code that would have to detect at runtime if
+ // it can be used or not. But we are told this is not
+ // rare with certain classes of embedded devices.
@jdstrand

jdstrand Jun 20, 2017

Contributor

Nit-pick: this still isn't worded as clearly as it could be. I suggest:

		// less common case: kernel and userspace have different archs
		// so add a compat architecture that matches the kernel. E.g.
		// an amd64 kernel with i386 userland needs the amd64 secondary
		// arch added to support specialized snaps that might
		// conditionally call 64bit code when the kernel supports it.
		// Note that in this case snapd requests i386 (or arch 'all')
		// snaps. While unusual from a traditional Linux distribution
		// perspective, certain classes of embedded devices are known
		// to use this configuration.
+
+ if compatArch != seccomp.ArchInvalid {
+ return secFilter.AddArch(compatArch)
+ }
@jdstrand

jdstrand Jun 20, 2017

Contributor

I wonder if it would be useful to add a warning that we are not adding a compat arch when seccomp.ArchInvalid.

@mvo5

mvo5 Jun 21, 2017

Collaborator

This would always trigger a warning on e.g. i386. The "0" value of seccomp.ScmpArch is "0", so this is really the equivalent of the C code if (compatArch > 0)

cmd/snap-seccomp/main.go
+ for _, line := range bytes.Split(content, []byte("\n")) {
+ if bytes.Equal(bytes.TrimSpace(line), []byte(needle)) {
+ return true
+
@jdstrand

jdstrand Jun 20, 2017

Contributor

You have an extra newline here.

cmd/snap-seccomp/main.go
+
+ if err := addSecondaryArches(secFilter); err != nil {
+ return err
+ }
@jdstrand

jdstrand Jun 20, 2017

Contributor

A newline here would be nice.

cmd/snap-seccomp/main.go
+ }
+ defer fout.Close()
+
+ //secFilter.ExportPFC(os.Stdout)
@jdstrand

jdstrand Jun 20, 2017

Contributor

You can remove this now.

@mvo5

mvo5 Jun 21, 2017

Collaborator

It is useful for debugging sometimes, I put a:

if osutil.GetenvBool("SNAP_SECCOMP_DEBUG") {
    secFilter.ExportPFC(os.Stdout)
}

around it.

cmd/snap-seccomp/main.go
+ return secFilter.ExportBPF(fout)
+}
+
+func showVersion() error {
@jdstrand

jdstrand Jun 20, 2017

Contributor

I wonder if this would be better named 'showSeccompLibraryVersion'.

cmd/snap-seccomp/main.go
+ break
+ }
+ err = compile(content, os.Args[3])
+ case "version":
@jdstrand

jdstrand Jun 20, 2017

Contributor

and this library-version

cmd/snap-seccomp/main_test.go
+
+// simulateBpf:
+// 1. runs main.Compile() which will catch syntax errors and output to a file
+// 2. takes the output file from main.Compile and loads it via decodeBpfFromFIle
@jdstrand

jdstrand Jun 20, 2017

Contributor

s/decodeBpfFromFIle/decodeBpfFromFile/

cmd/snap-seccomp/main_test.go
+// simulateBpf:
+// 1. runs main.Compile() which will catch syntax errors and output to a file
+// 2. takes the output file from main.Compile and loads it via decodeBpfFromFIle
+// 3. parses the decoded bpf
@jdstrand

jdstrand Jun 20, 2017

Contributor

Let's be a little more specific:

//  3. parses the decoded bpf using the seccomp library and various snapd functions
cmd/snap-seccomp/main_test.go
+// 4. runs the parsed bpf through a bpf VM
+//
+// In this manner, in addition to verifying policy syntax we are able to
+// unit test the resulting bpf in several ways approximating the kernels
@jdstrand

jdstrand Jun 20, 2017

Contributor

s/kernel/kernel's/

cmd/snap-seccomp/main_test.go
+// In this manner, in addition to verifying policy syntax we are able to
+// unit test the resulting bpf in several ways approximating the kernels
+// behaviour (approximating because this parser is not the kernel seccomp
+// parser.
@jdstrand

jdstrand Jun 20, 2017

Contributor

s/parser/parser)/

cmd/snap-seccomp/main_test.go
+// TestCompile will test the input from our textual seccomp whitelist
+// against a kernel syscall input that may contain arguments. The test
+// is performed by running the compiled bpf program on a virtual bpf
+// machine. Each test needs to declare what output from the VM it expects.
@jdstrand

jdstrand Jun 20, 2017

Contributor

Let's make this easier for people adding tests and rephrase to:

// TestCompile iterates over a range of textual seccomp whitelist rules and
// mocked kernel syscall input. For each rule, the test consists of compiling
// the rule into a bpf program and then running that program on a virtual bpf
// machine and comparing the bpf machine output to the specified expected
// output and seccomp operation. Eg:
//    {"<rule>", "<mocked kernel input>", <seccomp result>}
//
// Eg to test that the rule 'read >=2' is allowed with 'read(2)' and 'read(3)'
// and denied with 'read(1)' and 'read(0)', add the following tests:
//    {"read >=2", "read;native;2", main.SeccompRetAllow},
//    {"read >=2", "read;native;3", main.SeccompRetAllow},
//    {"read >=2", "read;native;1", main.SeccompRetKill},
//    {"read >=2", "read;native;0", main.SeccompRetKill},
+ // test_bad_seccomp_filter_args_termios
+ {"ioctl - TIOCST", `cannot parse line: cannot parse token "TIOCST" .*`},
+ {"ioctl - TIOCSTII", `cannot parse line: cannot parse token "TIOCSTII" .*`},
+ {"ioctl - TIOCST1", `cannot parse line: cannot parse token "TIOCST1" .*`},
@jdstrand

jdstrand Jun 20, 2017

Contributor

I thought of a few more useful tests for making sure a number is specified after the operator:

{"setpriority >", `strconv.ParseUint: parsing ".*": invalid syntax`},
{"setpriority >=", `strconv.ParseUint: parsing ".*": invalid syntax`},
{"setpriority <", `strconv.ParseUint: parsing ".*": invalid syntax`},
{"setpriority <=", `strconv.ParseUint: parsing ".*": invalid syntax`},
{"setpriority |", `strconv.ParseUint: parsing ".*": invalid syntax`},
{"setpriority !", `strconv.ParseUint: parsing ".*": invalid syntax`},

mvo5 added some commits Jun 21, 2017

Contributor

jdstrand commented Jun 21, 2017

Thanks for addressing all the feedback @mvo5! :)

@mvo5 mvo5 added the Critical label Jun 21, 2017

Assuming you're happy about these comments, LGTM.

The ".bpf" directory prefix is the only point I'd like to talk about first in case you disagree with the points made.

arch/arch.go
+ log.Panicf("cannot get kernel architecture: %v", err)
+ }
+
+ kernelArch := make([]byte, 0, len(utsname.Machine))
@niemeyer

niemeyer Jun 21, 2017

Contributor

Isn't the logic below basically

kernelArch := strings.SplitN(utsname.Machine, "\x00", 2)[0]

that's pedantic, but besides being simpler it means no need to convert and allocate back and forth into byte arrays. It's just a string being sliced.

@mvo5

mvo5 Jun 22, 2017

Collaborator

utsname.Machine is defined as [65]int8 so doing the above is slightly more involved. The following will work:

	p := (*[len(utsname.Machine)]byte)(unsafe.Pointer(&utsname.Machine))
	kernelArch := bytes.SplitN(p[:], []byte("\x00"), 2)[0]

If you prefer that, I'm happy to use that instead.

cmd/snap-seccomp/main.go
+ "strings"
+ "syscall"
+
+ // FIXME: we want github.com/mvo5/libseccomp-golang but that
@niemeyer

niemeyer Jun 21, 2017

Contributor

Did you mean seccomp instead of mvo5 here?

+
+// used to mock in tests
+var (
+ archUbuntuArchitecture = arch.UbuntuArchitecture
@niemeyer

niemeyer Jun 21, 2017

Contributor

All of these "Ubuntu" names should at some point probably be replaced by something that reflects better the cross-distribution nature of this code.

@mvo5

mvo5 Jun 22, 2017

Collaborator

Indeed, I added a FIXME for this. It seems like SnapdArchitecture or simply Architecture is reasonable. It is slightly sad that (by now) we have three different ways of describing the architecture. There is the "dpkg" architecture (and snaps use, e.g. amd64). There is the kernel architecture name (e.g. X86_64) and the go architecture (e.g. amd64). And there is the seccomp architecture which is two integers. One in the go code (iota) and one the kernel architecture uint32 (AUDIT_ARCH_X86_64). But I digress :)

cmd/snap-seccomp/main.go
+ // add a compat architecture for some architectures that
+ // support it, e.g. on amd64 kernel and userland, we add
+ // compat i386 syscalls.
+ if archUbuntuArchitecture() == archUbuntuKernelArchitecture() {
@niemeyer

niemeyer Jun 21, 2017

Contributor

Both of these should go into variables so we're not rebuilding them repeatedly below.

cmd/snap-seccomp/main.go
+
+ // FIXME: right now complain mode is the equivalent to unrestricted.
+ // We'll want to change this once we seccomp logging is in order.
+ if hasLine(content, "@unrestricted") || hasLine(content, "@complain") {
@niemeyer

niemeyer Jun 21, 2017

Contributor

Is the position of these lines really irrelevant?

@jdstrand

jdstrand Jun 21, 2017

Contributor

Yes. Since the seccomp profile is just an unordered list of syscalls, @unrestricted and @complain are treated similarly and may appear anywhere. By convention in snapd we put them at the top of the file, but anywhere in the file is fine and consistent with the C implementation. We could change this of course if it is deemed a problem. I like the property because it allows for these or future directives to potentially be composed (eg, an interface that adds @something).

cmd/snap-seccomp/main.go
+ }
+
+ scanner := bufio.NewScanner(bytes.NewBuffer(content))
+ for scanner.Scan() {
@niemeyer

niemeyer Jun 21, 2017

Contributor

We just took that exact same text and split it out entirely in memory above under hasLine and threw away the content. It feels a bit awkward to then take the care of streaming it line by line once more and over a much more complex machinery. If we don't care, then let's just split it up above once and ahead of time, and user the same slice of lines in hasLine (renaming it to contains or something) and here.

@mvo5

mvo5 Jun 22, 2017

Collaborator

Indeed, I modeled this after the C implementation, I reworked it now and it is only doing a single pass over content now.

@jdstrand

jdstrand Jun 22, 2017

Contributor

The C implementation algorithm was intentionally 2 pass with a line by line preprocess for @unrestricted/@complain and then another pass to process line by line because it resulted to simpler C setuid code.

The go code doesn't have to be 2-pass.

cmd/snap-seccomp/main.go
+
+func showSeccompLibraryVersion() error {
+ major, minor, micro := seccomp.GetLibraryVersion()
+ fmt.Fprintf(os.Stdout, "seccomp version: %d.%d.%d\n", major, minor, micro)
@niemeyer

niemeyer Jun 21, 2017

Contributor

I think we can print just the version itself here, since the command is already super specific (library-version).

+ if err != nil {
+ break
+ }
+ err = compile(content, os.Args[3])
@niemeyer

niemeyer Jun 21, 2017

Contributor

len(os.Args) == 3 means harsh crash here.
len(os.Args) == 2 means harsh crash four lines above.
len(os.Args) == 1 means harsh crash seven lines above.

interfaces/seccomp/backend.go
+
+ for baseName := range content {
+ in := filepath.Join(dirs.SnapSeccompDir, baseName)
+ out := filepath.Join(dirs.SnapSeccompDir, baseName+".bpf")
@niemeyer

niemeyer Jun 21, 2017

Contributor

If the file names are being extended with .bpf, is there any benefit in changing SnapSeccompDir to have .bpf in the directory itself, or could we simply keep the same directory and have .bpf in the filenames alone? This has the added benefit that the old profiles will be properly collected once new profiles compile, instead of being left around forever once this patch lands.

@mvo5

mvo5 Jun 21, 2017

Collaborator

My original approach was to use the same dir, however there are some complications. We use EnsureDirState() in the code that generates the seccomp profiles. This code will remove all of the old seccomp profiles (written by the old 2.24 code). So when we revert we run into the same problem that we need to solve, there will be new incompatible profiles at startup until 2.24 ran and generated valid profiles for 2.24. The easiest way out of this was to use a new dir. We can add a upcoming task to the forum to ensure we garbage collect the now-longer-used directory. We can of course still keep the /var/lib/snapd/seccomp/profiles/ directory, when we do this, we need to tweak the profile names so that the glob used in EnsureDirState() does not remove the old 2.24 seccomp profiles. If that is more desirable than the new directory in can investigate this in my morning.

@niemeyer

niemeyer Jun 22, 2017

Contributor

Thanks for the explanation. I forgot we'll always rebuild the profile on startup due to the binary change. It doesn't seem worth adding extra complexity to preserve the directory when something simpler along the lines of what you've done will sort it out just fine.

Here is an alternative suggestion: seccomp/bpf/*.src for the source files and .../*.bin for the binary ones. The .src is just a trivial hint to suggest that this file won't really be used by the runtime until it is compiled. The .bin is a way to avoid the .../bpf/*.bpf duplicity, and only really recommended if there's no standard suffix for the BPF bytecode (I couldn't find .bpf as a traditional suffix.. is it?).

@jdstrand

jdstrand Jun 22, 2017

Contributor

I like how /var/lib/snapd/seccomp is preserved in @niemeyer's suggestion to use /var/lib/snapd/seccomp/bpf/... since finding the subdir bpf/ is more discoverable than the sibling dir seccomp.bpf.

As mentioned in previous days, I don't have a strong preference on the naming approach, so what is proposed is fine with me, but I got to wondering if we have the bpf/ subdir and both the input and output files are there, we don't really need both the input and ouput files to have a suffix. Eg, could simply use seccomp/bpf/snap.name.cmd.src (or seccomp/bpf/snap.name.cmd.in) and seccomp/bpf/snap.name.cmd (ie, no extension for the output file). Or flip it and have no extension for the input file and use .cache or .bin for the output. I suspect that the former (have an extension on the input file) is preferred since it retains the hint @niemeyer was after.

Not blocking on this if you're happy with *.src and *.bin, just thought it was potentially cleaner to use an extension on only one of input or output.

@niemeyer

niemeyer Jun 22, 2017

Contributor

Problem with not having .bin is that everywhere else those are text files, and tab-completing them will hit the naked file before the .src, so people will tend to have an editor full of garbage. The .src and .bin make it very easy to grasp what's the content in each case without having to open it up.

@jdstrand

jdstrand Jun 22, 2017

Contributor

That makes sense especially if the hint is important.

mvo5 added some commits Jun 22, 2017

Use /var/lib/snapd/bpf/*.{src,bin} for the seccomp profiles
The binary bpf seccomp profiles will be stored in the .bin file,
the source in the .src file.
@@ -33,7 +33,7 @@
#include "../libsnap-confine-private/string-utils.h"
#include "../libsnap-confine-private/utils.h"
-static char *filter_profile_dir = "/var/lib/snapd/seccomp/profiles.bpf/";
+static char *filter_profile_dir = "/var/lib/snapd/seccomp/bpf/";
@jdstrand

jdstrand Jun 22, 2017

Contributor

Meh, I mentally misread all the other stuff as /var/lib/snapd/seccomp/profiles/bpf when I commented that I liked it. I don't want to belabor this point cause like I said before, I don't have a strong opinion. This is fine.

@mvo5 mvo5 merged commit 3247e95 into snapcore:master Jun 22, 2017

5 of 7 checks passed

xenial-i386 autopkgtest finished (failure)
Details
yakkety-amd64 autopkgtest finished (failure)
Details
artful-amd64 autopkgtest finished (success)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
xenial-amd64 autopkgtest finished (success)
Details
xenial-ppc64el autopkgtest finished (success)
Details
zesty-amd64 autopkgtest finished (success)
Details

mvo5 added a commit that referenced this pull request Jun 26, 2017

Merge pull request #3507 from mvo5/seccomp-bpf-2.26
many: backport of seccomp-bpf branch (#3431) to the 2.26 release
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment