more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec #4483

poettering · 2016-10-25T14:21:35Z

This one might break stuff, see commit msgs.

@evverx please have a look!

ronnychevalier · 2016-10-25T14:45:35Z

man/systemd.exec.xml

@@ -1342,13 +1342,13 @@
              </row>
              <row>
                <entry>@raw-io</entry>
-                <entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pciconfig_read()</function>, …</entry>
+                <entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pcaiconfig_read()</function>, …)</entry>


You added an a to pciconfig_read by mistake

ronnychevalier · 2016-10-25T14:52:28Z

man/systemd.exec.xml

+              </row>
+              <row>
+                <entry>@file-system-ops</entry>
+                <entry>File system operations: more advanced file system operations, such as renaming or removing files and directories, or creating hard and symbolic links.</entry>


While reading this, I was thinking about some sandbox architectures where the sandboxed application only have access to syscalls that can operate on a file descriptor, but no access to syscall that take a path. An external privileged agent gives all the file descriptors to the sandboxed application in order for it to work.

Maybe a syscall group with only filesytem syscalls using a file descriptor could be interesting?

@ronnychevalier yes this can be interesting and it goes into the direction of capsicum of FreeBSD, however here not sure if this makes sense, but instead of emulating the capabilities mode of the kernel, one can mimic partial functionality at userspace level using the file descriptor that you mention coupled with a mount namespace setup where only the fds that maps to resources and desired capabilities are available...

evverx · 2016-10-25T15:14:17Z

@poettering , thanks! I'll take a detailed look tomorrow.

This one might break stuff, see commit msgs.

I'm not sure about aa. But IIRC the order doesn't really matter for SELinux: #3852 (comment)

Also, I think we should somehow document this:

Be careful, though: LSMs might also not tighten constraints on exec
in no_new_privs mode. (This means that setting up a general-purpose
service launcher to set no_new_privs before execing daemons may
interfere with LSM-based sandboxing.)

But I don't know how to express this user-friendly :-(

evverx · 2016-10-25T15:31:42Z

#3970 (comment)

Ideally there should be no system calls between the final seccomp() and execve() to minimize restrictions and the default set. I'd suppose SELinux and AppArmor also would like to be the last ones too, so there's conflict. It's also not a good idea to block system calls that setexeccon() and aa_change_onexec happen to use. This could be avoided if the kernel provided a new way to install seccomp filters so that they only take effect after execve(), much like the context set by setexeccon() applies after execve(). Then also execve() could be allowed to be blocked by a filter to prevent the service from executing anything else, which would be very nice.

Seems like this PR will improve the current situation. @topimiettinen , what do you think?

topimiettinen · 2016-10-25T19:47:49Z

Looks fine. aa_change_onexec() also takes effect on execve():

https://bazaar.launchpad.net/~apparmor-dev/apparmor/master/view/head:/libraries/libapparmor/src/kernel.c#L496

https://github.com/torvalds/linux/blob/d011a4d861ce583466a8ae72a0c8e7f51c8cba4e/security/apparmor/lsm.c#L550

tixxdz · 2016-10-25T20:32:52Z

Lgtm in general for seccomp ordering , others have already defined layers https://chromium.googlesource.com/chromium/src/+/master/docs/linux_sandboxing.md .

keszybz · 2016-10-26T00:59:35Z

TODO

@@ -32,6 +32,11 @@ Janitorial Clean-ups:

 Features:

+* drop nss-myhostname in favour of nss-resolve?
+


I don't think so. Not everybody is using resolve and not everybody should be using it. For example in aa mock chroot I don't want systemd-resolve to run, but I still want localhost to be resolvable. myhostname provides useful functionality that is partially independent of resolve and I think we should keep it, because of this, and because of backwards compatibility.

well i added the question mark to this entry, since I am not entirely sure about it either...

keszybz · 2016-10-26T01:00:03Z

TODO

+* drop nss-myhostname in favour of nss-resolve?
+
+* drop internal dlopen() based nss-dns fallback in nss-resolve, and rely on the
+  external nsswitch.conf based one


This should go through a long deprecation period though.

keszybz · 2016-10-26T01:10:26Z

src/shared/seccomp-util.c

+        [SYSCALL_FILTER_SET_FILE_SYSTEM] = {
+                /* Basic file system access (reading file properties, opening/creating files and directories for
+                 * read+write) */
+                .name = "@file-system",


I don't really see the point of the split betwen @File-system and @file-system-ops. (And I also think no dash would be better: @filesystem...). "Advanced" is in the eye of the beholder. I'd rather split this into @filesystem-query (access, chdir, various getattr, stat, inotify, mmap, readlink).

fcntl, close can be moved to @basic-io.

Can we filter on open(…, O_CREATE) and mmap(…, PROT_WRITE) and move those to the latter group?

@keszybz

Can we filter on open(…, O_CREATE) and mmap(…, PROT_WRITE) and move those to the latter group?

So the @basic-io group may make sense, however IMHO we should wait on open(..., O_CREATE) and writable operations... , and better figure out a setup where it is simple for users to use more simple switches and maybe embed the open(O_CREATE) inside without exposing it, or even better following these choices:

ProtectSystem=strict

RestrictMount=no|rdonly_devices|strict

strict: disables mount,umount,chroot and pivot_root
rdonly_devices: if possible, allows chroot, and may imply some semantics of PrivateDevices. The idea is to not allow remounts, changing mount flags, ms_move etc on real devices, mount rw virtual filesystems can be ok, but we should prohibit write access to real devices, this way it will cover all the write syscalls on real hardware in one option, other stuff can be classified as memory...

I think similar split to writable and read/other operations was also discussed in early version of the grouping patches (#3053 and #3157). But it's very rare that a program only reads or receives from network and never writes or sends anything.

Blocking file creation and some mount operations may make sense. BTW, there's also creat().

Hmm, I figure I'll drop the file-system-access bit for now, and split it into a separate PR. I'll leave the basic IO stuff in the patch however.

@keszybz close() already is in the @basic-io group in this patch. And I was afraid of adding fcntl since it's one of those multplixed calls, and we'd blanket allow quite a few ops that way. I sounded better to me that people needing it would explicitly say "SystemCallFilter=@basic-io fcntl" or so...

(i think we should extend SystemCallFilters= eventually, so that besides syscalls we can also recognize the different commands of multiplexed syscalls, such as prctl(), setsockopt() and fcntl(). I.e. that you could write "SystemCallFilters=read write SO_PASSCRED socket PR_SET_NAME" to match three syscalls and the two specified subcommands of setsockopt and prctl... If we have that we could add the safe fcntl() subcommands to the @basic-io group...)

keszybz · 2016-10-26T01:13:03Z

I think that the SMACK setup should be moved too. If we're going to break setups, it's better to do it once, in a bigger step.

keszybz · 2016-10-26T01:14:00Z

man/systemd.exec.xml

@@ -1261,6 +1261,14 @@
        filter is reset, all prior assignments will have no effect. This does not affect commands prefixed with
        <literal>+</literal>.</para>

+        <para>Note that strict system call filters may impact execution and error handling code paths of the service
+        invocation. Specifically, access to the <function>execve</function> system call is required for the execution
+        of the service binary – if it is blocked service invocation will necessarily fail. Also, if execution of the


This should be an m-dash, not an n-dash.

hmm, how do i type that on my german X11 kbd? I can type these three characters easily here:

- (plain "-" key) – ("-" key combined with AltGr) — ("-" key combined with Shift-AltGr)

Which one is the right one in your opinion?

tixxdz · 2016-10-26T11:51:39Z

src/shared/seccomp-util.c

+                "statfs\0"
+        },
+        [SYSCALL_FILTER_SET_FILE_SYSTEM_OPS] = {
+                /* More advanced file system operations */


May be it will be better if it is defined as a set of syscalls that update metadata attributes etc of already created files or so ... ?! I mean a better wording or description.

keszybz · 2016-11-01T15:44:43Z

On Tue, Nov 01, 2016 at 08:43:15AM -0700, Lennart Poettering wrote:

well i added the question mark to this entry, since I am not entirely sure about it either...

I know ;) I was just providing my two cents.

keszybz · 2016-11-01T16:05:49Z

On Tue, Nov 01, 2016 at 08:50:04AM -0700, Lennart Poettering wrote:

(plain "-" key)
– ("-" key combined with AltGr)
— ("-" key combined with Shift-AltGr)

Try copy and paste into python3 -c 'import unicodedata;print(unicodedata.name("—"))'
This returns HYPHEN-MINUS, EN DASH, EM DASH for the three cases above, i.e. the
last one is good.

Timing and sleep are so basic operations, it makes very little sense to ever block them, hence don't.

@default

The system call is already part in @default hence implicitly allowed anyway. Also, if it is actually blocked then systemd couldn't execute the service in question anymore, since the application of seccomp is immediately followed by it.

@ipc

These system calls clearly fall in the @ipc category, hence should be listed there, simply to avoid confusion and surprise by the user.

@resources contains various syscalls that alter resource limits and memory and scheduling parameters of processes. As such they are good candidates to block for most services. @basic-io contains a number of basic syscalls for I/O, similar to the list seccomp v1 permitted but slightly more complete. It should be useful for building basic whitelisting for minimal sandboxes

Seccomp is generally an unprivileged operation, changing security contexts is most likely associated with some form of policy. Moreover, while seccomp may influence our own flow of code quite a bit (much more than the security context change) make sure to apply the seccomp filters immediately before executing the binary to invoke. This also moves enforcement of NNP after the security context change, so that NNP cannot affect it anymore. (However, the security policy now has to permit the NNP change). This change has a good chance of breaking current SELinux/AA/SMACK setups, because the policy might not expect this change of behaviour. However, it's technically the better choice I think and should hence be applied. Fixes: systemd#3993

…ice manager If execve() or socket() is filtered the service manager might get into trouble executing the service binary, or handling any failures when this fails. Mention this in the documentation. The other option would be to implicitly whitelist all system calls that are required for these codepaths. However, that appears less than desirable as this would mean socket() and many related calls have to be whitelisted unconditionally. As writing system call filters requires a certain level of expertise anyway it sounds like the better option to simply document these issues and suggest that the user disables system call filters in the service temporarily in order to debug any such failures. See: systemd#3993.

This is a follow-up for 6309e51 and makes sure we compare test results with the right user identifier.

poettering · 2016-11-02T14:57:30Z

I force pushed a new version now, with the @filesystem groups dropped for now. All pointed out issues should be fixed, and another commit was added, that fixes "sudo make check", and is hence kinda important. Please review and merge!

poettering · 2016-11-02T17:53:15Z

hmm, the ubuntu CI fails at:

[�[0;32m  OK  �[0m] Stopped Create Static Device Nodes in /dev.
[�[0;32m  OK  �[0m] Reached target Shutdown.
[�[0;32m  OK  �[0m] Reached target Final Step.
         Starting Power-Off...
[   83.411099] reboot: Power down
-rw-r-----+ 1 root systemd-journal 4845568 Nov  2 16:56 /var/tmp/systemd-test.q4eyN6/journal/d617569fc6e8437db569a22e32336c92/system.journal
-rw-r-----+ 1 root systemd-journal 4845568 Nov  2 16:56 /var/tmp/systemd-test.q4eyN6/journal/d617569fc6e8437db569a22e32336c92/system@a410db085f704dfb8dcaddfab0fdb458-0000000000000001-0005405450ea6bb0.journal
TEST RUN: Job-related tests [FAILED]

anyone has an idea what that means and what precisely fails here? does this have anything to do with the patch at hand? @martinpitt, @evverx, @keszyb, any idea?

evverx · 2016-11-02T18:01:37Z

anyone has an idea what that means and what precisely fails here?

I have no idea what is going on. And I can't reproduce this locally.

But we have already seen that error:
#4331 (comment)
#4414 (comment)

evverx · 2016-11-02T18:21:11Z

And I can't reproduce this locally.

Well, I've reproduced

Nov 02 09:24:55 systemd-testsuite test-jobs.sh[164]: + START_SEC=1478078695
Nov 02 09:24:55 systemd-testsuite test-jobs.sh[164]: + systemctl start --wait wait2.service
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: ++ date -u +%s
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + END_SEC=1478078699
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + ELAPSED=4
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + [[ 4 -ge 2 ]]
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + [[ 4 -le 3 ]]
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + exit 1

93a0884#diff-5a6a04288fc1ba785cfd801b4d20e147R52
seems like we should fix the test. @martinpitt ?

poettering · 2016-11-02T22:09:36Z

So I understand the issues is unrelated to the PR at hend? I figure I can merge this then!

martinpitt · 2016-11-03T05:19:52Z

seems like we should fix the test

Agreed, this seems to be a too tight race. Will look at that next week, I'm at a company sprint this week.

poettering added documentation pid1 labels Oct 25, 2016

poettering mentioned this pull request Oct 25, 2016

fix handling/ordering of seccomp(), selinux/aa during service exec #3993

Closed

ronnychevalier reviewed Oct 25, 2016

View reviewed changes

keszybz reviewed Oct 26, 2016

View reviewed changes

tixxdz reviewed Oct 26, 2016

View reviewed changes

keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 27, 2016

poettering added 9 commits November 2, 2016 08:49

update TODO

67234d2

seccomp: add clock query and sleeping syscalls to "@default" group

c79aff9

Timing and sleep are so basic operations, it makes very little sense to ever block them, hence don't.

seccomp: drop execve() from @process list

a8c157f

The system call is already part in @default hence implicitly allowed anyway. Also, if it is actually blocked then systemd couldn't execute the service in question anymore, since the application of seccomp is immediately followed by it.

seccomp: include pipes and memfd in @ipc

cd5bfd7

These system calls clearly fall in the @ipc category, hence should be listed there, simply to avoid confusion and surprise by the user.

man: two minor fixes

aa6b9ce

tests: make sure tests pass when invoked in "sudo"

999a6c5

This is a follow-up for 6309e51 and makes sure we compare test results with the right user identifier.

poettering force-pushed the exec-order branch from aafd190 to 999a6c5 Compare November 2, 2016 15:06

poettering mentioned this pull request Nov 2, 2016

add @filesystem and @filesystem-ops syscall filter groups #4537

Merged

poettering changed the title ~~more seccomp fixes, and change of order of selinux/aa and seccomp application on exec~~ more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec Nov 2, 2016

keszybz added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Nov 2, 2016

poettering merged commit 32e134c into systemd:master Nov 2, 2016

keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Nov 3, 2016

evverx mentioned this pull request Nov 5, 2016

TEST-03-JOBS fails sometimes #4582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec #4483

more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec #4483

poettering commented Oct 25, 2016

ronnychevalier Oct 25, 2016

ronnychevalier Oct 25, 2016

tixxdz Oct 25, 2016

evverx commented Oct 25, 2016

evverx commented Oct 25, 2016

topimiettinen commented Oct 25, 2016

tixxdz commented Oct 25, 2016

keszybz Oct 26, 2016

poettering Nov 1, 2016

keszybz Oct 26, 2016

keszybz Oct 26, 2016

tixxdz Oct 26, 2016

topimiettinen Oct 26, 2016

poettering Nov 2, 2016

poettering Nov 2, 2016

keszybz commented Oct 26, 2016

keszybz Oct 26, 2016

poettering Nov 1, 2016 •

edited

tixxdz Oct 26, 2016

keszybz commented Nov 1, 2016

keszybz commented Nov 1, 2016

poettering commented Nov 2, 2016

poettering commented Nov 2, 2016

evverx commented Nov 2, 2016

evverx commented Nov 2, 2016

poettering commented Nov 2, 2016

martinpitt commented Nov 3, 2016

		@@ -32,6 +32,11 @@ Janitorial Clean-ups:

		Features:

		* drop nss-myhostname in favour of nss-resolve?

more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec #4483

more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec #4483

Conversation

poettering commented Oct 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evverx commented Oct 25, 2016

evverx commented Oct 25, 2016

topimiettinen commented Oct 25, 2016

tixxdz commented Oct 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keszybz commented Oct 26, 2016

Choose a reason for hiding this comment

poettering Nov 1, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keszybz commented Nov 1, 2016

keszybz commented Nov 1, 2016

poettering commented Nov 2, 2016

poettering commented Nov 2, 2016

evverx commented Nov 2, 2016

evverx commented Nov 2, 2016

poettering commented Nov 2, 2016

martinpitt commented Nov 3, 2016

poettering Nov 1, 2016 •

edited