Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec #4483

Merged
merged 9 commits into from Nov 2, 2016

Conversation

poettering
Copy link
Member

This one might break stuff, see commit msgs.

@evverx please have a look!

@@ -1342,13 +1342,13 @@
</row>
<row>
<entry>@raw-io</entry>
<entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pciconfig_read()</function>, …</entry>
<entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pcaiconfig_read()</function>, …)</entry>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added an a to pciconfig_read by mistake

</row>
<row>
<entry>@file-system-ops</entry>
<entry>File system operations: more advanced file system operations, such as renaming or removing files and directories, or creating hard and symbolic links.</entry>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reading this, I was thinking about some sandbox architectures where the sandboxed application only have access to syscalls that can operate on a file descriptor, but no access to syscall that take a path. An external privileged agent gives all the file descriptors to the sandboxed application in order for it to work.

Maybe a syscall group with only filesytem syscalls using a file descriptor could be interesting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronnychevalier yes this can be interesting and it goes into the direction of capsicum of FreeBSD, however here not sure if this makes sense, but instead of emulating the capabilities mode of the kernel, one can mimic partial functionality at userspace level using the file descriptor that you mention coupled with a mount namespace setup where only the fds that maps to resources and desired capabilities are available...

@evverx
Copy link
Member

evverx commented Oct 25, 2016

@poettering , thanks! I'll take a detailed look tomorrow.

This one might break stuff, see commit msgs.

I'm not sure about aa. But IIRC the order doesn't really matter for SELinux: #3852 (comment)

Also, I think we should somehow document this:

Be careful, though: LSMs might also not tighten constraints on exec
in no_new_privs mode. (This means that setting up a general-purpose
service launcher to set no_new_privs before execing daemons may
interfere with LSM-based sandboxing.)

But I don't know how to express this user-friendly :-(

@evverx
Copy link
Member

evverx commented Oct 25, 2016

#3970 (comment)

Ideally there should be no system calls between the final seccomp() and execve() to minimize restrictions and the default set. I'd suppose SELinux and AppArmor also would like to be the last ones too, so there's conflict. It's also not a good idea to block system calls that setexeccon() and aa_change_onexec happen to use. This could be avoided if the kernel provided a new way to install seccomp filters so that they only take effect after execve(), much like the context set by setexeccon() applies after execve(). Then also execve() could be allowed to be blocked by a filter to prevent the service from executing anything else, which would be very nice.

Seems like this PR will improve the current situation. @topimiettinen , what do you think?

@tixxdz
Copy link
Member

tixxdz commented Oct 25, 2016

Lgtm in general for seccomp ordering , others have already defined layers https://chromium.googlesource.com/chromium/src/+/master/docs/linux_sandboxing.md .

@@ -32,6 +32,11 @@ Janitorial Clean-ups:

Features:

* drop nss-myhostname in favour of nss-resolve?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Not everybody is using resolve and not everybody should be using it. For example in aa mock chroot I don't want systemd-resolve to run, but I still want localhost to be resolvable. myhostname provides useful functionality that is partially independent of resolve and I think we should keep it, because of this, and because of backwards compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well i added the question mark to this entry, since I am not entirely sure about it either...

* drop nss-myhostname in favour of nss-resolve?

* drop internal dlopen() based nss-dns fallback in nss-resolve, and rely on the
external nsswitch.conf based one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go through a long deprecation period though.

[SYSCALL_FILTER_SET_FILE_SYSTEM] = {
/* Basic file system access (reading file properties, opening/creating files and directories for
* read+write) */
.name = "@file-system",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really see the point of the split betwen @File-system and @file-system-ops. (And I also think no dash would be better: @filesystem...). "Advanced" is in the eye of the beholder. I'd rather split this into @filesystem-query (access, chdir, various getattr, stat, inotify, mmap, readlink).

fcntl, close can be moved to @basic-io.

Can we filter on open(…, O_CREATE) and mmap(…, PROT_WRITE) and move those to the latter group?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keszybz

Can we filter on open(…, O_CREATE) and mmap(…, PROT_WRITE) and move those to the latter group?

So the @basic-io group may make sense, however IMHO we should wait on open(..., O_CREATE) and writable operations... , and better figure out a setup where it is simple for users to use more simple switches and maybe embed the open(O_CREATE) inside without exposing it, or even better following these choices:

  • ProtectSystem=strict

  • RestrictMount=no|rdonly_devices|strict

    strict: disables mount,umount,chroot and pivot_root
    rdonly_devices: if possible, allows chroot, and may imply some semantics of PrivateDevices. The idea is to not allow remounts, changing mount flags, ms_move etc on real devices, mount rw virtual filesystems can be ok, but we should prohibit write access to real devices, this way it will cover all the write syscalls on real hardware in one option, other stuff can be classified as memory...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think similar split to writable and read/other operations was also discussed in early version of the grouping patches (#3053 and #3157). But it's very rare that a program only reads or receives from network and never writes or sends anything.

Blocking file creation and some mount operations may make sense. BTW, there's also creat().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I figure I'll drop the file-system-access bit for now, and split it into a separate PR. I'll leave the basic IO stuff in the patch however.

@keszybz close() already is in the @basic-io group in this patch. And I was afraid of adding fcntl since it's one of those multplixed calls, and we'd blanket allow quite a few ops that way. I sounded better to me that people needing it would explicitly say "SystemCallFilter=@basic-io fcntl" or so...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i think we should extend SystemCallFilters= eventually, so that besides syscalls we can also recognize the different commands of multiplexed syscalls, such as prctl(), setsockopt() and fcntl(). I.e. that you could write "SystemCallFilters=read write SO_PASSCRED socket PR_SET_NAME" to match three syscalls and the two specified subcommands of setsockopt and prctl... If we have that we could add the safe fcntl() subcommands to the @basic-io group...)

@keszybz
Copy link
Member

keszybz commented Oct 26, 2016

I think that the SMACK setup should be moved too. If we're going to break setups, it's better to do it once, in a bigger step.

@@ -1261,6 +1261,14 @@
filter is reset, all prior assignments will have no effect. This does not affect commands prefixed with
<literal>+</literal>.</para>

<para>Note that strict system call filters may impact execution and error handling code paths of the service
invocation. Specifically, access to the <function>execve</function> system call is required for the execution
of the service binary – if it is blocked service invocation will necessarily fail. Also, if execution of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an m-dash, not an n-dash.

Copy link
Member Author

@poettering poettering Nov 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, how do i type that on my german X11 kbd? I can type these three characters easily here:

- (plain "-" key)
– ("-" key combined with AltGr)
— ("-" key combined with Shift-AltGr)

Which one is the right one in your opinion?

"statfs\0"
},
[SYSCALL_FILTER_SET_FILE_SYSTEM_OPS] = {
/* More advanced file system operations */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be it will be better if it is defined as a set of syscalls that update metadata attributes etc of already created files or so ... ?! I mean a better wording or description.

@keszybz keszybz added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 27, 2016
@keszybz
Copy link
Member

keszybz commented Nov 1, 2016

On Tue, Nov 01, 2016 at 08:43:15AM -0700, Lennart Poettering wrote:

well i added the question mark to this entry, since I am not entirely sure about it either...

I know ;) I was just providing my two cents.

@keszybz
Copy link
Member

keszybz commented Nov 1, 2016

On Tue, Nov 01, 2016 at 08:50:04AM -0700, Lennart Poettering wrote:

  • (plain "-" key)
    – ("-" key combined with AltGr)
    — ("-" key combined with Shift-AltGr)

Try copy and paste into python3 -c 'import unicodedata;print(unicodedata.name("—"))'
This returns HYPHEN-MINUS, EN DASH, EM DASH for the three cases above, i.e. the
last one is good.

Timing and sleep are so basic operations, it makes very little sense to ever
block them, hence don't.
The system call is already part in @default hence implicitly allowed anyway.
Also, if it is actually blocked then systemd couldn't execute the service in
question anymore, since the application of seccomp is immediately followed by
it.
These system calls clearly fall in the @ipc category, hence should be listed
there, simply to avoid confusion and surprise by the user.
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.

@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes
Seccomp is generally an unprivileged operation, changing security contexts is
most likely associated with some form of policy. Moreover, while seccomp may
influence our own flow of code quite a bit (much more than the security context
change) make sure to apply the seccomp filters immediately before executing the
binary to invoke.

This also moves enforcement of NNP after the security context change, so that
NNP cannot affect it anymore. (However, the security policy now has to permit
the NNP change).

This change has a good chance of breaking current SELinux/AA/SMACK setups, because
the policy might not expect this change of behaviour. However, it's technically
the better choice I think and should hence be applied.

Fixes: systemd#3993
…ice manager

If execve() or socket() is filtered the service manager might get into trouble
executing the service binary, or handling any failures when this fails. Mention
this in the documentation.

The other option would be to implicitly whitelist all system calls that are
required for these codepaths. However, that appears less than desirable as this
would mean socket() and many related calls have to be whitelisted
unconditionally. As writing system call filters requires a certain level of
expertise anyway it sounds like the better option to simply document these
issues and suggest that the user disables system call filters in the service
temporarily in order to debug any such failures.

See: systemd#3993.
This is a follow-up for 6309e51 and makes sure
we compare test results with the right user identifier.
@poettering
Copy link
Member Author

I force pushed a new version now, with the @filesystem groups dropped for now. All pointed out issues should be fixed, and another commit was added, that fixes "sudo make check", and is hence kinda important. Please review and merge!

@poettering poettering changed the title more seccomp fixes, and change of order of selinux/aa and seccomp application on exec more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec Nov 2, 2016
@keszybz keszybz added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Nov 2, 2016
@poettering
Copy link
Member Author

hmm, the ubuntu CI fails at:

[�[0;32m  OK  �[0m] Stopped Create Static Device Nodes in /dev.
[�[0;32m  OK  �[0m] Reached target Shutdown.
[�[0;32m  OK  �[0m] Reached target Final Step.
         Starting Power-Off...
[   83.411099] reboot: Power down
-rw-r-----+ 1 root systemd-journal 4845568 Nov  2 16:56 /var/tmp/systemd-test.q4eyN6/journal/d617569fc6e8437db569a22e32336c92/system.journal
-rw-r-----+ 1 root systemd-journal 4845568 Nov  2 16:56 /var/tmp/systemd-test.q4eyN6/journal/d617569fc6e8437db569a22e32336c92/system@a410db085f704dfb8dcaddfab0fdb458-0000000000000001-0005405450ea6bb0.journal
TEST RUN: Job-related tests [FAILED]

anyone has an idea what that means and what precisely fails here? does this have anything to do with the patch at hand? @martinpitt, @evverx, @keszyb, any idea?

@evverx
Copy link
Member

evverx commented Nov 2, 2016

anyone has an idea what that means and what precisely fails here?

I have no idea what is going on. And I can't reproduce this locally.

But we have already seen that error:
#4331 (comment)
#4414 (comment)

@evverx
Copy link
Member

evverx commented Nov 2, 2016

And I can't reproduce this locally.

Well, I've reproduced

Nov 02 09:24:55 systemd-testsuite test-jobs.sh[164]: + START_SEC=1478078695
Nov 02 09:24:55 systemd-testsuite test-jobs.sh[164]: + systemctl start --wait wait2.service
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: ++ date -u +%s
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + END_SEC=1478078699
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + ELAPSED=4
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + [[ 4 -ge 2 ]]
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + [[ 4 -le 3 ]]
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + exit 1

93a0884#diff-5a6a04288fc1ba785cfd801b4d20e147R52
seems like we should fix the test. @martinpitt ?

@poettering
Copy link
Member Author

So I understand the issues is unrelated to the PR at hend? I figure I can merge this then!

@poettering poettering merged commit 32e134c into systemd:master Nov 2, 2016
@martinpitt
Copy link
Contributor

seems like we should fix the test

Agreed, this seems to be a too tight race. Will look at that next week, I'm at a company sprint this week.

@keszybz keszybz removed the good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed label Nov 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

7 participants