New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more seccomp fixes, and change of order of selinux/aa/smack and seccomp application on exec #4483
Conversation
@@ -1342,13 +1342,13 @@ | |||
</row> | |||
<row> | |||
<entry>@raw-io</entry> | |||
<entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pciconfig_read()</function>, …</entry> | |||
<entry>Raw I/O port access (<citerefentry project='man-pages'><refentrytitle>ioperm</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <citerefentry project='man-pages'><refentrytitle>iopl</refentrytitle><manvolnum>2</manvolnum></citerefentry>, <function>pcaiconfig_read()</function>, …)</entry> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added an a
to pciconfig_read
by mistake
</row> | ||
<row> | ||
<entry>@file-system-ops</entry> | ||
<entry>File system operations: more advanced file system operations, such as renaming or removing files and directories, or creating hard and symbolic links.</entry> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reading this, I was thinking about some sandbox architectures where the sandboxed application only have access to syscalls that can operate on a file descriptor, but no access to syscall that take a path. An external privileged agent gives all the file descriptors to the sandboxed application in order for it to work.
Maybe a syscall group with only filesytem syscalls using a file descriptor could be interesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ronnychevalier yes this can be interesting and it goes into the direction of capsicum of FreeBSD, however here not sure if this makes sense, but instead of emulating the capabilities mode of the kernel, one can mimic partial functionality at userspace level using the file descriptor that you mention coupled with a mount namespace setup where only the fds that maps to resources and desired capabilities are available...
@poettering , thanks! I'll take a detailed look tomorrow.
I'm not sure about aa. But IIRC the order doesn't really matter for SELinux: #3852 (comment) Also, I think we should somehow document this:
But I don't know how to express this user-friendly :-( |
Seems like this PR will improve the current situation. @topimiettinen , what do you think? |
Looks fine. aa_change_onexec() also takes effect on execve(): |
Lgtm in general for seccomp ordering , others have already defined layers https://chromium.googlesource.com/chromium/src/+/master/docs/linux_sandboxing.md . |
@@ -32,6 +32,11 @@ Janitorial Clean-ups: | |||
|
|||
Features: | |||
|
|||
* drop nss-myhostname in favour of nss-resolve? | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Not everybody is using resolve and not everybody should be using it. For example in aa mock chroot I don't want systemd-resolve to run, but I still want localhost
to be resolvable. myhostname
provides useful functionality that is partially independent of resolve
and I think we should keep it, because of this, and because of backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well i added the question mark to this entry, since I am not entirely sure about it either...
* drop nss-myhostname in favour of nss-resolve? | ||
|
||
* drop internal dlopen() based nss-dns fallback in nss-resolve, and rely on the | ||
external nsswitch.conf based one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go through a long deprecation period though.
[SYSCALL_FILTER_SET_FILE_SYSTEM] = { | ||
/* Basic file system access (reading file properties, opening/creating files and directories for | ||
* read+write) */ | ||
.name = "@file-system", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really see the point of the split betwen @File-system and @file-system-ops. (And I also think no dash would be better: @filesystem...). "Advanced" is in the eye of the beholder. I'd rather split this into @filesystem-query (access, chdir, various getattr, stat, inotify, mmap, readlink).
fcntl, close can be moved to @basic-io.
Can we filter on open(…, O_CREATE) and mmap(…, PROT_WRITE) and move those to the latter group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we filter on open(…, O_CREATE) and mmap(…, PROT_WRITE) and move those to the latter group?
So the @basic-io group may make sense, however IMHO we should wait on open(..., O_CREATE)
and writable operations... , and better figure out a setup where it is simple for users to use more simple switches and maybe embed the open(O_CREATE)
inside without exposing it, or even better following these choices:
-
ProtectSystem=strict
-
RestrictMount=no|rdonly_devices|strict
strict: disables mount,umount,chroot and pivot_root
rdonly_devices: if possible, allows chroot, and may imply some semantics of PrivateDevices. The idea is to not allow remounts, changing mount flags, ms_move etc on real devices, mount rw virtual filesystems can be ok, but we should prohibit write access to real devices, this way it will cover all the write syscalls on real hardware in one option, other stuff can be classified as memory...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think similar split to writable and read/other operations was also discussed in early version of the grouping patches (#3053 and #3157). But it's very rare that a program only reads or receives from network and never writes or sends anything.
Blocking file creation and some mount operations may make sense. BTW, there's also creat().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I figure I'll drop the file-system-access bit for now, and split it into a separate PR. I'll leave the basic IO stuff in the patch however.
@keszybz close() already is in the @basic-io group in this patch. And I was afraid of adding fcntl since it's one of those multplixed calls, and we'd blanket allow quite a few ops that way. I sounded better to me that people needing it would explicitly say "SystemCallFilter=@basic-io fcntl" or so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i think we should extend SystemCallFilters= eventually, so that besides syscalls we can also recognize the different commands of multiplexed syscalls, such as prctl(), setsockopt() and fcntl(). I.e. that you could write "SystemCallFilters=read write SO_PASSCRED socket PR_SET_NAME" to match three syscalls and the two specified subcommands of setsockopt and prctl... If we have that we could add the safe fcntl() subcommands to the @basic-io
group...)
I think that the SMACK setup should be moved too. If we're going to break setups, it's better to do it once, in a bigger step. |
@@ -1261,6 +1261,14 @@ | |||
filter is reset, all prior assignments will have no effect. This does not affect commands prefixed with | |||
<literal>+</literal>.</para> | |||
|
|||
<para>Note that strict system call filters may impact execution and error handling code paths of the service | |||
invocation. Specifically, access to the <function>execve</function> system call is required for the execution | |||
of the service binary – if it is blocked service invocation will necessarily fail. Also, if execution of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an m-dash, not an n-dash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, how do i type that on my german X11 kbd? I can type these three characters easily here:
- (plain "-" key)
– ("-" key combined with AltGr)
— ("-" key combined with Shift-AltGr)
Which one is the right one in your opinion?
"statfs\0" | ||
}, | ||
[SYSCALL_FILTER_SET_FILE_SYSTEM_OPS] = { | ||
/* More advanced file system operations */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be it will be better if it is defined as a set of syscalls that update metadata attributes etc of already created files or so ... ?! I mean a better wording or description.
On Tue, Nov 01, 2016 at 08:43:15AM -0700, Lennart Poettering wrote:
I know ;) I was just providing my two cents. |
On Tue, Nov 01, 2016 at 08:50:04AM -0700, Lennart Poettering wrote:
Try copy and paste into python3 -c 'import unicodedata;print(unicodedata.name("—"))' |
Timing and sleep are so basic operations, it makes very little sense to ever block them, hence don't.
The system call is already part in @default hence implicitly allowed anyway. Also, if it is actually blocked then systemd couldn't execute the service in question anymore, since the application of seccomp is immediately followed by it.
These system calls clearly fall in the @ipc category, hence should be listed there, simply to avoid confusion and surprise by the user.
@resources contains various syscalls that alter resource limits and memory and scheduling parameters of processes. As such they are good candidates to block for most services. @basic-io contains a number of basic syscalls for I/O, similar to the list seccomp v1 permitted but slightly more complete. It should be useful for building basic whitelisting for minimal sandboxes
Seccomp is generally an unprivileged operation, changing security contexts is most likely associated with some form of policy. Moreover, while seccomp may influence our own flow of code quite a bit (much more than the security context change) make sure to apply the seccomp filters immediately before executing the binary to invoke. This also moves enforcement of NNP after the security context change, so that NNP cannot affect it anymore. (However, the security policy now has to permit the NNP change). This change has a good chance of breaking current SELinux/AA/SMACK setups, because the policy might not expect this change of behaviour. However, it's technically the better choice I think and should hence be applied. Fixes: systemd#3993
…ice manager If execve() or socket() is filtered the service manager might get into trouble executing the service binary, or handling any failures when this fails. Mention this in the documentation. The other option would be to implicitly whitelist all system calls that are required for these codepaths. However, that appears less than desirable as this would mean socket() and many related calls have to be whitelisted unconditionally. As writing system call filters requires a certain level of expertise anyway it sounds like the better option to simply document these issues and suggest that the user disables system call filters in the service temporarily in order to debug any such failures. See: systemd#3993.
This is a follow-up for 6309e51 and makes sure we compare test results with the right user identifier.
I force pushed a new version now, with the |
aafd190
to
999a6c5
Compare
hmm, the ubuntu CI fails at:
anyone has an idea what that means and what precisely fails here? does this have anything to do with the patch at hand? @martinpitt, @evverx, @keszyb, any idea? |
I have no idea what is going on. And I can't reproduce this locally. But we have already seen that error: |
Well, I've reproduced Nov 02 09:24:55 systemd-testsuite test-jobs.sh[164]: + START_SEC=1478078695
Nov 02 09:24:55 systemd-testsuite test-jobs.sh[164]: + systemctl start --wait wait2.service
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: ++ date -u +%s
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + END_SEC=1478078699
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + ELAPSED=4
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + [[ 4 -ge 2 ]]
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + [[ 4 -le 3 ]]
Nov 02 09:24:59 systemd-testsuite test-jobs.sh[164]: + exit 1 93a0884#diff-5a6a04288fc1ba785cfd801b4d20e147R52 |
So I understand the issues is unrelated to the PR at hend? I figure I can merge this then! |
Agreed, this seems to be a too tight race. Will look at that next week, I'm at a company sprint this week. |
This one might break stuff, see commit msgs.
@evverx please have a look!