-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core:sandbox: Add ProtectKernelModules= and some fixes #4243
core:sandbox: Add ProtectKernelModules= and some fixes #4243
Conversation
if (streq(set->set_name, "@module")) { | ||
syscalls_found = true; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I don't think it is worth relying on the syscall_filter_sets structure for this. I'd probably just hardcode the three syscalls here, it's not that many, and it's very unlikely it will change anytime soon. Also, the linear search goes away then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok done, but I think we may redo it later...
- we may want to show filtered syscall through D-Bus per group , 2) currently we don't show system calls that are filtered through other options in D-Bus systemcallfilter property, we only show the ones that were explicitly provided, not the syscalls that we have added... I already have a fix for this one, this way it will be easy to debug since at linuxcon everyone was asking how to debug... but lets delay it for now. I only want to be consistent. ok.
@@ -3364,6 +3364,11 @@ int unit_patch_contexts(Unit *u) { | |||
if (ec->private_devices) | |||
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_MKNOD); | |||
|
|||
if (ec->protect_kernel_modules) { | |||
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_SYS_MODULE); | |||
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_NET_ADMIN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not combine this into one line:
ec->capability_bounding_set &= ~((UINT64_C(1) << CAP_SYS_MODULE) | (UINT64_C(1) << CAP_NET_ADMIN));
looks pretty good already, thanks for looking into this! I think the kernel's own module auto-loading thing is something we can ignore (well, we should document that this exist, but it shouldn't affec this new option), simply as this is configured strictly within the domain of the kernel, and not of the service. Or to say this differently: if a service manages to somehow trigger module auto-loading of a module it should not have been able to then this indicates a general kernel problem that needs to be fixed in the kernel and its mapping files, not in the service. This new option should primarily protect that services cannot insert arbitrary code of their own choice I think. If the kernel auto-loads some specifically selected stuff I think this may be considered OK and orthogonal to this new option. hence: a comment in the man page about this should be added, but otherwise we should just accept it. |
hmm, how did you come to the conclusion that we need to block CAP_NET_ADMIN too? have a link about that? |
regarding kexec. i'd claim that's a different concept, unrelated to kmod. It might make sense to add another option for it though. Maybe a generic RestrictReboot= which would make the reboot() and kexec_load() syscalls unavailable, as well as related files in /proc and /sys which may be used to trigger reboots, plus maybe kill(1, SIGINT) (and a couple of other signals). It wouldn't be perfect (dbus, ...) but I think it could get us quite far... |
abad134
to
c0d2516
Compare
Updated.
Ok done, after some digging I'm not sure how hard it will be to make usermodhelper of the kernel understand namespaces... at least that way we can protect and bind mount inaccessible files over modprobe path... anyway documented now.
http://lxr.free-electrons.com/source/net/core/dev_ioctl.c#L359 netdev aliases were added later to cap_net_admin so you can't load arbitrary modules with it IIRC ...
Ok, will try to see, and for this PR if you agree on the I have added some other changes on top of this PR. Thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good to me, I like the idea of blocking /usr/lib/modules a lot.
I'd really would like to avoid the CAP_NET_ADMIN blockage however, I have the suspicion that this is unnecessary (if documented properly) and overreaching for many cases, as this would mean that a lot of network-facing code that would benefit from this option couldn't use it. See my comments above.
@@ -3369,6 +3369,9 @@ int unit_patch_contexts(Unit *u) { | |||
if (ec->private_devices) | |||
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_MKNOD); | |||
|
|||
if (ec->protect_kernel_modules) | |||
ec->capability_bounding_set &= ~((UINT64_C(1) << CAP_SYS_MODULE) | (UINT64_C(1) << CAP_NET_ADMIN)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, I think we should not take CAP_NET_ADMIN away here, but only CAP_SYS_MODULE.
The rationale for this: if you have CAP_NET_ADMIN, then you can only load modules prefixed with netdev-
but not arbitrary modules. And as such it really is more in the territory of module auto-loading, not of explicit loading. I think this patch should not bother window module auto-loading, as problems with that are limited to specific sets of kernel module names, that usually share a common prefix, and are always subject to configuration in the modules configuration or mapping files.
Yes, this deserves documentation, but only in the context of "this setting only blocks explicit module loading. Limited automatic module loading because of user configuration or kernel mapping tables might still happen as side effect of requested user operations, both privileged and unprivileged." or something like that.
@@ -3367,7 +3367,7 @@ int unit_patch_contexts(Unit *u) { | |||
ec->no_new_privileges = true; | |||
|
|||
if (ec->private_devices) | |||
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_MKNOD); | |||
ec->capability_bounding_set &= ~((UINT64_C(1) << CAP_MKNOD) | (UINT64_C(1) << CAP_SYS_RAWIO)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one makes a ton of sense indeed!
@@ -1035,7 +1035,10 @@ | |||
boot-time, with the <citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry> | |||
mechanism. Almost no services need to write to these at runtime; it is hence recommended to turn this on for | |||
most services. For this setting the same restrictions regarding mount propagation and privileges apply as for | |||
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.</para></listitem> | |||
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off. | |||
Note that this option does not prevent kernel tunning through IPC interfaces and exeternal programs. However |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tunning → tuning
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.</para></listitem> | ||
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off. | ||
Note that this option does not prevent kernel tunning through IPC interfaces and exeternal programs. However | ||
<varname>InaccessiblePaths=</varname> can be used to make IPC file system objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make IPC file system objects → make some IPC file system objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -45,6 +45,14 @@ int main(int argc, char *argv[]) { | |||
"/home/lennart/projects", | |||
NULL | |||
}; | |||
|
|||
const NameSpaceContext ns_ctx = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also be "static" I figure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -40,15 +43,20 @@ typedef enum ProtectSystem { | |||
_PROTECT_SYSTEM_INVALID = -1 | |||
} ProtectSystem; | |||
|
|||
struct NameSpaceContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this NamespaceExtras or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NameSpaceInfo
bool private_dev; | ||
bool protect_control_groups; | ||
bool protect_kernel_tunables; | ||
bool protect_kernel_modules; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these could become bitflags? i.e. gain a :1
each?
<filename>/usr/lib/modules</filename> and other related module directories | ||
are made inaccessible. For this setting the same restrictions regarding mount | ||
propagation and privileges apply as for <varname>ReadOnlyPaths=</varname> | ||
and related calls, see above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the concept! Makes sense to block this, indeed!
{ "/lib/modules", INACCESSIBLE, true }, | ||
{ "/lib64/modules", INACCESSIBLE, true }, | ||
{ "/usr/lib/modules", INACCESSIBLE, true }, | ||
{ "/usr/lib64/modules", INACCESSIBLE, true }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so, the kernel modules are never in /usr/lib64. I am pretty sure we should only look for /usr/lib here, not the 64bit counterpart, after all the lib64 thing is only necessary for multilib implementations, and lib64 should pretty much only include .so libraries, and nothing else, in particular not kernels or their kmods.
I also think we should not add /lib/modules to the block list here, except if HAVE_SPLIT_USR is undefined. i.e. on joined /usr, we shouldn't bother with the dir at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
This is useful to turn off explicit module load and unload operations on modular kernels. This option removes CAP_SYS_MODULE from the capability bounding set for the unit, and installs a system call filter to block module system calls. This option will not prevent the kernel from loading modules using the module auto-load feature which is a system wide operation.
This just adds capabilities test.
The rawio system calls were filtered, but CAP_SYS_RAWIO allows to access raw data through /proc, ioctl and some other exotic system calls...
…dules= Lets go further and make /lib/modules/ inaccessible for services that do not have business with modules, this is a minor improvment but it may help on setups with custom modules and they are limited... in regard of kernel auto-load feature. This change introduce NameSpaceInfo struct which we may embed later inside ExecContext but for now lets just reduce the argument number to setup_namespace() and merge ProtectKernelModules feature.
…mount propagation
596c6f1
to
4982dbc
Compare
@poettering updated thanks for the review. So still todo I guess on top of this PR:
Thanks |
On Wed, Oct 12, 2016 at 3:51 PM, Djalal Harouni notifications@github.com
The tests related to For the moment, we did not have to run specific C programs to execute |
Ok so current tests do that, thanks
I see, however some tests can be a bit complex so I was wondering what's the best location where we can put python script files ? Thank you! |
Currently, we can't do that. I just looked at |
Oh sorry, I misread you question. I thought you asked how to execute complex python scripts. I think |
@ronnychevalier ok thank you ! |
patch looks good, thanks for reworking this! Shall we merge this as is, or do you want to add tests before we do so? |
Ok, let me add tests later , thanks! |
mount("/run/systemd/inaccessible/dir", "/usr/lib/modules", ...) = -1 ENOENT 12037 unshare(CLONE_NEWNS) = 0
12037 mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) = 0
12037 open("/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 3</proc/12037/mountinfo>
12037 fstat(3</proc/12037/mountinfo>, {st_dev=makedev(0, 4), st_ino=605828, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2016/10/13-23:27:03.204544645, st_mtime=2016/10/13-23:27:03.204544645, st_ctime=2016/10/13-23:27:03.204544645}) = 0
12037 read(3</proc/12037/mountinfo>, "74 73 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 rw,seclabel,data=ordered\n75 74 0:6 / /dev rw,nosuid master:2 - devtmpfs devtmpfs rw,seclabel,size=1013664k,nr_inodes=253416,mode=755\n76 75 0:18 / /dev/shm rw,nosuid,nodev master:3 - tmpfs tmpfs rw,seclabel\n77 75 0:19 / /dev/pts rw,nosuid,noexec,relatime master:4 - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=000\n78 75 0:36 / /dev/hugepages rw,relatime master:25 - hugetlbfs hugetlbfs rw,seclabel\n79 75 0:14 / /dev/mqueue rw,relatime master:2"..., 1024) = 1024
12037 read(3</proc/12037/mountinfo>, "l,mode=755\n85 84 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:9 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd\n86 84 0:24 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:10 - cgroup cgroup rw,cpu,cpuacct\n87 84 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,net_cls,net_prio\n88 84 0:26 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:12 - cgroup "..., 1024) = 1024
12037 read(3</proc/12037/mountinfo>, " rw,perf_event\n94 84 0:32 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,hugetlb\n95 84 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,memory\n96 82 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:20 - pstore pstore rw,seclabel\n97 82 0:34 / /sys/kernel/config rw,relatime master:21 - configfs configfs rw\n98 82 0:15 / /sys/fs/selinux rw,relatime master:22 - selinuxfs selinuxfs rw\n99 82 0:7 / /sys/kernel/de"..., 1024) = 767
12037 read(3</proc/12037/mountinfo>, "", 1024) = 0
12037 close(3</proc/12037/mountinfo>) = 0
12037 lstat("/usr/lib/modules", {st_dev=makedev(8, 1), st_ino=1575640, st_mode=S_IFDIR|0755, st_nlink=5, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2016/10/13-23:24:06.972930969, st_mtime=2016/10/09-16:58:41.408993736, st_ctime=2016/10/09-16:58:41.408993736}) = 0
12037 mount("/run/systemd/inaccessible/dir", "/usr/lib/modules", NULL, MS_BIND|MS_REC, NULL) = -1 ENOENT (No such file or directory)
12037 writev(2</dev/pts/0>, [{iov_base="exec-protectkernelmodules-yes-capabilities.service: Failed at step NAMESPACE spawning /bin/sh: No such file or directory", iov_len=120}, {iov_base="\n", iov_len=1}], 2) = 121 Seems like this is the root cause:
|
@evverx hey thanks for the report, I guess you are missing /run/systemd/inaccessible/dir due to systemd version ? so if you manually add |
@tixxdz , sure :) But I think we should fix the test. i.e. we should skip the test if
Anyway, this is not critical. |
@evverx aah hehe ok I see :-) , when time permits later, noted. |
In case of running test-execute on systems with systemd < v232, several tests like privatedevices or protectkernelmodules fail because /run/systemd/inaccessible/ is empty. In these cases, we should skip tests to avoid unnecessary errors. See also systemd#4243 (comment)
In case of running test-execute on systems with systemd < v232, several tests like privatedevices or protectkernelmodules fail because /run/systemd/inaccessible/ is empty. In these cases, we should skip tests to avoid unnecessary errors. See also systemd#4243 (comment)
In case of running test-execute on systems with systemd < v232, several tests like privatedevices or protectkernelmodules fail because /run/systemd/inaccessible/ doesn't exist. In these cases, we should skip tests to avoid unnecessary errors. See also systemd#4243 (comment)
In case of running test-execute on systems with systemd < v232, several tests like privatedevices or protectkernelmodules fail because /run/systemd/inaccessible/ doesn't exist. In these cases, we should skip tests to avoid unnecessary errors. See also #4243 (comment)
This is a work in progress to add ProtectKernelModules=, not ready to be
merged.
We need to find a solution for kernel module autoloading... /proc/sys/kernel/modprobe
That file is shared between all mount namespace since it's backed by
kernel.
modprobe can point to any exe that will be loaded by the usermodehelper
kernel thread, and no special capabilities are needed to write to
/proc/sys/kernel/modprobe... later the thread is started in the init mount
namespace context, which means right now we have to make a system wide
change to prevent kernel from calling /proc/sys/kernel/modprobe... A
first thought suggests we may check current before dispatching to the
kthread to check if seccomp blocks init_module() ? at least that should
work for a namespace setup without adding new fields to task_struct...
As a side note: we have to drop CAP_SYS_MODULE and CAP_NET_ADMIN which
also allow auto loading of netdev modules.
Another note: what to do about kexec ?
Anyway let this here for now...