Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core:sandbox: Add ProtectKernelModules= and some fixes #4243

Conversation

tixxdz
Copy link
Member

@tixxdz tixxdz commented Sep 29, 2016

This is a work in progress to add ProtectKernelModules=, not ready to be
merged.

We need to find a solution for kernel module autoloading... /proc/sys/kernel/modprobe
That file is shared between all mount namespace since it's backed by
kernel.

modprobe can point to any exe that will be loaded by the usermodehelper
kernel thread, and no special capabilities are needed to write to
/proc/sys/kernel/modprobe... later the thread is started in the init mount
namespace context, which means right now we have to make a system wide
change to prevent kernel from calling /proc/sys/kernel/modprobe... A
first thought suggests we may check current before dispatching to the
kthread to check if seccomp blocks init_module() ? at least that should
work for a namespace setup without adding new fields to task_struct...

As a side note: we have to drop CAP_SYS_MODULE and CAP_NET_ADMIN which
also allow auto loading of netdev modules.

Another note: what to do about kexec ?

Anyway let this here for now...

@tixxdz tixxdz changed the title core:sandbox: Add ProtectKernelModules= [WIP] core:sandbox: Add ProtectKernelModules= Sep 29, 2016
@martinpitt martinpitt added pid1 reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Sep 30, 2016
if (streq(set->set_name, "@module")) {
syscalls_found = true;
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't think it is worth relying on the syscall_filter_sets structure for this. I'd probably just hardcode the three syscalls here, it's not that many, and it's very unlikely it will change anytime soon. Also, the linear search goes away then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok done, but I think we may redo it later...

  1. we may want to show filtered syscall through D-Bus per group , 2) currently we don't show system calls that are filtered through other options in D-Bus systemcallfilter property, we only show the ones that were explicitly provided, not the syscalls that we have added... I already have a fix for this one, this way it will be easy to debug since at linuxcon everyone was asking how to debug... but lets delay it for now. I only want to be consistent. ok.

@@ -3364,6 +3364,11 @@ int unit_patch_contexts(Unit *u) {
if (ec->private_devices)
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_MKNOD);

if (ec->protect_kernel_modules) {
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_SYS_MODULE);
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_NET_ADMIN);
Copy link
Member

@poettering poettering Oct 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not combine this into one line:

ec->capability_bounding_set &= ~((UINT64_C(1) << CAP_SYS_MODULE) | (UINT64_C(1) << CAP_NET_ADMIN));

@poettering
Copy link
Member

looks pretty good already, thanks for looking into this!

I think the kernel's own module auto-loading thing is something we can ignore (well, we should document that this exist, but it shouldn't affec this new option), simply as this is configured strictly within the domain of the kernel, and not of the service. Or to say this differently: if a service manages to somehow trigger module auto-loading of a module it should not have been able to then this indicates a general kernel problem that needs to be fixed in the kernel and its mapping files, not in the service.

This new option should primarily protect that services cannot insert arbitrary code of their own choice I think. If the kernel auto-loads some specifically selected stuff I think this may be considered OK and orthogonal to this new option.

hence: a comment in the man page about this should be added, but otherwise we should just accept it.

@poettering
Copy link
Member

hmm, how did you come to the conclusion that we need to block CAP_NET_ADMIN too? have a link about that?

@poettering
Copy link
Member

regarding kexec. i'd claim that's a different concept, unrelated to kmod. It might make sense to add another option for it though. Maybe a generic RestrictReboot= which would make the reboot() and kexec_load() syscalls unavailable, as well as related files in /proc and /sys which may be used to trigger reboots, plus maybe kill(1, SIGINT) (and a couple of other signals). It wouldn't be perfect (dbus, ...) but I think it could get us quite far...

@tixxdz tixxdz force-pushed the djalal/sandbox-first-protection-kernelmodules-v1 branch from abad134 to c0d2516 Compare October 8, 2016 15:55
@tixxdz tixxdz changed the title [WIP] core:sandbox: Add ProtectKernelModules= core:sandbox: Add ProtectKernelModules= and some fixes Oct 8, 2016
@tixxdz
Copy link
Member Author

tixxdz commented Oct 8, 2016

@poettering

looks pretty good already, thanks for looking into this!

Updated.

I think the kernel's own module auto-loading thing is something we can ignore (well, we should document that this exist, but it shouldn't affec this new option), simply as this is configured strictly within the domain of the kernel, and not of the service. Or to say this differently: if a service manages to somehow trigger module auto-loading of a module it should not have been able to then this indicates a general kernel problem that needs to be fixed in the kernel and its mapping files, not in the service.
This new option should primarily protect that services cannot insert arbitrary code of their own choice I think. If the kernel auto-loads some specifically selected stuff I think this may be considered OK and orthogonal to this new option.
hence: a comment in the man page about this should be added, but otherwise we should just accept it.

Ok done, after some digging I'm not sure how hard it will be to make usermodhelper of the kernel understand namespaces... at least that way we can protect and bind mount inaccessible files over modprobe path... anyway documented now.

hmm, how did you come to the conclusion that we need to block CAP_NET_ADMIN too? have a link about that?

http://lxr.free-electrons.com/source/net/core/dev_ioctl.c#L359

netdev aliases were added later to cap_net_admin so you can't load arbitrary modules with it IIRC ...

regarding kexec. i'd claim that's a different concept, unrelated to kmod. It might make sense to add another option for it though. Maybe a generic RestrictReboot= which would make the reboot() and kexec_load() syscalls unavailable, as well as related files in /proc and /sys which may be used to trigger reboots, plus maybe kill(1, SIGINT) (and a couple of other signals). It wouldn't be perfect (dbus, ...) but I think it could get us quite far...

Ok, will try to see, and for this PR if you agree on the ProtectKernelModules= name then lets merge it if code is ok, otherwise I'm ok with DenyKernelModules= or anything.

I have added some other changes on top of this PR. Thanks for the review!

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me, I like the idea of blocking /usr/lib/modules a lot.

I'd really would like to avoid the CAP_NET_ADMIN blockage however, I have the suspicion that this is unnecessary (if documented properly) and overreaching for many cases, as this would mean that a lot of network-facing code that would benefit from this option couldn't use it. See my comments above.

@@ -3369,6 +3369,9 @@ int unit_patch_contexts(Unit *u) {
if (ec->private_devices)
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_MKNOD);

if (ec->protect_kernel_modules)
ec->capability_bounding_set &= ~((UINT64_C(1) << CAP_SYS_MODULE) | (UINT64_C(1) << CAP_NET_ADMIN));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I think we should not take CAP_NET_ADMIN away here, but only CAP_SYS_MODULE.

The rationale for this: if you have CAP_NET_ADMIN, then you can only load modules prefixed with netdev- but not arbitrary modules. And as such it really is more in the territory of module auto-loading, not of explicit loading. I think this patch should not bother window module auto-loading, as problems with that are limited to specific sets of kernel module names, that usually share a common prefix, and are always subject to configuration in the modules configuration or mapping files.

Yes, this deserves documentation, but only in the context of "this setting only blocks explicit module loading. Limited automatic module loading because of user configuration or kernel mapping tables might still happen as side effect of requested user operations, both privileged and unprivileged." or something like that.

@@ -3367,7 +3367,7 @@ int unit_patch_contexts(Unit *u) {
ec->no_new_privileges = true;

if (ec->private_devices)
ec->capability_bounding_set &= ~(UINT64_C(1) << CAP_MKNOD);
ec->capability_bounding_set &= ~((UINT64_C(1) << CAP_MKNOD) | (UINT64_C(1) << CAP_SYS_RAWIO));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one makes a ton of sense indeed!

@@ -1035,7 +1035,10 @@
boot-time, with the <citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry>
mechanism. Almost no services need to write to these at runtime; it is hence recommended to turn this on for
most services. For this setting the same restrictions regarding mount propagation and privileges apply as for
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.</para></listitem>
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.
Note that this option does not prevent kernel tunning through IPC interfaces and exeternal programs. However
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tunning → tuning

<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.</para></listitem>
<varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off.
Note that this option does not prevent kernel tunning through IPC interfaces and exeternal programs. However
<varname>InaccessiblePaths=</varname> can be used to make IPC file system objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make IPC file system objects → make some IPC file system objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -45,6 +45,14 @@ int main(int argc, char *argv[]) {
"/home/lennart/projects",
NULL
};

const NameSpaceContext ns_ctx = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also be "static" I figure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -40,15 +43,20 @@ typedef enum ProtectSystem {
_PROTECT_SYSTEM_INVALID = -1
} ProtectSystem;

struct NameSpaceContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this NamespaceExtras or so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NameSpaceInfo

bool private_dev;
bool protect_control_groups;
bool protect_kernel_tunables;
bool protect_kernel_modules;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could become bitflags? i.e. gain a :1 each?

<filename>/usr/lib/modules</filename> and other related module directories
are made inaccessible. For this setting the same restrictions regarding mount
propagation and privileges apply as for <varname>ReadOnlyPaths=</varname>
and related calls, see above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept! Makes sense to block this, indeed!

{ "/lib/modules", INACCESSIBLE, true },
{ "/lib64/modules", INACCESSIBLE, true },
{ "/usr/lib/modules", INACCESSIBLE, true },
{ "/usr/lib64/modules", INACCESSIBLE, true },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so, the kernel modules are never in /usr/lib64. I am pretty sure we should only look for /usr/lib here, not the 64bit counterpart, after all the lib64 thing is only necessary for multilib implementations, and lib64 should pretty much only include .so libraries, and nothing else, in particular not kernels or their kmods.

I also think we should not add /lib/modules to the block list here, except if HAVE_SPLIT_USR is undefined. i.e. on joined /usr, we shouldn't bother with the dir at all...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

This is useful to turn off explicit module load and unload operations on modular
kernels. This option removes CAP_SYS_MODULE from the capability bounding set for
the unit, and installs a system call filter to block module system calls.

This option will not prevent the kernel from loading modules using the module
auto-load feature which is a system wide operation.
This just adds capabilities test.
The rawio system calls were filtered, but CAP_SYS_RAWIO allows to access raw
data through /proc, ioctl and some other exotic system calls...
…dules=

Lets go further and make /lib/modules/ inaccessible for services that do
not have business with modules, this is a minor improvment but it may
help on setups with custom modules and they are limited... in regard of
kernel auto-load feature.

This change introduce NameSpaceInfo struct which we may embed later
inside ExecContext but for now lets just reduce the argument number to
setup_namespace() and merge ProtectKernelModules feature.
@tixxdz tixxdz force-pushed the djalal/sandbox-first-protection-kernelmodules-v1 branch from 596c6f1 to 4982dbc Compare October 12, 2016 13:48
@tixxdz
Copy link
Member Author

tixxdz commented Oct 12, 2016

@poettering updated thanks for the review. So still todo I guess on top of this PR:

  1. Document CAP_NET_ADMIN and netdev prefix that we consider them part of implicit or kernel module autoload feature ? I could not come up with the correct wording for now... will see later.

  2. question: how to test seccomp filters ? I mean where to put a simple c or python program that do syscalls ? and how to call it using unit files or transient units which is the best way ?

Thanks

@ronnychevalier
Copy link
Member

ronnychevalier commented Oct 12, 2016

On Wed, Oct 12, 2016 at 3:51 PM, Djalal Harouni notifications@github.com
wrote:

  1. question: how to test seccomp filters ? I mean where to put a simple c
    or python program that do syscalls ? and how to call it using unit files or
    transient units which is the best way ?

The tests related to systemd.exec, like system call filters, are located
in src/test/test-execute.c. All the unit files for test-execute are
located in test/test-execute/. So add your unit file that needs to be
tested in test/test-execute/ and add a line in src/test/test-execute.c
in the test_exec_systemcallfilter function.

For the moment, we did not have to run specific C programs to execute
particular syscalls, we used common programs. But, I guess that calling
/usr/bin/python -c "a = 1" in ExecStart should not be an issue since we
already use python for some tests.

@tixxdz
Copy link
Member Author

tixxdz commented Oct 12, 2016

@ronnychevalier

The tests related to systemd.exec, like system call filters, are located
in src/test/test-execute.c. All the unit files for test-execute are
located in test/test-execute/. So add your unit file that needs to be
tested in test/test-execute/ and add a line in src/test/test-execute.c
in the test_exec_systemcallfilter function.

Ok so current tests do that, thanks

For the moment, we did not have to run specific C programs to execute
particular syscalls, we used common programs. But, I guess that calling
/usr/bin/python -c "a = 1" in ExecStart should not be an issue since we
already use python for some tests.

I see, however some tests can be a bit complex so I was wondering what's the best location where we can put python script files ?

Thank you!

@ronnychevalier
Copy link
Member

I see, however some tests can be a bit complex so I was wondering what's the best location where we can put python script files ?

Currently, we can't do that.
I guess a solution would be to put the path of your python scripts (located in test/test-execute/ probably) in your unit files with a .in file first: ExecStart=/usr/bin/python @TEST_DIR@/test-execute/myscript.py
Like how it is done for the unit files in units/.

I just looked at Makefile.am and TEST_DIR is not in the substitutions, so if you add $(abs_top_srcdir)/test for TEST_DIR, it should work.

@ronnychevalier
Copy link
Member

Oh sorry, I misread you question. I thought you asked how to execute complex python scripts.

I think test/text-execute/ is the best location. Its goal is to store any data needed by the tests related to test-execute. Since the python script will be bound with the unit file it should be in the same directory or subdir?

@tixxdz
Copy link
Member Author

tixxdz commented Oct 12, 2016

@ronnychevalier ok thank you !

@poettering
Copy link
Member

patch looks good, thanks for reworking this! Shall we merge this as is, or do you want to add tests before we do so?

@poettering poettering removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Oct 13, 2016
@tixxdz
Copy link
Member Author

tixxdz commented Oct 13, 2016

Ok, let me add tests later , thanks!

@poettering poettering merged commit 8bfdf29 into systemd:master Oct 13, 2016
@evverx
Copy link
Member

evverx commented Oct 13, 2016

@tixxdz

$ sudo ./libtool --mode=execute ./test-execute
...
exec-protectkernelmodules-yes-capabilities.service: Failed at step NAMESPACE spawning /bin/sh: No such file or directory
exec-protectkernelmodules-yes-capabilities.service: Main process exited, code=exited, status=226/NAMESPACE
exec-protectkernelmodules-yes-capabilities.service: Unit entered failed state.
exec-protectkernelmodules-yes-capabilities.service: Failed with result 'exit-code'.
Assertion 'service->main_exec_status.status == status_expected' failed at src/test/test-execute.c:69, function check(). Aborting.
...

$ uname -r
4.7.6-200.fc24.x86_64

mount("/run/systemd/inaccessible/dir", "/usr/lib/modules", ...) = -1 ENOENT

12037 unshare(CLONE_NEWNS)              = 0
12037 mount(NULL, "/", NULL, MS_REC|MS_SLAVE, NULL) = 0
12037 open("/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 3</proc/12037/mountinfo>
12037 fstat(3</proc/12037/mountinfo>, {st_dev=makedev(0, 4), st_ino=605828, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=1024, st_blocks=0, st_size=0, st_atime=2016/10/13-23:27:03.204544645, st_mtime=2016/10/13-23:27:03.204544645, st_ctime=2016/10/13-23:27:03.204544645}) = 0
12037 read(3</proc/12037/mountinfo>, "74 73 8:1 / / rw,relatime master:1 - ext4 /dev/sda1 rw,seclabel,data=ordered\n75 74 0:6 / /dev rw,nosuid master:2 - devtmpfs devtmpfs rw,seclabel,size=1013664k,nr_inodes=253416,mode=755\n76 75 0:18 / /dev/shm rw,nosuid,nodev master:3 - tmpfs tmpfs rw,seclabel\n77 75 0:19 / /dev/pts rw,nosuid,noexec,relatime master:4 - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=000\n78 75 0:36 / /dev/hugepages rw,relatime master:25 - hugetlbfs hugetlbfs rw,seclabel\n79 75 0:14 / /dev/mqueue rw,relatime master:2"..., 1024) = 1024
12037 read(3</proc/12037/mountinfo>, "l,mode=755\n85 84 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime master:9 - cgroup cgroup rw,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd\n86 84 0:24 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime master:10 - cgroup cgroup rw,cpu,cpuacct\n87 84 0:25 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime master:11 - cgroup cgroup rw,net_cls,net_prio\n88 84 0:26 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime master:12 - cgroup "..., 1024) = 1024
12037 read(3</proc/12037/mountinfo>, " rw,perf_event\n94 84 0:32 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime master:18 - cgroup cgroup rw,hugetlb\n95 84 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime master:19 - cgroup cgroup rw,memory\n96 82 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime master:20 - pstore pstore rw,seclabel\n97 82 0:34 / /sys/kernel/config rw,relatime master:21 - configfs configfs rw\n98 82 0:15 / /sys/fs/selinux rw,relatime master:22 - selinuxfs selinuxfs rw\n99 82 0:7 / /sys/kernel/de"..., 1024) = 767
12037 read(3</proc/12037/mountinfo>, "", 1024) = 0
12037 close(3</proc/12037/mountinfo>)   = 0
12037 lstat("/usr/lib/modules", {st_dev=makedev(8, 1), st_ino=1575640, st_mode=S_IFDIR|0755, st_nlink=5, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=4096, st_atime=2016/10/13-23:24:06.972930969, st_mtime=2016/10/09-16:58:41.408993736, st_ctime=2016/10/09-16:58:41.408993736}) = 0
12037 mount("/run/systemd/inaccessible/dir", "/usr/lib/modules", NULL, MS_BIND|MS_REC, NULL) = -1 ENOENT (No such file or directory)
12037 writev(2</dev/pts/0>, [{iov_base="exec-protectkernelmodules-yes-capabilities.service: Failed at step NAMESPACE spawning /bin/sh: No such file or directory", iov_len=120}, {iov_base="\n", iov_len=1}], 2) = 121

Seems like this is the root cause:

$ systemctl --version
systemd 229
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN

@tixxdz
Copy link
Member Author

tixxdz commented Oct 14, 2016

@evverx hey thanks for the report, I guess you are missing /run/systemd/inaccessible/dir due to systemd version ? so if you manually add /run/systemd/inaccessible/dir and maybe make the appropriate permissions then it should work ?

@evverx
Copy link
Member

evverx commented Oct 14, 2016

@tixxdz , sure :) But I think we should fix the test. i.e. we should skip the test if

  • systemd is not running
  • /run/systemd/inaccessible/dir doesn't exists

Anyway, this is not critical.
Thanks!

@tixxdz
Copy link
Member Author

tixxdz commented Oct 14, 2016

@evverx aah hehe ok I see :-) , when time permits later, noted.

dongsupark pushed a commit to endocode/systemd that referenced this pull request Oct 24, 2016
In case of running test-execute on systems with systemd < v232, several
tests like privatedevices or protectkernelmodules fail because
/run/systemd/inaccessible/ is empty. In these cases, we should skip
tests to avoid unnecessary errors.

See also systemd#4243 (comment)
dongsupark pushed a commit to endocode/systemd that referenced this pull request Oct 24, 2016
In case of running test-execute on systems with systemd < v232, several
tests like privatedevices or protectkernelmodules fail because
/run/systemd/inaccessible/ is empty. In these cases, we should skip
tests to avoid unnecessary errors.

See also systemd#4243 (comment)
dongsupark pushed a commit to endocode/systemd that referenced this pull request Oct 25, 2016
In case of running test-execute on systems with systemd < v232, several
tests like privatedevices or protectkernelmodules fail because
/run/systemd/inaccessible/ doesn't exist. In these cases, we should skip
tests to avoid unnecessary errors.

See also systemd#4243 (comment)
evverx pushed a commit that referenced this pull request Oct 25, 2016
In case of running test-execute on systems with systemd < v232, several
tests like privatedevices or protectkernelmodules fail because
/run/systemd/inaccessible/ doesn't exist. In these cases, we should skip
tests to avoid unnecessary errors.

See also #4243 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

5 participants