core: add new RestrictNamespaces= unit file setting #4536

Merged
merged 1 commit into from Nov 9, 2016

Conversation

3 participants
@poettering
Member

poettering commented Nov 2, 2016

This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.

(This could be delayed for v233, but otoh hand is pretty self-contained, might be good enough for v232, too)

@keszybz

Looks good, and as a you wrote, this is pretty self-contained. But I think it'd be better to leave this for the next cycle, to give a chance to test this and shake out any wrinkles. If we merge this now, it'll be released without nay wider testing.

man/systemd.exec.xml
+ <varname>RestrictAddressFamilies=</varname>, <varname>RestrictNamespaces=</varname>,
+ <varname>PrivateDevices=</varname>, <varname>ProtectKernelTunables=</varname>,
+ <varname>ProtectKernelModules=</varname>, <varname>MemoryDenyWriteExecute=</varname>, or
+ <varname>RestrictRealtime=</varname> are specified. </para></listitem>

This comment has been minimized.

@keszybz

keszybz Nov 3, 2016

Member

Spurious space after ".".

@keszybz

keszybz Nov 3, 2016

Member

Spurious space after ".".

man/systemd.exec.xml
+ about Linux namespaces, see
+ <citerefentry><refentrytitle>namespaces</refentrytitle><manvolnum>7</manvolnum></citerefentry>. Either takes a
+ boolean argument, or a space-separated list of namespace type identifiers. If false (the default), no
+ restrictions on namespaces are made. If true, access to any kind of namespacing is prohibited. If neither, a

This comment has been minimized.

@keszybz

keszybz Nov 3, 2016

Member

"restrictions on namespace creation or switching"?

@keszybz

keszybz Nov 3, 2016

Member

"restrictions on namespace creation or switching"?

This comment has been minimized.

@keszybz

keszybz Nov 3, 2016

Member

If neither → Otherwise

@keszybz

keszybz Nov 3, 2016

Member

If neither → Otherwise

man/systemd.exec.xml
+ restrictions on namespaces are made. If true, access to any kind of namespacing is prohibited. If neither, a
+ space-separated list of namespace type identifiers must be specified, consisting of any combination of:
+ <constant>cgroup</constant>, <constant>ipc</constant>, <constant>net</constant>, <constant>mnt</constant>,
+ <constant>pid</constant>, <constant>user</constant> and <constant>uts</constant>. Any namespace type listed is

This comment has been minimized.

@keszybz

keszybz Nov 3, 2016

Member

Here too, what we are actually restricting, is the creation or switching of namespaces, not access to them.

@keszybz

keszybz Nov 3, 2016

Member

Here too, what we are actually restricting, is the creation or switching of namespaces, not access to them.

+
+ flag &= NAMESPACE_FLAGS_ALL;
+
+ for (i = 0; namespace_flag_map[i].name; i++)

This comment has been minimized.

@keszybz

keszybz Nov 3, 2016

Member

Why not use ELEMENTSOF instead (and drop the terminating entry)?

@keszybz

keszybz Nov 3, 2016

Member

Why not use ELEMENTSOF instead (and drop the terminating entry)?

This comment has been minimized.

@poettering

poettering Nov 3, 2016

Member

i did that initially, but the array is exported, since the seccomp_restrict_namespaces() call from seccomp-util.c needs it. And for exported arrays the size cannot be determined via ELEMENTSOF() that easily, unless it is either specified explicitly in the header file (which I didn't like since it might very well get out of date). The other option is to export the size as a second exported (const) variable, but that didn't sound pretty either to me. Hence I opted for a "null" terminated list here, as this allows us to export this without specifiying the size in the header, and doesn't require us to add a second exported variable. And given that all algorithms using this iterate linearly through the list anyway it doesn't really matter anyway if they know the precise size in advance or if they just look for the final NULL...

(of course, another option would have been to move seccomp_restrict_namepsaces() into nsflags.c, but that's nasty due to the fact that seccomp-util.c is currently conditionally compiled depending on ehwther the libseccomp dep is on, and we'd then have to add something similar to nsflags.c which i really didn't like)

@poettering

poettering Nov 3, 2016

Member

i did that initially, but the array is exported, since the seccomp_restrict_namespaces() call from seccomp-util.c needs it. And for exported arrays the size cannot be determined via ELEMENTSOF() that easily, unless it is either specified explicitly in the header file (which I didn't like since it might very well get out of date). The other option is to export the size as a second exported (const) variable, but that didn't sound pretty either to me. Hence I opted for a "null" terminated list here, as this allows us to export this without specifiying the size in the header, and doesn't require us to add a second exported variable. And given that all algorithms using this iterate linearly through the list anyway it doesn't really matter anyway if they know the precise size in advance or if they just look for the final NULL...

(of course, another option would have been to move seccomp_restrict_namepsaces() into nsflags.c, but that's nasty due to the fact that seccomp-util.c is currently conditionally compiled depending on ehwther the libseccomp dep is on, and we'd then have to add something similar to nsflags.c which i really didn't like)

This comment has been minimized.

@keszybz

keszybz Nov 3, 2016

Member

Ack.

@@ -1457,6 +1451,29 @@
</varlistentry>
+ <varlistentry>
+ <term><varname>RestrictNamespaces=</varname></term>

This comment has been minimized.

@keszybz

keszybz Nov 3, 2016

Member

I think there's potential for semantic confusion here. The options says "Restrict", and "yes" is about restricting, but individual values are about allowing. Let's compare this with DHCP: one could say that DHCP=no < DHCP=ipv4/6 < DHCP=yes, but here "yes" is not a superset of options. I see the appeal of calling this "Restrict*", but maybe "AllowNamespaces=" would be better (with a default of "yes", and "no" meaning complete refusal).

@keszybz

keszybz Nov 3, 2016

Member

I think there's potential for semantic confusion here. The options says "Restrict", and "yes" is about restricting, but individual values are about allowing. Let's compare this with DHCP: one could say that DHCP=no < DHCP=ipv4/6 < DHCP=yes, but here "yes" is not a superset of options. I see the appeal of calling this "Restrict*", but maybe "AllowNamespaces=" would be better (with a default of "yes", and "no" meaning complete refusal).

This comment has been minimized.

@poettering

poettering Nov 4, 2016

Member

I see your point but then again, I modelled this after RestrictAddressFamily= which implements a whitelist, too. Sure, this one also is a boolean, but the precedent of making a setting called "RestrictXYZ=" implementing a whitelist is already set...

The reason I used the word "restrict" for this option (and the realtime + address family ones) was because i wanted to clarify that this is opt-in stuff, and the default is the non-restrict one. An option called AllowNamespaces= OTOH would be an option that defaults to "true", which I always try to avoid I must say...

Not sure what we can do about this now. I think RestrictAF= should be a much stronger influence to this than the DHCP= stanza...

@poettering

poettering Nov 4, 2016

Member

I see your point but then again, I modelled this after RestrictAddressFamily= which implements a whitelist, too. Sure, this one also is a boolean, but the precedent of making a setting called "RestrictXYZ=" implementing a whitelist is already set...

The reason I used the word "restrict" for this option (and the realtime + address family ones) was because i wanted to clarify that this is opt-in stuff, and the default is the non-restrict one. An option called AllowNamespaces= OTOH would be an option that defaults to "true", which I always try to avoid I must say...

Not sure what we can do about this now. I think RestrictAF= should be a much stronger influence to this than the DHCP= stanza...

@keszybz keszybz added this to the v233 milestone Nov 3, 2016

core: add new RestrictNamespaces= unit file setting
This new setting permits restricting whether namespaces may be created and
managed by processes started by a unit. It installs a seccomp filter blocking
certain invocations of unshare(), clone() and setns().

RestrictNamespaces=no is the default, and does not restrict namespaces in any
way. RestrictNamespaces=yes takes away the ability to create or manage any kind
of namspace. "RestrictNamespaces=mnt ipc" restricts the creation of namespaces
so that only mount and IPC namespaces may be created/managed, but no other
kind of namespaces.

This setting should be improve security quite a bit as in particular user
namespacing was a major source of CVEs in the kernel in the past, and is
accessible to unprivileged processes. With this setting the entire attack
surface may be removed for system services that do not make use of namespaces.
@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Nov 4, 2016

Member

I have force pushed a new version now, and fixed the issues pointed out, except for the naming of the option and semantics thing you raised... I am note quite sure what the best approach here is on that... I see your point, but I don't like AllowXYZ= (i.e. an option that defaults to on) either...

Any more suggestions or ideas?

Member

poettering commented Nov 4, 2016

I have force pushed a new version now, and fixed the issues pointed out, except for the naming of the option and semantics thing you raised... I am note quite sure what the best approach here is on that... I see your point, but I don't like AllowXYZ= (i.e. an option that defaults to on) either...

Any more suggestions or ideas?

@jcrites

This comment has been minimized.

Show comment
Hide comment
@jcrites

jcrites Nov 8, 2016

For an alternative idea, from the peanut gallery: RestrictNamespacesTo=mnt ipc would convey the idea that mnt and ipc are allowed, and that by default nothing is restricted. Under this model, RestrictNamespacesTo=none could take away the ability to manage any kind of namespace, while RestrictNamespacesTo=all could be the default option. The downside is that this doesn't follow the convention established by RestrictAddressFamilies=. I'd agree that RestrictAddressFamilies= is the relevant precedent.

jcrites commented Nov 8, 2016

For an alternative idea, from the peanut gallery: RestrictNamespacesTo=mnt ipc would convey the idea that mnt and ipc are allowed, and that by default nothing is restricted. Under this model, RestrictNamespacesTo=none could take away the ability to manage any kind of namespace, while RestrictNamespacesTo=all could be the default option. The downside is that this doesn't follow the convention established by RestrictAddressFamilies=. I'd agree that RestrictAddressFamilies= is the relevant precedent.

@keszybz

This comment has been minimized.

Show comment
Hide comment
@keszybz

keszybz Nov 9, 2016

Member

I thought about appending "To", but I'd reject it for the same reason you mention: it's does not follow the precedent anyway, and if we break with precedent, AllowNamespaces= would be a better option.

Member

keszybz commented Nov 9, 2016

I thought about appending "To", but I'd reject it for the same reason you mention: it's does not follow the precedent anyway, and if we break with precedent, AllowNamespaces= would be a better option.

@keszybz

keszybz approved these changes Nov 9, 2016

This looks good code-wise, so let's merge. Thorough testing is advised ;)

+static inline bool exec_context_restrict_namespaces_set(const ExecContext *c) {
+ assert(c);
+
+ return (c->restrict_namespaces & NAMESPACE_FLAGS_ALL) != NAMESPACE_FLAGS_ALL;

This comment has been minimized.

@keszybz

keszybz Nov 9, 2016

Member

c->restrict_namespaces != NAMESPACE_FLAGS_ALL should be equivalent, but maybe this version is better against future changes...

@keszybz

keszybz Nov 9, 2016

Member

c->restrict_namespaces != NAMESPACE_FLAGS_ALL should be equivalent, but maybe this version is better against future changes...

+ if (!name) {
+ *ret = 0;
+ return 0;
+ }

This comment has been minimized.

@keszybz

keszybz Nov 9, 2016

Member

This part is not necessary, extract_first_word is fine with a NULL arg.

@keszybz

keszybz Nov 9, 2016

Member

This part is not necessary, extract_first_word is fine with a NULL arg.

@keszybz keszybz merged commit d85a0f8 into systemd:master Nov 9, 2016

5 checks passed

default Build finished.
Details
semaphoreci The build passed on Semaphore.
Details
xenial-amd64 autopkgtest finished (success)
Details
xenial-i386 autopkgtest finished (success)
Details
xenial-s390x autopkgtest finished (success)
Details
@poettering

This comment has been minimized.

Show comment
Hide comment
@poettering

poettering Nov 23, 2016

Member

This fixed #4043 btw.

Member

poettering commented Nov 23, 2016

This fixed #4043 btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment