Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: support network namespace #14915

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Conversation

yuwata
Copy link
Member

@yuwata yuwata commented Feb 21, 2020

Closes #11103.

@yuwata yuwata added the network label Feb 21, 2020
@yuwata yuwata force-pushed the network-netns branch 2 times, most recently from a57efcd to 287960b Compare February 21, 2020 10:33
@yuwata
Copy link
Member Author

yuwata commented Feb 21, 2020

TODO:

@DaanDeMeyer
Copy link
Contributor

First of all, great to see support for this being added!

I haven't looked in detail, but I'm wondering if we can't make this stuff a little more generic. Why limit the namespace stuff to systemd-networkd and networkctl? Shouldn't we consider making separate namespace units instead (ipc, mount, future namespaces)? Of course, adding those will require extra work compared to this solution but I can come up with some use cases where we want to run programs in a separate mount or ipc namespace as well (aside from only network).

@yuwata yuwata force-pushed the network-netns branch 8 times, most recently from 2019173 to f366c1d Compare February 25, 2020 17:26
@yuwata
Copy link
Member Author

yuwata commented Feb 25, 2020

@DaanDeMeyer Could you elaborate more? I think, for networkd, supporting network namespace is non-trivial and needs several works, but other namespace is trivial and just modifying the unit file seems enough, though not tested.

@DaanDeMeyer
Copy link
Contributor

DaanDeMeyer commented Feb 25, 2020

I wrote out a massive wall of text but it comes down to this:

Instead of adding systemd-networkd-create-netns@ and networkctl netns-create, can't we simply modify NetworkNamespacePath to create the network namespace if it doesn't exist? It could use the same method you used in verb_netns_create. The only thing I'm not sure of is when to remove the bind mount to make sure the namespace gets cleaned up when all processes in it exit. Maybe each process could add a bind mount on top of the existing ones so you have a stack of bind mounts that gets unwinded as processes exit one by one? (Although I doubt that's a good approach). Aside from NetworkNamespacePath, other namespace types could have a similar setting.

If we can't manage the lifetime of namespaces without a separate unit like systemd-networkd-create-netns, maybe this needs to be a general systemd concept instead of a one time off implementation in systemd-networkd-create-netns@ and networkctl?

@yuwata yuwata force-pushed the network-netns branch 2 times, most recently from ddf91d3 to 208cf89 Compare February 27, 2020 11:35
@yuwata
Copy link
Member Author

yuwata commented Feb 28, 2020

Instead of adding systemd-networkd-create-netns@ and networkctl netns-create, can't we simply modify NetworkNamespacePath to create the network namespace if it doesn't exist?

I like that idea. I will try to implement that. Thanks.

@yuwata yuwata force-pushed the network-netns branch 3 times, most recently from e338f43 to b5db70a Compare March 2, 2020 16:52
@yuwata yuwata marked this pull request as ready for review March 2, 2020 17:00
@yuwata yuwata changed the title WIP: network: support network namespace network: support network namespace Mar 2, 2020
Copy link
Contributor

@DaanDeMeyer DaanDeMeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the parts related to my earlier suggestion since I'm somewhat familiar with those.

src/core/namespace.c Outdated Show resolved Hide resolved
@@ -294,6 +294,29 @@ s - Service VLAN, m - Two-port MAC Relay (TPMR)
which match the file are reconfigured.</para></listitem>
</varlistentry>

<varlistentry>
<term>
<command>netns-create</command>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need these commands if NetworkNamespacePath creates the network namespace automatically?

Copy link
Member Author

@yuwata yuwata Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe useful for systemd-nspawn --network-namespace-path=?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, I think the existing tools unshare, lsns and unmount cover these use cases well enough. Of course, if you or someone else feels we should have these then by all means lets add them. It doesn't hurt anyone if networkctl has a few extra commands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id drop them really.

man/systemd.exec.xml Outdated Show resolved Hide resolved
man/systemd.network.xml Outdated Show resolved Hide resolved
src/basic/fd-util.c Show resolved Hide resolved
src/core/namespace.c Outdated Show resolved Hide resolved
(void) loopback_setup();

/* Mount the new netns onto the path. */
if (mount("/proc/self/ns/net", path, "none", MS_BIND, NULL) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever clean this up somewhere? Won't we end up with a namespace leak if we keep the file bind-mounted somewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not know when the file is not used anymore. So, I think we must 'leak' the file. (And can be manually removed by networkctl remove-netns command.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there's a way to keep track of whether there are any services running in the network namespace (or just a namespace in general to be a little more generic)? As soon as the count reaches zero, we unmount the file. Of course, that does not keep track of any processes not spawned by systemd but that seems like a tradeoff we could make (and explicitly document).

The idea is that we only need to have the file mounted if there is at least one other service running in the namespace. If the last service exits, we can safely remove the mount and create the mount again once another service starts with the same namespace path.

The only thing I'm not sure about is how hard it is to add the machinery required to keep track of which services are running in which persistent namespaces.

On the other hand, leaking the file might not be a huge problem. Although I'm a bit worried that if many unique instances of a template service are spawned and they use the template parameter in the network namespace path that we get a lot of namespace mounts lying around. I have a feeling someone would hit some kind of limit sooner or later and get a rather obscure error if we don't clean the mounts ourselves.

Maybe @poettering knows more?

Copy link
Member Author

@yuwata yuwata Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that removing the netns file causes a backward compatibility issue. Previously, when a service which has NetworkNamespacePath= is stopped, the netns file still exists, and can be used any later commands or invoked services. Such behavior must be preserved. So, even if pid1 creates the file, the file must be 'leak'ed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well yes, but that's only because the netns file was created explicitly by the user beforehand no? It makes sense then that they can rely on the file being there after the service stops.

However, if the file doesn't exist, I think we currently hard-fail so I think that means we have an avenue for adding new behaviour any way we want to without having to worry about backwards compatibility. In other words, there's no way users can depend on netns files created by systemd being there after the service stops since at the moment systemd never creates any netns files in the first place.

If the user currently creates the netns file, of course he's responsible for cleaning it up afterwards. However, if we now start creating the netns files ourselves, I think that makes us responsible for cleaning them up as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if /var/run/netns mountpoint could be handled by systemd .mount & service lifetime/dependencies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably not actively use .mount unit for this kind of mount point. It's not really a file system after all, but more a weird API of the kernel to make namespaces persistent...

I'd just add NetworkNamespaceMode=join|create|sticky instead, as suggested elsewhere.

}

/* Restore the original netns. */
if (setns(old_netns, CLONE_NEWNET) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @poettering asked me to do namespace changes in a child process to make cleanup easier in an nspawn PR review. I'm not sure if this is a code path where we can afford to fork a child process but if we can, it might be easier to fork, unshare and mount instead of having logic everywhere to move back to the original network namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will re-consider this point later. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, as mentioned elsewhere, let's do that in a child process. i.e. clone() with CLONE_NEWNET, and pass netns fd up to parent again

<varlistentry>
<term><varname>Namespace=</varname></term>
<listitem>
<para>Takes a name or an abosolute file system path referring to a Linux network namespace

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<para>Takes a name or an abosolute file system path referring to a Linux network namespace
<para>Takes a name or an absolute file system path referring to a Linux network namespace

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this a lot! Excellent work!

r = netlink_message_read_internal(m, type, &attr_data, NULL);
if (r < 0)
return r;
else if ((size_t) r < sizeof(int8_t))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter much, but you could just drop the word else here

r = netlink_message_read_internal(m, type, &attr_data, &net_byteorder);
if (r < 0)
return r;
else if ((size_t) r < sizeof(int16_t))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

</term>
<listitem><para>List files referring to Linux network namespace pseudo-files. If no path is
specified, then the files in <filename>/var/run/netns</filename> are shown.</para></listitem>
</varlistentry>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do we want these really? I think it would be much nicer if you allocate a new netns simply by starting another networkd instance. i.e.

"systemctl start networkd@foobar.service" starts a new netns called foobar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commands basically mirror the ip netns list, ip netns add and ip netns delete commands. Not sure if they bring additional value. One could argue networkctl list also is just a more fancy ip l, and we could show more information here, too.

I don't think there always needs to be a separate networkd instance for each namespace "managed" by networkd.

There's two usecases here:

  1. In addition to the default network namespace and main networkd, having a separate networkd running in a separate network namespace with interfaces in that namespace.
    Don't we have systemd-nspawn for this?

  2. Main networkd in the default network namespace sets up a tunnel interface (wireguard, openvpn), but moves the created interface into a separate network namespace (creating if it doesn't exist already), possibly configuring addresses/(default) route(s) on that interface as well.
    People that want to pipe traffic "through that tunnel" invoke applications (ip netns exec $application) or whole graphical sessions in that namespace, but they don't really care about configuring more interfaces inside this namespace.

Copy link
Contributor

@g0tar g0tar Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, do we want these really? I think it would be much nicer if you allocate a new netns simply by starting another networkd instance. i.e.

"systemctl start networkd@foobar.service" starts a new netns called foobar

Let's remember that this really should fit well into regular workflow with "personalities". I would expect that:

ip netns exec alt-namespace-2 /bin/zsh
networkctl

returns the currently "logged in" network status, just like ip, iptables or tc tools are. Consider a scenario, when entire ssh server is being run within some netns (in general: one ssh server per netns) and one directly logs into the netns. It should be possible to manage this "internal" (separated) network, including stopping it - without interfering other namespaces.

Thinking about this makes me think, that maybe entire networkd@some-netns.service should be started inside some-netns (just like the ssh instance from example above), and the master (host) networkd.service should set up the namespaces itself and interfaces (like creating veth pair and moving one end into the specified netns). This seems to mean that .netdev with for NetNS=x would be picked by main systemd.service (in order to set up the device+netns assignment) and by networkd@x.service (in order to match the device), while .network for NetNS=x would belong to networkd@x.service only.

Then, the namespaced networkctl should display and manage only it's own parts, while the external one could give access to all the childrens. Also, some day, the systemctl could gain a feature to filter out namespace-bound services and disallow messing with master or siblings - this is the way of partitioning system and creating network-admin roles (known from Big Telco solutions; thinking about apache running within web1 netns and user/group X allowed to restart every service within web1 netns).

As for the (simpified) use-case (italic, long): I need to set up 4 netnses: bgp1, bgp2, lan1, lan2 interconnected with veth pairs. The bgp* personalities would run separate BGP instances (bird) on assigned VLANs exchanging traffic with their respective lan* internal network BRASes. Each bgp* uses it's own uplinks by default and provides backup transmission for the other, so in general traffic can go through 3 namespaces (lan1-bgp1-bgp2), while in normal conditions the flows are separated and can be accounted per instance. In general setup there might be more than 2 "entity sets" and the "set" might contain more "boxes" (BGP, NAT/log, QoS/BRAS, IDS), while the "interface/VLAN" is stacked over some QinQ or LACP (802.1ad), VXLAN etc. Most of this can be acomplished by using "traditional" methods, like policy based routing, separate routing tables etc. but as the number of entities grow the configuration complicates exponentially as they start to heavily interfere. Using separate network namespaces, with their own routing tables, filtering rules (iptables), conntrack, DHCP servers etc. trades some performance for simplicity and manageability (like having different network-admins for different "sets"). Using full virtualization would result in much more performance penalty (and require additional layer of OpenVSwitch). Containers are functionally equivalent, but require massive system (fs tree) duplication and would became upgrade management hell really fast; note the network-admins should only care about the rules (BGP, iptables, ipsets, QoS), not the system ("firmaware") itself. Also, I do trust network-admins that they are not malicious, just want to make their life easier - working with hundreds of interfaces stacked in several layers is error-prone.

Currently systemd handles well my (over?)complicated block-devices scheme (drive-partition-LVM-integrity-mdadm-LUKS-VDO-LVM/thin_pool(snapshots)-mdadm (with top-level mdadm used for bitmap-assisted on-line backups of either snapshots or fsfrozen data via ISCSI@ipsec to remote location) and it would be really nice if network stack could be handled as well.

</term>
<listitem><para>Create a file referring to a Linux network namespace pseudo-file. Takes a name
or an absolute path. If a name is provided, then the file is created at
<filename>/var/run/netns/<replaceable>NAME</replaceable></filename>.</para></listitem>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should step into that directory, i.e. territory of iproute? is that public API even?

Also /var/run is a legacy alias for /run. Let's not use legacy names.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows user to interact with the netns from userspace, which would be much appreciated.

(void) umount(mount_entry_path(m));

if (mount("sysfs", mount_entry_path(m), "sysfs", 0, NULL) < 0)
return log_debug_errno(errno, "Failed to mount %s: %m", mount_entry_path(m));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably use MS_NOSUID|MS_NOEXEC|MS_NODEV as mount flags, since we mount the host sysfs like that

if (r < 0)
return r;

if (link->dhcp_server) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above

if (r < 0)
return log_error_errno(r, "Failed to renew dynamic configuration of interface %s: %s",
name, bus_error_message(&error, r));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i'd probably also go via varlink if client is root, since then you can issue this before dbus is up.

dbus gives us powerful access control, which varlink has. But if we are root anyway, then there's little point to bother with dbus for this i think.

static int link_renew_one(sd_bus *bus, Varlink *varlink, int index, const char *name) {
int r;

if (arg_namespace) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change this check to if (varlink), i.e. instead of checking each time why we use varlink, just check if we can use varlink.

@@ -111,6 +111,13 @@
<listitem><para>Suppress log messages.</para></listitem>
</varlistentry>

<varlistentry>
<term><option>-N</option></term>
<term><option>--namespace</option></term>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--namespace=, i.e. with trailing = since it expects an argument

@@ -44,6 +45,7 @@ static int help(void) {
" Required operational state\n"
" --any Wait until at least one of the interfaces is online\n"
" --timeout=SECS Maximum time to wait for network connectivity\n"
" -N --namespace Network namespace\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--namespace=NAME

@keszybz
Copy link
Member

keszybz commented Jul 24, 2020

@yuwata update?

@yuwata
Copy link
Member Author

yuwata commented Jul 24, 2020

Will update after v246 is released.

@poettering
Copy link
Member

would still love this to materialize!

@kjander0
Copy link

I've noticed different behaviour between 'ip netns exec' and 'systemd's NetworkNamespacePath'. The former remounts /sys/ when switching the namespace of a process. Systemd doesn't seem to do this and therefore the process running in the namespace sees the wrong network devices in /sys/class/net, etc. Is this expected behaviour?

@poettering
Copy link
Member

@kjander0 plese file a separate issue about this. Comments on unrelated PRs are not a great way to report a bug.

@LaserEyess
Copy link
Contributor

Any chance at a rebase of this? I am very interested in this functionality and I would like to test it, but I am not skilled enough nor familiar enough with the codebase to rebase.

Base automatically changed from master to main January 21, 2021 11:54
@codepeon
Copy link

Here are some rebased versions of this:
Based on v248, Builds and seems to work: https://github.com/codepeon/systemd/tree/netns-v248
Based on main, not tested: https://github.com/codepeon/systemd/tree/netns-main

@agowa
Copy link

agowa commented May 17, 2021

What's the current status? Is this PR still active or is it superseded by some other PR?

@herbetom
Copy link

herbetom commented Jul 11, 2021

Since I'm once again facing the situation where it would be handy to have network namespace support in systemd-networkd I would like to repeat the question of @agowa338 about the current status. I would be really happy if this functionality would exist one day.

@yuwata

@Manouchehri
Copy link

Any chance of this being merged in?

@OJFord
Copy link

OJFord commented Sep 23, 2021

Does this support (or not complicate supporting in the future) differing 'current' and 'birth' namespaces?

This can be useful because when the interface is moved to another namespace, its sockets are still attached to (? created in? I'm fuzzy on the specifics) its birth namespace. WireGuard describes a use case in more detail here (under heading 'the new namespace solution'): https://www.wireguard.com/netns/

@jmpolom
Copy link

jmpolom commented Nov 22, 2021

This would be a pretty amazing thing to have support for in networkd.

@andrewgdunn
Copy link

There was an interesting discussion in the podman mailing lists where someone is trying to do these things for quadlet (a systemd/podman sort of bridge prototype).

@DemiMarie
Copy link

@yuwata do you plan to finish this PR, or should someone else do so?

@Torxed
Copy link

Torxed commented Apr 20, 2022

@yuwata added the new-feature label on Oct 29, 2020

There's been no meaningful update since 2020-10-29.
The overall code of systemd has evolved quite significantly since this PR. And rebasing is no longer trivial.

Perhaps split this PR into smaller PR's for individual functionalities such as creating namespaces and managing interfaces?
I'd love to see it all go in but if it's unmaintainable to introduce this somewhat large PR of ~1.5k line changes, then perhaps introduce features in steps?

I get that everyone is a volunteer and that we're welcome to submit PR's, but this one felt somewhat complete.
Would be nice with a status update of any kind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase network new-feature reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks
Development

Successfully merging this pull request may close these issues.

[RFE] systemd-networkd set netns for netdev