-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
network: support network namespace #14915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a57efcd to
287960b
Compare
|
TODO:
|
0baadce to
a4e57d4
Compare
|
First of all, great to see support for this being added! I haven't looked in detail, but I'm wondering if we can't make this stuff a little more generic. Why limit the namespace stuff to systemd-networkd and networkctl? Shouldn't we consider making separate namespace units instead (ipc, mount, future namespaces)? Of course, adding those will require extra work compared to this solution but I can come up with some use cases where we want to run programs in a separate mount or ipc namespace as well (aside from only network). |
2019173 to
f366c1d
Compare
|
@DaanDeMeyer Could you elaborate more? I think, for networkd, supporting network namespace is non-trivial and needs several works, but other namespace is trivial and just modifying the unit file seems enough, though not tested. |
|
I wrote out a massive wall of text but it comes down to this: Instead of adding systemd-networkd-create-netns@ and networkctl netns-create, can't we simply modify If we can't manage the lifetime of namespaces without a separate unit like systemd-networkd-create-netns, maybe this needs to be a general systemd concept instead of a one time off implementation in systemd-networkd-create-netns@ and networkctl? |
ddf91d3 to
208cf89
Compare
I like that idea. I will try to implement that. Thanks. |
e338f43 to
b5db70a
Compare
DaanDeMeyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the parts related to my earlier suggestion since I'm somewhat familiar with those.
man/networkctl.xml
Outdated
|
|
||
| <varlistentry> | ||
| <term> | ||
| <command>netns-create</command> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need these commands if NetworkNamespacePath creates the network namespace automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe useful for systemd-nspawn --network-namespace-path=?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, I think the existing tools unshare, lsns and unmount cover these use cases well enough. Of course, if you or someone else feels we should have these then by all means lets add them. It doesn't hurt anyone if networkctl has a few extra commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id drop them really.
| (void) loopback_setup(); | ||
|
|
||
| /* Mount the new netns onto the path. */ | ||
| if (mount("/proc/self/ns/net", path, "none", MS_BIND, NULL) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we ever clean this up somewhere? Won't we end up with a namespace leak if we keep the file bind-mounted somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not know when the file is not used anymore. So, I think we must 'leak' the file. (And can be manually removed by networkctl remove-netns command.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there's a way to keep track of whether there are any services running in the network namespace (or just a namespace in general to be a little more generic)? As soon as the count reaches zero, we unmount the file. Of course, that does not keep track of any processes not spawned by systemd but that seems like a tradeoff we could make (and explicitly document).
The idea is that we only need to have the file mounted if there is at least one other service running in the namespace. If the last service exits, we can safely remove the mount and create the mount again once another service starts with the same namespace path.
The only thing I'm not sure about is how hard it is to add the machinery required to keep track of which services are running in which persistent namespaces.
On the other hand, leaking the file might not be a huge problem. Although I'm a bit worried that if many unique instances of a template service are spawned and they use the template parameter in the network namespace path that we get a lot of namespace mounts lying around. I have a feeling someone would hit some kind of limit sooner or later and get a rather obscure error if we don't clean the mounts ourselves.
Maybe @poettering knows more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that removing the netns file causes a backward compatibility issue. Previously, when a service which has NetworkNamespacePath= is stopped, the netns file still exists, and can be used any later commands or invoked services. Such behavior must be preserved. So, even if pid1 creates the file, the file must be 'leak'ed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well yes, but that's only because the netns file was created explicitly by the user beforehand no? It makes sense then that they can rely on the file being there after the service stops.
However, if the file doesn't exist, I think we currently hard-fail so I think that means we have an avenue for adding new behaviour any way we want to without having to worry about backwards compatibility. In other words, there's no way users can depend on netns files created by systemd being there after the service stops since at the moment systemd never creates any netns files in the first place.
If the user currently creates the netns file, of course he's responsible for cleaning it up afterwards. However, if we now start creating the netns files ourselves, I think that makes us responsible for cleaning them up as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if /var/run/netns mountpoint could be handled by systemd .mount & service lifetime/dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably not actively use .mount unit for this kind of mount point. It's not really a file system after all, but more a weird API of the kernel to make namespaces persistent...
I'd just add NetworkNamespaceMode=join|create|sticky instead, as suggested elsewhere.
| } | ||
|
|
||
| /* Restore the original netns. */ | ||
| if (setns(old_netns, CLONE_NEWNET) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @poettering asked me to do namespace changes in a child process to make cleanup easier in an nspawn PR review. I'm not sure if this is a code path where we can afford to fork a child process but if we can, it might be easier to fork, unshare and mount instead of having logic everywhere to move back to the original network namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will re-consider this point later. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, as mentioned elsewhere, let's do that in a child process. i.e. clone() with CLONE_NEWNET, and pass netns fd up to parent again
poettering
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this a lot! Excellent work!
| r = netlink_message_read_internal(m, type, &attr_data, NULL); | ||
| if (r < 0) | ||
| return r; | ||
| else if ((size_t) r < sizeof(int8_t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't matter much, but you could just drop the word else here
| r = netlink_message_read_internal(m, type, &attr_data, &net_byteorder); | ||
| if (r < 0) | ||
| return r; | ||
| else if ((size_t) r < sizeof(int16_t)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
| </term> | ||
| <listitem><para>List files referring to Linux network namespace pseudo-files. If no path is | ||
| specified, then the files in <filename>/var/run/netns</filename> are shown.</para></listitem> | ||
| </varlistentry> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, do we want these really? I think it would be much nicer if you allocate a new netns simply by starting another networkd instance. i.e.
"systemctl start networkd@foobar.service" starts a new netns called foobar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These commands basically mirror the ip netns list, ip netns add and ip netns delete commands. Not sure if they bring additional value. One could argue networkctl list also is just a more fancy ip l, and we could show more information here, too.
I don't think there always needs to be a separate networkd instance for each namespace "managed" by networkd.
There's two usecases here:
-
In addition to the default network namespace and main networkd, having a separate networkd running in a separate network namespace with interfaces in that namespace.
Don't we have systemd-nspawn for this? -
Main networkd in the default network namespace sets up a tunnel interface (wireguard, openvpn), but moves the created interface into a separate network namespace (creating if it doesn't exist already), possibly configuring addresses/(default) route(s) on that interface as well.
People that want to pipe traffic "through that tunnel" invoke applications (ip netns exec $application) or whole graphical sessions in that namespace, but they don't really care about configuring more interfaces inside this namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, do we want these really? I think it would be much nicer if you allocate a new netns simply by starting another networkd instance. i.e.
"systemctl start networkd@foobar.service" starts a new netns called
foobar
Let's remember that this really should fit well into regular workflow with "personalities". I would expect that:
ip netns exec alt-namespace-2 /bin/zsh
networkctl
returns the currently "logged in" network status, just like ip, iptables or tc tools are. Consider a scenario, when entire ssh server is being run within some netns (in general: one ssh server per netns) and one directly logs into the netns. It should be possible to manage this "internal" (separated) network, including stopping it - without interfering other namespaces.
Thinking about this makes me think, that maybe entire networkd@some-netns.service should be started inside some-netns (just like the ssh instance from example above), and the master (host) networkd.service should set up the namespaces itself and interfaces (like creating veth pair and moving one end into the specified netns). This seems to mean that .netdev with for NetNS=x would be picked by main systemd.service (in order to set up the device+netns assignment) and by networkd@x.service (in order to match the device), while .network for NetNS=x would belong to networkd@x.service only.
Then, the namespaced networkctl should display and manage only it's own parts, while the external one could give access to all the childrens. Also, some day, the systemctl could gain a feature to filter out namespace-bound services and disallow messing with master or siblings - this is the way of partitioning system and creating network-admin roles (known from Big Telco solutions; thinking about apache running within web1 netns and user/group X allowed to restart every service within web1 netns).
As for the (simpified) use-case (italic, long): I need to set up 4 netnses: bgp1, bgp2, lan1, lan2 interconnected with veth pairs. The bgp* personalities would run separate BGP instances (bird) on assigned VLANs exchanging traffic with their respective lan* internal network BRASes. Each bgp* uses it's own uplinks by default and provides backup transmission for the other, so in general traffic can go through 3 namespaces (lan1-bgp1-bgp2), while in normal conditions the flows are separated and can be accounted per instance. In general setup there might be more than 2 "entity sets" and the "set" might contain more "boxes" (BGP, NAT/log, QoS/BRAS, IDS), while the "interface/VLAN" is stacked over some QinQ or LACP (802.1ad), VXLAN etc. Most of this can be acomplished by using "traditional" methods, like policy based routing, separate routing tables etc. but as the number of entities grow the configuration complicates exponentially as they start to heavily interfere. Using separate network namespaces, with their own routing tables, filtering rules (iptables), conntrack, DHCP servers etc. trades some performance for simplicity and manageability (like having different network-admins for different "sets"). Using full virtualization would result in much more performance penalty (and require additional layer of OpenVSwitch). Containers are functionally equivalent, but require massive system (fs tree) duplication and would became upgrade management hell really fast; note the network-admins should only care about the rules (BGP, iptables, ipsets, QoS), not the system ("firmaware") itself. Also, I do trust network-admins that they are not malicious, just want to make their life easier - working with hundreds of interfaces stacked in several layers is error-prone.
Currently systemd handles well my (over?)complicated block-devices scheme (drive-partition-LVM-integrity-mdadm-LUKS-VDO-LVM/thin_pool(snapshots)-mdadm (with top-level mdadm used for bitmap-assisted on-line backups of either snapshots or fsfrozen data via ISCSI@ipsec to remote location) and it would be really nice if network stack could be handled as well.
| </term> | ||
| <listitem><para>Create a file referring to a Linux network namespace pseudo-file. Takes a name | ||
| or an absolute path. If a name is provided, then the file is created at | ||
| <filename>/var/run/netns/<replaceable>NAME</replaceable></filename>.</para></listitem> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we should step into that directory, i.e. territory of iproute? is that public API even?
Also /var/run is a legacy alias for /run. Let's not use legacy names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It allows user to interact with the netns from userspace, which would be much appreciated.
| (void) umount(mount_entry_path(m)); | ||
|
|
||
| if (mount("sysfs", mount_entry_path(m), "sysfs", 0, NULL) < 0) | ||
| return log_debug_errno(errno, "Failed to mount %s: %m", mount_entry_path(m)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably use MS_NOSUID|MS_NOEXEC|MS_NODEV as mount flags, since we mount the host sysfs like that
| if (r < 0) | ||
| return r; | ||
|
|
||
| if (link->dhcp_server) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above
| if (r < 0) | ||
| return log_error_errno(r, "Failed to renew dynamic configuration of interface %s: %s", | ||
| name, bus_error_message(&error, r)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i'd probably also go via varlink if client is root, since then you can issue this before dbus is up.
dbus gives us powerful access control, which varlink has. But if we are root anyway, then there's little point to bother with dbus for this i think.
| static int link_renew_one(sd_bus *bus, Varlink *varlink, int index, const char *name) { | ||
| int r; | ||
|
|
||
| if (arg_namespace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd change this check to if (varlink), i.e. instead of checking each time why we use varlink, just check if we can use varlink.
|
|
||
| <varlistentry> | ||
| <term><option>-N</option></term> | ||
| <term><option>--namespace</option></term> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
→ --namespace=, i.e. with trailing = since it expects an argument
| " Required operational state\n" | ||
| " --any Wait until at least one of the interfaces is online\n" | ||
| " --timeout=SECS Maximum time to wait for network connectivity\n" | ||
| " -N --namespace Network namespace\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
→ --namespace=NAME
|
@yuwata update? |
|
Will update after v246 is released. |
|
would still love this to materialize! |
|
I've noticed different behaviour between 'ip netns exec' and 'systemd's NetworkNamespacePath'. The former remounts /sys/ when switching the namespace of a process. Systemd doesn't seem to do this and therefore the process running in the namespace sees the wrong network devices in /sys/class/net, etc. Is this expected behaviour? |
|
@kjander0 plese file a separate issue about this. Comments on unrelated PRs are not a great way to report a bug. |
|
Any chance at a rebase of this? I am very interested in this functionality and I would like to test it, but I am not skilled enough nor familiar enough with the codebase to rebase. |
|
Here are some rebased versions of this: |
|
What's the current status? Is this PR still active or is it superseded by some other PR? |
|
Any chance of this being merged in? |
|
Does this support (or not complicate supporting in the future) differing 'current' and 'birth' namespaces? This can be useful because when the interface is moved to another namespace, its sockets are still attached to (? created in? I'm fuzzy on the specifics) its birth namespace. WireGuard describes a use case in more detail here (under heading 'the new namespace solution'): https://www.wireguard.com/netns/ |
|
This would be a pretty amazing thing to have support for in networkd. |
|
There was an interesting discussion in the podman mailing lists where someone is trying to do these things for quadlet (a systemd/podman sort of bridge prototype). |
|
@yuwata do you plan to finish this PR, or should someone else do so? |
There's been no meaningful update since 2020-10-29. Perhaps split this PR into smaller PR's for individual functionalities such as creating namespaces and managing interfaces? I get that everyone is a volunteer and that we're welcome to submit PR's, but this one felt somewhat complete. |
|
It is possible to achieve running systemd-networkd per namespace using the tools systemd already provides with no code changes (this may have only become possible since this draft was first created). It's not trivial, but neither is it incredibly complicated. It comes down to setting up a namespace, setting up a template for the systemd-networkd service to use the targeted namespace, and setting up a dbus instance in that namespace as well. You'll also need to mount per-namespace directories over the You would use the Perhaps this is a better path forward than trying to revive this |
|
I strongly disagree with this @kniteli. In paticular, I would like the ability to set up all physical devices in a separate network namespace, so that the root namespace can be reserved for a WireGuard VPN tunnel. |
|
@kniteli I also considered something like this in the past. However in practice it interferes way too much with systemd-networkd, systemd-udevd, and anything that already ships with a configuration for either What I tried so far (and thereby you also see how hack working around the missing network namespace support in systemd-network is):
Instead of 6 I also tried hacking it into the initrd systemd startup and hackisly insert a So year, it can be hacked but that's mostly working around systemd instead of working with systemd... |
Closes #11103.