-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing thousands of mounts blocks processing of systemd communication leading to dhcp lease expiry and more #31137
Comments
Yeah, the Linux mount API is awful, we have to consistently rescan it, after getting a notification about a change. generally. software that installs thousands of mounts in the host mount fs is broken, it simply shouldn#t do that. Linux mount APIs simply do not scale good enough for that. And there's really no reason why such software would install so many mounts on the host, because they can just open a mount namespace. There's some work going on in improving the kernel APIs, so that we got notification messages with the actual changes so that we don#t have to rescan all the time, but this isn#t complete yet. |
One side effect of this issue that I want to call attention to is that systemd stops consuming watchdog notification messages when it is being DOS'd in this way. Combine this with the fact that, by default, the Even if nothing is done to optimize the mount point scanning, systemd should occasionally defer processing the mount events to consume watchdog notifications within some reasonable time frame (1 minute)? or libsystemd should perform a non-blocking |
@droyo that's not true though. Current systemd versions (since d586f64) actually stop processing /proc/self/mountinfo events if triggered too often, for a while. Maybe you are just using a too old a systemd version? Basically we have three options with current kernel API.
We opted for option 3. But in the meantime, the best approach is certainly to fix the software in question to not have such massive numbers of mounts in the root mount namespace, Linux is simply not ready for that yet. It's not just systemd that has to deal with this mess, it's any program that uses the kernel APIs, i.e. libmount and stuff. Frankly: fix your software. And then one day, when the kernel added a better api we cn fix this properly. But even then, having thousands of mounts in the root mount is still sucky, because it's very unfriendly to admins. Better fix your software to not do that. |
@poettering This was observed on systemd 249, ubuntu 22.04. When was this rate limiting introduced? Are you referring to sd_event_source_set_ratelimit here? Line 2091 in 556d2bc
|
@matthewruffell in your reproduction on Ubuntu noble, do you see watchdog notifications getting consumed in a reasonable time? |
The event ratelimiting helps when there is high-rate of mount space changes (not necessarily large number of mounts). In won't help in the reported case but I invite the audience to review #29821 (for user units). :-) BTW how many CPUs does your machine have and how many user instances are running? |
systemd version the issue has been seen with
255.2-3ubuntu2
Used distribution
Ubuntu Noble 24.04
Linux kernel version used
6.6.0-14-generic
CPU architectures issue was seen on
x86_64
Component
systemd, systemd-networkd, systemd-udevd
Expected behaviour you didn't see
When mounting a filesystem or creating a tmpfs mount, systemd should be responsive and service important messages on its sockets, such as restarting services or renewing dhcp leases in a timely fashion.
Unexpected behaviour you saw
This issue was first seen on a cloud instance that was mounting and unmounting tmpfs filesystems every second or so by containerd collecting telemetry.
The system had about 3000 stale mountpoints of the root filesystem under /home due to a mountpoint leak by osqueryd. When the new tmpfs mounts were created, systemd is at 100% cpu and is completely blocked. It appears to parse each of the 3000 mountpoints under the udev filesystem during the tmpfs mount, which takes considerable time. While systemd is blocked, you cannot restart or query the status of services, and the dhcp lease can expire, and not be renewed leading to loss of network connectivity.
This isn't a bug per se, due to the large amount of mounts involved, but the issue is more systemd is blocked and communication to systemd sockets is not consumed, leading to issues such as dhcp lease expiry.
Is it strictly necessary to parse all these udev entries per tmpfs mount?
Is it possible to optimise the mount path to scan the udev entries in chunks, and break inbetween to let systemd service events like dhcp lease renewal etc in a timely fashion? Then it can return to the ongoing mount.
Steps to reproduce the problem
Start a VM, easier to see the impact if you select 1 vcpu.
Create a file,
repro.sh
with the following contents:Adjust /dev/sda1 to your system, e.g. /dev/vda1.
Run repro.sh, it will create 3000 mounts of the primary filesystem, and then
once per second, create a tmpfs, wait a second, and then unmount the tmpfs.
Once all 3000 mounts have been created, if you then restart
systemd-networkd
to simulate a dhcp lease expiry, you can see that it is blocked waiting for 1.5 to
2.5 minutes, and during this time, there is no dhcp lease, and networking is down.
Additional program output to the terminal or log subsystem illustrating the issue
htop shows systemd at 100% cpu:
The text was updated successfully, but these errors were encountered: