Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing thousands of mounts blocks processing of systemd communication leading to dhcp lease expiry and more #31137

Open
matthewruffell opened this issue Jan 31, 2024 · 6 comments

Comments

@matthewruffell
Copy link

systemd version the issue has been seen with

255.2-3ubuntu2

Used distribution

Ubuntu Noble 24.04

Linux kernel version used

6.6.0-14-generic

CPU architectures issue was seen on

x86_64

Component

systemd, systemd-networkd, systemd-udevd

Expected behaviour you didn't see

When mounting a filesystem or creating a tmpfs mount, systemd should be responsive and service important messages on its sockets, such as restarting services or renewing dhcp leases in a timely fashion.

Unexpected behaviour you saw

This issue was first seen on a cloud instance that was mounting and unmounting tmpfs filesystems every second or so by containerd collecting telemetry.

The system had about 3000 stale mountpoints of the root filesystem under /home due to a mountpoint leak by osqueryd. When the new tmpfs mounts were created, systemd is at 100% cpu and is completely blocked. It appears to parse each of the 3000 mountpoints under the udev filesystem during the tmpfs mount, which takes considerable time. While systemd is blocked, you cannot restart or query the status of services, and the dhcp lease can expire, and not be renewed leading to loss of network connectivity.

This isn't a bug per se, due to the large amount of mounts involved, but the issue is more systemd is blocked and communication to systemd sockets is not consumed, leading to issues such as dhcp lease expiry.

Is it strictly necessary to parse all these udev entries per tmpfs mount?

Is it possible to optimise the mount path to scan the udev entries in chunks, and break inbetween to let systemd service events like dhcp lease renewal etc in a timely fashion? Then it can return to the ongoing mount.

Steps to reproduce the problem

Start a VM, easier to see the impact if you select 1 vcpu.

Create a file, repro.sh with the following contents:

#!/bin/bash

NUMDIRS=3000
DIRPREFIX="mnt"
NUM_MOUNTS=0
TDIR=""

cleanup() {
	for idx in $(seq 1 $NUM_MOUNTS); do
		DIRNAME="$DIRPREFIX/$idx"
		sudo umount "$DIRNAME"
		rmdir $DIRNAME
		echo "Removed $DIRNAME"
	done
	if [ -n "$TDIR" -a -e "$TDIR" ]; then 
	   sudo umount $TDIR
	fi
	return
}

trap "cleanup" EXIT

for idx in $(seq 1 $NUMDIRS); do
	DIRNAME="$DIRPREFIX/$idx"
	mkdir -p "$DIRNAME"
	sudo mount /dev/sda1 "$DIRNAME"
	NUM_MOUNTS=$idx
	echo "Created $DIRNAME mount"
done

for idx in $(seq 1 1000); do
	TDIR=$(mktemp -d); 
	sudo mount -o size=1k -t tmpfs tmpfs $TDIR
	sleep 1
	sudo umount "$TDIR"
done

Adjust /dev/sda1 to your system, e.g. /dev/vda1.

Run repro.sh, it will create 3000 mounts of the primary filesystem, and then
once per second, create a tmpfs, wait a second, and then unmount the tmpfs.

$ chmod +x repro.sh
$ ./repro.sh
Created mnt/1
...

Once all 3000 mounts have been created, if you then restart systemd-networkd
to simulate a dhcp lease expiry, you can see that it is blocked waiting for 1.5 to
2.5 minutes, and during this time, there is no dhcp lease, and networking is down.

$ time sudo systemctl restart systemd-networkd

real	2m28.361s
user	0m0.004s
sys	0m0.000s
$ time sudo systemctl restart systemd-networkd
Failed to restart systemd-networkd.service: Transport endpoint is not connected
See system logs and 'systemctl status systemd-networkd.service' for details.

real	1m30.145s
user	0m0.002s
sys	0m0.003s
$ time sudo systemctl restart systemd-networkd

real	2m17.425s
user	0m0.001s
sys	0m0.005s
$ ping github.com
ping: github.com: Temporary failure in name resolution
$ ping github.com
ping: github.com: Temporary failure in name resolution
$ ping github.com
ping: github.com: Temporary failure in name resolution

Additional program output to the terminal or log subsystem illustrating the issue

Running an strace during mount shows many udev entries being parsed for each
mountpoint, which takes time.


19:55:06 newfstatat(AT_FDCWD, "/dev/sda1", {st_mode=S_IFBLK|0660, st_rdev=makedev(0x8, 0x1), ...}, 0) = 0 <0.000061>
19:55:06 openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = 52 <0.000034>
19:55:06 openat(52, "sys", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000038>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0555, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000034>
19:55:06 close(52)                      = 0 <0.000032>
19:55:06 openat(53, "dev", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000043>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000222>
19:55:06 close(53)                      = 0 <0.000021>
19:55:06 openat(52, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000042>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000034>
19:55:06 close(52)                      = 0 <0.000030>
19:55:06 openat(53, "8:1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000048>
19:55:06 newfstatat(52, "", {st_mode=S_IFLNK|0777, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 readlinkat(53, "8:1", "../../devices/pci0000:00/0000:00"..., 4096) = 86 <0.000037>
19:55:06 close(52)                      = 0 <0.000030>
19:55:06 openat(53, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000037>
19:55:06 close(53)                      = 0 <0.000037>
19:55:06 openat(52, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000179>
19:55:06 close(52)                      = 0 <0.000060>
19:55:06 openat(53, "devices", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000043>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000033>
19:55:06 close(53)                      = 0 <0.000030>
19:55:06 openat(52, "pci0000:00", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000040>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 close(52)                      = 0 <0.000031>
19:55:06 openat(53, "0000:00:03.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000053>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000032>
19:55:06 close(53)                      = 0 <0.000108>
19:55:06 openat(52, "virtio0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000089>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000064>
19:55:06 close(52)                      = 0 <0.000030>
19:55:06 openat(53, "host0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000037>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000029>
19:55:06 close(53)                      = 0 <0.000041>
19:55:06 openat(52, "target0:0:1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000042>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000027>
19:55:06 close(52)                      = 0 <0.000044>
19:55:06 openat(53, "0:0:1:0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000044>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000036>
19:55:06 close(53)                      = 0 <0.000028>
19:55:06 openat(52, "block", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000040>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000104>
19:55:06 close(52)                      = 0 <0.000034>
19:55:06 openat(53, "sda", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 52 <0.000042>
19:55:06 newfstatat(52, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000033>
19:55:06 close(53)                      = 0 <0.000033>
19:55:06 openat(52, "sda1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 53 <0.000043>
19:55:06 newfstatat(53, "", {st_mode=S_IFDIR|0755, st_size=0, ...}, AT_EMPTY_PATH) = 0 <0.000062>
19:55:06 close(52)                      = 0 <0.000051>
19:55:06 close(53)                      = 0 <0.000038>
19:55:06 access("/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/uevent", F_OK) = 0 <0.000107>
19:55:06 openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/uevent", O_RDONLY|O_NOCTTY|O_CLOEXEC) = 52 <0.000064>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=4096, ...}, AT_EMPTY_PATH) = 0 <0.000038>
19:55:06 read(52, "MAJOR=8\nMINOR=1\nDEVNAME=sda1\nDEV"..., 4104) = 65 <0.000055>
19:55:06 close(52)                      = 0 <0.000068>
19:55:06 readlinkat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:03.0/virtio0/host0/target0:0:1/0:0:1:0/block/sda/sda1/subsystem", "../../../../../../../../../../cl"..., 4096) = 41 <0.000051>
19:55:06 openat(AT_FDCWD, "/run/udev/data/b8:1", O_RDONLY|O_CLOEXEC) = 52 <0.000063>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=1509, ...}, AT_EMPTY_PATH) = 0 <0.000073>
19:55:06 newfstatat(52, "", {st_mode=S_IFREG|0644, st_size=1509, ...}, AT_EMPTY_PATH) = 0 <0.000043>
19:55:06 read(52, "S:disk/by-partuuid/04a52087-8ae8"..., 4096) = 1509 <0.000038>
19:55:06 read(52, "", 4096)             = 0 <0.000056>
19:55:06 close(52)                      = 0 <0.000036>

htop shows systemd at 100% cpu:

PID	USER	PRI	NI	VIRT	RES		SHR		S	CPU%	MEM%	TIME+	Command
1	root	20	0	40004	30796	9548	R	32.5	0.8	1:18.35		/sbin/init
792	ubuntu	20	0	36580	26880	9472	R	31.8	0.7	0:57.32		/usr/lib/systemd/systemd --user
1351root	20	0	23112	12544	8064	R	20.1	0.3	0:15.57		(networkd)
637	root	20	0	460M	16132	11392	R	13.6	0.4	0:53.56		/usr/libexec/udisks2/udisksd
@matthewruffell matthewruffell added the bug 🐛 Programming errors, that need preferential fixing label Jan 31, 2024
@poettering
Copy link
Member

Yeah, the Linux mount API is awful, we have to consistently rescan it, after getting a notification about a change.

generally. software that installs thousands of mounts in the host mount fs is broken, it simply shouldn#t do that. Linux mount APIs simply do not scale good enough for that. And there's really no reason why such software would install so many mounts on the host, because they can just open a mount namespace.

There's some work going on in improving the kernel APIs, so that we got notification messages with the actual changes so that we don#t have to rescan all the time, but this isn#t complete yet.

@droyo
Copy link

droyo commented Feb 1, 2024

One side effect of this issue that I want to call attention to is that systemd stops consuming watchdog notification messages when it is being DOS'd in this way. Combine this with the fact that, by default, the sd_notify routine in libsystemd performs a blocking send, and this means that daemons which use sd_notify to report their health to systemd will eventually get blocked on the sendmsg of the notification socket once it becomes full. In the case of systemd-networkd, the full socket will cause DHCP lease expiration even without the forced restart suggested in the bug report.

Even if nothing is done to optimize the mount point scanning, systemd should occasionally defer processing the mount events to consume watchdog notifications within some reasonable time frame (1 minute)? or libsystemd should perform a non-blocking sendmsg in sd_notify -- though that might be a more controversial change.

@poettering
Copy link
Member

poettering commented Feb 1, 2024

@droyo that's not true though. Current systemd versions (since d586f64) actually stop processing /proc/self/mountinfo events if triggered too often, for a while. Maybe you are just using a too old a systemd version?

Basically we have three options with current kernel API.

  1. Drown in the flood of events and waste time constantly reparsing /p/s/mi,
  2. Don't process events on /p/s/mi instantly, and thus introduce latency but not drown in work
  3. Find a middle ground: accept the flood for a while, but once we hit some ratelimit stop processing and accept the extra latency

We opted for option 3.

But in the meantime, the best approach is certainly to fix the software in question to not have such massive numbers of mounts in the root mount namespace, Linux is simply not ready for that yet. It's not just systemd that has to deal with this mess, it's any program that uses the kernel APIs, i.e. libmount and stuff.

Frankly: fix your software. And then one day, when the kernel added a better api we cn fix this properly. But even then, having thousands of mounts in the root mount is still sucky, because it's very unfriendly to admins. Better fix your software to not do that.

@droyo
Copy link

droyo commented Feb 1, 2024

@poettering This was observed on systemd 249, ubuntu 22.04. When was this rate limiting introduced? Are you referring to sd_event_source_set_ratelimit here?

r = sd_event_source_set_ratelimit(m->mount_event_source, 1 * USEC_PER_SEC, mount_rate_limit_burst);

@droyo
Copy link

droyo commented Feb 1, 2024

@matthewruffell in your reproduction on Ubuntu noble, do you see watchdog notifications getting consumed in a reasonable time?

@Werkov
Copy link
Contributor

Werkov commented Feb 2, 2024

The event ratelimiting helps when there is high-rate of mount space changes (not necessarily large number of mounts).
But it may still take only a single event to hit the restart/watchdog expiration window and it triggers full parsing and timeout threatens when the mountinfo is large.
It's a suckage you can get away with by reducing number of (visible) mounts :-/

In won't help in the reported case but I invite the audience to review #29821 (for user units). :-)

BTW how many CPUs does your machine have and how many user instances are running?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants