RFC manager: Add deferred parsing of mountinfo #29821

Werkov · 2023-11-01T17:52:40Z

Manager start (measured via readiness notification) takes time that is proportionate to number of entries in /proc/self/mountinfo since all mount units must be enumerated (CPU profile of a starting user instance).

This is problematic for user instances because:

they can start at times when lots of entries are created already,
their start latency propagates directly to user sessions establishment via pam_systemd.

Alleviate this latency by postponing full mountinfo parsing after the manager notifies its readiness. The reasoning is that mount units that are somehow relevant during startup of the user instance are referenced by other units and they would be loaded on demand anyway. [1] The complete set of all mounts can be loaded later.

Use 'post' event source (instead of 'defer') since there is no point in checking readiness in event loop when no other event was processed.

If a mount happens concurrently to the startup (mountinfo event), manager will parse it completely, that is life. Fortunately, user-runtime-dir@.service is ordered before user@.service startup so it is safe from self-notifications.

Rough measurements with 3000 mount units:

time systemctl start user@1000

before
real 0m1.358s
user 0m0.005s
sys 0m0.013s

after
real 0m0.351s
user 0m0.008s
sys 0m0.010s

[1] There is a catch though (thus RFC) -- such mount units may be loaded on demand but their state may be outdated (I guess). The solution could be deferring parsing the mountinfo only for certain time and then launch in unconditionally. (Ratelimited event source could come handy?)

yuwata

Instead, maybe it is better to set $SYSTEMD_SUPPORT_MOUNT=no for such systems or setups.

src/core/manager.h

yuwata

Is it safe to call systemctl --user daemon-reexec or so with this change?

Still I think $SYSTEMD_SUPPORT_MOUNT=no should be used for such cases.

src/core/mount.c

keszybz

I think this sounds reasonable. Essentially, we treat mounts as asynchronous and order them a bit later. This changes semantics a bit, but I think it should be fine for the user manager.

keszybz · 2023-11-08T08:15:17Z

Instead, maybe it is better to set $SYSTEMD_SUPPORT_MOUNT=no for such systems or setups.

$SYSTEMD_SUPPORT_MOUNT=no has different semantics: .mount units cannot be used at all. I can imagine that in such a setup with many disk, users might actually depend on the mount units to trigger actions.

yuwata

Several more superficial comments.

src/core/mount.c

Manager start (measured via readiness notification) takes time that is proportionate to number of entries in /proc/self/mountinfo since all mount units must be enumerated. This is problematic for user instances because: a) they can start at times when lots of entries are created already, b) their start latency propagates directly to user sessions establishment via pam_systemd. Alleviate this latency by postponing full mountinfo parsing after the manager notifies its readiness. The reasoning is that mount units that are somehow relevant during startup of the user instance are referenced by other units and they would be loaded on demand anyway. The complete set of all mounts can be loaded later. If we used plain 'defer' source, it would be fetched from event queue before manager_check_finished() can be called (hence mountinfo parsing still delays manager's apparent readiness). Use 'post' source that can trigger after readiness notification. If a mount happens concurrently to the startup, manager will parse it completely, that is life. Fortunately, user-runtime-dir@.service is ordered before user@.service startup so it is safe from self (mount) notifications. Rough measurements with 3000 mount units: (TODO re-measure on main) > time systemctl start user@1000 before real 0m1.358s user 0m0.005s sys 0m0.013s after real 0m0.351s user 0m0.008s sys 0m0.010s

A thottled event source will eventually invoke an expiration callback. The current implementation sets expiration equal to the rate limiting interval. Make that variable configuratble, so the expiration timer can be controller too.

This complements commit 2fdc274 ("sd-event: add an explicit API for leaving the ratelimit state") so that users can create a rate limited from the beginning.

…rrent mounts When user instance starts on a busy system, its start would be delayed by parsing large mountinfo upon processing mount events. Most such events are irrelevant for user instance's startup, therefore postpone their processing with event rate-limiting mechanism. Thanks to the commit edc027b ("mount: retrigger run queue after ratelimit expired to run delayed mount start jobs") damped relevant events will not hold back indefinitely the startup process.

Werkov · 2024-01-26T18:34:23Z

@yuwata thanks for the review.
1st force push -- rebase on main
2nd force push -- your review fixups
3nd push -- another correction to handle not only high number of mounts but also high "traffic" of mounts (which could have triggered another long mountinfo parsing too).

Reasons for still RFC: (ab)use of rate limiting during boot, new APIs need docs, needs renewed measurement.

poettering · 2024-02-15T11:02:16Z

src/core/mount.c

+                                goto fail;
+                        }
+
+                        r = sd_event_source_set_priority(m->mount_enumerate_source, SD_EVENT_PRIORITY_IDLE);


idle seems very low for this. also, please use the EVENT_PRIORITY_XYZ enum that now is add the end of manager.h

poettering · 2024-02-15T11:02:28Z

src/core/mount.c

@@ -2106,11 +2124,27 @@ static void mount_enumerate(Manager *m) {
                }

                (void) sd_event_source_set_description(m->mount_event_source, "mount-monitor-dispatch");
+
+                if (!MANAGER_IS_SYSTEM(m)) {
+                        r = sd_event_add_post(m->event, &m->mount_enumerate_source, mount_enumerate_late, m);


there should be an explanation here that explains the why

poettering · 2024-02-15T11:03:49Z

src/core/mount.c

-        r = mount_load_proc_self_mountinfo(m, false);
-        if (r < 0)
-                goto fail;
+        if (MANAGER_IS_SYSTEM(m)) {


this conditionalization sounds really weird to me. if this is safe to do, why wouldn't this be safe for the system too?

poettering · 2024-02-15T11:04:14Z

src/core/mount.c

+                                log_debug("Invalid value in $SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_INTERVAL, ignoring: %s", e);
+                }
+
+


spurious double empty line

poettering · 2024-02-15T11:04:47Z

src/core/mount.c

@@ -2063,7 +2063,9 @@ static void mount_enumerate(Manager *m) {
        mnt_init_debug(0);


this 2nd commit should really be a separate PR. it should be fine to merge independently.

poettering · 2024-02-15T11:05:27Z

src/libsystemd/sd-event/sd-event.c

@@ -5214,6 +5219,27 @@ _public_ int sd_event_source_leave_ratelimit(sd_event_source *s) {
        return 1; /* tell caller that we indeed just left the ratelimit state */
 }

+_public_ int sd_event_source_enter_ratelimit(sd_event_source *s) {


maybe call this "force_ratelimit"?

poettering · 2024-02-15T11:07:05Z

src/libsystemd/sd-event/sd-event.c

@@ -3354,6 +3354,10 @@ static int event_source_enter_ratelimited(sd_event_source *s) {
        if (s->ratelimited)
                return 0; /* Already ratelimited, this is a NOP hence */

+        if (!exceeded) {
+                s->rate_limit.begin = now(CLOCK_MONOTONIC);
+        }


coding style, no {} arund single line if block

also, why not just do this in sd_event_source_enter_ratelimit(), i.e. in the caller?

poettering · 2024-02-15T11:07:17Z

src/libsystemd/sd-event/sd-event.c

@@ -3379,7 +3383,8 @@ static int event_source_enter_ratelimited(sd_event_source *s) {

        event_source_pp_prioq_reshuffle(s);

-        log_debug("Event source %p (%s) entered rate limit state.", s, strna(s->description));
+        if (exceeded)
+                log_debug("Event source %p (%s) entered rate limit state.", s, strna(s->description));


why conditionalize this?

That's to distinguish source limiting based on real even rate vs artificial via explicit call.

poettering · 2024-02-15T11:07:27Z

src/libsystemd/libsystemd.sym

+
+LIBSYSTEMD_256 {
+global:
+        sd_event_source_enter_ratelimit;


also needs man page addition

poettering · 2024-02-15T11:18:27Z

So I am a bit puzzled about this: this is hacking around shitty kernel interfaces, on systems that are just terribly set up: have huge mount tables in the main mount namespace.

What kind of a system are we even talking about here? What kind of a system has 3000 mounts in the root mount namespace? And how is that defensible? The linux kernel apis are just too bad with this, /proc/self/mountinfo is the best thing we got after all, and it just doesn't scale with such a setup. Hence the right approach to this is always to me: don#t do this! just don't.

must of the time similar issues came up before this was just crappy cloud software that didn't know how to do mount namespacing properly, and just littered the root mount namespace with an essentially unbounded number of mounts.

I mean, there's work going on in the kernel to fix this properly, i.e. get better APIs for enumeration and for notification of mount table changes. With that in place we can fix things properly, but until then this seems like working around bad apis in light of just a completely fucked setup in the first place.

anyway, trying to understand where this comes from?

or to say this differently: instead of adding hacks like this, maybe let's first port things over to the new listmount() syscall in 6.8. And only when that doesn't help revisit hacks like this?

I am also very conservative with hacks like this that only affect the --user instance, and aren't applied to --system too. This raises alarm bells to me, as this means that behaviour of the two insrances will quite needlessly deviate, and be tested differently. And possibly break, since people will make changes testing things in testing mode and not notice that user mode operates quite differently.

hence to summarize:

a hack like this needs a really strong reason
it needs to be generic
should only be a last resort, i.e. after listmount() has been proven to not suffice. given that listmount() is released now I think this is definitely where we should put the focus on for now first. Would probably mean prepping a PR for libmount first.

i think the env var thing and the force ratelimit stuff have merit on their own. if you want to clean those up, would be willing to merge that independently.

Werkov · 2024-02-15T16:28:57Z

What kind of a system are we even talking about here? What kind of a system has 3000 mounts in the root mount namespace?

It's a system with hundreds of containers, a few mounts per container which adds up.

And how is that defensible?

(It's customer's system.) I don't have details whether all the mounts can be mnt ns privatized/hidden.

instead of adding hacks like this, maybe let's first port things over to the new listmount() syscall in 6.8. And only when that doesn't help revisit hacks like this?

Ah, I didn't notice that work has landed already. Let me check how it could work (and postpone this RFC till then.)

only affect the --user instance, and aren't applied to --system too. This raises alarm bells to me, as this means that behaviour of the two insrances will quite needlessly deviate

IIUC, user instance only watches mounts and it creates .mount unit representations of global (system instance) mounts but it is not an active party wrt mounts setup. I added the condition to exclude system instance as I was afraid of possible deadlocks. (Which is possibly tackled in current version with time-limited deferal of mounts processing.)

poettering · 2024-02-16T16:30:35Z

What kind of a system are we even talking about here? What kind of a system has 3000 mounts in the root mount namespace?

It's a system with hundreds of containers, a few mounts per container which adds up.

but why would those leak into the host?

poettering · 2024-02-16T16:31:57Z

IIUC, user instance only watches mounts and it creates .mount unit representations of global (system instance) mounts but it is not an active party wrt mounts setup. I added the condition to exclude system instance as I was afraid of possible deadlocks. (Which is possibly tackled in current version with time-limited deferal of mounts processing.)

user instances can add their own mounts if they want (if /bin/mount allows them to be established by unpriv clients), and they can wait for system mounts too if they like or act on them.

Of course, this is typically not done, but it's certainly possible.

Werkov added the scalability label Nov 1, 2023

yuwata requested changes Nov 2, 2023

View reviewed changes

src/core/manager.h Show resolved Hide resolved

yuwata added pid1 user-session ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Nov 2, 2023

Werkov force-pushed the deferred-mountinfo branch from a603be2 to d105b40 Compare November 2, 2023 15:38

github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Nov 2, 2023

Werkov force-pushed the deferred-mountinfo branch from d105b40 to a7f312f Compare November 2, 2023 15:40

yuwata reviewed Nov 6, 2023

View reviewed changes

src/core/mount.c Outdated Show resolved Hide resolved

src/core/mount.c Outdated Show resolved Hide resolved

keszybz reviewed Nov 8, 2023

View reviewed changes

keszybz requested a review from poettering November 8, 2023 08:15

yuwata requested changes Nov 11, 2023

View reviewed changes

src/core/mount.c Outdated Show resolved Hide resolved

src/core/mount.c Outdated Show resolved Hide resolved

yuwata added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed please-review PR is ready for (re-)review by a maintainer labels Nov 11, 2023

Werkov added 4 commits January 26, 2024 19:28

sd-event: Add explicit API for entering ratelimit state

c3d270a

This complements commit 2fdc274 ("sd-event: add an explicit API for leaving the ratelimit state") so that users can create a rate limited from the beginning.

Werkov force-pushed the deferred-mountinfo branch from a7f312f to 502f8d3 Compare January 26, 2024 18:29

github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jan 26, 2024

Werkov force-pushed the deferred-mountinfo branch from 502f8d3 to 12e8a2a Compare January 26, 2024 18:30

github-actions bot added the sd-event label Jan 26, 2024

Werkov mentioned this pull request Feb 2, 2024

Parsing thousands of mounts blocks processing of systemd communication leading to dhcp lease expiry and more #31137

Open

poettering added the needs-rebase label Feb 15, 2024

poettering requested changes Feb 15, 2024

View reviewed changes

poettering added needs-discussion 🤔 reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks postponed and removed please-review PR is ready for (re-)review by a maintainer labels Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC manager: Add deferred parsing of mountinfo #29821

RFC manager: Add deferred parsing of mountinfo #29821

Werkov commented Nov 1, 2023

yuwata left a comment

yuwata left a comment

keszybz left a comment

keszybz commented Nov 8, 2023

yuwata left a comment

Werkov commented Jan 26, 2024

poettering Feb 15, 2024

poettering Feb 15, 2024

poettering Feb 15, 2024

poettering Feb 15, 2024

poettering Feb 15, 2024

poettering Feb 15, 2024

poettering Feb 15, 2024

poettering Feb 15, 2024

Werkov Feb 15, 2024

poettering Feb 15, 2024

poettering commented Feb 15, 2024

Werkov commented Feb 15, 2024 •

edited

poettering commented Feb 16, 2024

poettering commented Feb 16, 2024

		log_debug("Invalid value in $SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_INTERVAL, ignoring: %s", e);
		}

		@@ -2063,7 +2063,9 @@ static void mount_enumerate(Manager *m) {
		mnt_init_debug(0);

RFC manager: Add deferred parsing of mountinfo #29821

Are you sure you want to change the base?

RFC manager: Add deferred parsing of mountinfo #29821

Conversation

Werkov commented Nov 1, 2023

yuwata left a comment

Choose a reason for hiding this comment

yuwata left a comment

Choose a reason for hiding this comment

keszybz left a comment

Choose a reason for hiding this comment

keszybz commented Nov 8, 2023

yuwata left a comment

Choose a reason for hiding this comment

Werkov commented Jan 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poettering commented Feb 15, 2024

Werkov commented Feb 15, 2024 • edited

poettering commented Feb 16, 2024

poettering commented Feb 16, 2024

Werkov commented Feb 15, 2024 •

edited