Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC manager: Add deferred parsing of mountinfo #29821

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Werkov
Copy link
Contributor

@Werkov Werkov commented Nov 1, 2023

Manager start (measured via readiness notification) takes time that is proportionate to number of entries in /proc/self/mountinfo since all mount units must be enumerated (CPU profile of a starting user instance).

This is problematic for user instances because:

  • they can start at times when lots of entries are created already,
  • their start latency propagates directly to user sessions establishment via pam_systemd.

Alleviate this latency by postponing full mountinfo parsing after the manager notifies its readiness. The reasoning is that mount units that are somehow relevant during startup of the user instance are referenced by other units and they would be loaded on demand anyway. [1] The complete set of all mounts can be loaded later.

Use 'post' event source (instead of 'defer') since there is no point in checking readiness in event loop when no other event was processed.

If a mount happens concurrently to the startup (mountinfo event), manager will parse it completely, that is life. Fortunately, user-runtime-dir@.service is ordered before user@.service startup so it is safe from self-notifications.

Rough measurements with 3000 mount units:

time systemctl start user@1000

before
real 0m1.358s
user 0m0.005s
sys 0m0.013s

after
real 0m0.351s
user 0m0.008s
sys 0m0.010s

[1] There is a catch though (thus RFC) -- such mount units may be loaded on demand but their state may be outdated (I guess). The solution could be deferring parsing the mountinfo only for certain time and then launch in unconditionally. (Ratelimited event source could come handy?)

Copy link
Member

@yuwata yuwata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, maybe it is better to set $SYSTEMD_SUPPORT_MOUNT=no for such systems or setups.

src/core/manager.h Show resolved Hide resolved
@yuwata yuwata added pid1 user-session ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Nov 2, 2023
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR labels Nov 2, 2023
Copy link
Member

@yuwata yuwata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to call systemctl --user daemon-reexec or so with this change?

Still I think $SYSTEMD_SUPPORT_MOUNT=no should be used for such cases.

src/core/mount.c Outdated Show resolved Hide resolved
src/core/mount.c Outdated Show resolved Hide resolved
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sounds reasonable. Essentially, we treat mounts as asynchronous and order them a bit later. This changes semantics a bit, but I think it should be fine for the user manager.

@keszybz
Copy link
Member

keszybz commented Nov 8, 2023

Instead, maybe it is better to set $SYSTEMD_SUPPORT_MOUNT=no for such systems or setups.

$SYSTEMD_SUPPORT_MOUNT=no has different semantics: .mount units cannot be used at all. I can imagine that in such a setup with many disk, users might actually depend on the mount units to trigger actions.

Copy link
Member

@yuwata yuwata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several more superficial comments.

src/core/mount.c Outdated Show resolved Hide resolved
src/core/mount.c Outdated Show resolved Hide resolved
@yuwata yuwata added reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks and removed please-review PR is ready for (re-)review by a maintainer labels Nov 11, 2023
Manager start (measured via readiness notification) takes time that is
proportionate to number of entries in /proc/self/mountinfo since all
mount units must be enumerated.

This is problematic for user instances because:
a) they can start at times when lots of entries are created already,
b) their start latency propagates directly to user sessions
   establishment via pam_systemd.

Alleviate this latency by postponing full mountinfo parsing after the
manager notifies its readiness. The reasoning is that mount units that
are somehow relevant during startup of the user instance are referenced
by other units and they would be loaded on demand anyway. The complete
set of all mounts can be loaded later.

If we used plain 'defer' source, it would be fetched from event queue
before manager_check_finished() can be called (hence mountinfo parsing
still delays manager's apparent readiness). Use 'post' source that can
trigger after readiness notification.

If a mount happens concurrently to the startup, manager will parse it
completely, that is life. Fortunately, user-runtime-dir@.service is
ordered before user@.service startup so it is safe from self (mount)
notifications.

Rough measurements with 3000 mount units: (TODO re-measure on main)
> time systemctl start user@1000
before
        real	0m1.358s
        user	0m0.005s
        sys	0m0.013s

after
        real	0m0.351s
        user	0m0.008s
        sys	0m0.010s
A thottled event source will eventually invoke an expiration callback.
The current implementation sets expiration equal to the rate limiting
interval. Make that variable configuratble, so the expiration timer can
be controller too.
This complements commit 2fdc274 ("sd-event: add an explicit API for
leaving the ratelimit state") so that users can create a rate limited
from the beginning.
…rrent mounts

When user instance starts on a busy system, its start would be delayed
by parsing large mountinfo upon processing mount events. Most such
events are irrelevant for user instance's startup, therefore postpone
their processing with event rate-limiting mechanism. Thanks to the
commit edc027b ("mount: retrigger run queue after ratelimit expired
to run delayed mount start jobs") damped relevant events will not hold
back indefinitely the startup process.
@github-actions github-actions bot added please-review PR is ready for (re-)review by a maintainer and removed reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jan 26, 2024
@Werkov
Copy link
Contributor Author

Werkov commented Jan 26, 2024

@yuwata thanks for the review.
1st force push -- rebase on main
2nd force push -- your review fixups
3nd push -- another correction to handle not only high number of mounts but also high "traffic" of mounts (which could have triggered another long mountinfo parsing too).

Reasons for still RFC: (ab)use of rate limiting during boot, new APIs need docs, needs renewed measurement.

goto fail;
}

r = sd_event_source_set_priority(m->mount_enumerate_source, SD_EVENT_PRIORITY_IDLE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idle seems very low for this. also, please use the EVENT_PRIORITY_XYZ enum that now is add the end of manager.h

@@ -2106,11 +2124,27 @@ static void mount_enumerate(Manager *m) {
}

(void) sd_event_source_set_description(m->mount_event_source, "mount-monitor-dispatch");

if (!MANAGER_IS_SYSTEM(m)) {
r = sd_event_add_post(m->event, &m->mount_enumerate_source, mount_enumerate_late, m);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be an explanation here that explains the why

r = mount_load_proc_self_mountinfo(m, false);
if (r < 0)
goto fail;
if (MANAGER_IS_SYSTEM(m)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this conditionalization sounds really weird to me. if this is safe to do, why wouldn't this be safe for the system too?

log_debug("Invalid value in $SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_INTERVAL, ignoring: %s", e);
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious double empty line

@@ -2063,7 +2063,9 @@ static void mount_enumerate(Manager *m) {
mnt_init_debug(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 2nd commit should really be a separate PR. it should be fine to merge independently.

@@ -5214,6 +5219,27 @@ _public_ int sd_event_source_leave_ratelimit(sd_event_source *s) {
return 1; /* tell caller that we indeed just left the ratelimit state */
}

_public_ int sd_event_source_enter_ratelimit(sd_event_source *s) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this "force_ratelimit"?

@@ -3354,6 +3354,10 @@ static int event_source_enter_ratelimited(sd_event_source *s) {
if (s->ratelimited)
return 0; /* Already ratelimited, this is a NOP hence */

if (!exceeded) {
s->rate_limit.begin = now(CLOCK_MONOTONIC);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding style, no {} arund single line if block

also, why not just do this in sd_event_source_enter_ratelimit(), i.e. in the caller?

@@ -3379,7 +3383,8 @@ static int event_source_enter_ratelimited(sd_event_source *s) {

event_source_pp_prioq_reshuffle(s);

log_debug("Event source %p (%s) entered rate limit state.", s, strna(s->description));
if (exceeded)
log_debug("Event source %p (%s) entered rate limit state.", s, strna(s->description));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why conditionalize this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's to distinguish source limiting based on real even rate vs artificial via explicit call.


LIBSYSTEMD_256 {
global:
sd_event_source_enter_ratelimit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also needs man page addition

@poettering
Copy link
Member

So I am a bit puzzled about this: this is hacking around shitty kernel interfaces, on systems that are just terribly set up: have huge mount tables in the main mount namespace.

What kind of a system are we even talking about here? What kind of a system has 3000 mounts in the root mount namespace? And how is that defensible? The linux kernel apis are just too bad with this, /proc/self/mountinfo is the best thing we got after all, and it just doesn't scale with such a setup. Hence the right approach to this is always to me: don#t do this! just don't.

must of the time similar issues came up before this was just crappy cloud software that didn't know how to do mount namespacing properly, and just littered the root mount namespace with an essentially unbounded number of mounts.

I mean, there's work going on in the kernel to fix this properly, i.e. get better APIs for enumeration and for notification of mount table changes. With that in place we can fix things properly, but until then this seems like working around bad apis in light of just a completely fucked setup in the first place.

anyway, trying to understand where this comes from?

or to say this differently: instead of adding hacks like this, maybe let's first port things over to the new listmount() syscall in 6.8. And only when that doesn't help revisit hacks like this?

I am also very conservative with hacks like this that only affect the --user instance, and aren't applied to --system too. This raises alarm bells to me, as this means that behaviour of the two insrances will quite needlessly deviate, and be tested differently. And possibly break, since people will make changes testing things in testing mode and not notice that user mode operates quite differently.

hence to summarize:

  1. a hack like this needs a really strong reason
  2. it needs to be generic
  3. should only be a last resort, i.e. after listmount() has been proven to not suffice. given that listmount() is released now I think this is definitely where we should put the focus on for now first. Would probably mean prepping a PR for libmount first.

i think the env var thing and the force ratelimit stuff have merit on their own. if you want to clean those up, would be willing to merge that independently.

@poettering poettering added needs-discussion 🤔 reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks postponed and removed please-review PR is ready for (re-)review by a maintainer labels Feb 15, 2024
@Werkov
Copy link
Contributor Author

Werkov commented Feb 15, 2024

What kind of a system are we even talking about here? What kind of a system has 3000 mounts in the root mount namespace?

It's a system with hundreds of containers, a few mounts per container which adds up.

And how is that defensible?

(It's customer's system.) I don't have details whether all the mounts can be mnt ns privatized/hidden.

instead of adding hacks like this, maybe let's first port things over to the new listmount() syscall in 6.8. And only when that doesn't help revisit hacks like this?

Ah, I didn't notice that work has landed already. Let me check how it could work (and postpone this RFC till then.)

only affect the --user instance, and aren't applied to --system too. This raises alarm bells to me, as this means that behaviour of the two insrances will quite needlessly deviate

IIUC, user instance only watches mounts and it creates .mount unit representations of global (system instance) mounts but it is not an active party wrt mounts setup. I added the condition to exclude system instance as I was afraid of possible deadlocks. (Which is possibly tackled in current version with time-limited deferal of mounts processing.)

@poettering
Copy link
Member

What kind of a system are we even talking about here? What kind of a system has 3000 mounts in the root mount namespace?

It's a system with hundreds of containers, a few mounts per container which adds up.

but why would those leak into the host?

@poettering
Copy link
Member

IIUC, user instance only watches mounts and it creates .mount unit representations of global (system instance) mounts but it is not an active party wrt mounts setup. I added the condition to exclude system instance as I was afraid of possible deadlocks. (Which is possibly tackled in current version with time-limited deferal of mounts processing.)

user instances can add their own mounts if they want (if /bin/mount allows them to be established by unpriv clients), and they can wait for system mounts too if they like or act on them.

Of course, this is typically not done, but it's certainly possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

4 participants