-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC manager: Add deferred parsing of mountinfo #29821
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, maybe it is better to set $SYSTEMD_SUPPORT_MOUNT=no
for such systems or setups.
a603be2
to
d105b40
Compare
d105b40
to
a7f312f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to call systemctl --user daemon-reexec
or so with this change?
Still I think $SYSTEMD_SUPPORT_MOUNT=no
should be used for such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this sounds reasonable. Essentially, we treat mounts as asynchronous and order them a bit later. This changes semantics a bit, but I think it should be fine for the user manager.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several more superficial comments.
Manager start (measured via readiness notification) takes time that is proportionate to number of entries in /proc/self/mountinfo since all mount units must be enumerated. This is problematic for user instances because: a) they can start at times when lots of entries are created already, b) their start latency propagates directly to user sessions establishment via pam_systemd. Alleviate this latency by postponing full mountinfo parsing after the manager notifies its readiness. The reasoning is that mount units that are somehow relevant during startup of the user instance are referenced by other units and they would be loaded on demand anyway. The complete set of all mounts can be loaded later. If we used plain 'defer' source, it would be fetched from event queue before manager_check_finished() can be called (hence mountinfo parsing still delays manager's apparent readiness). Use 'post' source that can trigger after readiness notification. If a mount happens concurrently to the startup, manager will parse it completely, that is life. Fortunately, user-runtime-dir@.service is ordered before user@.service startup so it is safe from self (mount) notifications. Rough measurements with 3000 mount units: (TODO re-measure on main) > time systemctl start user@1000 before real 0m1.358s user 0m0.005s sys 0m0.013s after real 0m0.351s user 0m0.008s sys 0m0.010s
A thottled event source will eventually invoke an expiration callback. The current implementation sets expiration equal to the rate limiting interval. Make that variable configuratble, so the expiration timer can be controller too.
This complements commit 2fdc274 ("sd-event: add an explicit API for leaving the ratelimit state") so that users can create a rate limited from the beginning.
…rrent mounts When user instance starts on a busy system, its start would be delayed by parsing large mountinfo upon processing mount events. Most such events are irrelevant for user instance's startup, therefore postpone their processing with event rate-limiting mechanism. Thanks to the commit edc027b ("mount: retrigger run queue after ratelimit expired to run delayed mount start jobs") damped relevant events will not hold back indefinitely the startup process.
a7f312f
to
502f8d3
Compare
502f8d3
to
12e8a2a
Compare
@yuwata thanks for the review. Reasons for still RFC: (ab)use of rate limiting during boot, new APIs need docs, needs renewed measurement. |
goto fail; | ||
} | ||
|
||
r = sd_event_source_set_priority(m->mount_enumerate_source, SD_EVENT_PRIORITY_IDLE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idle seems very low for this. also, please use the EVENT_PRIORITY_XYZ enum that now is add the end of manager.h
@@ -2106,11 +2124,27 @@ static void mount_enumerate(Manager *m) { | |||
} | |||
|
|||
(void) sd_event_source_set_description(m->mount_event_source, "mount-monitor-dispatch"); | |||
|
|||
if (!MANAGER_IS_SYSTEM(m)) { | |||
r = sd_event_add_post(m->event, &m->mount_enumerate_source, mount_enumerate_late, m); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be an explanation here that explains the why
r = mount_load_proc_self_mountinfo(m, false); | ||
if (r < 0) | ||
goto fail; | ||
if (MANAGER_IS_SYSTEM(m)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this conditionalization sounds really weird to me. if this is safe to do, why wouldn't this be safe for the system too?
log_debug("Invalid value in $SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_INTERVAL, ignoring: %s", e); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious double empty line
@@ -2063,7 +2063,9 @@ static void mount_enumerate(Manager *m) { | |||
mnt_init_debug(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 2nd commit should really be a separate PR. it should be fine to merge independently.
@@ -5214,6 +5219,27 @@ _public_ int sd_event_source_leave_ratelimit(sd_event_source *s) { | |||
return 1; /* tell caller that we indeed just left the ratelimit state */ | |||
} | |||
|
|||
_public_ int sd_event_source_enter_ratelimit(sd_event_source *s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this "force_ratelimit"?
@@ -3354,6 +3354,10 @@ static int event_source_enter_ratelimited(sd_event_source *s) { | |||
if (s->ratelimited) | |||
return 0; /* Already ratelimited, this is a NOP hence */ | |||
|
|||
if (!exceeded) { | |||
s->rate_limit.begin = now(CLOCK_MONOTONIC); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coding style, no {} arund single line if block
also, why not just do this in sd_event_source_enter_ratelimit(), i.e. in the caller?
@@ -3379,7 +3383,8 @@ static int event_source_enter_ratelimited(sd_event_source *s) { | |||
|
|||
event_source_pp_prioq_reshuffle(s); | |||
|
|||
log_debug("Event source %p (%s) entered rate limit state.", s, strna(s->description)); | |||
if (exceeded) | |||
log_debug("Event source %p (%s) entered rate limit state.", s, strna(s->description)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why conditionalize this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's to distinguish source limiting based on real even rate vs artificial via explicit call.
|
||
LIBSYSTEMD_256 { | ||
global: | ||
sd_event_source_enter_ratelimit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also needs man page addition
So I am a bit puzzled about this: this is hacking around shitty kernel interfaces, on systems that are just terribly set up: have huge mount tables in the main mount namespace. What kind of a system are we even talking about here? What kind of a system has 3000 mounts in the root mount namespace? And how is that defensible? The linux kernel apis are just too bad with this, /proc/self/mountinfo is the best thing we got after all, and it just doesn't scale with such a setup. Hence the right approach to this is always to me: don#t do this! just don't. must of the time similar issues came up before this was just crappy cloud software that didn't know how to do mount namespacing properly, and just littered the root mount namespace with an essentially unbounded number of mounts. I mean, there's work going on in the kernel to fix this properly, i.e. get better APIs for enumeration and for notification of mount table changes. With that in place we can fix things properly, but until then this seems like working around bad apis in light of just a completely fucked setup in the first place. anyway, trying to understand where this comes from? or to say this differently: instead of adding hacks like this, maybe let's first port things over to the new listmount() syscall in 6.8. And only when that doesn't help revisit hacks like this? I am also very conservative with hacks like this that only affect the --user instance, and aren't applied to --system too. This raises alarm bells to me, as this means that behaviour of the two insrances will quite needlessly deviate, and be tested differently. And possibly break, since people will make changes testing things in testing mode and not notice that user mode operates quite differently. hence to summarize:
i think the env var thing and the force ratelimit stuff have merit on their own. if you want to clean those up, would be willing to merge that independently. |
It's a system with hundreds of containers, a few mounts per container which adds up.
(It's customer's system.) I don't have details whether all the mounts can be mnt ns privatized/hidden.
Ah, I didn't notice that work has landed already. Let me check how it could work (and postpone this RFC till then.)
IIUC, user instance only watches mounts and it creates .mount unit representations of global (system instance) mounts but it is not an active party wrt mounts setup. I added the condition to exclude system instance as I was afraid of possible deadlocks. (Which is possibly tackled in current version with time-limited deferal of mounts processing.) |
but why would those leak into the host? |
user instances can add their own mounts if they want (if /bin/mount allows them to be established by unpriv clients), and they can wait for system mounts too if they like or act on them. Of course, this is typically not done, but it's certainly possible. |
Manager start (measured via readiness notification) takes time that is proportionate to number of entries in
/proc/self/mountinfo
since all mount units must be enumerated (CPU profile of a starting user instance).This is problematic for user instances because:
Alleviate this latency by postponing full mountinfo parsing after the manager notifies its readiness. The reasoning is that mount units that are somehow relevant during startup of the user instance are referenced by other units and they would be loaded on demand anyway. [1] The complete set of all mounts can be loaded later.
Use 'post' event source (instead of 'defer') since there is no point in checking readiness in event loop when no other event was processed.
If a mount happens concurrently to the startup (mountinfo event), manager will parse it completely, that is life. Fortunately,
user-runtime-dir@.service
is ordered beforeuser@.service
startup so it is safe from self-notifications.Rough measurements with 3000 mount units:
before
real 0m1.358s
user 0m0.005s
sys 0m0.013s
after
real 0m0.351s
user 0m0.008s
sys 0m0.010s
[1] There is a catch though (thus RFC) -- such mount units may be loaded on demand but their state may be outdated (I guess). The solution could be deferring parsing the mountinfo only for certain time and then launch in unconditionally. (Ratelimited event source could come handy?)