-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core/mount: minimize impact on mount storm. #10268
Conversation
The description sounds like it fixes #8703 |
hmm, i wish the kernel had better APIs for this, i.e. proper notifications for mounts being added or going away. I am a bit puzzled by this though: why would this take 20s to process for systemd? i mean, the poll events of /proc/self/mountinfo should be coalesced, so that after the last mount is established it should take only a short time until PID 1 caught up. i.e. making 2,000 changes or 2,000,000 in a short time window should not really matter much as the kernel API should coalesce the POLLIN events for those events and systemd should process /proc/self/mountinfo only a few times while the mounts are being established and a last time immediately after the last was added. Or is this not so much about the mount events being triggered but about the large size of mount table after all mounts have been established? If that's the case then this suggests that some processing logic when reading the mount table in systemd is O(n^2) or so (instead of the expected O(n)). Quite frankly that's totally possible, but I think we should really work on optimizing that then... or are you saying that because systemd needs to process the mount table on each POLLIN with O(n) this simply takes away CPU time so much that your loop around mount() gets too little CPU time? i.e. that this is caused by CPU time starvation in the loop? |
I agree about the poor api. Improvements might be coming (https://lwn.net/Articles/760714/) but they are still a long way off I think. Parsing mountinfo does take longer as the file grows, but it looks linear. It does seem to do quite a lot of work (from eyeballing strace output) so there probably is room for improvement there. I can see it taking 4 seconds to parse when getting close to 2000 entries. I think the problem is a combination of the parsing taking quite a lot of time, and the high rate of mount events making it that systemd is always runnable. If the mount loop is run with a 'nice' of -20 it runs much faster as systemd doesn't keep interrupting it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conceptually lgtm, but a few comments
(and yupp, this all feels like a big hack around crappy kernel APIs and crappy kernel CPU scheduling, if it's so easy to starve out other processes...)
src/core/manager.h
Outdated
@@ -224,6 +224,9 @@ struct Manager { | |||
/* Data specific to the mount subsystem */ | |||
struct libmnt_monitor *mount_monitor; | |||
sd_event_source *mount_event_source; | |||
sd_event_source *mount_timeout_source; | |||
uint64_t mount_last_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please uniformly use "usec_t" for time values. also consider suffixing the variable names with _usec
or so, that it's always clear it's a time value and in which unit is used...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used uint64_t for absolute times because that it what sd_event_add_time() expects. But I'll change to usec_t.
src/core/mount.c
Outdated
@@ -53,6 +53,7 @@ static const UnitActiveState state_translation_table[_MOUNT_STATE_MAX] = { | |||
|
|||
static int mount_dispatch_timer(sd_event_source *source, usec_t usec, void *userdata); | |||
static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, void *userdata); | |||
static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's a timeout , then it's not "io", hence shouldn't carry the name "io"...
I think a better name would be "mount_dispatch_proc_self_mountinfo_timer()" or so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to do some io (exactly the same way that mount_dispatch_io does) because there was a timeout... But I can also see that value of your suggestion - I'll go with that.
src/core/mount.c
Outdated
@@ -1765,6 +1766,27 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, | |||
Iterator i; | |||
Unit *u; | |||
int r; | |||
uint64_t this_run = now(CLOCK_MONOTONIC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above: please use usec_t...
Also, according to our CODING_STYLE we don't like mixing variable declarations and function invocations in one line...
src/core/mount.c
Outdated
@@ -1765,6 +1766,27 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, | |||
Iterator i; | |||
Unit *u; | |||
int r; | |||
uint64_t this_run = now(CLOCK_MONOTONIC); | |||
|
|||
if (source && this_run < m->mount_last_read + m->mount_last_duration * 10) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is going to explode if mount_last_duration has not been initialized yet, no?
src/core/mount.c
Outdated
/* If we run too often we can steal CPU from | ||
* the process that is mounting/unmounting things. | ||
*/ | ||
sd_event_source_set_enabled(source, SD_EVENT_OFF); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs some error handling (even if it is just logging someithing at log_warning() level and then ignoring it
src/core/mount.c
Outdated
} | ||
|
||
/* If an error occurs, assume 10ms */ | ||
m->mount_last_duration = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please write as 10 * USEC_PER_MSEC
src/core/mount.c
Outdated
@@ -1891,9 +1913,21 @@ static int mount_dispatch_io(sd_event_source *source, int fd, uint32_t revents, | |||
device_found_node(m, what, 0, DEVICE_FOUND_MOUNT); | |||
} | |||
|
|||
m->mount_last_read = this_run; | |||
m->mount_last_duration = now(CLOCK_MONOTONIC) - this_run; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usec_sub()
src/core/mount.c
Outdated
static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata) { | ||
Manager *m = userdata; | ||
|
||
sd_event_source_set_enabled(m->mount_event_source, SD_EVENT_ON); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error handling
src/core/mount.c
Outdated
sd_event_source_set_enabled(m->mount_event_source, SD_EVENT_ON); | ||
m->mount_timeout_source = sd_event_source_unref(source); | ||
return mount_dispatch_io(NULL, mnt_monitor_get_fd(m->mount_monitor), | ||
EPOLLIN, userdata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't say i am fan of invoking IO handlers with a NULL source here. that's just ugly... can't we find a different way? maybe just move the processing of the /proc/self/mountinfo changes into a new call of its own, that gets clean arguments
src/core/mount.c
Outdated
return 0; | ||
} | ||
|
||
static int mount_dispatch_timeout_io(sd_event_source *source, usec_t usec, void *userdata) { | ||
Manager *m = userdata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somewhere here there should be a big comment explaining the situation and why we do this all. i.e. that this is about CPU starving out other processes when a mount storm takes place. And that this is a hack around bad kernel APIs and CPU scheduling
ce6d415
to
463bd36
Compare
Thanks for the thorough review. I think I have addressed everything you raised. |
src/core/mount.c
Outdated
@@ -1734,6 +1735,8 @@ static void mount_enumerate(Manager *m) { | |||
goto fail; | |||
} | |||
|
|||
m->mount_last_read_usec = 0; | |||
m->mount_last_duration_usec = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be necessary. All our objects when allocated are memset() to zero. Hence the Manager object is too
src/core/mount.c
Outdated
int r; | ||
|
||
if (now(CLOCK_MONOTONIC) < usec_add(m->mount_last_read_usec, | ||
m->mount_last_duration_usec * 10)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that it matters, but given that we do the same usec_add() invocation again a few lines down, maybe just store the result in a variable here and reuse it?
CI failed, and it looks like this is caused by this commit:
|
That CI failure must have been because mount_dispatch_proc_self_mountinfo_timer() ran after mount_shutdown(). I need to sd_event_source_unref(m->mount_time_source) in mount_shutdown(). |
463bd36
to
86d3296
Compare
Patch looks great, but unfortunately it needs a rebase right now to be mergeable. Can you rebase, please? |
If we create 2000 mounts (on a 1-CPU qemu VM) with mkdir -p /MNT/{1..2000} time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done it takes around 20 seconds to complete. Much of this time is taken up by systemd repeatedly processing /proc/self/mountinfo. If I disable the processing, the time drops to about 4 seconds. I have reports that on a larger system with multiple active user sessions, each with it's own systemd, the impact can be higher. One particular use-case where a large number of mounts can be expected in quick succession is when the "clearcase" SCM starts up. This patch modifies the handling up events from /proc/self/mountinfo so that systemd backs off when a storm is detected. Specifically the time to process mountinfo is measured, and the process will not be repeated until 10 times that duration has passed. This ensures systemd won't use more than 10% of real time processing mountinfo. With this patch, my test above takes about 5 seconds.
86d3296
to
314515a
Compare
@neilbrown This is bad, like TOTALLY bad, results in unbootable system, maybe something went wrong with the rebase? systemd falls back to an emergency shell.
/etc/fstab is simple:
Reverting this PR results in a working system
@poettering please consider reverting this unless there is a trivial fixas this PR is broken in runtime. edit: typos |
I suspect the patch subtly breaks this assumption: |
Nope, won't even build, in mount_spawn m is Mount , not Manager
If you could provide patches I would gladly test those. |
m->meta.manager->mount_last_read_usec = 0; should compile. |
I've just added a patch to my neilbrown/systemd/mount-storm branch. It is commit bc59f84 |
I'm surprised this wasn't caught by the CI... |
@neilbrown Nice, that patch does fix the issue. Tested 3 times, clean boot. TYVM @mbiebl maybe it triggered local-fs.target failure due to the fact that it's an UEFI system with /boot/efi mount forced in fstab and that I have /home as a separate fs, dunno didn't check what CI does. In case someone is interested - full log from a failing boot (without the patch linked by neilbrown) |
Hmmm... I'm guessing I need to open a new pull request... |
If create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done
it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.
I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.
One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.
This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs of when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensure systemd won't use more than 10% of
a CPU processing mountinfo.
With this patch, my test aboce takes about 5 seconds.