-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many schedulers seem to be heavily penalized by heavy I/O, at least with bcachefs #96
Comments
I'm wondering if this happens because we are giving too much priority to per-CPU kthreads, like during a massive I/O workload with fast storage drives, we may have per-CPU kernel workers that are actually CPU-bound more than IO-bound. In rustland, for example, per-CPU kthreads are directly dispatched from the kernel to the local DSQ, bypassing the user-space scheduler. So they always win, over any other task and potentially they can up monopolizing the CPUs. Maybe something similar is happening also with the other schedulers? Anyway, I was actually working on a patch to use the global DSQ for per-CPU kthreads (or better, for all the per-CPU threads in general). I'll do some tests and will create a new PR, I'll update this thread when it's ready. |
@arighi FYI I'm not sure how much using the global DSQ will help, unfortunately. In the core ext.c code we automatically first try to dispatch from the global DSQ, and then if we find a task that can run on there we use that without invoking What might work a bit better is to instead create a custom DSQ per-CPU that you dispatch the kthreads to, and then consume from that in ops.dispatch() so you still have a chance to consume the remaining tasks from the |
@Decave from Documentation/scheduler/sched-ext.rst:
IIUC from doc it seems that local DSQ wins over global DSQ. So, let's say I have a single CPU, if I'm dispatching task1 with |
TL;DR: I think you're right that dispatching to So there are a couple things to clarify here:
Yes, that is correct. A core will first check its local DSQ to see if there are any tasks. Then it will check
So there is a concept in sched_ext called direct dispatch. This refers to when a task is dispatched "directly" from either So going back to your example, if you were to do a direct dispatch of That said -- to tie all of that back to the example at hand -- you're correct that a task dispatched with The crux of the issue is similar to what I alluded to above -- we're not actually getting to do things in
That would allow us to always call |
This doesn't perform very well, but showing an example of what I meant in #96. Signed-off-by: David Vernet <void@manifault.com>
This doesn't perform very well, but showing an example of what I meant in #96. Signed-off-by: David Vernet <void@manifault.com>
Here's an example of what I was talking about: f4a7cb2. It doesn't perform very well though :-( |
@Decave thank you so much for all the details and the example! It's all clear now. About the tasks dispatched to the global DSQ, yes we may have starvation. And I like your idea about using a DSQ per-cpu. About the poor performance, how about kicking the cpu when we're sending task to a different one? Something like this:
|
Well, that makes the scheduler work great! |
@Decave I've done some some quick tests and with the extra kick the scheduler doesn't seem bad at all. I'll do more tests tomorrow morning, but I think we may have a solution, at least for my other cpumask/affinity issue, |
This doesn't perform very well, but showing an example of what I meant in sched-ext#96. Signed-off-by: David Vernet <void@manifault.com> [ add kick cpu to improve responsiveness ] Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
I have tried out #99 against my system, and it does seem to work for a while, but then randomly devolves into horrible stuttering, both moving my mouse cursor and moving windows around the desktop. |
Sorry @kode54 , #99 doesn't address yet any IO pressure issue, but it was required for a follow up patch that I'm planning to test (the idea is to send all kthreads to the user-space scheduler as well as the regular tasks), that maybe can help to handle the situation. But thanks for testing this one and good to know it helped a bit with this issue. I'll keep you informed when I have something ready for testing. |
@kode54 can you test #99 now? I added more changes that seem to help under IO pressure condition (as a stress test I'm running a bunch of |
Still stuttering, but in this case, the load is OBS Studio running a virtual camera and using 60-80% of a core for that, and Discord video chat using the same amount on another core, and my compositor using a steady 10% of a core or so, or slightly more if I start moving windows around. |
Hm.. ok, so in this case you don't even have any heavy IO-bound task. I'm curious, do you have SMT enabled? If possible can you try to add |
SMT is enabled, disabling it did not have any appreciable effect, other than reducing me to 8 threads. Still stuttering while under CPU load. On login, Steam downloads and compiles shaders for about 20 different installed games. It seems to do this about every day. |
OK, good! At least we know it's not SMT-related. With the same workload the stuttering happens also with other schedulers, like scx_simple, scx_rusty, etc. or it's just scx_rustland? |
Under certain loads, I experienced stutter with scx_rusty and scx_nest. I can try scx_simple as well. Does it matter that the kernel I am using the patch set with is linux-cachyos? Built using the default settings, except for using generic cpu instead of autodetect, and adding a bcachefs update patch set that shouldn’t be incurring a significant load versus plain 6.7.1. |
It shouldn't matter, if the default scheduler works fine, then the sched-ext schedulers should also work fine, to a large degree, meaning not for everything, but they shouldn't show obvious lags or stuttering, especially with workloads that are not super intense. But even if scx_simple clearly shows this problem, then maybe something odd is happening in the kernel. |
scx_simple seems to be fine, so far. |
ok, in this case it seems reasonable to assume that this issue has nothing to do with the kernel or sched-ext in general. So, I would suggest another test: can you try #99 again (I pushed/updated stuff, so make sure you refresh the repo) and start scx_rustland with This would tell us the problem (for rustland at least) is in the priority boot part or somewhere else. |
@kode54 I think I was able to reproduce the problem on my side, starting obs and recording my session. I've updated #99 again, adding a WIP patch (make sure to git reset the repo). I'm still not happy about this patch (hence the WIP), but it seems to make a significant difference in my case and it might fix the problem also in your case. Can you do another test? Thanks. |
I'll test shortly. Sorry if I wasn't much help the first post, I wasn't up yet. Edit: Thought I'd drop a mention, scx_simple bogs down horribly under the load of building this package. Using the AUR package for scx-scheds-git, and my system -j$(nproc), it builds all the Rust packages simultaneously, and each one invokes up to 16 threads of rustc at once. Will test scx_rustland now. |
Was running your latest #99 of scx_rustland, and it was working mostly fine, until I attempted to rebuild the scx-scheds package again, which with my AUR defaults, attempts over 30-40 threads of rustc in the final stage. This led to my GPU timing out and resetting, and the reset took so long, my desktop session crashed back to login, terminating the build. Next time, I'll try forcing |
hm... it should survive to a 30-40 threads build, I'll try to run some parallel rustc builds also on my box. |
I will leave this issue open, but I'm ceasing testing for now, until sched-ext makes it into the upstream kernel tree in a stable release. I won't be testing any 6.8 kernels until Arch has a 6.8 kernel in the |
Okay, I've returned to continue testing this, because I overcame my stupid problem. I will simply live within the issues of rolling new kernels. Anyway: I encountered further bugs with scx_rustland, but not with scx_rusty or other BPF schedulers that aren't entirely userspace. Basically, if I have the This build queue has immediate problems with my Radeon RX 6700-XT on Wayfire, but only with scx_rustland. It will almost immediately result in a GPU reset, which will end up failing, leaving the GPU broken until the machine is soft rebooted using SSH to log in remotely. scx_rusty, the default in Here is a dmesg dump of the failing scx_rustland session: And here is a dmesg dump of the successful session which followed, running on scx_rusty: |
@kode54 thank you for sharing this, it's very useful to understand what's happening! It looks like you hit a sched-ext issue, more than a rustland issue, more exactly this warning:
I'm wondering if we need to exclude exiting / autogroup tasks in this logic, something similar to what we did in sched-ext/sched_ext@6b747e0. I'm not sure if the following patch makes any sense at all (posting here just for discussion): Maybe if you have the time / possibility to recompile the kernel and do more tests you can check if the problem is still happening with this patch applied. Otherwise, let's wait for the opinion of more experienced people than me in this topic, such as @htejun or @Decave . |
It still reset my GPU with that patch applied. scx_rusty did not, once again. scx_rustland: dmesg.3.txt |
@kode54 about rustland I pushed some improvements yesterday that should reduce the cpu usage and improve responsiveness in general. Do you mind trying again with the latest version from the |
I tried with both rusty and rustland, stuttering is pretty bad, and the wineserver process is using 20-30% more CPU according to I don't know if it's worth noting that I was using the Steam version of the game, which is no longer available for purchase. I can try the Epic Games Store version under Heroic Games Launcher, if you think that will help. |
|
Hm.. I've tried a bunch of games with rustland and I can't reproduce any stuttering, fps is always really close to the default scheduler (or even better is the system is busy). @kode54 maybe you can try to run Even better, you could try to generate a more detailed profiling using this command (maybe run it for like 30 sec or similar when the stuttering is happening, the ctrl+c and post the output):
This would tell us where the system is spending most of the time, showing both the kernel and the user stack trace of all the running processes. |
I have generated a lengthy log of tracing from roughly before launching the game, to launching it, which stutters on loading. It also appears my machine is using 7GiB of ZSWAP, ZSTD compressed. Even though it is only using 8 GiB of application memory, it's not relinquishing much of its 21 GiB of cached file data. Could you tell me how I should analyze this data, if I want to inspect it myself? |
The idea is to look at the stack traces (at the bottom you find those with the highest number - meaning that most of the time the CPUs where spending time there). To have a more "visual" overview of what is happening you need to feed this data to something like flamegraph (https://github.com/brendangregg/FlameGraph/), for example:
Then open /tmp/out.svg in your browser and you can have a nice graphical overview of the stack trace samples (represented as a flame graph: in the y axis you see the stack trace, on the x axis you have the number of samples). The bigger horizontal blocks represent where your CPUs are spending most of the time. In your specific case it seems that most of the time the CPUs are doing syscalls (I guess your %sys time should be pretty high). I see a big chunk of This was with rustland running right? It would be interesting to get a bpftrace.txt also with the default scheduler and compare the traces. |
Also, as you mentioned swap, can you record the output of |
Here's a memory pressure log. The increase in 10 second average levels was right around where the game was loading. I do not know why it is swapping so much when I have 32GB of RAM. |
It's not necessarily swapping. If it mmapped large memory areas and faulting them in for loading, they'd show up as memory pressure, which isn't too surprising while loading. Just so that we can rule out memory/io issues, you still see stuttering problems after the pressure spike from loading subsides? |
Periodically, every time a resource seems to load. |
So, the stuttering problems are associate with memory pressure? If so, I wonder whether this is from the schedulers always dispatching per-cpu kworkers directly to local DSQ prioritizing them over everything else. |
@htejun I'm also wondering if we can we still hit some page faults under memory pressure. Despite using the custom allocator and mlock-ing all the memory, a shared library, for example, can be unmapped under memory pressure, in that case I think scx_rustland may still hit a page fault,causing the stuttering. @kode54 if you look at the scx_rustland output, do you see a value >0 in |
|
ok Listing |
Here's a sample of output with a queued task:
|
That looks correct, things can be problematic if |
I attempted to run it with |
I doubled my RAM capacity, and it's still happening. No swapping going on. I checked Netdata. Average CPU Pressure hits about 17% at its peak while the game is loading, then drops to about 10% while it's running idle. CPU utilization hits 38% or so while it's loading, and that's across all 16 threads. |
Further checking: I'm compiling a kernel now, while a btrfs scrub is running on my large storage array, which isn't involved in the compilation or its backing. I was hitting 98-100% CPU usage and load averages of 19 or so for my 16 threads. Then I loaded scx_rustland. CPU usage dropped to 78% and load averages shot up to 30. I should also mention that switching Fall Guys to Proton 9.0 beta also seems to have alleviated most of the problems it had with sched-ext. |
Alright, I'm re-reading this thread, let's try to tackle this one and see if we can figure out what's going on. IIUC you are still experiencing some stuttering if you start any sched-ext scheduler (either With With With
That said, my assumption is that your workload consists of multiple tasks contending the CPUs at the same time (many more tasks than the amount of cores), some of these tasks are CPU bound, others are I/O bound. Can you confirm that this is the case? (my assumption is based on the fact that your load increases in some cases). If that's the case then we should probably focus either at At some point you mentioned:
Are you able to reproduce this? If you can, it'd be interesting to check if all these queued tasks are reported in Another aspect that may impact on system responsiveness is the time slice (how much time each task is allowed to run before the scheduler reclaims their CPU). For this, have you tried to use a smaller time slice (like start the scheduler - either So, to recap, first of all I think we should better understand the nature of your workload (lots of tasks vs few tasks running - at least now know that we're not dealing with memory pressure conditions), then understand if the scheduler is massively overloaded (for some reasons) or not, then understand if the default time slice is appropriate for your responsiveness expectations. Once we understand all of this we can try to refine our analysis by doing some targeted profiling. And thanks tons for all your updates and for sharing all these details with us! |
It seems to be working fine so far with |
Sometimes I also need to use I'm really considering to set the default time slice to 5000 (or 10000) for |
In line with rustland's focus on prioritizing interactive tasks, set the default base time slice to 5ms. This allows to mitigate potential audio craking issues or system lags when the system is overloaded or under memory pressure condition (i.e., #96 (comment)). A downside of this change is to introduce potential regressions in the throughput of CPU-intensive workloads, but in such scenarios rustland may not be the optimal choice and alternative schedulers may be preferred. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
I think we can close this one for now, after the latest changes to the default settings in scx_rustland we should be able to mitigate these performance issues. @kode54 do you agree or do you think there's something else that we should address/improve? |
scx_rusty needs to be 5ms too, unless low latency isn’t its aim. |
scx_rusty is a more general purpose scheduler, so it also needs to take throughput into account. Do you see some improvements in your case If you run I know @Byte-Lab had some plans to apply the same "dynamic time slice" concept also to scx_rusty, that should help in cases like this. I'm not sure if there's some work in progress for this, otherwise I can take a look. So, let's keep this open for now. Thanks @kode54 . |
I’m not sure if it’s kernel updates or using |
Nice, glad to hear. We still have a lot more we can do to make rusty more interactive, but glad to hear things seem to be going in the right direction. |
About rustland this is kind of expected, the scheduler is not really nice with non-interactive cpu-intensive tasks. 😄 |
Adding another weird case: Running an OBS Studio virtual camera with v4l2loopback, feeding it a combined PipeWire desktop capture and webcam capture overlay, and it becomes incredibly laggy when the CPU is fully loaded, such as from building a package or a kernel. |
I am using Arch, with linux-cachyos 6.7.0-4, with every upstream bcachefs patch from the 2024-01-01 tag up to 2bf0b0a9dff974cac259ce92d146e7142f472496 applied on top, with a bcachefs rootfs on a WD SN750 SSD, another bcachefs filesystem on a Samsung 980 Pro, and finally, my major storage on two 18TB WD Red Pro drives with 2 replicas enabled for both metadata and user data.
For heavy I/O, I am either running qBittorrent downloading some rather large (200GiB+) data sets at 1-2MB/s, plus some smaller ones (10-30GiB) at 20MB/s, to the two replicas hard drives. Or I am building a kernel on the rootfs with all 16 threads of my CPU.
Either of those tasks cause my compositor, Wayfire, to bog down heavily, if I happen to be using either scx_rustland, scx_rusty, or scx_nest. Disabling the sched-ext processes, and the compositor becomes performant again under the same conditions.
The text was updated successfully, but these errors were encountered: