Many schedulers seem to be heavily penalized by heavy I/O, at least with bcachefs #96

kode54 · 2024-01-19T07:38:56Z

I am using Arch, with linux-cachyos 6.7.0-4, with every upstream bcachefs patch from the 2024-01-01 tag up to 2bf0b0a9dff974cac259ce92d146e7142f472496 applied on top, with a bcachefs rootfs on a WD SN750 SSD, another bcachefs filesystem on a Samsung 980 Pro, and finally, my major storage on two 18TB WD Red Pro drives with 2 replicas enabled for both metadata and user data.

For heavy I/O, I am either running qBittorrent downloading some rather large (200GiB+) data sets at 1-2MB/s, plus some smaller ones (10-30GiB) at 20MB/s, to the two replicas hard drives. Or I am building a kernel on the rootfs with all 16 threads of my CPU.

Either of those tasks cause my compositor, Wayfire, to bog down heavily, if I happen to be using either scx_rustland, scx_rusty, or scx_nest. Disabling the sched-ext processes, and the compositor becomes performant again under the same conditions.

arighi · 2024-01-19T14:45:10Z

I'm wondering if this happens because we are giving too much priority to per-CPU kthreads, like during a massive I/O workload with fast storage drives, we may have per-CPU kernel workers that are actually CPU-bound more than IO-bound.

In rustland, for example, per-CPU kthreads are directly dispatched from the kernel to the local DSQ, bypassing the user-space scheduler. So they always win, over any other task and potentially they can up monopolizing the CPUs. Maybe something similar is happening also with the other schedulers?

Anyway, I was actually working on a patch to use the global DSQ for per-CPU kthreads (or better, for all the per-CPU threads in general). I'll do some tests and will create a new PR, I'll update this thread when it's ready.

Byte-Lab · 2024-01-19T15:12:30Z

Anyway, I was actually working on a patch to use the global DSQ for per-CPU kthreads (or better, for all the per-CPU threads in general). I'll do some tests and will create a new PR, I'll update this thread when it's ready.

@arighi FYI I'm not sure how much using the global DSQ will help, unfortunately. In the core ext.c code we automatically first try to dispatch from the global DSQ, and then if we find a task that can run on there we use that without invoking ops.dispatch(). That also has the downside of us requiring the overhead of walking the DSQ until we find the per-CPU task.

What might work a bit better is to instead create a custom DSQ per-CPU that you dispatch the kthreads to, and then consume from that in ops.dispatch() so you still have a chance to consume the remaining tasks from the dispatched list.

arighi · 2024-01-19T19:56:30Z

@Decave from Documentation/scheduler/sched-ext.rst:

A CPU always executes a task from its local DSQ. A task is "dispatched" to a
DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
local DSQ.

When a CPU is looking for the next task to run, if the local DSQ is not
empty, the first task is picked. Otherwise, the CPU tries to consume the
global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
is invoked.

IIUC from doc it seems that local DSQ wins over global DSQ. So, let's say I have a single CPU, if I'm dispatching task1 with DSQ_LOCAL_ON | cpu and task2 with DSQ_GLOBAL, I would expect that task1 runs before task2. What am I missing?

Byte-Lab · 2024-01-19T21:46:24Z

TL;DR:

I think you're right that dispatching to SCX_DSQ_GLOBAL will actually in fact throttle the pcpu kthread, but some more relevant context below:

So there are a couple things to clarify here:

IIUC from doc it seems that local DSQ wins over global DSQ.

Yes, that is correct. A core will first check its local DSQ to see if there are any tasks. Then it will check SCX_DSQ_GLOBAL, and only then will it call ops.dispatch().

So, let's say I have a single CPU, if I'm dispatching task1 with DSQ_LOCAL_ON | cpu and task2 with DSQ_GLOBAL, I would expect that task1 runs before task2.

So there is a concept in sched_ext called direct dispatch. This refers to when a task is dispatched "directly" from either ops.select_cpu() or ops.enqueue(), rather than waiting to be dispatched until ops.dispatch(). If you're doing direct dispatch, then you can't dispatch to SCX_DSQ_LOCAL_ON | cpu because we can't drop the rq lock on the enqueue path. You can dispatch directly to the local DSQ with SCX_DSQ_LOCAL, or you can dispatch directly to any other non-local DSQ.

So going back to your example, if you were to do a direct dispatch of task1 with SCX_DSQ_LOCAL, then it would indeed be chosen before task2. Otherwise, if you were to wait to dispatch task1 until ops.dispatch(), and you instead did a direct dispatch of task2, task2 will be chosen first because it will be consumed before ops.dispatch() is invoked, and thus before you can use SCX_DSQ_LOCAL_ON | cpu.

That said -- to tie all of that back to the example at hand -- you're correct that a task dispatched with SCX_DSQ_LOCAL_ON | cpu will take precedence over a task dispatched to SCX_DSQ_GLOBAL, so I actually do think what you're proposing should work to throttle the pcpu kthread. The only caveat potentially is that if tasks keep getting dispatched to that CPU from other CPUs (using SCX_DSQ_LOCAL_ON from ops.dispatch(), I think there's a possibility that the pcpu kthread could starve and actually never get to run, given that we'll always see that there's a task available on the local DSQ.

The crux of the issue is similar to what I alluded to above -- we're not actually getting to do things in ops.dispatch(). Perhaps the correct thing to do is to create per-cpu DSQs which we dispatch everybody to instead of using SCX_DSQ_LOCAL_ON? That would look something like this:

In rustland_init(), we create a custom DSQ per CPU
In rustland_enqueue(), rather than dispatching to SCX_DSQ_LOCAL for pcpu kthreads, we dispatch to RUSTLAND_DSQ_N (N == cpu where task is being enqueued).
In rustland_dispatch(), rather than dispatching a task to SCX_DSQ_LOCAL_ON | task.cpu, we instead dispatch to RUSTLAND_DSQ_N where N == task.cpu. After dispatching, we consume all of the tasks from RUSTLAND_DSQ_N with scx_bpf_consume().

That would allow us to always call dispatch_user_scheduler() from rustland_dispatch, while also giving us the same FIFO semantics we're already encountering with using SCX_DSQ_LOCAL{_ON}. In other words, we'll be calling rustland_dispatch() a bit more, but it should be minimal and also avoid the issue you pointed out. Let me try putting together a PR for this.

This doesn't perform very well, but showing an example of what I meant in #96. Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab · 2024-01-19T21:59:57Z

Here's an example of what I was talking about: f4a7cb2. It doesn't perform very well though :-(

arighi · 2024-01-19T23:27:33Z

@Decave thank you so much for all the details and the example! It's all clear now.

About the tasks dispatched to the global DSQ, yes we may have starvation. And I like your idea about using a DSQ per-cpu. About the poor performance, how about kicking the cpu when we're sending task to a different one?

Something like this:

--- a/scheds/rust/scx_rustland/src/bpf/main.bpf.c
+++ b/scheds/rust/scx_rustland/src/bpf/main.bpf.c
@@ -515,6 +515,9 @@ void BPF_STRUCT_OPS(rustland_dispatch, s32 cpu, struct task_struct *prev)
                dbg_msg("usersched: pid=%d cpu=%d payload=%llu",
                        task.pid, task.cpu, task.payload);
                dispatch_task(p, task.cpu, 0);
+               if (cpu != task.cpu)
+                       scx_bpf_kick_cpu(task.cpu, 0);
+
                __sync_fetch_and_add(&nr_user_dispatches, 1);
                bpf_task_release(p);

Byte-Lab · 2024-01-19T23:35:49Z

Well, that makes the scheduler work great!

arighi · 2024-01-19T23:39:20Z

@Decave I've done some some quick tests and with the extra kick the scheduler doesn't seem bad at all. I'll do more tests tomorrow morning, but I think we may have a solution, at least for my other cpumask/affinity issue, stress-ng --race-sched N doesn't crash the scheduler now.

This doesn't perform very well, but showing an example of what I meant in sched-ext#96. Signed-off-by: David Vernet <void@manifault.com> [ add kick cpu to improve responsiveness ] Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

kode54 · 2024-01-21T06:14:09Z

I have tried out #99 against my system, and it does seem to work for a while, but then randomly devolves into horrible stuttering, both moving my mouse cursor and moving windows around the desktop.

arighi · 2024-01-21T07:54:57Z

I have tried out #99 against my system, and it does seem to work for a while, but then randomly devolves into horrible stuttering, both moving my mouse cursor and moving windows around the desktop.

Sorry @kode54 , #99 doesn't address yet any IO pressure issue, but it was required for a follow up patch that I'm planning to test (the idea is to send all kthreads to the user-space scheduler as well as the regular tasks), that maybe can help to handle the situation. But thanks for testing this one and good to know it helped a bit with this issue. I'll keep you informed when I have something ready for testing.

arighi · 2024-01-21T14:58:57Z

@kode54 can you test #99 now? I added more changes that seem to help under IO pressure condition (as a stress test I'm running a bunch of fio writers on an encrypted dm-crypt volume). I'm not sure that I'm reproducing exactly your problem, but in my case the system seems more responsive now. Thanks!

kode54 · 2024-01-21T21:49:37Z

Still stuttering, but in this case, the load is OBS Studio running a virtual camera and using 60-80% of a core for that, and Discord video chat using the same amount on another core, and my compositor using a steady 10% of a core or so, or slightly more if I start moving windows around.

arighi · 2024-01-21T22:01:49Z

Still stuttering, but in this case, the load is OBS Studio running a virtual camera and using 60-80% of a core for that, and Discord video chat using the same amount on another core, and my compositor using a steady 10% of a core or so, or slightly more if I start moving windows around.

Hm.. ok, so in this case you don't even have any heavy IO-bound task. I'm curious, do you have SMT enabled? If possible can you try to add nosmt to your grub config, reboot and see if it's better (or worse)?

kode54 · 2024-01-22T02:46:44Z

SMT is enabled, disabling it did not have any appreciable effect, other than reducing me to 8 threads. Still stuttering while under CPU load.

On login, Steam downloads and compiles shaders for about 20 different installed games. It seems to do this about every day.

arighi · 2024-01-22T06:17:12Z

SMT is enabled, disabling it did not have any appreciable effect, other than reducing me to 8 threads. Still stuttering while under CPU load.

On login, Steam downloads and compiles shaders for about 20 different installed games. It seems to do this about every day.

OK, good! At least we know it's not SMT-related. With the same workload the stuttering happens also with other schedulers, like scx_simple, scx_rusty, etc. or it's just scx_rustland?

kode54 · 2024-01-22T07:24:17Z

Under certain loads, I experienced stutter with scx_rusty and scx_nest. I can try scx_simple as well.

Does it matter that the kernel I am using the patch set with is linux-cachyos? Built using the default settings, except for using generic cpu instead of autodetect, and adding a bcachefs update patch set that shouldn’t be incurring a significant load versus plain 6.7.1.

arighi · 2024-01-22T07:35:17Z

It shouldn't matter, if the default scheduler works fine, then the sched-ext schedulers should also work fine, to a large degree, meaning not for everything, but they shouldn't show obvious lags or stuttering, especially with workloads that are not super intense.

But even if scx_simple clearly shows this problem, then maybe something odd is happening in the kernel.

kode54 · 2024-01-22T10:33:09Z

scx_simple seems to be fine, so far.

arighi · 2024-01-22T12:07:50Z

scx_simple seems to be fine, so far.

ok, in this case it seems reasonable to assume that this issue has nothing to do with the kernel or sched-ext in general.

So, I would suggest another test: can you try #99 again (I pushed/updated stuff, so make sure you refresh the repo) and start scx_rustland with -b 0? This disables the priority boost logic for interactive tasks and the scheduler becomes pretty much a vruntime-based scheduler with a variable time slice, that is very similar to what scx_simple does.

This would tell us the problem (for rustland at least) is in the priority boot part or somewhere else.

arighi · 2024-01-22T18:17:44Z

@kode54 I think I was able to reproduce the problem on my side, starting obs and recording my session. I've updated #99 again, adding a WIP patch (make sure to git reset the repo). I'm still not happy about this patch (hence the WIP), but it seems to make a significant difference in my case and it might fix the problem also in your case.

Can you do another test? Thanks.

kode54 · 2024-01-22T21:58:24Z

I'll test shortly. Sorry if I wasn't much help the first post, I wasn't up yet.

Edit: Thought I'd drop a mention, scx_simple bogs down horribly under the load of building this package. Using the AUR package for scx-scheds-git, and my system -j$(nproc), it builds all the Rust packages simultaneously, and each one invokes up to 16 threads of rustc at once.

Will test scx_rustland now.

kode54 · 2024-01-22T23:42:13Z

Was running your latest #99 of scx_rustland, and it was working mostly fine, until I attempted to rebuild the scx-scheds package again, which with my AUR defaults, attempts over 30-40 threads of rustc in the final stage. This led to my GPU timing out and resetting, and the reset took so long, my desktop session crashed back to login, terminating the build. Next time, I'll try forcing -j1 for the package, which should limit it to ~16 rustc threads.

arighi · 2024-01-23T07:37:09Z

hm... it should survive to a 30-40 threads build, I'll try to run some parallel rustc builds also on my box.

kode54 · 2024-01-24T02:43:48Z

I will leave this issue open, but I'm ceasing testing for now, until sched-ext makes it into the upstream kernel tree in a stable release. I won't be testing any 6.8 kernels until Arch has a 6.8 kernel in the linux package, and I will only be using official binaries for this, since I can't otherwise report a userland package being incompatible.

kode54 · 2024-01-29T03:48:28Z

Okay, I've returned to continue testing this, because I overcame my stupid problem. I will simply live within the issues of rolling new kernels.

Anyway:

I encountered further bugs with scx_rustland, but not with scx_rusty or other BPF schedulers that aren't entirely userspace. Basically, if I have the scx-scheds-git AUR PKGBUILD, and a makepkg.conf set to use -j$(nproc) on my 16 thread CPU, it will queue up all four of the rust build jobs at once, and each one of them will use up to 16 threads automatically without regard for each other.

This build queue has immediate problems with my Radeon RX 6700-XT on Wayfire, but only with scx_rustland. It will almost immediately result in a GPU reset, which will end up failing, leaving the GPU broken until the machine is soft rebooted using SSH to log in remotely.

scx_rusty, the default in /etc/default/scx shipped with the current master, survives the build process. It is slightly stuttery, but so is the kernel built-in scheduler.

Here is a dmesg dump of the failing scx_rustland session:
dmesg.1.txt

And here is a dmesg dump of the successful session which followed, running on scx_rusty:
dmesg.2.txt

arighi · 2024-01-29T07:50:25Z

@kode54 thank you for sharing this, it's very useful to understand what's happening! It looks like you hit a sched-ext issue, more than a rustland issue, more exactly this warning:

int scx_cgroup_can_attach(struct cgroup_taskset *tset)
{
...
                WARN_ON_ONCE(p->scx.cgrp_moving_from);
...

I'm wondering if we need to exclude exiting / autogroup tasks in this logic, something similar to what we did in sched-ext/sched_ext@6b747e0.

I'm not sure if the following patch makes any sense at all (posting here just for discussion):
arighi/sched_ext@6f51182

Maybe if you have the time / possibility to recompile the kernel and do more tests you can check if the problem is still happening with this patch applied. Otherwise, let's wait for the opinion of more experienced people than me in this topic, such as @htejun or @Decave .

kode54 · 2024-01-29T08:35:52Z

It still reset my GPU with that patch applied. scx_rusty did not, once again.

scx_rustland: dmesg.3.txt
scx_rusty: dmesg.4.txt

arighi · 2024-02-21T12:31:38Z

@kode54 about rustland I pushed some improvements yesterday that should reduce the cpu usage and improve responsiveness in general. Do you mind trying again with the latest version from the main branch and see if the stuttering is still the same / better / worse? Thanks!

kode54 · 2024-02-22T02:59:47Z

I tried with both rusty and rustland, stuttering is pretty bad, and the wineserver process is using 20-30% more CPU according to top, compared to the kernel scheduler. I can't reproduce the lockup, but maybe that was because I started rusty while the game was already running.

I don't know if it's worth noting that I was using the Steam version of the game, which is no longer available for purchase. I can try the Epic Games Store version under Heroic Games Launcher, if you think that will help.

Byte-Lab · 2024-02-22T03:53:52Z

rusty still has a lot of room for improvement -- it's mostly been targeted towards server workloads so far. My plan is to start looking at making it more interactive in the near future.

arighi · 2024-02-22T06:55:48Z

Hm.. I've tried a bunch of games with rustland and I can't reproduce any stuttering, fps is always really close to the default scheduler (or even better is the system is busy).

@kode54 maybe you can try to run perf top when the stuttering is happening, that should help to identify the bottleneck.

Even better, you could try to generate a more detailed profiling using this command (maybe run it for like 30 sec or similar when the stuttering is happening, the ctrl+c and post the output):

sudo bpftrace -e 'profile:hz:99 { @[ustack, kstack] = count(); }'

This would tell us where the system is spending most of the time, showing both the kernel and the user stack trace of all the running processes.

kode54 · 2024-02-22T10:18:15Z

I have generated a lengthy log of tracing from roughly before launching the game, to launching it, which stutters on loading.

It also appears my machine is using 7GiB of ZSWAP, ZSTD compressed. Even though it is only using 8 GiB of application memory, it's not relinquishing much of its 21 GiB of cached file data.

Could you tell me how I should analyze this data, if I want to inspect it myself?

bpftrace.txt

arighi · 2024-02-22T14:53:02Z

The idea is to look at the stack traces (at the bottom you find those with the highest number - meaning that most of the time the CPUs where spending time there).

To have a more "visual" overview of what is happening you need to feed this data to something like flamegraph (https://github.com/brendangregg/FlameGraph/), for example:

cat bpftrace.txt | ./stackcollapse-bpftrace.pl | ./flamegraph.pl > /tmp/out.svg

Then open /tmp/out.svg in your browser and you can have a nice graphical overview of the stack trace samples (represented as a flame graph: in the y axis you see the stack trace, on the x axis you have the number of samples). The bigger horizontal blocks represent where your CPUs are spending most of the time.

In your specific case it seems that most of the time the CPUs are doing syscalls (I guess your %sys time should be pretty high). I see a big chunk of sys_epoll and sched_yield, with some sys_getsockopt / sys_recv.

This was with rustland running right? It would be interesting to get a bpftrace.txt also with the default scheduler and compare the traces.

htejun · 2024-02-22T14:59:24Z

Also, as you mentioned swap, can you record the output of /proc/pressure/memory every 10 or so seconds and see what it says?

kode54 · 2024-02-23T09:59:15Z

Here's a memory pressure log. The increase in 10 second average levels was right around where the game was loading. I do not know why it is swapping so much when I have 32GB of RAM.
memorylog.txt

htejun · 2024-02-23T16:46:24Z

It's not necessarily swapping. If it mmapped large memory areas and faulting them in for loading, they'd show up as memory pressure, which isn't too surprising while loading. Just so that we can rule out memory/io issues, you still see stuttering problems after the pressure spike from loading subsides?

kode54 · 2024-02-24T00:32:20Z

Periodically, every time a resource seems to load.

htejun · 2024-02-24T00:40:32Z

So, the stuttering problems are associate with memory pressure? If so, I wonder whether this is from the schedulers always dispatching per-cpu kworkers directly to local DSQ prioritizing them over everything else.

arighi · 2024-02-25T14:56:54Z

@htejun I'm also wondering if we can we still hit some page faults under memory pressure. Despite using the custom allocator and mlock-ing all the memory, a shared library, for example, can be unmapped under memory pressure, in that case I think scx_rustland may still hit a page fault,causing the stuttering.

@kode54 if you look at the scx_rustland output, do you see a value >0 in nr_page_faults? Thanks.

kode54 · 2024-02-26T06:32:52Z

nr_page_faults never exceeds 0. Though when I launched scx_rustland, tasks was about 60-70, then increased to about 700 by the time I had the game running. Also, most of the cores were listing pid=0 most of the time, except when 4 processes from other PIDs cycled around the various cores every second of output.

arighi · 2024-02-26T06:46:02Z

ok nr_page_faults=0 is good, 700 waiting tasks, instead, is not really good, are they all listed in nr_queued, in nr_scheduled or both? If they're all in nr_queued it means that the scheduler is not awakened fast enough, if they're in nr_scheduled it means that the scheduler should be more aggressive at dispatching them.

Listing pid=0 in most of the cores is normal (unless the system is massively overloaded), because the scheduler runs when some tasks expire their time slice and in order to be able to dispatch more tasks at least some cores needs to be free.

kode54 · 2024-02-27T10:08:18Z

Here's a sample of output with a queued task:

Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] vruntime=185057608512
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   tasks=988
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_user_dispatches=3077278 nr_kernel_dispatches=57754827
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_cancel_dispatches=0 nr_bounce_dispatches=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_waiting=1 [nr_queued=1 + nr_scheduled=0]
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_failed_dispatches=0 nr_sched_congested=0 nr_page_faults=0 [OK]
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] time slice = 20000 us
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] slice boost = 1600
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] Running tasks:
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  0 cpu  0 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  0 cpu  8 pid=184230
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  1 cpu  1 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  1 cpu  9 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  2 cpu  2 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  2 cpu 10 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  3 cpu  3 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  3 cpu 11 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  4 cpu  4 pid=686362
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  4 cpu 12 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  5 cpu  5 pid=1289884
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  5 cpu 13 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  6 cpu  6 pid=[self]
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  6 cpu 14 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  7 cpu  7 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  7 cpu 15 pid=0

arighi · 2024-02-27T10:34:22Z

That looks correct, things can be problematic if nr_queued or nr_scheduled become big numbers.

kode54 · 2024-02-28T02:22:06Z

I attempted to run it with PROTON_LOG=1 to see if there would be anything useful. Just 375MB of log spew over 100 seconds. Most of it is exceptions being thrown and stack unwind being logged, multiple times per millisecond.

kode54 · 2024-03-01T23:02:31Z

I doubled my RAM capacity, and it's still happening. No swapping going on.

I checked Netdata. Average CPU Pressure hits about 17% at its peak while the game is loading, then drops to about 10% while it's running idle. CPU utilization hits 38% or so while it's loading, and that's across all 16 threads.

kode54 · 2024-03-02T07:19:43Z

Further checking: I'm compiling a kernel now, while a btrfs scrub is running on my large storage array, which isn't involved in the compilation or its backing. I was hitting 98-100% CPU usage and load averages of 19 or so for my 16 threads.

Then I loaded scx_rustland. CPU usage dropped to 78% and load averages shot up to 30.

I should also mention that switching Fall Guys to Proton 9.0 beta also seems to have alleviated most of the problems it had with sched-ext.

arighi · 2024-03-03T08:10:55Z

Alright, I'm re-reading this thread, let's try to tackle this one and see if we can figure out what's going on.

IIUC you are still experiencing some stuttering if you start any sched-ext scheduler (either scx_rusty, scx_rustland, scx_nest or scx_simple). IMHO we should try to focus at one scheduler, because it seems unlikely that sched-ext itself can cause performance issues, it's more likely that the bottleneck is in the particular scheduler's code.

With scx_simple I'd expect poor responsiveness if you have a lot of tasks running in your system. It's using a vruntime-based scheduling by default, but apart than that, there's not much going on, so when there are lots of tasks running the average wait time can naturally increase, due to the single queue FIFO ordering (and the constant time slice assigned to all tasks), causing the stuttering.

With scx_rusty things can be much better, even in presence of lots of tasks running, because of its multi-domain nature + the load balancing and many other things.

With scx_rustland we have the user-space overhead, but there's also the logic to boost interactive tasks, that should compensate the overhead and improve responsiveness even in presence of a many tasks running.

scx_nest in this context is probably not the best choice, considering its approach to keep tasks together on warm cores (when the system is massively overloaded, we may want to do the opposite and try to spread tasks among the available cores as much as possible, since caches will be thrashed anyway).

That said, my assumption is that your workload consists of multiple tasks contending the CPUs at the same time (many more tasks than the amount of cores), some of these tasks are CPU bound, others are I/O bound. Can you confirm that this is the case? (my assumption is based on the fact that your load increases in some cases). If that's the case then we should probably focus either at scx_rusty or scx_rustland.

At some point you mentioned:

when I launched scx_rustland, tasks was about 60-70, then increased to about 700 by the time I had the game running.

Are you able to reproduce this? If you can, it'd be interesting to check if all these queued tasks are reported in nr_queued or nr_scheduled (or both). The former means that tasks keep piling up in the queue and we don't wake up the user-space scheduler fast enough, the latter means that the user-space scheduler is awakened, but it fails to dispatch tasks, because the CPUs are busy or for other reasons (then we may need to figure out what's going on, but one step at a time...).

Another aspect that may impact on system responsiveness is the time slice (how much time each task is allowed to run before the scheduler reclaims their CPU). For this, have you tried to use a smaller time slice (like start the scheduler - either scx_rusty or scx_rustland with -s 5000 for example). Does it make the system more responsive or not?

So, to recap, first of all I think we should better understand the nature of your workload (lots of tasks vs few tasks running - at least now know that we're not dealing with memory pressure conditions), then understand if the scheduler is massively overloaded (for some reasons) or not, then understand if the default time slice is appropriate for your responsiveness expectations.

Once we understand all of this we can try to refine our analysis by doing some targeted profiling.

And thanks tons for all your updates and for sharing all these details with us!

kode54 · 2024-03-05T07:34:03Z

It seems to be working fine so far with -s 5000 on scx_rustland. I’m not sure if previous updates from 6.7.7 affecting zswap had any effect on this. Also the game seems to use more cpu time if I have vsync enabled. I’m also using variable refresh rate, 40-60Hz.

arighi · 2024-03-05T07:54:45Z

Sometimes I also need to use -s 5000 to prevent some audio cracking when the system is massively loaded. Or (not sure why) I need to change the scheduling class of pipewire/pipewire-pulse/wireplumber from real-time to normal (so that they'll be scheduled by rustland as well). Either way allows me to prevent the audio cracking.

I'm really considering to set the default time slice to 5000 (or 10000) for scx_rustland, considering that the main goal of this scheduler is to prioritize low-latency workloads, probably it makes sense to use a shorter time slice by default.

In line with rustland's focus on prioritizing interactive tasks, set the default base time slice to 5ms. This allows to mitigate potential audio craking issues or system lags when the system is overloaded or under memory pressure condition (i.e., #96 (comment)). A downside of this change is to introduce potential regressions in the throughput of CPU-intensive workloads, but in such scenarios rustland may not be the optimal choice and alternative schedulers may be preferred. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi · 2024-03-14T06:51:19Z

I think we can close this one for now, after the latest changes to the default settings in scx_rustland we should be able to mitigate these performance issues.

@kode54 do you agree or do you think there's something else that we should address/improve?

kode54 · 2024-03-14T07:37:24Z

scx_rusty needs to be 5ms too, unless low latency isn’t its aim.

arighi · 2024-03-14T07:50:08Z

scx_rusty is a more general purpose scheduler, so it also needs to take throughput into account. Do you see some improvements in your case If you run scx_rusty -s 5000?

I know @Byte-Lab had some plans to apply the same "dynamic time slice" concept also to scx_rusty, that should help in cases like this. I'm not sure if there's some work in progress for this, otherwise I can take a look.

So, let's keep this open for now. Thanks @kode54 .

kode54 · 2024-03-15T01:23:49Z

I’m not sure if it’s kernel updates or using -s 5000, but I can now play Fall Guys without terrible lag. Also, unlike rustland, rusty doesn’t impact my Geekbench score by over 3000 points on the multi core result. My multi core score is now more than the average score for this model CPU.

Byte-Lab · 2024-03-15T03:03:53Z

Nice, glad to hear. We still have a lot more we can do to make rusty more interactive, but glad to hear things seem to be going in the right direction.

arighi · 2024-03-15T06:57:13Z

I’m not sure if it’s kernel updates or using -s 5000, but I can now play Fall Guys without terrible lag. Also, unlike rustland, rusty doesn’t impact my Geekbench score by over 3000 points on the multi core result. My multi core score is now more than the average score for this model CPU.

About rustland this is kind of expected, the scheduler is not really nice with non-interactive cpu-intensive tasks. 😄

kode54 · 2024-03-30T01:48:08Z

Adding another weird case: Running an OBS Studio virtual camera with v4l2loopback, feeding it a combined PipeWire desktop capture and webcam capture overlay, and it becomes incredibly laggy when the CPU is fully loaded, such as from building a package or a kernel.

Byte-Lab added a commit that referenced this issue Jan 19, 2024

WIP: Use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}

45d2819

This doesn't perform very well, but showing an example of what I meant in #96. Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab added a commit that referenced this issue Jan 19, 2024

WIP: Use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}

f4a7cb2

This doesn't perform very well, but showing an example of what I meant in #96. Signed-off-by: David Vernet <void@manifault.com>

Many schedulers seem to be heavily penalized by heavy I/O, at least with bcachefs #96

Many schedulers seem to be heavily penalized by heavy I/O, at least with bcachefs #96

Comments

kode54 commented Jan 19, 2024

arighi commented Jan 19, 2024

Byte-Lab commented Jan 19, 2024 • edited

arighi commented Jan 19, 2024

Byte-Lab commented Jan 19, 2024

Byte-Lab commented Jan 19, 2024

arighi commented Jan 19, 2024

Byte-Lab commented Jan 19, 2024

arighi commented Jan 19, 2024

kode54 commented Jan 21, 2024

arighi commented Jan 21, 2024

arighi commented Jan 21, 2024

kode54 commented Jan 21, 2024

arighi commented Jan 21, 2024

kode54 commented Jan 22, 2024

arighi commented Jan 22, 2024

kode54 commented Jan 22, 2024

arighi commented Jan 22, 2024

kode54 commented Jan 22, 2024

arighi commented Jan 22, 2024

arighi commented Jan 22, 2024

kode54 commented Jan 22, 2024 • edited

kode54 commented Jan 22, 2024 • edited

arighi commented Jan 23, 2024

kode54 commented Jan 24, 2024

kode54 commented Jan 29, 2024

arighi commented Jan 29, 2024

kode54 commented Jan 29, 2024

arighi commented Feb 21, 2024

kode54 commented Feb 22, 2024

Byte-Lab commented Feb 22, 2024

arighi commented Feb 22, 2024

kode54 commented Feb 22, 2024

arighi commented Feb 22, 2024

htejun commented Feb 22, 2024

kode54 commented Feb 23, 2024

htejun commented Feb 23, 2024

kode54 commented Feb 24, 2024

htejun commented Feb 24, 2024

arighi commented Feb 25, 2024

kode54 commented Feb 26, 2024

arighi commented Feb 26, 2024

kode54 commented Feb 27, 2024

arighi commented Feb 27, 2024

kode54 commented Feb 28, 2024

kode54 commented Mar 1, 2024

kode54 commented Mar 2, 2024 • edited

arighi commented Mar 3, 2024

kode54 commented Mar 5, 2024

arighi commented Mar 5, 2024

arighi commented Mar 14, 2024

kode54 commented Mar 14, 2024

arighi commented Mar 14, 2024

kode54 commented Mar 15, 2024

Byte-Lab commented Mar 15, 2024

arighi commented Mar 15, 2024

kode54 commented Mar 30, 2024

Byte-Lab commented Jan 19, 2024 •

edited

kode54 commented Jan 22, 2024 •

edited

kode54 commented Jan 22, 2024 •

edited

kode54 commented Mar 2, 2024 •

edited