Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many schedulers seem to be heavily penalized by heavy I/O, at least with bcachefs #96

Open
kode54 opened this issue Jan 19, 2024 · 67 comments

Comments

@kode54
Copy link

kode54 commented Jan 19, 2024

I am using Arch, with linux-cachyos 6.7.0-4, with every upstream bcachefs patch from the 2024-01-01 tag up to 2bf0b0a9dff974cac259ce92d146e7142f472496 applied on top, with a bcachefs rootfs on a WD SN750 SSD, another bcachefs filesystem on a Samsung 980 Pro, and finally, my major storage on two 18TB WD Red Pro drives with 2 replicas enabled for both metadata and user data.

For heavy I/O, I am either running qBittorrent downloading some rather large (200GiB+) data sets at 1-2MB/s, plus some smaller ones (10-30GiB) at 20MB/s, to the two replicas hard drives. Or I am building a kernel on the rootfs with all 16 threads of my CPU.

Either of those tasks cause my compositor, Wayfire, to bog down heavily, if I happen to be using either scx_rustland, scx_rusty, or scx_nest. Disabling the sched-ext processes, and the compositor becomes performant again under the same conditions.

@arighi
Copy link
Collaborator

arighi commented Jan 19, 2024

I'm wondering if this happens because we are giving too much priority to per-CPU kthreads, like during a massive I/O workload with fast storage drives, we may have per-CPU kernel workers that are actually CPU-bound more than IO-bound.

In rustland, for example, per-CPU kthreads are directly dispatched from the kernel to the local DSQ, bypassing the user-space scheduler. So they always win, over any other task and potentially they can up monopolizing the CPUs. Maybe something similar is happening also with the other schedulers?

Anyway, I was actually working on a patch to use the global DSQ for per-CPU kthreads (or better, for all the per-CPU threads in general). I'll do some tests and will create a new PR, I'll update this thread when it's ready.

@Byte-Lab
Copy link
Contributor

Byte-Lab commented Jan 19, 2024

Anyway, I was actually working on a patch to use the global DSQ for per-CPU kthreads (or better, for all the per-CPU threads in general). I'll do some tests and will create a new PR, I'll update this thread when it's ready.

@arighi FYI I'm not sure how much using the global DSQ will help, unfortunately. In the core ext.c code we automatically first try to dispatch from the global DSQ, and then if we find a task that can run on there we use that without invoking ops.dispatch(). That also has the downside of us requiring the overhead of walking the DSQ until we find the per-CPU task.

What might work a bit better is to instead create a custom DSQ per-CPU that you dispatch the kthreads to, and then consume from that in ops.dispatch() so you still have a chance to consume the remaining tasks from the dispatched list.

@arighi
Copy link
Collaborator

arighi commented Jan 19, 2024

@Decave from Documentation/scheduler/sched-ext.rst:

A CPU always executes a task from its local DSQ. A task is "dispatched" to a
DSQ. A non-local DSQ is "consumed" to transfer a task to the consuming CPU's
local DSQ.

When a CPU is looking for the next task to run, if the local DSQ is not
empty, the first task is picked. Otherwise, the CPU tries to consume the
global DSQ. If that doesn't yield a runnable task either, ``ops.dispatch()``
is invoked.

IIUC from doc it seems that local DSQ wins over global DSQ. So, let's say I have a single CPU, if I'm dispatching task1 with DSQ_LOCAL_ON | cpu and task2 with DSQ_GLOBAL, I would expect that task1 runs before task2. What am I missing?

@Byte-Lab
Copy link
Contributor

TL;DR:

I think you're right that dispatching to SCX_DSQ_GLOBAL will actually in fact throttle the pcpu kthread, but some more relevant context below:

So there are a couple things to clarify here:

IIUC from doc it seems that local DSQ wins over global DSQ.

Yes, that is correct. A core will first check its local DSQ to see if there are any tasks. Then it will check SCX_DSQ_GLOBAL, and only then will it call ops.dispatch().

So, let's say I have a single CPU, if I'm dispatching task1 with DSQ_LOCAL_ON | cpu and task2 with DSQ_GLOBAL, I would expect that task1 runs before task2.

So there is a concept in sched_ext called direct dispatch. This refers to when a task is dispatched "directly" from either ops.select_cpu() or ops.enqueue(), rather than waiting to be dispatched until ops.dispatch(). If you're doing direct dispatch, then you can't dispatch to SCX_DSQ_LOCAL_ON | cpu because we can't drop the rq lock on the enqueue path. You can dispatch directly to the local DSQ with SCX_DSQ_LOCAL, or you can dispatch directly to any other non-local DSQ.

So going back to your example, if you were to do a direct dispatch of task1 with SCX_DSQ_LOCAL, then it would indeed be chosen before task2. Otherwise, if you were to wait to dispatch task1 until ops.dispatch(), and you instead did a direct dispatch of task2, task2 will be chosen first because it will be consumed before ops.dispatch() is invoked, and thus before you can use SCX_DSQ_LOCAL_ON | cpu.

That said -- to tie all of that back to the example at hand -- you're correct that a task dispatched with SCX_DSQ_LOCAL_ON | cpu will take precedence over a task dispatched to SCX_DSQ_GLOBAL, so I actually do think what you're proposing should work to throttle the pcpu kthread. The only caveat potentially is that if tasks keep getting dispatched to that CPU from other CPUs (using SCX_DSQ_LOCAL_ON from ops.dispatch(), I think there's a possibility that the pcpu kthread could starve and actually never get to run, given that we'll always see that there's a task available on the local DSQ.

The crux of the issue is similar to what I alluded to above -- we're not actually getting to do things in ops.dispatch(). Perhaps the correct thing to do is to create per-cpu DSQs which we dispatch everybody to instead of using SCX_DSQ_LOCAL_ON? That would look something like this:

  • In rustland_init(), we create a custom DSQ per CPU
  • In rustland_enqueue(), rather than dispatching to SCX_DSQ_LOCAL for pcpu kthreads, we dispatch to RUSTLAND_DSQ_N (N == cpu where task is being enqueued).
  • In rustland_dispatch(), rather than dispatching a task to SCX_DSQ_LOCAL_ON | task.cpu, we instead dispatch to RUSTLAND_DSQ_N where N == task.cpu. After dispatching, we consume all of the tasks from RUSTLAND_DSQ_N with scx_bpf_consume().

That would allow us to always call dispatch_user_scheduler() from rustland_dispatch, while also giving us the same FIFO semantics we're already encountering with using SCX_DSQ_LOCAL{_ON}. In other words, we'll be calling rustland_dispatch() a bit more, but it should be minimal and also avoid the issue you pointed out. Let me try putting together a PR for this.

Byte-Lab added a commit that referenced this issue Jan 19, 2024
This doesn't perform very well, but showing an example of what I meant
in #96.

Signed-off-by: David Vernet <void@manifault.com>
Byte-Lab added a commit that referenced this issue Jan 19, 2024
This doesn't perform very well, but showing an example of what I meant
in #96.

Signed-off-by: David Vernet <void@manifault.com>
@Byte-Lab
Copy link
Contributor

Here's an example of what I was talking about: f4a7cb2. It doesn't perform very well though :-(

@arighi
Copy link
Collaborator

arighi commented Jan 19, 2024

@Decave thank you so much for all the details and the example! It's all clear now.

About the tasks dispatched to the global DSQ, yes we may have starvation. And I like your idea about using a DSQ per-cpu. About the poor performance, how about kicking the cpu when we're sending task to a different one?

Something like this:

--- a/scheds/rust/scx_rustland/src/bpf/main.bpf.c
+++ b/scheds/rust/scx_rustland/src/bpf/main.bpf.c
@@ -515,6 +515,9 @@ void BPF_STRUCT_OPS(rustland_dispatch, s32 cpu, struct task_struct *prev)
                dbg_msg("usersched: pid=%d cpu=%d payload=%llu",
                        task.pid, task.cpu, task.payload);
                dispatch_task(p, task.cpu, 0);
+               if (cpu != task.cpu)
+                       scx_bpf_kick_cpu(task.cpu, 0);
+
                __sync_fetch_and_add(&nr_user_dispatches, 1);
                bpf_task_release(p);

@Byte-Lab
Copy link
Contributor

Well, that makes the scheduler work great!

@arighi
Copy link
Collaborator

arighi commented Jan 19, 2024

@Decave I've done some some quick tests and with the extra kick the scheduler doesn't seem bad at all. I'll do more tests tomorrow morning, but I think we may have a solution, at least for my other cpumask/affinity issue, stress-ng --race-sched N doesn't crash the scheduler now.

arighi pushed a commit to arighi/scx that referenced this issue Jan 20, 2024
This doesn't perform very well, but showing an example of what I meant
in sched-ext#96.

Signed-off-by: David Vernet <void@manifault.com>
[ add kick cpu to improve responsiveness ]
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@kode54
Copy link
Author

kode54 commented Jan 21, 2024

I have tried out #99 against my system, and it does seem to work for a while, but then randomly devolves into horrible stuttering, both moving my mouse cursor and moving windows around the desktop.

@arighi
Copy link
Collaborator

arighi commented Jan 21, 2024

I have tried out #99 against my system, and it does seem to work for a while, but then randomly devolves into horrible stuttering, both moving my mouse cursor and moving windows around the desktop.

Sorry @kode54 , #99 doesn't address yet any IO pressure issue, but it was required for a follow up patch that I'm planning to test (the idea is to send all kthreads to the user-space scheduler as well as the regular tasks), that maybe can help to handle the situation. But thanks for testing this one and good to know it helped a bit with this issue. I'll keep you informed when I have something ready for testing.

@arighi
Copy link
Collaborator

arighi commented Jan 21, 2024

@kode54 can you test #99 now? I added more changes that seem to help under IO pressure condition (as a stress test I'm running a bunch of fio writers on an encrypted dm-crypt volume). I'm not sure that I'm reproducing exactly your problem, but in my case the system seems more responsive now. Thanks!

@kode54
Copy link
Author

kode54 commented Jan 21, 2024

Still stuttering, but in this case, the load is OBS Studio running a virtual camera and using 60-80% of a core for that, and Discord video chat using the same amount on another core, and my compositor using a steady 10% of a core or so, or slightly more if I start moving windows around.

@arighi
Copy link
Collaborator

arighi commented Jan 21, 2024

Still stuttering, but in this case, the load is OBS Studio running a virtual camera and using 60-80% of a core for that, and Discord video chat using the same amount on another core, and my compositor using a steady 10% of a core or so, or slightly more if I start moving windows around.

Hm.. ok, so in this case you don't even have any heavy IO-bound task. I'm curious, do you have SMT enabled? If possible can you try to add nosmt to your grub config, reboot and see if it's better (or worse)?

@kode54
Copy link
Author

kode54 commented Jan 22, 2024

SMT is enabled, disabling it did not have any appreciable effect, other than reducing me to 8 threads. Still stuttering while under CPU load.

On login, Steam downloads and compiles shaders for about 20 different installed games. It seems to do this about every day.

@arighi
Copy link
Collaborator

arighi commented Jan 22, 2024

SMT is enabled, disabling it did not have any appreciable effect, other than reducing me to 8 threads. Still stuttering while under CPU load.

On login, Steam downloads and compiles shaders for about 20 different installed games. It seems to do this about every day.

OK, good! At least we know it's not SMT-related. With the same workload the stuttering happens also with other schedulers, like scx_simple, scx_rusty, etc. or it's just scx_rustland?

@kode54
Copy link
Author

kode54 commented Jan 22, 2024

Under certain loads, I experienced stutter with scx_rusty and scx_nest. I can try scx_simple as well.

Does it matter that the kernel I am using the patch set with is linux-cachyos? Built using the default settings, except for using generic cpu instead of autodetect, and adding a bcachefs update patch set that shouldn’t be incurring a significant load versus plain 6.7.1.

@arighi
Copy link
Collaborator

arighi commented Jan 22, 2024

It shouldn't matter, if the default scheduler works fine, then the sched-ext schedulers should also work fine, to a large degree, meaning not for everything, but they shouldn't show obvious lags or stuttering, especially with workloads that are not super intense.

But even if scx_simple clearly shows this problem, then maybe something odd is happening in the kernel.

@kode54
Copy link
Author

kode54 commented Jan 22, 2024

scx_simple seems to be fine, so far.

@arighi
Copy link
Collaborator

arighi commented Jan 22, 2024

scx_simple seems to be fine, so far.

ok, in this case it seems reasonable to assume that this issue has nothing to do with the kernel or sched-ext in general.

So, I would suggest another test: can you try #99 again (I pushed/updated stuff, so make sure you refresh the repo) and start scx_rustland with -b 0? This disables the priority boost logic for interactive tasks and the scheduler becomes pretty much a vruntime-based scheduler with a variable time slice, that is very similar to what scx_simple does.

This would tell us the problem (for rustland at least) is in the priority boot part or somewhere else.

@arighi
Copy link
Collaborator

arighi commented Jan 22, 2024

@kode54 I think I was able to reproduce the problem on my side, starting obs and recording my session. I've updated #99 again, adding a WIP patch (make sure to git reset the repo). I'm still not happy about this patch (hence the WIP), but it seems to make a significant difference in my case and it might fix the problem also in your case.

Can you do another test? Thanks.

@kode54
Copy link
Author

kode54 commented Jan 22, 2024

I'll test shortly. Sorry if I wasn't much help the first post, I wasn't up yet.

Edit: Thought I'd drop a mention, scx_simple bogs down horribly under the load of building this package. Using the AUR package for scx-scheds-git, and my system -j$(nproc), it builds all the Rust packages simultaneously, and each one invokes up to 16 threads of rustc at once.

Will test scx_rustland now.

@kode54
Copy link
Author

kode54 commented Jan 22, 2024

Was running your latest #99 of scx_rustland, and it was working mostly fine, until I attempted to rebuild the scx-scheds package again, which with my AUR defaults, attempts over 30-40 threads of rustc in the final stage. This led to my GPU timing out and resetting, and the reset took so long, my desktop session crashed back to login, terminating the build. Next time, I'll try forcing -j1 for the package, which should limit it to ~16 rustc threads.

@arighi
Copy link
Collaborator

arighi commented Jan 23, 2024

hm... it should survive to a 30-40 threads build, I'll try to run some parallel rustc builds also on my box.

@kode54
Copy link
Author

kode54 commented Jan 24, 2024

I will leave this issue open, but I'm ceasing testing for now, until sched-ext makes it into the upstream kernel tree in a stable release. I won't be testing any 6.8 kernels until Arch has a 6.8 kernel in the linux package, and I will only be using official binaries for this, since I can't otherwise report a userland package being incompatible.

@kode54
Copy link
Author

kode54 commented Jan 29, 2024

Okay, I've returned to continue testing this, because I overcame my stupid problem. I will simply live within the issues of rolling new kernels.

Anyway:

I encountered further bugs with scx_rustland, but not with scx_rusty or other BPF schedulers that aren't entirely userspace. Basically, if I have the scx-scheds-git AUR PKGBUILD, and a makepkg.conf set to use -j$(nproc) on my 16 thread CPU, it will queue up all four of the rust build jobs at once, and each one of them will use up to 16 threads automatically without regard for each other.

This build queue has immediate problems with my Radeon RX 6700-XT on Wayfire, but only with scx_rustland. It will almost immediately result in a GPU reset, which will end up failing, leaving the GPU broken until the machine is soft rebooted using SSH to log in remotely.

scx_rusty, the default in /etc/default/scx shipped with the current master, survives the build process. It is slightly stuttery, but so is the kernel built-in scheduler.

Here is a dmesg dump of the failing scx_rustland session:
dmesg.1.txt

And here is a dmesg dump of the successful session which followed, running on scx_rusty:
dmesg.2.txt

@arighi
Copy link
Collaborator

arighi commented Jan 29, 2024

@kode54 thank you for sharing this, it's very useful to understand what's happening! It looks like you hit a sched-ext issue, more than a rustland issue, more exactly this warning:

int scx_cgroup_can_attach(struct cgroup_taskset *tset)
{
...
                WARN_ON_ONCE(p->scx.cgrp_moving_from);
...

I'm wondering if we need to exclude exiting / autogroup tasks in this logic, something similar to what we did in sched-ext/sched_ext@6b747e0.

I'm not sure if the following patch makes any sense at all (posting here just for discussion):
arighi/sched_ext@6f51182

Maybe if you have the time / possibility to recompile the kernel and do more tests you can check if the problem is still happening with this patch applied. Otherwise, let's wait for the opinion of more experienced people than me in this topic, such as @htejun or @Decave .

@kode54
Copy link
Author

kode54 commented Jan 29, 2024

It still reset my GPU with that patch applied. scx_rusty did not, once again.

scx_rustland: dmesg.3.txt
scx_rusty: dmesg.4.txt

@arighi
Copy link
Collaborator

arighi commented Feb 21, 2024

@kode54 about rustland I pushed some improvements yesterday that should reduce the cpu usage and improve responsiveness in general. Do you mind trying again with the latest version from the main branch and see if the stuttering is still the same / better / worse? Thanks!

@kode54
Copy link
Author

kode54 commented Feb 22, 2024

I tried with both rusty and rustland, stuttering is pretty bad, and the wineserver process is using 20-30% more CPU according to top, compared to the kernel scheduler. I can't reproduce the lockup, but maybe that was because I started rusty while the game was already running.

I don't know if it's worth noting that I was using the Steam version of the game, which is no longer available for purchase. I can try the Epic Games Store version under Heroic Games Launcher, if you think that will help.

@Byte-Lab
Copy link
Contributor

rusty still has a lot of room for improvement -- it's mostly been targeted towards server workloads so far. My plan is to start looking at making it more interactive in the near future.

@arighi
Copy link
Collaborator

arighi commented Feb 22, 2024

Hm.. I've tried a bunch of games with rustland and I can't reproduce any stuttering, fps is always really close to the default scheduler (or even better is the system is busy).

@kode54 maybe you can try to run perf top when the stuttering is happening, that should help to identify the bottleneck.

Even better, you could try to generate a more detailed profiling using this command (maybe run it for like 30 sec or similar when the stuttering is happening, the ctrl+c and post the output):

sudo bpftrace -e 'profile:hz:99 { @[ustack, kstack] = count(); }'

This would tell us where the system is spending most of the time, showing both the kernel and the user stack trace of all the running processes.

@kode54
Copy link
Author

kode54 commented Feb 22, 2024

I have generated a lengthy log of tracing from roughly before launching the game, to launching it, which stutters on loading.

It also appears my machine is using 7GiB of ZSWAP, ZSTD compressed. Even though it is only using 8 GiB of application memory, it's not relinquishing much of its 21 GiB of cached file data.

Could you tell me how I should analyze this data, if I want to inspect it myself?

bpftrace.txt

@arighi
Copy link
Collaborator

arighi commented Feb 22, 2024

The idea is to look at the stack traces (at the bottom you find those with the highest number - meaning that most of the time the CPUs where spending time there).

To have a more "visual" overview of what is happening you need to feed this data to something like flamegraph (https://github.com/brendangregg/FlameGraph/), for example:

cat bpftrace.txt | ./stackcollapse-bpftrace.pl | ./flamegraph.pl > /tmp/out.svg

Then open /tmp/out.svg in your browser and you can have a nice graphical overview of the stack trace samples (represented as a flame graph: in the y axis you see the stack trace, on the x axis you have the number of samples). The bigger horizontal blocks represent where your CPUs are spending most of the time.

In your specific case it seems that most of the time the CPUs are doing syscalls (I guess your %sys time should be pretty high). I see a big chunk of sys_epoll and sched_yield, with some sys_getsockopt / sys_recv.

This was with rustland running right? It would be interesting to get a bpftrace.txt also with the default scheduler and compare the traces.

@htejun
Copy link
Contributor

htejun commented Feb 22, 2024

Also, as you mentioned swap, can you record the output of /proc/pressure/memory every 10 or so seconds and see what it says?

@kode54
Copy link
Author

kode54 commented Feb 23, 2024

Here's a memory pressure log. The increase in 10 second average levels was right around where the game was loading. I do not know why it is swapping so much when I have 32GB of RAM.
memorylog.txt

@htejun
Copy link
Contributor

htejun commented Feb 23, 2024

It's not necessarily swapping. If it mmapped large memory areas and faulting them in for loading, they'd show up as memory pressure, which isn't too surprising while loading. Just so that we can rule out memory/io issues, you still see stuttering problems after the pressure spike from loading subsides?

@kode54
Copy link
Author

kode54 commented Feb 24, 2024

Periodically, every time a resource seems to load.

@htejun
Copy link
Contributor

htejun commented Feb 24, 2024

So, the stuttering problems are associate with memory pressure? If so, I wonder whether this is from the schedulers always dispatching per-cpu kworkers directly to local DSQ prioritizing them over everything else.

@arighi
Copy link
Collaborator

arighi commented Feb 25, 2024

@htejun I'm also wondering if we can we still hit some page faults under memory pressure. Despite using the custom allocator and mlock-ing all the memory, a shared library, for example, can be unmapped under memory pressure, in that case I think scx_rustland may still hit a page fault,causing the stuttering.

@kode54 if you look at the scx_rustland output, do you see a value >0 in nr_page_faults? Thanks.

@kode54
Copy link
Author

kode54 commented Feb 26, 2024

nr_page_faults never exceeds 0. Though when I launched scx_rustland, tasks was about 60-70, then increased to about 700 by the time I had the game running. Also, most of the cores were listing pid=0 most of the time, except when 4 processes from other PIDs cycled around the various cores every second of output.

@arighi
Copy link
Collaborator

arighi commented Feb 26, 2024

ok nr_page_faults=0 is good, 700 waiting tasks, instead, is not really good, are they all listed in nr_queued, in nr_scheduled or both? If they're all in nr_queued it means that the scheduler is not awakened fast enough, if they're in nr_scheduled it means that the scheduler should be more aggressive at dispatching them.

Listing pid=0 in most of the cores is normal (unless the system is massively overloaded), because the scheduler runs when some tasks expire their time slice and in order to be able to dispatch more tasks at least some cores needs to be free.

@kode54
Copy link
Author

kode54 commented Feb 27, 2024

Here's a sample of output with a queued task:

Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] vruntime=185057608512
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   tasks=988
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_user_dispatches=3077278 nr_kernel_dispatches=57754827
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_cancel_dispatches=0 nr_bounce_dispatches=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_waiting=1 [nr_queued=1 + nr_scheduled=0]
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   nr_failed_dispatches=0 nr_sched_congested=0 nr_page_faults=0 [OK]
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] time slice = 20000 us
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] slice boost = 1600
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO] Running tasks:
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  0 cpu  0 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  0 cpu  8 pid=184230
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  1 cpu  1 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  1 cpu  9 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  2 cpu  2 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  2 cpu 10 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  3 cpu  3 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  3 cpu 11 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  4 cpu  4 pid=686362
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  4 cpu 12 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  5 cpu  5 pid=1289884
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  5 cpu 13 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  6 cpu  6 pid=[self]
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  6 cpu 14 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  7 cpu  7 pid=0
Feb 27 02:05:53 mrgency bash[1286191]: 10:05:53 [INFO]   core  7 cpu 15 pid=0

@arighi
Copy link
Collaborator

arighi commented Feb 27, 2024

That looks correct, things can be problematic if nr_queued or nr_scheduled become big numbers.

@kode54
Copy link
Author

kode54 commented Feb 28, 2024

I attempted to run it with PROTON_LOG=1 to see if there would be anything useful. Just 375MB of log spew over 100 seconds. Most of it is exceptions being thrown and stack unwind being logged, multiple times per millisecond.

@kode54
Copy link
Author

kode54 commented Mar 1, 2024

I doubled my RAM capacity, and it's still happening. No swapping going on.

I checked Netdata. Average CPU Pressure hits about 17% at its peak while the game is loading, then drops to about 10% while it's running idle. CPU utilization hits 38% or so while it's loading, and that's across all 16 threads.

@kode54
Copy link
Author

kode54 commented Mar 2, 2024

Further checking: I'm compiling a kernel now, while a btrfs scrub is running on my large storage array, which isn't involved in the compilation or its backing. I was hitting 98-100% CPU usage and load averages of 19 or so for my 16 threads.

Then I loaded scx_rustland. CPU usage dropped to 78% and load averages shot up to 30.

I should also mention that switching Fall Guys to Proton 9.0 beta also seems to have alleviated most of the problems it had with sched-ext.

@arighi
Copy link
Collaborator

arighi commented Mar 3, 2024

Alright, I'm re-reading this thread, let's try to tackle this one and see if we can figure out what's going on.

IIUC you are still experiencing some stuttering if you start any sched-ext scheduler (either scx_rusty, scx_rustland, scx_nest or scx_simple). IMHO we should try to focus at one scheduler, because it seems unlikely that sched-ext itself can cause performance issues, it's more likely that the bottleneck is in the particular scheduler's code.

With scx_simple I'd expect poor responsiveness if you have a lot of tasks running in your system. It's using a vruntime-based scheduling by default, but apart than that, there's not much going on, so when there are lots of tasks running the average wait time can naturally increase, due to the single queue FIFO ordering (and the constant time slice assigned to all tasks), causing the stuttering.

With scx_rusty things can be much better, even in presence of lots of tasks running, because of its multi-domain nature + the load balancing and many other things.

With scx_rustland we have the user-space overhead, but there's also the logic to boost interactive tasks, that should compensate the overhead and improve responsiveness even in presence of a many tasks running.

scx_nest in this context is probably not the best choice, considering its approach to keep tasks together on warm cores (when the system is massively overloaded, we may want to do the opposite and try to spread tasks among the available cores as much as possible, since caches will be thrashed anyway).

That said, my assumption is that your workload consists of multiple tasks contending the CPUs at the same time (many more tasks than the amount of cores), some of these tasks are CPU bound, others are I/O bound. Can you confirm that this is the case? (my assumption is based on the fact that your load increases in some cases). If that's the case then we should probably focus either at scx_rusty or scx_rustland.

At some point you mentioned:

when I launched scx_rustland, tasks was about 60-70, then increased to about 700 by the time I had the game running.

Are you able to reproduce this? If you can, it'd be interesting to check if all these queued tasks are reported in nr_queued or nr_scheduled (or both). The former means that tasks keep piling up in the queue and we don't wake up the user-space scheduler fast enough, the latter means that the user-space scheduler is awakened, but it fails to dispatch tasks, because the CPUs are busy or for other reasons (then we may need to figure out what's going on, but one step at a time...).

Another aspect that may impact on system responsiveness is the time slice (how much time each task is allowed to run before the scheduler reclaims their CPU). For this, have you tried to use a smaller time slice (like start the scheduler - either scx_rusty or scx_rustland with -s 5000 for example). Does it make the system more responsive or not?

So, to recap, first of all I think we should better understand the nature of your workload (lots of tasks vs few tasks running - at least now know that we're not dealing with memory pressure conditions), then understand if the scheduler is massively overloaded (for some reasons) or not, then understand if the default time slice is appropriate for your responsiveness expectations.

Once we understand all of this we can try to refine our analysis by doing some targeted profiling.

And thanks tons for all your updates and for sharing all these details with us!

@kode54
Copy link
Author

kode54 commented Mar 5, 2024

It seems to be working fine so far with -s 5000 on scx_rustland. I’m not sure if previous updates from 6.7.7 affecting zswap had any effect on this. Also the game seems to use more cpu time if I have vsync enabled. I’m also using variable refresh rate, 40-60Hz.

@arighi
Copy link
Collaborator

arighi commented Mar 5, 2024

Sometimes I also need to use -s 5000 to prevent some audio cracking when the system is massively loaded. Or (not sure why) I need to change the scheduling class of pipewire/pipewire-pulse/wireplumber from real-time to normal (so that they'll be scheduled by rustland as well). Either way allows me to prevent the audio cracking.

I'm really considering to set the default time slice to 5000 (or 10000) for scx_rustland, considering that the main goal of this scheduler is to prioritize low-latency workloads, probably it makes sense to use a shorter time slice by default.

arighi added a commit that referenced this issue Mar 10, 2024
In line with rustland's focus on prioritizing interactive tasks, set the
default base time slice to 5ms.

This allows to mitigate potential audio craking issues or system lags
when the system is overloaded or under memory pressure condition (i.e.,
#96 (comment)).

A downside of this change is to introduce potential regressions in the
throughput of CPU-intensive workloads, but in such scenarios rustland
may not be the optimal choice and alternative schedulers may be
preferred.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi
Copy link
Collaborator

arighi commented Mar 14, 2024

I think we can close this one for now, after the latest changes to the default settings in scx_rustland we should be able to mitigate these performance issues.

@kode54 do you agree or do you think there's something else that we should address/improve?

@kode54
Copy link
Author

kode54 commented Mar 14, 2024

scx_rusty needs to be 5ms too, unless low latency isn’t its aim.

@arighi
Copy link
Collaborator

arighi commented Mar 14, 2024

scx_rusty is a more general purpose scheduler, so it also needs to take throughput into account. Do you see some improvements in your case If you run scx_rusty -s 5000?

I know @Byte-Lab had some plans to apply the same "dynamic time slice" concept also to scx_rusty, that should help in cases like this. I'm not sure if there's some work in progress for this, otherwise I can take a look.

So, let's keep this open for now. Thanks @kode54 .

@kode54
Copy link
Author

kode54 commented Mar 15, 2024

I’m not sure if it’s kernel updates or using -s 5000, but I can now play Fall Guys without terrible lag. Also, unlike rustland, rusty doesn’t impact my Geekbench score by over 3000 points on the multi core result. My multi core score is now more than the average score for this model CPU.

@Byte-Lab
Copy link
Contributor

Nice, glad to hear. We still have a lot more we can do to make rusty more interactive, but glad to hear things seem to be going in the right direction.

@arighi
Copy link
Collaborator

arighi commented Mar 15, 2024

I’m not sure if it’s kernel updates or using -s 5000, but I can now play Fall Guys without terrible lag. Also, unlike rustland, rusty doesn’t impact my Geekbench score by over 3000 points on the multi core result. My multi core score is now more than the average score for this model CPU.

About rustland this is kind of expected, the scheduler is not really nice with non-interactive cpu-intensive tasks. 😄

@kode54
Copy link
Author

kode54 commented Mar 30, 2024

Adding another weird case: Running an OBS Studio virtual camera with v4l2loopback, feeding it a combined PipeWire desktop capture and webcam capture overlay, and it becomes incredibly laggy when the CPU is fully loaded, such as from building a package or a kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants