scx_rustland: per-CPU DSQs + global shared DSQ #110

arighi · 2024-01-27T14:37:56Z

Opening this PR mostly as an open discussion.

Main topic
Designing a universal BPF layer that allows to implement totally generic schedulers in user-space.

Specific goal addressed by this PR
The BPF part should provide an interface that allows the user-space scheduler to select a specific CPU to dispatch a task and it should also allow to not specify any CPU at all (in this case the task can be dispatched in any CPU, like the first one that becomes available).

Implementation
The idea is to adapt @Decave's suggestion of introducing per-CPU DSQs, the user-space scheduler sends to the BPF dispatcher a PID with an optional target CPU; the BPF part dispatches the task to the DSQ associated to the target CPU, if specified, otherwise the task is dispatched to a global shared DSQ. Per-CPU DSQs are consumed from the dispatch() callback of their corresponding CPU. Shared DSQ is consumed from the dispatch() callback of any CPU.

In this way we can provide a totally generic interface that any user-space scheduler can use to implement any type of CPU selection logic.

Problem
A task is dispatched to a specific per-CPU DSQ, the dispatch() callback is never called on that CPU => starvation.

I haven't been able to find a nice way to enforce the call of the dispatch() event on a specific CPU, so that tasks queued there can be consumed. I was assuming that scx_bpf_kick_cpu() could be used for this, but it doesn't seem to be the case.

Apparently all this logic works (no starvation) if the target CPU of a task is always the same as the one determined in the select_cpu callback. Basically the user-space scheduler can only "acknowledge" select_cpu. What I would like to do, instead, is give the ability to the user-space scheduler to potentially override the choice made in select_cpu.

So, at the moment, as a workaround, when the user-space scheduler tries to dispatch a task on a CPU that is different than the one selected in select_cpu, I simply redirect the task to the shared DSQ (that is consumed from all the CPUs) => no starvation. But I would definitely prefer if we could honor the decision made by the user-space scheduler.

Does all of this make sense? Any suggestion? Thanks!

arighi · 2024-01-30T17:47:16Z

@Decave @htejun I think I figured out my starvation issue! It looks like the problem was that I still needed to check the cpumask before dispatching a task to a per-CPU DSQ.

If I dispatch (and consume) a task to an "invalid" CPU I can immediately trigger the starvation issue. So the fix is to check the cpumask before calling scx_bpf_dispatch() and if the CPU picked by the user-space scheduler is invalid the BPF layer will re-direct the task to the SHARED_DSQ, kicking the CPU selected by the built-in idle selection logic, so the task is going to run there, using a valid CPU, all good.

In this way the BPF layer is able to provide a generic interface to the user-space scheduler, that is able to pick any arbitrary CPU when dispatching a task, or not specify any CPU at all (task.cpu = NO_CPU) and, in this case, the BPF dispatcher will rely on the built-in idle selection logic. If the user-space selects an invalid CPU the BPF layer will fix that transparently, redirecting the task to a valid CPU.

I've updated this PR with all of the above + also some low-latency improvements on top, because I've been testing the scheduler a lot today with all these changes applied and it seems to work pretty well.

All the games that I've tested seem to be responsive, no starvation, no stall, both with and without the background make -j N (varying N from 8, 16, 32, 64). I also ran stress-ng --race-sched N, as well as stress-ng -c N. I even did a video call while building the kernel (make -j 32) and the audio/video was really smooth.

In conclusion, everything seems to work as expected now (even better than expected I'd say... at least on my desktop and my laptop).

htejun

It might be worthwhile to see whether just ignoring the task targeting the wrong dsq works. Other than that, looks good to me.

htejun · 2024-01-31T22:25:06Z

scheds/rust/scx_rustland/src/bpf/main.bpf.c

+		 *
+		 * In the future we may want to provide a way to check the
+		 * cpumask in advance from user-space in a proper synchronized
+		 * way, instead of bouncing the task somewhere else, but for


So, this should only be a problem when cpus_allowed changes. That synchronization is performed through .dequeue() and the problem is that that path can't sleep, so implementing actual handshake with userland is likely not possible. The sequence of events is likely as follows:

Task gets enqueued and pushed to userspace and queued there.

The task's cpus_allowed is going to be updated, so the task gets dequeued (ignored by rustland) and the cpumask is being updated.

Userland sched pushes the task to the kernel for dispatching. Actual dispatch hasn't happened yet.

The task's cpus_allowed is updated and the task gets reenqueued.

The task gets dispatched targeting now wrong CPU.

So, in scenarios like this, it should be safe to just ignore the task. The task is already re-enqueued, so it will get re-dispatched. The problem with the earlier attempt likely was that it was dispatched to the wrong dsq. Then, later when the re-enqueued task gets dispatched, it's ignored as it's already on another dsq.

Thanks tons for looking at this @htejun !

I've tried to ignore the task completely when the cpumask is not valid, but I still get the failed to run for 5s error. To reproduce this I'm always dispatching to CPU#2 from the user-space scheduler and then I run a stress-ng --race-sched 16 that spawn 16 tasks that are constantly bouncing around using sched_setaffinity() in a busy loop.

However, it looks like we can actually ignore the task if target CPU is not valid and also if the CPU assigned to the task (scx_bpf_task_cpu()) is also not valid, see d13ed5c. Does it make sense to you? With this in place everything seems to work and I can't break the scheduler. So I added this change to the PR and updated some comments here and there.

I'm currently stress testing this, but so far so good.

No functional change, just some refactoring to make the code more clear. We have is_usersched_needed() and set_usersched_needed() that are doing different things (the former is checkig if there are pending tasks for the scheduler, the latter is setting the usersched_needed flag to activate the dispatch of the user-space scheduler). Rename is_usersched_needed() to usersched_has_pending_tasks() to make the code more clear and understandable. Also move dispatch_user_scheduler() closer to the other dispatch-related helper functions. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

With commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks. This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON doesn't provide a guarantee that the cpumask, checked at dispatch time to determine the validity of a target CPU, remains valid. This method solved the cpumask validity issue, but unfortunately it introduced a noticeable performance regression and a potential starvation issue (that were probably caused by the same problem): if a task is assigned to a CPU in select_cpu() and the scheduler decides to dispatch it on a different CPU, the task will be added to the new CPU's DSQ, but if no dispatch event happens there, the task may remain stuck in the per-CPU DSQ for a long time, triggering the sched-ext watchdog timeout that would kick out the scheduler, for example: 12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026) 12:53:28 [INFO] Unregister RustLand scheduler Therefore, we reverted this change with 6d89ece ("scx_rustland: dispatch tasks only on the global DSQ"), dispatching all the tasks to the global DSQ, completely delegating the kernel to distribute tasks among the available CPUs. This is not the ideal solution, because we still want to give the possibility to the user-space scheduler to assign tasks to specific CPUs. Therefore, re-introduce distinct per-CPU DSQs, but also provide a global shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the dispatch() callback of their corresponding CPU, tasks dispatched in the global shared DSQ are consumed from any CPU. In this way the BPF layer is able to provide an interface that gives the flexibility to the user-space to dispatch a task on a specific CPU or on the first CPU available, depending on the particular scheduler's need. If an invalid CPU (according to the cpumask) is selected the BPF dispatcher will transparently redirect the task to a valid CPU, selected using the built-in idle selection logic. In the future we may want to improve this part, giving to the user-space the visibility of the cpumask, in order to pick a valid CPU in advance and in a proper synchronized way. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

When the user-space scheduler dispatches a task on a specific CPU, that CPU might not be valid, since the user-space doesn't have visibility of the task's cpumask. When this happens the BPF dispatcher (that has direct visibility of the cpumask) should automatically redirect the task to a valid CPU, but instead of bouncing the task on the shared DSQ, we should try to use the CPU assigned by the built-in idle selection logic. If this CPU is also not valid, then we can simply ignore the task, that has been de-queued and re-enqueued, since a valid CPU will be naturally re-selected at a later time. Moreover, avoid to kick any specific CPU when the task is dispatched to shared DSQ, since the task can be consumed on any CPU and the additional kick would simply add more overhead. Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to cpu_to_dsq() for more clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

No functional change, just move code around to make it more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Simplify the idle selection logic by relying only on the built-in idle selection performed in the BPF layer. When there are idle CPUs available in the system, tasks are dispatched directly by the BPF dispatcher without invoking the user-space scheduler. This allows to avoid the user-space overhead and get the best system performance when CPU resources are not overcommitted. Once the number of tasks exceeds the available CPUs, the user-space scheduler takes over. However, by this time, the system is already overcommitted, so there's little advantage in attempting to pinpoint the optimal idle CPU through the user-space scheduler. Instead, tasks can be executed on the first available CPU, consistently dispatching them to the shared DSQ. This allows to achieve the optimal performance both with system under-utilization and over-utilization. With this change in place the user-space scheduler won't dispatch tasks directly to specific CPUs, but we still want to keep this as a generic feature in the BPF layer, so that it can be potentially used in the future by this scheduler or even by other user-space schedulers (once the BPF layer will be moved to a more generic place). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Evaluate the number of voluntary context switches per second (nvcsw/sec) for each task using an exponentially weighted moving average (EWMA) with weight 0.5, that allows to classify interactive tasks with more accuracy. Using a simple average over a period of time of 10 sec can introduce small lags every 10 sec, as the statistics for the number of voluntary context switches are refreshed. This can result in interactive tasks taking a brief time to catch up in order to be accurately classified as so, causing for example short audio cracks, small drop of 5-10 fps in games, etc. Using a EMWA allows to smooth the average of nvcsw/sec, preventing short lags in the interactive tasks, while also preventing to incorrectly classify as interactive tasks that may experience an isolated short burst of voluntary context switches. This patch has been tested with the usual test case of playing a videogame while running a parallel kernel build in the background. Without this patch the short lag every 10 sec is clearly noticeable, with this patch applied the game and audio run smoothly. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Allow to scale the effective time slice down to 250 us. This can help to maintain a good quality of the audio even when the system is overloaded by multiple CPU-intensive tasks. Moreover, always round up the time slice scaling factor to be a little more aggressive and prioritize at scaling the time slice, so that we can prioritize low latency tasks even more. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi force-pushed the scx-rustland-pcpu-dsq branch 2 times, most recently from e4479e0 to f615e5b Compare January 30, 2024 17:24

htejun approved these changes Jan 31, 2024

View reviewed changes

arighi added 7 commits February 1, 2024 00:33

scx_rustland: usersched: code refactoring

b9e60f7

No functional change, just move code around to make it more readable. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi force-pushed the scx-rustland-pcpu-dsq branch from f615e5b to 681b3fd Compare February 1, 2024 15:42

htejun merged commit 7ffdcc1 into main Feb 2, 2024
2 checks passed

Byte-Lab deleted the scx-rustland-pcpu-dsq branch March 14, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_rustland: per-CPU DSQs + global shared DSQ #110

scx_rustland: per-CPU DSQs + global shared DSQ #110

arighi commented Jan 27, 2024

arighi commented Jan 30, 2024

htejun left a comment

htejun Jan 31, 2024

arighi Feb 1, 2024

scx_rustland: per-CPU DSQs + global shared DSQ #110

scx_rustland: per-CPU DSQs + global shared DSQ #110

Conversation

arighi commented Jan 27, 2024

arighi commented Jan 30, 2024

htejun left a comment

Choose a reason for hiding this comment

htejun Jan 31, 2024

Choose a reason for hiding this comment

arighi Feb 1, 2024

Choose a reason for hiding this comment