Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustland: per-CPU DSQs + global shared DSQ #110

Merged
merged 7 commits into from
Feb 2, 2024
Merged

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Jan 27, 2024

Opening this PR mostly as an open discussion.

Main topic
Designing a universal BPF layer that allows to implement totally generic schedulers in user-space.

Specific goal addressed by this PR
The BPF part should provide an interface that allows the user-space scheduler to select a specific CPU to dispatch a task and it should also allow to not specify any CPU at all (in this case the task can be dispatched in any CPU, like the first one that becomes available).

Implementation
The idea is to adapt @Decave's suggestion of introducing per-CPU DSQs, the user-space scheduler sends to the BPF dispatcher a PID with an optional target CPU; the BPF part dispatches the task to the DSQ associated to the target CPU, if specified, otherwise the task is dispatched to a global shared DSQ. Per-CPU DSQs are consumed from the dispatch() callback of their corresponding CPU. Shared DSQ is consumed from the dispatch() callback of any CPU.

In this way we can provide a totally generic interface that any user-space scheduler can use to implement any type of CPU selection logic.

Problem
A task is dispatched to a specific per-CPU DSQ, the dispatch() callback is never called on that CPU => starvation.

I haven't been able to find a nice way to enforce the call of the dispatch() event on a specific CPU, so that tasks queued there can be consumed. I was assuming that scx_bpf_kick_cpu() could be used for this, but it doesn't seem to be the case.

Apparently all this logic works (no starvation) if the target CPU of a task is always the same as the one determined in the select_cpu callback. Basically the user-space scheduler can only "acknowledge" select_cpu. What I would like to do, instead, is give the ability to the user-space scheduler to potentially override the choice made in select_cpu.

So, at the moment, as a workaround, when the user-space scheduler tries to dispatch a task on a CPU that is different than the one selected in select_cpu, I simply redirect the task to the shared DSQ (that is consumed from all the CPUs) => no starvation. But I would definitely prefer if we could honor the decision made by the user-space scheduler.

Does all of this make sense? Any suggestion? Thanks!

@arighi arighi force-pushed the scx-rustland-pcpu-dsq branch 2 times, most recently from e4479e0 to f615e5b Compare January 30, 2024 17:24
@arighi
Copy link
Collaborator Author

arighi commented Jan 30, 2024

@Decave @htejun I think I figured out my starvation issue! It looks like the problem was that I still needed to check the cpumask before dispatching a task to a per-CPU DSQ.

If I dispatch (and consume) a task to an "invalid" CPU I can immediately trigger the starvation issue. So the fix is to check the cpumask before calling scx_bpf_dispatch() and if the CPU picked by the user-space scheduler is invalid the BPF layer will re-direct the task to the SHARED_DSQ, kicking the CPU selected by the built-in idle selection logic, so the task is going to run there, using a valid CPU, all good.

In this way the BPF layer is able to provide a generic interface to the user-space scheduler, that is able to pick any arbitrary CPU when dispatching a task, or not specify any CPU at all (task.cpu = NO_CPU) and, in this case, the BPF dispatcher will rely on the built-in idle selection logic. If the user-space selects an invalid CPU the BPF layer will fix that transparently, redirecting the task to a valid CPU.

I've updated this PR with all of the above + also some low-latency improvements on top, because I've been testing the scheduler a lot today with all these changes applied and it seems to work pretty well.

All the games that I've tested seem to be responsive, no starvation, no stall, both with and without the background make -j N (varying N from 8, 16, 32, 64). I also ran stress-ng --race-sched N, as well as stress-ng -c N. I even did a video call while building the kernel (make -j 32) and the audio/video was really smooth.

In conclusion, everything seems to work as expected now (even better than expected I'd say... at least on my desktop and my laptop).

Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worthwhile to see whether just ignoring the task targeting the wrong dsq works. Other than that, looks good to me.

*
* In the future we may want to provide a way to check the
* cpumask in advance from user-space in a proper synchronized
* way, instead of bouncing the task somewhere else, but for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this should only be a problem when cpus_allowed changes. That synchronization is performed through .dequeue() and the problem is that that path can't sleep, so implementing actual handshake with userland is likely not possible. The sequence of events is likely as follows:

  1. Task gets enqueued and pushed to userspace and queued there.
  2. The task's cpus_allowed is going to be updated, so the task gets dequeued (ignored by rustland) and the cpumask is being updated.
  3. Userland sched pushes the task to the kernel for dispatching. Actual dispatch hasn't happened yet.
  4. The task's cpus_allowed is updated and the task gets reenqueued.
  5. The task gets dispatched targeting now wrong CPU.

So, in scenarios like this, it should be safe to just ignore the task. The task is already re-enqueued, so it will get re-dispatched. The problem with the earlier attempt likely was that it was dispatched to the wrong dsq. Then, later when the re-enqueued task gets dispatched, it's ignored as it's already on another dsq.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks tons for looking at this @htejun !

I've tried to ignore the task completely when the cpumask is not valid, but I still get the failed to run for 5s error. To reproduce this I'm always dispatching to CPU#2 from the user-space scheduler and then I run a stress-ng --race-sched 16 that spawn 16 tasks that are constantly bouncing around using sched_setaffinity() in a busy loop.

However, it looks like we can actually ignore the task if target CPU is not valid and also if the CPU assigned to the task (scx_bpf_task_cpu()) is also not valid, see d13ed5c. Does it make sense to you? With this in place everything seems to work and I can't break the scheduler. So I added this change to the PR and updated some comments here and there.

I'm currently stress testing this, but so far so good.

No functional change, just some refactoring to make the code more clear.

We have is_usersched_needed() and set_usersched_needed() that are doing
different things (the former is checkig if there are pending tasks for
the scheduler, the latter is setting the usersched_needed flag to
activate the dispatch of the user-space scheduler).

Rename is_usersched_needed() to usersched_has_pending_tasks() to make
the code more clear and understandable.

Also move dispatch_user_scheduler() closer to the other dispatch-related
helper functions.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With commit c6ada25 ("scx_rustland: use custom pcpu DSQ instead of
SCX_DSQ_LOCAL{_ON}") we tried to introduce custom per-CPU DSQs, instead
of using SCX_DSQ_LOCAL and SCX_DSQ_LOCAL_ON to dispatch tasks.

This was required, because dispatching tasks using SCX_DSQ_LOCAL_ON
doesn't provide a guarantee that the cpumask, checked at dispatch time
to determine the validity of a target CPU, remains valid.

This method solved the cpumask validity issue, but unfortunately it
introduced a noticeable performance regression and a potential
starvation issue (that were probably caused by the same problem): if a
task is assigned to a CPU in select_cpu() and the scheduler decides to
dispatch it on a different CPU, the task will be added to the new CPU's
DSQ, but if no dispatch event happens there, the task may remain stuck
in the per-CPU DSQ for a long time, triggering the sched-ext watchdog
timeout that would kick out the scheduler, for example:

  12:53:28 [WARN] FAIL: IPC:CSteamEngin[7217] failed to run for 6.482s (err=1026)
  12:53:28 [INFO] Unregister RustLand scheduler

Therefore, we reverted this change with 6d89ece ("scx_rustland: dispatch
tasks only on the global DSQ"), dispatching all the tasks to the global
DSQ, completely delegating the kernel to distribute tasks among the
available CPUs.

This is not the ideal solution, because we still want to give the
possibility to the user-space scheduler to assign tasks to specific
CPUs.

Therefore, re-introduce distinct per-CPU DSQs, but also provide a global
shared DSQ. Tasks dispatched in the per-CPU DSQs are consumed from the
dispatch() callback of their corresponding CPU, tasks dispatched in the
global shared DSQ are consumed from any CPU.

In this way the BPF layer is able to provide an interface that gives
the flexibility to the user-space to dispatch a task on a specific CPU
or on the first CPU available, depending on the particular scheduler's
need.

If an invalid CPU (according to the cpumask) is selected the BPF
dispatcher will transparently redirect the task to a valid CPU, selected
using the built-in idle selection logic.

In the future we may want to improve this part, giving to the
user-space the visibility of the cpumask, in order to pick a valid CPU
in advance and in a proper synchronized way.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When the user-space scheduler dispatches a task on a specific CPU, that
CPU might not be valid, since the user-space doesn't have visibility of
the task's cpumask.

When this happens the BPF dispatcher (that has direct visibility of the
cpumask) should automatically redirect the task to a valid CPU, but
instead of bouncing the task on the shared DSQ, we should try to use the
CPU assigned by the built-in idle selection logic.

If this CPU is also not valid, then we can simply ignore the task, that
has been de-queued and re-enqueued, since a valid CPU will be naturally
re-selected at a later time.

Moreover, avoid to kick any specific CPU when the task is dispatched to
shared DSQ, since the task can be consumed on any CPU and the additional
kick would simply add more overhead.

Lastly, rename dsq_id_to_cpu() to dsq_to_cpu() and cpu_to_dsq_id() to
cpu_to_dsq() for more clarity.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, just move code around to make it more readable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Simplify the idle selection logic by relying only on the built-in idle
selection performed in the BPF layer.

When there are idle CPUs available in the system, tasks are dispatched
directly by the BPF dispatcher without invoking the user-space
scheduler. This allows to avoid the user-space overhead and get the best
system performance when CPU resources are not overcommitted.

Once the number of tasks exceeds the available CPUs, the user-space
scheduler takes over. However, by this time, the system is already
overcommitted, so there's little advantage in attempting to pinpoint the
optimal idle CPU through the user-space scheduler. Instead, tasks can be
executed on the first available CPU, consistently dispatching them to
the shared DSQ.

This allows to achieve the optimal performance both with system
under-utilization and over-utilization.

With this change in place the user-space scheduler won't dispatch tasks
directly to specific CPUs, but we still want to keep this as a generic
feature in the BPF layer, so that it can be potentially used in the
future by this scheduler or even by other user-space schedulers (once
the BPF layer will be moved to a more generic place).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Evaluate the number of voluntary context switches per second (nvcsw/sec)
for each task using an exponentially weighted moving average (EWMA) with
weight 0.5, that allows to classify interactive tasks with more
accuracy.

Using a simple average over a period of time of 10 sec can introduce
small lags every 10 sec, as the statistics for the number of voluntary
context switches are refreshed. This can result in interactive tasks
taking a brief time to catch up in order to be accurately classified as
so, causing for example short audio cracks, small drop of 5-10 fps in
games, etc.

Using a EMWA allows to smooth the average of nvcsw/sec, preventing short
lags in the interactive tasks, while also preventing to incorrectly
classify as interactive tasks that may experience an isolated short
burst of voluntary context switches.

This patch has been tested with the usual test case of playing a
videogame while running a parallel kernel build in the background.

Without this patch the short lag every 10 sec is clearly noticeable,
with this patch applied the game and audio run smoothly.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Allow to scale the effective time slice down to 250 us. This can help to
maintain a good quality of the audio even when the system is overloaded
by multiple CPU-intensive tasks.

Moreover, always round up the time slice scaling factor to be a little
more aggressive and prioritize at scaling the time slice, so that we can
prioritize low latency tasks even more.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@htejun htejun merged commit 7ffdcc1 into main Feb 2, 2024
2 checks passed
@Byte-Lab Byte-Lab deleted the scx-rustland-pcpu-dsq branch March 14, 2024 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants