scx_rustland: fix cpumask stall and prevent stuttery behavior #132

arighi · 2024-02-08T17:13:32Z

A set of improvements/fixes for rustland.

In particular:

scx_rustland: use scx_bpf_dispatch_cancel()

This uses the new scx_bpf_dispatch_cancel() API that prevents the cpumask-related stall conditions. I still can't completely ignore the task when the cpumask is not valid anymore, because there are cases where a task is never dispatched to any DSQ. But I can use the cancel operation to redirect the task to the CPU assigned during select_cpu or to the shared DSQ (if the prev cpu is also not available) in a reliable way. With this logic applied I haven't been able to stall the scheduler using stress-ng --race-sched, so in any case it is definitely an improved and a reasonable fix IMHO.

And this also allows to enable the following:

scx_rustland: keep default CPU selection when idle

This uses a more reasonable CPU assignment logic in the user-space scheduler, that redirects to tasks to the shared DSQ only when the CPU assigned by the BPF part is not idle anymore.

I've also included a fix that prevents stuttery behaviors (I can systematically reproduce this problem playing Team Fortress 2):

scx_rustland: kick user-space scheduler when a CPU is released

The idea is to kick the user-space scheduler immediately in the .stopping() callback and speculate on the fact that most of the time there's always another task ready to run. This prevents evident lag issues with Team Fortress 2 and also sporadic lags with Counter-Strike 2. The extra overhead added due to potential unnecessary wake-up events seems to be negligible compared to the benefits.

Last but not least, a small change that can be really useful for debugging:

scx_rustland: dump scheduler statistics before exiting

Print all the scheduler statistics before exiting. Reporting the very last state of the scheduler can help to debug events that could trigger error conditions (such as page faults, scheduler congestions, etc.). While at it, fix also some minor coding style issues (tabs vs spaces). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi · 2024-02-08T17:16:36Z

...and I think we need to update https://github.com/sched-ext/sched_ext/tree/sched_ext-ci, otherwise the CI will fail because it's still using the old API.

htejun · 2024-02-08T17:24:14Z

scheds/rust/scx_rustland/src/bpf/main.bpf.c

-			dsq_id = dsq_to_cpu(cpu);
+				dsq_id = SHARED_DSQ;
+			else
+				dsq_id = cpu_to_dsq(cpu);


In this path, it probably should always dispatch to SHARED_DSQ because there's no guarantee that there's going to be another racing set_cpumask() betwen the second bpf_cpumask_test_cpu() and the cancel/dispatch. That said, do you still see stalls if you simply do cancel as follows?

scx_bpf_dispatch(p, dsq_id, slice, enq_flags); if (!bpf_cpumask_test_cpu(dsq_to_cpu(dsq_id), p->cpus_ptr)) scx_bpf_dispatch_cancel();

It'd be also useful to keep a counter for this event.

In this path, it probably should always dispatch to SHARED_DSQ because there's no guarantee that there's going to be another racing set_cpumask() betwen the second bpf_cpumask_test_cpu() and the cancel/dispatch. That said, do you still see stalls if you simply do cancel as follows?

scx_bpf_dispatch(p, dsq_id, slice, enq_flags); if (!bpf_cpumask_test_cpu(dsq_to_cpu(dsq_id), p->cpus_ptr)) scx_bpf_dispatch_cancel();

Hm.. that's true, I'm surprised that I haven't triggered this condition yet, even with massive stress-ng race-sched runs. But it definitely seems safer to just bounce the task to the shared DSQ. I'll change this.

About the stall yes, if I just return after cancel I can easily trigger stalls and in the trace I can see that the stalling task is not dispatched to any DSQ.

It'd be also useful to keep a counter for this event.

This is a good idea, I'll add it!

About the stall yes, if I just return after cancel I can easily trigger stalls and in the trace I can see that the stalling task is not dispatched to any DSQ.

Hmm... that shouldn't be the case because the task gets requeued after cpumask is updated and thus should re-dispatch the task. Once you land this PR, I'll test and see what's going on.

htejun · 2024-02-08T17:31:49Z

Oh, lemme delete the ci branch. We can track the main branch as there's no API breakge.

htejun · 2024-02-08T17:33:05Z

Oops, the branch is protected. Just updated it.

arighi · 2024-02-08T21:34:29Z

Oops, the branch is protected. Just updated it.

It looks like scx_bpf_dispatch_cancel() is still missing in the sched_ext-ci:

libbpf: extern (func ksym) 'scx_bpf_dispatch_cancel': not found in kernel or module BTFs
4728

Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions, and redirect them to the shared DSQ. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Moreover, introduce a counter that is periodically reported to stdout as an additional statistic, that can be helpful for debugging. Link: sched-ext/sched_ext#135 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

When the system is not being fully utilized there may be delays in promptly awakening the user-space scheduler. This can happen for example, when some CPU-intensive tasks are constantly dispatched bypassing the user-space scheduler (e.g., using SCX_DSQ_LOCAL) and other CPUs are completely idle. Under this condition the update_idle() can fail to activate the user-space scheduler, because there are no pending events, and only the periodic timer will wake up the scheduler, potentially introducing lags of up to 1 sec. This can be reproduced, for example, running a video game that doesn't use all the CPUs available in the system (i.e., Team Fortress 2). With this game it is pretty easy to notice sporadic lags that are resumed after ~1sec, due to the periodic timer kicking scheduler. To prevent this from happening wake up the user-space scheduler immediately as soon as a CPU is released, speculating on the fact that most of the time there will be always another task ready to run. This can introduce a little more overhead in the scheduler (due to potential unnecessary wake up events), but it also prevents stuttery behaviors and it makes the system much more smooth and responsive, especially with video games. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not idle anymore, otherwise maintain the same CPU that has been assigned by the BPF layer. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

htejun · 2024-02-08T22:10:25Z

It looks like scx_bpf_dispatch_cancel() is still missing in the sched_ext-ci:

Sorry, pushed to the wrong tree. Pushed now. Will give you write perm so that you can update it too.

htejun approved these changes Feb 8, 2024

View reviewed changes

arighi force-pushed the rustland-fix-cpumask-stall branch from 2cc3934 to 3faabde Compare February 8, 2024 20:49

arighi added 3 commits February 8, 2024 22:48

scx_rustland: keep default CPU selection when idle

8e47602

Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not idle anymore, otherwise maintain the same CPU that has been assigned by the BPF layer. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi force-pushed the rustland-fix-cpumask-stall branch from 3faabde to 8e47602 Compare February 8, 2024 21:52

arighi merged commit a4ff395 into main Feb 8, 2024
1 check passed

Byte-Lab deleted the rustland-fix-cpumask-stall branch March 14, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_rustland: fix cpumask stall and prevent stuttery behavior #132

scx_rustland: fix cpumask stall and prevent stuttery behavior #132

arighi commented Feb 8, 2024

arighi commented Feb 8, 2024

htejun Feb 8, 2024

htejun Feb 8, 2024

arighi Feb 8, 2024

arighi Feb 8, 2024

htejun Feb 8, 2024

htejun commented Feb 8, 2024

htejun commented Feb 8, 2024

arighi commented Feb 8, 2024

htejun commented Feb 8, 2024

scx_rustland: fix cpumask stall and prevent stuttery behavior #132

scx_rustland: fix cpumask stall and prevent stuttery behavior #132

Conversation

arighi commented Feb 8, 2024

arighi commented Feb 8, 2024

htejun Feb 8, 2024

Choose a reason for hiding this comment

htejun Feb 8, 2024

Choose a reason for hiding this comment

arighi Feb 8, 2024

Choose a reason for hiding this comment

arighi Feb 8, 2024

Choose a reason for hiding this comment

htejun Feb 8, 2024

Choose a reason for hiding this comment

htejun commented Feb 8, 2024

htejun commented Feb 8, 2024

arighi commented Feb 8, 2024

htejun commented Feb 8, 2024