Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustland: fix cpumask stall and prevent stuttery behavior #132

Merged
merged 4 commits into from
Feb 8, 2024

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Feb 8, 2024

A set of improvements/fixes for rustland.

In particular:

scx_rustland: use scx_bpf_dispatch_cancel()

This uses the new scx_bpf_dispatch_cancel() API that prevents the cpumask-related stall conditions. I still can't completely ignore the task when the cpumask is not valid anymore, because there are cases where a task is never dispatched to any DSQ. But I can use the cancel operation to redirect the task to the CPU assigned during select_cpu or to the shared DSQ (if the prev cpu is also not available) in a reliable way. With this logic applied I haven't been able to stall the scheduler using stress-ng --race-sched, so in any case it is definitely an improved and a reasonable fix IMHO.

And this also allows to enable the following:

scx_rustland: keep default CPU selection when idle

This uses a more reasonable CPU assignment logic in the user-space scheduler, that redirects to tasks to the shared DSQ only when the CPU assigned by the BPF part is not idle anymore.

I've also included a fix that prevents stuttery behaviors (I can systematically reproduce this problem playing Team Fortress 2):

scx_rustland: kick user-space scheduler when a CPU is released

The idea is to kick the user-space scheduler immediately in the .stopping() callback and speculate on the fact that most of the time there's always another task ready to run. This prevents evident lag issues with Team Fortress 2 and also sporadic lags with Counter-Strike 2. The extra overhead added due to potential unnecessary wake-up events seems to be negligible compared to the benefits.

Last but not least, a small change that can be really useful for debugging:

scx_rustland: dump scheduler statistics before exiting

Print all the scheduler statistics before exiting. Reporting the very
last state of the scheduler can help to debug events that could trigger
error conditions (such as page faults, scheduler congestions, etc.).

While at it, fix also some minor coding style issues (tabs vs spaces).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi
Copy link
Collaborator Author

arighi commented Feb 8, 2024

...and I think we need to update https://github.com/sched-ext/sched_ext/tree/sched_ext-ci, otherwise the CI will fail because it's still using the old API.

dsq_id = dsq_to_cpu(cpu);
dsq_id = SHARED_DSQ;
else
dsq_id = cpu_to_dsq(cpu);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this path, it probably should always dispatch to SHARED_DSQ because there's no guarantee that there's going to be another racing set_cpumask() betwen the second bpf_cpumask_test_cpu() and the cancel/dispatch. That said, do you still see stalls if you simply do cancel as follows?

scx_bpf_dispatch(p, dsq_id, slice, enq_flags);
if (!bpf_cpumask_test_cpu(dsq_to_cpu(dsq_id), p->cpus_ptr))
        scx_bpf_dispatch_cancel();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be also useful to keep a counter for this event.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this path, it probably should always dispatch to SHARED_DSQ because there's no guarantee that there's going to be another racing set_cpumask() betwen the second bpf_cpumask_test_cpu() and the cancel/dispatch. That said, do you still see stalls if you simply do cancel as follows?

scx_bpf_dispatch(p, dsq_id, slice, enq_flags);
if (!bpf_cpumask_test_cpu(dsq_to_cpu(dsq_id), p->cpus_ptr))
        scx_bpf_dispatch_cancel();

Hm.. that's true, I'm surprised that I haven't triggered this condition yet, even with massive stress-ng race-sched runs. But it definitely seems safer to just bounce the task to the shared DSQ. I'll change this.

About the stall yes, if I just return after cancel I can easily trigger stalls and in the trace I can see that the stalling task is not dispatched to any DSQ.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be also useful to keep a counter for this event.

This is a good idea, I'll add it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the stall yes, if I just return after cancel I can easily trigger stalls and in the trace I can see that the stalling task is not dispatched to any DSQ.

Hmm... that shouldn't be the case because the task gets requeued after cpumask is updated and thus should re-dispatch the task. Once you land this PR, I'll test and see what's going on.

@htejun
Copy link
Contributor

htejun commented Feb 8, 2024

Oh, lemme delete the ci branch. We can track the main branch as there's no API breakge.

@htejun
Copy link
Contributor

htejun commented Feb 8, 2024

Oops, the branch is protected. Just updated it.

@arighi
Copy link
Collaborator Author

arighi commented Feb 8, 2024

Oops, the branch is protected. Just updated it.

It looks like scx_bpf_dispatch_cancel() is still missing in the sched_ext-ci:

libbpf: extern (func ksym) 'scx_bpf_dispatch_cancel': not found in kernel or module BTFs
4728

Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU
DSQ, due to cpumask race conditions, and redirect them to the shared
DSQ.

This prevents dispatching tasks to CPU that cannot be used according to
the task's cpumask.

With this applied the scheduler passed all the `stress-ng --race-sched`
stress tests.

Moreover, introduce a counter that is periodically reported to stdout as
an additional statistic, that can be helpful for debugging.

Link: sched-ext/sched_ext#135
Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When the system is not being fully utilized there may be delays in
promptly awakening the user-space scheduler.

This can happen for example, when some CPU-intensive tasks are
constantly dispatched bypassing the user-space scheduler (e.g., using
SCX_DSQ_LOCAL) and other CPUs are completely idle.

Under this condition the update_idle() can fail to activate the
user-space scheduler, because there are no pending events, and only the
periodic timer will wake up the scheduler, potentially introducing lags
of up to 1 sec.

This can be reproduced, for example, running a video game that doesn't
use all the CPUs available in the system (i.e., Team Fortress 2). With
this game it is pretty easy to notice sporadic lags that are resumed
after ~1sec, due to the periodic timer kicking scheduler.

To prevent this from happening wake up the user-space scheduler
immediately as soon as a CPU is released, speculating on the fact that
most of the time there will be always another task ready to run.

This can introduce a little more overhead in the scheduler (due to
potential unnecessary wake up events), but it also prevents stuttery
behaviors and it makes the system much more smooth and responsive,
especially with video games.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not
idle anymore, otherwise maintain the same CPU that has been assigned by
the BPF layer.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@htejun
Copy link
Contributor

htejun commented Feb 8, 2024

It looks like scx_bpf_dispatch_cancel() is still missing in the sched_ext-ci:

Sorry, pushed to the wrong tree. Pushed now. Will give you write perm so that you can update it too.

@arighi arighi merged commit a4ff395 into main Feb 8, 2024
1 check passed
@Byte-Lab Byte-Lab deleted the rustland-fix-cpumask-stall branch March 14, 2024 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants