-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_rustland: fix cpumask stall and prevent stuttery behavior #132
Conversation
Print all the scheduler statistics before exiting. Reporting the very last state of the scheduler can help to debug events that could trigger error conditions (such as page faults, scheduler congestions, etc.). While at it, fix also some minor coding style issues (tabs vs spaces). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
...and I think we need to update https://github.com/sched-ext/sched_ext/tree/sched_ext-ci, otherwise the CI will fail because it's still using the old API. |
dsq_id = dsq_to_cpu(cpu); | ||
dsq_id = SHARED_DSQ; | ||
else | ||
dsq_id = cpu_to_dsq(cpu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this path, it probably should always dispatch to SHARED_DSQ
because there's no guarantee that there's going to be another racing set_cpumask()
betwen the second bpf_cpumask_test_cpu()
and the cancel/dispatch. That said, do you still see stalls if you simply do cancel as follows?
scx_bpf_dispatch(p, dsq_id, slice, enq_flags);
if (!bpf_cpumask_test_cpu(dsq_to_cpu(dsq_id), p->cpus_ptr))
scx_bpf_dispatch_cancel();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be also useful to keep a counter for this event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this path, it probably should always dispatch to
SHARED_DSQ
because there's no guarantee that there's going to be another racingset_cpumask()
betwen the secondbpf_cpumask_test_cpu()
and the cancel/dispatch. That said, do you still see stalls if you simply do cancel as follows?scx_bpf_dispatch(p, dsq_id, slice, enq_flags); if (!bpf_cpumask_test_cpu(dsq_to_cpu(dsq_id), p->cpus_ptr)) scx_bpf_dispatch_cancel();
Hm.. that's true, I'm surprised that I haven't triggered this condition yet, even with massive stress-ng race-sched runs. But it definitely seems safer to just bounce the task to the shared DSQ. I'll change this.
About the stall yes, if I just return after cancel I can easily trigger stalls and in the trace I can see that the stalling task is not dispatched to any DSQ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be also useful to keep a counter for this event.
This is a good idea, I'll add it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the stall yes, if I just return after cancel I can easily trigger stalls and in the trace I can see that the stalling task is not dispatched to any DSQ.
Hmm... that shouldn't be the case because the task gets requeued after cpumask is updated and thus should re-dispatch the task. Once you land this PR, I'll test and see what's going on.
Oh, lemme delete the ci branch. We can track the main branch as there's no API breakge. |
Oops, the branch is protected. Just updated it. |
2cc3934
to
3faabde
Compare
It looks like
|
Use scx_bpf_dispatch_cancel() to invalidate dispatches on wrong per-CPU DSQ, due to cpumask race conditions, and redirect them to the shared DSQ. This prevents dispatching tasks to CPU that cannot be used according to the task's cpumask. With this applied the scheduler passed all the `stress-ng --race-sched` stress tests. Moreover, introduce a counter that is periodically reported to stdout as an additional statistic, that can be helpful for debugging. Link: sched-ext/sched_ext#135 Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When the system is not being fully utilized there may be delays in promptly awakening the user-space scheduler. This can happen for example, when some CPU-intensive tasks are constantly dispatched bypassing the user-space scheduler (e.g., using SCX_DSQ_LOCAL) and other CPUs are completely idle. Under this condition the update_idle() can fail to activate the user-space scheduler, because there are no pending events, and only the periodic timer will wake up the scheduler, potentially introducing lags of up to 1 sec. This can be reproduced, for example, running a video game that doesn't use all the CPUs available in the system (i.e., Team Fortress 2). With this game it is pretty easy to notice sporadic lags that are resumed after ~1sec, due to the periodic timer kicking scheduler. To prevent this from happening wake up the user-space scheduler immediately as soon as a CPU is released, speculating on the fact that most of the time there will be always another task ready to run. This can introduce a little more overhead in the scheduler (due to potential unnecessary wake up events), but it also prevents stuttery behaviors and it makes the system much more smooth and responsive, especially with video games. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Dispatch to the shared DSQ (NO_CPU) only when the assigned CPU is not idle anymore, otherwise maintain the same CPU that has been assigned by the BPF layer. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
3faabde
to
8e47602
Compare
Sorry, pushed to the wrong tree. Pushed now. Will give you write perm so that you can update it too. |
A set of improvements/fixes for rustland.
In particular:
scx_rustland: use scx_bpf_dispatch_cancel()
This uses the new
scx_bpf_dispatch_cancel()
API that prevents the cpumask-related stall conditions. I still can't completely ignore the task when the cpumask is not valid anymore, because there are cases where a task is never dispatched to any DSQ. But I can use the cancel operation to redirect the task to the CPU assigned during select_cpu or to the shared DSQ (if the prev cpu is also not available) in a reliable way. With this logic applied I haven't been able to stall the scheduler usingstress-ng --race-sched
, so in any case it is definitely an improved and a reasonable fix IMHO.And this also allows to enable the following:
scx_rustland: keep default CPU selection when idle
This uses a more reasonable CPU assignment logic in the user-space scheduler, that redirects to tasks to the shared DSQ only when the CPU assigned by the BPF part is not idle anymore.
I've also included a fix that prevents stuttery behaviors (I can systematically reproduce this problem playing Team Fortress 2):
scx_rustland: kick user-space scheduler when a CPU is released
The idea is to kick the user-space scheduler immediately in the
.stopping()
callback and speculate on the fact that most of the time there's always another task ready to run. This prevents evident lag issues with Team Fortress 2 and also sporadic lags with Counter-Strike 2. The extra overhead added due to potential unnecessary wake-up events seems to be negligible compared to the benefits.Last but not least, a small change that can be really useful for debugging:
scx_rustland: dump scheduler statistics before exiting