scx_rustland: mitigate sub-optimal performance with offline CPUs #189
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Most of the schedulers assume that the amount of possible CPUs in the system represents the actual number of CPUs available.
This is not always true: some CPUs may be offline or certain CPU models (AMD CPUs for example) may include unavailable CPUs in this number.
This can lead to sub-optimal performance or even errors in the scheduler (see for example [1][2]).
Ideally, we need to attack this issue in a more generic way, such as having a proper API provided by a C library, that can be used by all schedulers and the topology Rust module (scx_utils crate).
But for now, let's try to mitigate most of the common sub-optimal cases separately inside each scheduler.
For rustland we can apply some mitigations both in select_cpu() (for the BPF part) and in the user-space part:
the former is fixed in the sched-ext kernel by commit 94dc0c01b957 ("scx: Use cpu_online_mask when resetting idle masks"). However, adding an extra check
cpu < num_possible_cpus
in select_cpu(), allows to properly support AMD CPUs, even with kernels that don't have the cpu_online_mask fix yet (this doesn't always guarantee the validity of cpu, but it should be enough to mitigate the majority of the potential sub-optimal cases, without introducing any significant overhead)the latter can be fixed relying on topology.span(), instead of topology.nr_cpus(), to count the amount of available CPUs in the system.
[1] sched-ext/sched_ext#69
[2] #147
Link: sched-ext/sched_ext@94dc0c0