scx-rustland: SMT improvements #94

arighi · 2024-01-17T16:32:22Z

This PR contains an improvement in the SMT logic, it's actually a simplification and, performance-wise, it seem to deliver better results, so it's a win-win IMHO.

As mentioned in the git commit:

    Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled):
    running my usual benchmark of measuring the fps of a videogame
    (Counter-Strike 2) during a parallel kernel build-induced system
    overload, shows an improvement of approximately 2x (from 8-10fps to
    15-25fps vs 1-2fps with EEVDF).

The other commit is more like a clean-up, apparently now we can drop the scx_bpf_kick_cpu() from update_idle() without any noticeable performance drop, so let's just get rid of it.

htejun · 2024-01-17T16:49:05Z

scheds/rust/scx_rustland/src/bpf/main.bpf.c

-		 * Kick the CPU to make it immediately ready to accept
-		 * dispatched tasks.
-		 */
-		scx_bpf_kick_cpu(cpu, 0);


Hmm... I wonder why this is. It might be useful to add an option to re-enable this, so that if ppl report stalls, they can try the option as a debug step.

Sure, looks like a good idea, and I'd also like to bisect it a bit and figure out when exactly this stopped being useful.

'Hm... even with v0.1.5 dropping scx_bpf_kick_cpu() doesn't seem to make any difference, any chance that maybe something changed in the kernel? (Going before v0.1.5 is a bit tricky because of the ABI change)

I'm also wondering if the fact that now I'm always dispatching from .dispatch() with SCX_DSQ_LOCAL_ON | cpu (using the cpu selected by the user-space scheduler) implicitly kicks the target cpu, therefore the kick in .update_idle() is redundant.

Kernel side hasn't changed. Yeah, LOCAL_ON|cpu will wake up the CPU.

ok, I think everything makes sense now. I was using SCX_DSQ_GLOBAL for the scheduler and it wasn't waking up without the explicit kick cpu. Then I changed the scheduler to also use LOCAL_ON|cpu, so if that implicitly kicks the cpu we don't need the kick in update_idle anymore.

Sounds great. No need to add the option then. That sounds like it should be reliable. Thanks.

I'll update the commit message, so it's more clear why we dropped that.

Prior to commit 676bd88 ("bpf_rustland: do not dispatch the scheduler to the global DSQ"), the user-space scheduler was dispatched using SCX_DSQ_GLOBAL and we needed to explicitly kick idle CPUs from update_idle() to ensure that at least one CPU was available to run the user-space scheduler. Now that we are using SCX_DSQ_LOCAL_ON|cpu to dispatch the user-space scheduler, the target CPU is implicitly kicked. Therefore, the call to scx_bpf_kick_cpu() within .update_idle() becomes redundant and we can get rid of it. Fixes: 676bd88 ("bpf_rustland: do not dispatch the scheduler to the global DSQ") Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

The user-space scheduler dispatches tasks in batches, with the batch size matching the number of idle CPUs. Commit 791bdbe ("scx_rustland: introduce SMT support") changed the order of idle CPUs, prioritizing dispatching tasks on the least busy cores (those with the most idle CPUs) before moving on to busier cores (those with the least idle CPUs). While this approach works well for a small number of tasks, it can lead to uneven performance as the number of tasks increases and all cores are saturated. Such uneven performance can be attributed to SMT interactions causing potential short lags and erratic system performance. In some cases, disabling SMT entirely results in better system responsiveness. To address this issue, instruct the scheduler to implicitly disable SMT and consistently dispatch tasks only on the first (or last) CPU of each core. This approach ensures an equal distribution of tasks among the available cores, preventing SMT disturbances and aligning with non-SMT performance, also when a significant amount of tasks are running. Additionally, the unused sibling CPUs within each core can be used as "spare" CPUs for the BPF dispatcher. This is particularly beneficial for tasks that cannot be dispatched on the target CPU selected by the scheduler, due to cpumask restrictions or congestion conditions. Therefore, this new approach allows to enhance system responsiveness on SMT systems, while simultaneously improving scheduler stability. Some preliminary results on an AMD Ryzen 7 5800X 8-Cores (SMT enabled): running my usual benchmark of measuring the fps of a videogame (Counter-Strike 2) during a parallel kernel build-induced system overload, shows an improvement of approximately 2x (from 8-10fps to 15-25fps vs 1-2fps with EEVDF). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

htejun approved these changes Jan 17, 2024

View reviewed changes

arighi added 2 commits January 17, 2024 20:49

arighi force-pushed the scx-rustland-smt-improvements branch from 770e9b5 to be1cb87 Compare January 17, 2024 19:49

arighi merged commit 24ef0f6 into main Jan 17, 2024
1 of 2 checks passed

Byte-Lab deleted the scx-rustland-smt-improvements branch March 14, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx-rustland: SMT improvements #94

scx-rustland: SMT improvements #94

arighi commented Jan 17, 2024

htejun Jan 17, 2024

arighi Jan 17, 2024

arighi Jan 17, 2024

arighi Jan 17, 2024

htejun Jan 17, 2024

arighi Jan 17, 2024

htejun Jan 17, 2024

arighi Jan 17, 2024

scx-rustland: SMT improvements #94

scx-rustland: SMT improvements #94

Conversation

arighi commented Jan 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment