rustland: reduce scheduling overhead #272

arighi · 2024-05-09T10:12:58Z

This PR contains a set of changes that I've been experimenting with for some time.

Given the optimized communication between user-space -> kernel (#270), it seems like a good time now to apply also these changes and potentially elevate scx_rustland to a more production-ready scheduler.

With these changes in place we can reduce the CPU time used by the scheduler and improve the level of responsiveness whether the system is under or over commissioned.

In terms of numbers the improvement can be quantified around a +4-5% boost in fps in games, such as Baldur's Gate 3, Counter-Strike 2 and Cyberpunk 2077, when the system is over commissioned. ~~The same results can be obtained both in the under-commissioned and over-commissioned scenarios.~~

Here is a short summary of the changes:

scx_rustland_core: implement effective time slice on a per-task basis: remove the global dynamic slice_ns and set a time slice on a per-task basis
~~scx_rustland_core: switch to FIFO when system is underutilized: automatically switch to a simple FIFO when the scheduler is not really needed (under commissioned system)~~
scx_rustland_core: refine built-in CPU idle selection logic: improve the BPF select_cpu() logic differentiating tasks waking up from a wait state vs tasks not coming from a wait state
~~- scx_rustland_core: optimize update_idle(): do not re-kick the current CPU when entering an idle state (since it seems to introduce more overhead than benefits with all the recent changes applied)~~

Edit: do not merge yet, because I would like to do more tests with the last change (the one about update_idle()), since it could regress some workloads.

Edit: dropped the update_idle() that actually seems to introduce big regressions with CPU-intensive workloads (making a kernel build too much slower, for example).

Edit: update description with fps results.

Edit: temporarily dropped the FIFO mode change from this PR, since it requires more testing. The switch to FIFO mode makes the system quite unresponsive if all of a sudden a bunch of tasks are spawned and the scheduler doesn't react fast enough at exiting from FIFO mode. Even if it can provide a slightly boost when the system is under commissioned, it's still preferable for now to maintain the reliability of the scheduler to provide system responsiveness in all the possible scenarios. I'll re-introduce the FIFO mode in a separate PR after more tweaking and testing.

htejun · 2024-05-09T18:36:43Z

rust/scx_rustland_core/assets/bpf/main.bpf.c

+ * scheduler and check if we need to switch to FIFO mode or regular
+ * scheduling.
+ */
+#define USERSCHED_TIMER_NS NSEC_PER_SEC


It's just my intuition but if the overhead isn't too high, it might be worthwhile to try to keep this duration something which people usually don't notice - say. something under 100ms. Something which is best decided with experiments, probably.

Yeah, I was also thinking about this, the overhead shouldn't be that much, we're just accessing mostly local variables in the timer callback.

I think a potential issue of reducing the timer too much could be that we may get a lot of false positives and we may start jumping back and forth into FIFO mode. In this case we may want to refine the moving average to better "smooth" the potential false positives. Because ideally we want to act promptly at exiting from FIFO mode, while entering in FIFO mode is not so critical.

I'll run more tests trying different values of USERSCHED_TIMER_NS, potentially also with a longer time window for the nr_active_avg moving average.

htejun · 2024-05-09T18:48:41Z

rust/scx_rustland_core/assets/bpf/main.bpf.c

@@ -521,6 +590,16 @@ void BPF_STRUCT_OPS(rustland_enqueue, struct task_struct *p, u64 enq_flags)
 		return;
 	}

+	/*
+	 * Dispatch directly to the shared DSQif the scheduler is set to FIFO


A missing space between "DSQ" and "if".

htejun · 2024-05-09T18:58:44Z

rust/scx_rustland_core/assets/bpf/main.bpf.c

+		nr_active = nr_active_avg;
+
+	/* Update flag that determines if FIFO needs to be enabled */
+	return nr_active <= (num_possible_cpus >> 1);


I wonder whether having separate enable and disable thresholds would be useful so that a system which is sitting right on the edge doesn't keep flipping between the two modes. Or maybe that's okay?

Hm... good point. There's the moving average that is trying to smooth nr_active a bit, maybe adding also a separate threshold can help to reduce the risk of flipping between the two modes.

Also, flipping between the two modes is not dramatic, the downside is that "FIFO" tasks will get more priority than the "non-FIFO" tasks, because the former are dispatched using SCX_DSQ_LOCAL. Now that I think about it, that's not really needed, actually they should be dispatched to the shared DSQ. I'll run some tests with this change.

Actually no... I take it back, using the shared DSQ would be bad, because they could be bounced on the first available CPU, instead of running them on their local one, so SCX_DSQ_LOCAL is better to keep them running on the CPU selected during select_cpu.

Edit: ...but, we can use the per-CPU DSQ (without using dispatch_task(), because we can't apply the whole scx_bpf_dispatch_cancel() logic in enqueue).

Edit: ...but, we can use the per-CPU DSQ (without using dispatch_task(), because we can't apply the whole scx_bpf_dispatch_cancel() logic in enqueue).

Basically 4b2defc, that I like it better. than using SCX_DSQ_LOCAL and it also seems to perform slightly better.

arighi · 2024-05-09T21:21:51Z

...

scx_rustland_core: optimize update_idle(): do not re-kick the current CPU when entering an idle state (since it seems to introduce more overhead than benefits with all the recent changes applied)
...
Edit: dropped the update_idle() that actually seems to introduce big regressions with CPU-intensive workloads (making a kernel build too much slower, for example).

@htejun this change that I dropped sounds like a bad idea and a good idea at the same time.

Basically, if we don't re-kick the CPU from update_idle(), the CPUs in the system are not used at their full capacity, hence the regression with the CPU-bound workloads. However, the scheduler is acting in a more non-work conserving way, making it even more effective at prioritizing interactive workloads. In fact, I can experience a better responsiveness / better fps during the usual benchmark "play a game while recompiling the kernel".

In terms of throughput this is bad, because we want to use the CPUs at their full capacity. But in terms of power consumption, this could be an interesting way of enforcing a "low power" mode, where the CPU throttling is applied mostly to the CPU-bound tasks (i.e., kernel build) while the interactive tasks are actually performing better.

In my specific case, I can apply this "low power" mode, keep building the kernel while gaming, and I won't notice any slowness / stuttering in the game.

To give some numbers, testing this scenario on my laptop, I can go from ~15W to ~10W (that is like a +33% saving) without noticing any performance degradation in the game (the kernel build is noticeably slower of course).

I'm probably missing something obvious, but it could be an intriguing aspect to explore further.

Drop the global effective time-slice and use the more fine-grained per-task time-slice to implement the dynamic time-slice capability. This allows to reduce the scheduler's overhead (dropping the global time slice volatile variable shared between user-space and BPF) and it provides a more fine-grained control on the per-task time slice. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

Add a method to TopologyMap to get the amount of online CPUs. Considering that most of the schedulers are not handling CPU hotplugging it can be useful to expose also this metric in addition to the amount of available CPUs in the system. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi · 2024-05-10T20:18:17Z

Alright, I pushed a few more changes:

reduced the USERSCHED_TIMER_NS (to 250ms that seems the optimal value according to the tests that I've done)
changed the select_cpu logic to only distinguish between tasks waking up from a wait event vs tasks not coming from a wait (similar to what scx_lavd does in select_cpu)
refined the logic to switch to/from FIFO mode to be a little less "bouncy"

In terms of numbers I can see some improvements (around +4-5%) in terms of fps pretty much with all the games that I've tested (Baldur's Gate 3, Counter-Strike 2, Cyberpunk 2077, Terraria), both in the over-commissioned and under-commissioned scenario.

There might still be some regressions and we can probably do better, but according to all the tests that I've done it looks like there are pretty consistent improvements.

I'll re-run another set of tests and, unless I see obvious issues, I'll probably merge this one.

Change the BPF CPU selection logic as following: - if the previously used CPU is idle, keep using it - if the task is not coming from a wait state, try to stick as much as possible to the same CPU (for better cache usage) - if the task is waking up from a wait state rely on the sched_ext built-int idle selection logic This logic can be completely disabled when the full user-space mode is enabled. In this case tasks will always be assigned to the previously used CPU and the user-space scheduler should take care of distributing them among the available CPUs. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

The comment that describes rustland_update_idle() is still incorrectly reporting an old implemention detail. Update its description for better clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi · 2024-05-11T05:52:59Z

As reported in the description at the top, I've temporarily dropped the FIFO mode change from this PR. It can introduce potential regressions in some cases (like system becoming unresponsive if all of a sudden a lot of tasks are spawned - e.g., a stress-ng -c 128 - and the scheduler isn't fast enough at exiting from FIFO mode).

This probably needs more testing and tweaking. I'll re-introduce this feature in a separate PR. The other changes can still provide benefits for the over commissioned scenario.

arighi · 2024-05-11T08:16:55Z

Moreover, since the only change that I wasn't still 100% convinced to apply is the FIFO mode, I'm going to merge this PR now, since the other changes look pretty safe.

As mentioned in the previous comment, I will re-introduce the FIFO mode in another PR after doing more tests and experiments. I'm actually considering to create a separate rustland-next branch, so that I can easily share with others a more "experimental" branch of rustland.

arighi requested review from Byte-Lab and htejun May 9, 2024 10:13

arighi force-pushed the rustland-reduce-scheduling-overhead branch 2 times, most recently from a74ba86 to 6705495 Compare May 9, 2024 17:25

htejun approved these changes May 9, 2024

View reviewed changes

arighi force-pushed the rustland-reduce-scheduling-overhead branch 2 times, most recently from 2ccc352 to abaf4be Compare May 9, 2024 19:57

Andrea Righi added 2 commits May 10, 2024 17:24

arighi force-pushed the rustland-reduce-scheduling-overhead branch from abaf4be to ca32173 Compare May 10, 2024 19:50

Andrea Righi added 2 commits May 11, 2024 07:37

scx_rustland_core: fix update_idle description

209c454

The comment that describes rustland_update_idle() is still incorrectly reporting an old implemention detail. Update its description for better clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

arighi force-pushed the rustland-reduce-scheduling-overhead branch from ca32173 to 209c454 Compare May 11, 2024 05:38

arighi merged commit 5f9ce3b into main May 11, 2024
1 check passed

arighi deleted the rustland-reduce-scheduling-overhead branch May 11, 2024 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustland: reduce scheduling overhead #272

rustland: reduce scheduling overhead #272

arighi commented May 9, 2024 •

edited

Loading

htejun May 9, 2024

arighi May 9, 2024

htejun May 9, 2024

htejun May 9, 2024

arighi May 9, 2024

arighi May 9, 2024 •

edited

Loading

arighi May 9, 2024

arighi commented May 9, 2024

arighi commented May 10, 2024

arighi commented May 11, 2024

arighi commented May 11, 2024

rustland: reduce scheduling overhead #272

rustland: reduce scheduling overhead #272

Conversation

arighi commented May 9, 2024 • edited Loading

htejun May 9, 2024

Choose a reason for hiding this comment

arighi May 9, 2024

Choose a reason for hiding this comment

htejun May 9, 2024

Choose a reason for hiding this comment

htejun May 9, 2024

Choose a reason for hiding this comment

arighi May 9, 2024

Choose a reason for hiding this comment

arighi May 9, 2024 • edited Loading

Choose a reason for hiding this comment

arighi May 9, 2024

Choose a reason for hiding this comment

arighi commented May 9, 2024

arighi commented May 10, 2024

arighi commented May 11, 2024

arighi commented May 11, 2024

arighi commented May 9, 2024 •

edited

Loading

arighi May 9, 2024 •

edited

Loading