-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rustland: reduce scheduling overhead #272
Conversation
a74ba86
to
6705495
Compare
* scheduler and check if we need to switch to FIFO mode or regular | ||
* scheduling. | ||
*/ | ||
#define USERSCHED_TIMER_NS NSEC_PER_SEC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just my intuition but if the overhead isn't too high, it might be worthwhile to try to keep this duration something which people usually don't notice - say. something under 100ms. Something which is best decided with experiments, probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was also thinking about this, the overhead shouldn't be that much, we're just accessing mostly local variables in the timer callback.
I think a potential issue of reducing the timer too much could be that we may get a lot of false positives and we may start jumping back and forth into FIFO mode. In this case we may want to refine the moving average to better "smooth" the potential false positives. Because ideally we want to act promptly at exiting from FIFO mode, while entering in FIFO mode is not so critical.
I'll run more tests trying different values of USERSCHED_TIMER_NS
, potentially also with a longer time window for the nr_active_avg
moving average.
@@ -521,6 +590,16 @@ void BPF_STRUCT_OPS(rustland_enqueue, struct task_struct *p, u64 enq_flags) | |||
return; | |||
} | |||
|
|||
/* | |||
* Dispatch directly to the shared DSQif the scheduler is set to FIFO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A missing space between "DSQ" and "if".
nr_active = nr_active_avg; | ||
|
||
/* Update flag that determines if FIFO needs to be enabled */ | ||
return nr_active <= (num_possible_cpus >> 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether having separate enable and disable thresholds would be useful so that a system which is sitting right on the edge doesn't keep flipping between the two modes. Or maybe that's okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... good point. There's the moving average that is trying to smooth nr_active
a bit, maybe adding also a separate threshold can help to reduce the risk of flipping between the two modes.
Also, flipping between the two modes is not dramatic, the downside is that "FIFO" tasks will get more priority than the "non-FIFO" tasks, because the former are dispatched using SCX_DSQ_LOCAL
. Now that I think about it, that's not really needed, actually they should be dispatched to the shared DSQ. I'll run some tests with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no... I take it back, using the shared DSQ would be bad, because they could be bounced on the first available CPU, instead of running them on their local one, so SCX_DSQ_LOCAL
is better to keep them running on the CPU selected during select_cpu
.
Edit: ...but, we can use the per-CPU DSQ (without using dispatch_task()
, because we can't apply the whole scx_bpf_dispatch_cancel()
logic in enqueue
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: ...but, we can use the per-CPU DSQ (without using
dispatch_task()
, because we can't apply the wholescx_bpf_dispatch_cancel()
logic inenqueue
).
Basically 4b2defc, that I like it better. than using SCX_DSQ_LOCAL and it also seems to perform slightly better.
2ccc352
to
abaf4be
Compare
...
@htejun this change that I dropped sounds like a bad idea and a good idea at the same time. Basically, if we don't re-kick the CPU from In terms of throughput this is bad, because we want to use the CPUs at their full capacity. But in terms of power consumption, this could be an interesting way of enforcing a "low power" mode, where the CPU throttling is applied mostly to the CPU-bound tasks (i.e., kernel build) while the interactive tasks are actually performing better. In my specific case, I can apply this "low power" mode, keep building the kernel while gaming, and I won't notice any slowness / stuttering in the game. To give some numbers, testing this scenario on my laptop, I can go from ~15W to ~10W (that is like a +33% saving) without noticing any performance degradation in the game (the kernel build is noticeably slower of course). I'm probably missing something obvious, but it could be an intriguing aspect to explore further. |
Drop the global effective time-slice and use the more fine-grained per-task time-slice to implement the dynamic time-slice capability. This allows to reduce the scheduler's overhead (dropping the global time slice volatile variable shared between user-space and BPF) and it provides a more fine-grained control on the per-task time slice. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Add a method to TopologyMap to get the amount of online CPUs. Considering that most of the schedulers are not handling CPU hotplugging it can be useful to expose also this metric in addition to the amount of available CPUs in the system. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
abaf4be
to
ca32173
Compare
Alright, I pushed a few more changes:
In terms of numbers I can see some improvements (around +4-5%) in terms of fps pretty much with all the games that I've tested (Baldur's Gate 3, Counter-Strike 2, Cyberpunk 2077, Terraria), both in the over-commissioned and under-commissioned scenario. There might still be some regressions and we can probably do better, but according to all the tests that I've done it looks like there are pretty consistent improvements. I'll re-run another set of tests and, unless I see obvious issues, I'll probably merge this one. |
Change the BPF CPU selection logic as following: - if the previously used CPU is idle, keep using it - if the task is not coming from a wait state, try to stick as much as possible to the same CPU (for better cache usage) - if the task is waking up from a wait state rely on the sched_ext built-int idle selection logic This logic can be completely disabled when the full user-space mode is enabled. In this case tasks will always be assigned to the previously used CPU and the user-space scheduler should take care of distributing them among the available CPUs. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The comment that describes rustland_update_idle() is still incorrectly reporting an old implemention detail. Update its description for better clarity. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
ca32173
to
209c454
Compare
As reported in the description at the top, I've temporarily dropped the FIFO mode change from this PR. It can introduce potential regressions in some cases (like system becoming unresponsive if all of a sudden a lot of tasks are spawned - e.g., a This probably needs more testing and tweaking. I'll re-introduce this feature in a separate PR. The other changes can still provide benefits for the over commissioned scenario. |
Moreover, since the only change that I wasn't still 100% convinced to apply is the FIFO mode, I'm going to merge this PR now, since the other changes look pretty safe. As mentioned in the previous comment, I will re-introduce the FIFO mode in another PR after doing more tests and experiments. I'm actually considering to create a separate |
This PR contains a set of changes that I've been experimenting with for some time.
Given the optimized communication between user-space -> kernel (#270), it seems like a good time now to apply also these changes and potentially elevate scx_rustland to a more production-ready scheduler.
With these changes in place we can reduce the CPU time used by the scheduler and improve the level of responsiveness whether the system is under or over commissioned.
In terms of numbers the improvement can be quantified around a +4-5% boost in fps in games, such as Baldur's Gate 3, Counter-Strike 2 and Cyberpunk 2077, when the system is over commissioned.
The same results can be obtained both in the under-commissioned and over-commissioned scenarios.Here is a short summary of the changes:
scx_rustland_core: implement effective time slice on a per-task basis
: remove the global dynamic slice_ns and set a time slice on a per-task basisscx_rustland_core: switch to FIFO when system is underutilized
: automatically switch to a simple FIFO when the scheduler is not really needed (under commissioned system)scx_rustland_core: refine built-in CPU idle selection logic
: improve the BPFselect_cpu()
logic differentiating tasks waking up from a wait state vs tasks not coming from a wait state-scx_rustland_core: optimize update_idle()
: do not re-kick the current CPU when entering an idle state (since it seems to introduce more overhead than benefits with all the recent changes applied)Edit: do not merge yet, because I would like to do more tests with the last change (the one about
update_idle()
), since it could regress some workloads.Edit: dropped the
update_idle()
that actually seems to introduce big regressions with CPU-intensive workloads (making a kernel build too much slower, for example).Edit: update description with fps results.
Edit: temporarily dropped the FIFO mode change from this PR, since it requires more testing. The switch to FIFO mode makes the system quite unresponsive if all of a sudden a bunch of tasks are spawned and the scheduler doesn't react fast enough at exiting from FIFO mode. Even if it can provide a slightly boost when the system is under commissioned, it's still preferable for now to maintain the reliability of the scheduler to provide system responsiveness in all the possible scenarios. I'll re-introduce the FIFO mode in a separate PR after more tweaking and testing.