Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustland: enable preemption #235

Merged
merged 5 commits into from
Apr 23, 2024
Merged

rustland: enable preemption #235

merged 5 commits into from
Apr 23, 2024

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Apr 22, 2024

Overview

Provide preemption capability in scx_rustland_core and use this feature in scx_rustland to improve the the responsiveness of interactive tasks.

Design

For now the design is pretty simple:

  • scx_rustland_core provides a new dispatch flag RL_PREEMPT_CPU that is mapped to SCX_ENQ_PREEMPT
  • scx_rustland uses this flag when dispatching tasks that are classified as "interactive", so that they can preempt other tasks before their assigned time slice expires

This implies that interactive tasks can also preempt each other, potentially causing an excessive amount of preemption events if a consistent amount of tasks is classified as "interactive". This can be improved in the future by limiting the the amount of preemption events per sec., similar to what scx_lavd is doing.

Results

Measuring the performance of the usual benchmark "playing a video game while running a parallel kernel build in background" seems to give around 2-10% boost in the fps with preemption enabled, depending on the
particular video game.

Results were obtained running a make -j32 kernel build on a AMD Ryzen 7 5800X 8-Cores 16GB RAM, while testing video games such as Baldur's Gate 3 (with a solid +10% fps), Counter Strike 2 (around +5%) and Team
Fortress 2 (+2% boost).

Moreover, some WebGL applications (such as https://webglsamples.org/aquarium/aquarium.html) seem to benefit even more with preemption enabled, providing up to a +15% fps boost.

@arighi arighi requested review from Byte-Lab and htejun April 22, 2024 10:27
Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Overloading cpu field to carry flags seems a bit unnecessarily complicated tho. Adding another field wouldn't make any practical difference, right?

@multics69
Copy link
Contributor

Looks great to me. Simple and elegant!

It would be worth considering the worst-case behavior: When (almost) all tasks are interactive, will they preempt each other with little actual progress? It would require some sort of rate limiting.

Reserve some bits of the `cpu` attribute of a task to store special
dispatch flags.

Initially, let's introduce just RL_CPU_ANY to replace the special value
NO_CPU, indicating that the task can be dispatched on any CPU,
specifically the first CPU that becomes available.

This allows to keep the CPU value assigned by the builtin idle selection
logic, that can potentially be used later for further optimizations.

Moreover, having the possibility to specify dispatch flags gives more
flexibility and it allows to map new scheduling features to such flags.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Introduce the new dispatch flag RL_PREEMPT_CPU that can be used to
dispatch tasks that can preempt others.

Tasks with this flag set will be dispatched by the BPF part using
SCX_ENQ_PREEMPT, so they can potentially preempt any other task running
on the target CPU.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Use the new scx_rustland_core dispatch flag RL_PREEMPT_CPU to allow
interactive tasks to preempt other tasks with scx_rustland.

If the built-in idle selection logic is enforced (option `-i`), the
scheduler prioritizes keeping tasks on the target CPU designated by this
logic. With preemption enabled, these tasks have a higher likelihood of
reusing their cached working set, potentially improving performance.

Alternatively, when tasks are dispatched to the first available CPU
(default behavior), interactive tasks benefit from running more promptly
by kicking out other tasks before their assigned time slice expires.

This potentially allows to increase the default time slice to higher
values in the future, to improve the overall throughput in the system
and, at the same time, still maintain a good level of responsiveness,
because interactive tasks are now able to run pretty much immediately,
independently on the remaining time slice of the other tasks that are
contending the CPUs in the system.

= Results =

Measuring the performance of the usual benchmark "playing a video game
while running a parallel kernel build in background" seems to give
around 2-10% boost in the fps with preemption enabled, depending on the
particular video game.

Results were obtained running a `make -j32` kernel build on a AMD Ryzen
7 5800X 8-Cores 16GB RAM, while testing video games such as Baldur's
Gate 3 (with a solid +10% fps), Counter Strike 2 (around +5%) and Team
Fortress 2 (+2% boost).

Moreover, some WebGL applications (such as
https://webglsamples.org/aquarium/aquarium.html) seem to benefit even
more with preemption enabled, providing up to a +15% fps boost.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Provide a run-time option to disable task preemption.

This option can be used to improve the throughput of the CPU-intensive
tasks while still providing a good level of responsiveness in the
system.

By default preemption is enabled, to provide a higher level of
responsiveness to the interactive tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi
Copy link
Collaborator Author

arighi commented Apr 23, 2024

LGTM. Overloading cpu field to carry flags seems a bit unnecessarily complicated tho. Adding another field wouldn't make any practical difference, right?

I agree. Moreover, adding another u64 increases the struct from 24 bytes to 32 bytes that might be actually better in terms of cacheline usage / performance:

  • before:
struct dispatched_task_ctx {
	s32                        pid;                  /*     0     4 */
	s32                        cpu;                  /*     4     4 */
	u64                        cpumask_cnt;          /*     8     8 */
	u64                        slice_ns;             /*    16     8 */

	/* size: 24, cachelines: 1, members: 4 */
	/* last cacheline: 24 bytes */
};
  • after:
struct dispatched_task_ctx {
	s32                        pid;                  /*     0     4 */
	s32                        cpu;                  /*     4     4 */
	u64                        cpumask_cnt;          /*     8     8 */
	u64                        slice_ns;             /*    16     8 */
	u64                        flags;                /*    24     8 */

	/* size: 32, cachelines: 1, members: 5 */
	/* last cacheline: 32 bytes */
};

I'll change this and update the PR. Thanks!

@arighi
Copy link
Collaborator Author

arighi commented Apr 23, 2024

Looks great to me. Simple and elegant!

It would be worth considering the worst-case behavior: When (almost) all tasks are interactive, will they preempt each other with little actual progress? It would require some sort of rate limiting.

Right, rate-limiting the RL_PREEMPT_CPU dispatches seems the easiest way to mitigate potential storms of unnecessary preempt events and it's probably good enough in practice, that's the first improvement that I'm planning to do and I'll try to re-use some of the logic from scx__lavd. :)

Do not encode dispatch flags in the cpu field, but simply use a separate
"flags" field.

This makes the code much simpler and it increases the size of
dispatched_task_ctx from 24 to 32, that is probably better in terms of
cacheline allocation / performance.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi arighi merged commit a8226f0 into main Apr 23, 2024
1 check passed
@arighi arighi deleted the rustland-preemption branch April 23, 2024 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants