scx: Update conditions for WAKE_SYNC migration #167

Byte-Lab · 2024-03-28T15:11:53Z

In scx_select_cpu_dfl(), we currently migrate the waking task to the CPU of the waker in the following scenario:

WAKE_SYNC is specified in wake_flags
There is at least one idle core in the system
The wakee can run on the waker CPU

The assumption implicit with (2) is that the system is under saturated, and that therefore the wakee's runqueue delay would not be impacted by migrating to the waker's CPU rather than migrating to an idle core. This doesn't always happen in practice though. Consider the following scenario:

The system is overloaded, and at least one core becomes idle
Some groups of pairs of tasks that communicate over IPC are spawned.
Sender tasks are running on cores that still have enqueued tasks from when the system was overloaded, and they repeatedly wake wakee tasks with WAKE_SYNC.
The waker tasks observe that the system is underloaded, and so think that it's optimal for the wakee to be migrated to their CPU despite having a deep runqueue.

This can cause serious performance regressions for such workloads. For example, hackbench regresses by nearly 10x relative to EEVDF:

[1]+ ./scx_simple > /dev/null 2> /dev/null &
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes
Time: 2.944
[root@virtme-ng bin]# fg
./scx_simple > /dev/null 2> /dev/null
^C
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes
Time: 0.345

What we really want is to only migrate to the waker CPU if nobody else is already enqueued there. This will cause tasks to fan out over any idle CPUs when they're available if the waker's rq is overloaded, and then eventually to start enjoying wakeups on the waker's CPU once load has been distributed and tasks are no longer piling up on a subset of cores.

With this patch, the regression is addressed:

[root@virtme-ng bin]# ./scx_simple > /dev/null &
[1] 336
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes
Time: 0.348
[root@virtme-ng bin]# fg
./scx_simple > /dev/null
^CEXIT: BPF scheduler unregistered
[root@virtme-ng bin]# hackbench --loops 1000
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes
Time: 0.352

htejun

That analysis makes perfect sense. Thanks for fixing it. I think rusty has the same logic and should probably be updated together. I left a comment re. removal of the idle test. I think we should keep it. Otherwise, looks great.

htejun · 2024-03-28T16:08:11Z

kernel/sched/ext.c

 	if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 &&
-	    !cpumask_empty(idle_masks.cpu) && !(current->flags & PF_EXITING)) {
-		cpu = smp_processor_id();
+	    cpu_rq(cpu)->scx.local_dsq.nr == 0 && !(current->flags & PF_EXITING)) {


The !cpumask_empty(idle_masks.cpu) test is there for fairness as the wakee would get an unfair advantage if directly dispatched to the local dsq when the system is saturated - e.g. in scx_simple's case, the task would go to local dsq when it should go to the end of the fallback dsq.

So, maybe the solution is to keep both conditions? The idle test for fairness, local dsq test so that we don't end up queueing multiple on the same dsq?

Makes sense, I'll put the idle mask check back in and augment the comment a bit

htejun · 2024-03-28T16:11:23Z

kernel/sched/ext.c

 	 * local DSQ of the waker.
 	 */
+	cpu = smp_processor_id();


Oh, also, what does this relocation do?

We need it for the cpu_rq(cpu) call. Maybe it's a bit better to just use this_rq()?

Oh, I mean, moving it out is completely fine. I just missed where it's used.

Byte-Lab · 2024-03-28T16:39:30Z

Applied @htejun's suggestion. Will fix rusty in a follow-up

In scx_select_cpu_dfl(), we currently migrate the waking task to the CPU of the waker in the following scenario: 1. WAKE_SYNC is specified in wake_flags 2. There is at least one idle core in the system 3. The wakee can run on the waker CPU The assumption implicit with (2) is that the system is under saturated, and that therefore the wakee's runqueue delay would not be impacted by migrating to the waker's CPU rather than migrating to an idle core. This doesn't always happen in practice though. Consider the following scenario: 1. The system is overloaded, and at least one core becomes idle 2. Some groups of pairs of tasks that communicate over IPC are spawned. 3. Sender tasks are running on cores that still have enqueued tasks from when the system was overloaded, and they repeatedly wake waker tasks with WAKE_SYNC. 4. The waker tasks observe that the system is underloaded, and so think that it's optimal for the wakee to be migrated to their CPU despite having a deep runqueue. This can cause serious performance regressions for such workloads. For example, hackbench regresses by nearly 10x relative to EEVDF: [1]+ ./scx_simple > /dev/null 2> /dev/null & [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 2.944 [root@virtme-ng bin]# fg ./scx_simple > /dev/null 2> /dev/null ^C [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.345 What we really want is to only migrate to the waker CPU if nobody else is already enqueued there. This will cause tasks to fan out over any idle CPUs when they're available if the waker's rq is overloaded, and then eventually to start enjoying wakeups on the waker's CPU once load has been distributed and tasks are no longer piling up on a subset of cores. With this patch, the regression is addressed: [root@virtme-ng bin]# ./scx_simple > /dev/null & [1] 336 [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.348 [root@virtme-ng bin]# fg ./scx_simple > /dev/null ^CEXIT: BPF scheduler unregistered [root@virtme-ng bin]# hackbench --loops 1000 Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks) Each sender will pass 1000 messages of 100 bytes Time: 0.352 Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab requested a review from htejun March 28, 2024 15:11

htejun approved these changes Mar 28, 2024

View reviewed changes

htejun reviewed Mar 28, 2024

View reviewed changes

Byte-Lab force-pushed the select_dfl_wake_sync branch from 4ecf8bc to 266a655 Compare March 28, 2024 16:38

Byte-Lab force-pushed the select_dfl_wake_sync branch from 266a655 to 83e31d6 Compare March 28, 2024 17:57

Byte-Lab merged commit 2fc287e into sched_ext Mar 28, 2024
1 check passed

Byte-Lab deleted the select_dfl_wake_sync branch March 28, 2024 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx: Update conditions for WAKE_SYNC migration #167

scx: Update conditions for WAKE_SYNC migration #167

Byte-Lab commented Mar 28, 2024 •

edited

Loading

htejun left a comment

htejun Mar 28, 2024

Byte-Lab Mar 28, 2024

htejun Mar 28, 2024

Byte-Lab Mar 28, 2024 •

edited

Loading

htejun Mar 28, 2024

Byte-Lab commented Mar 28, 2024 •

edited

Loading

scx: Update conditions for WAKE_SYNC migration #167

scx: Update conditions for WAKE_SYNC migration #167

Conversation

Byte-Lab commented Mar 28, 2024 • edited Loading

htejun left a comment

Choose a reason for hiding this comment

htejun Mar 28, 2024

Choose a reason for hiding this comment

Byte-Lab Mar 28, 2024

Choose a reason for hiding this comment

htejun Mar 28, 2024

Choose a reason for hiding this comment

Byte-Lab Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

htejun Mar 28, 2024

Choose a reason for hiding this comment

Byte-Lab commented Mar 28, 2024 • edited Loading

Byte-Lab commented Mar 28, 2024 •

edited

Loading

Byte-Lab Mar 28, 2024 •

edited

Loading

Byte-Lab commented Mar 28, 2024 •

edited

Loading