Skip to content
This repository has been archived by the owner on Jun 18, 2024. It is now read-only.

scx: Pick an idle cpu when select runqueue with WF_EXEC #221

Closed
wants to merge 1 commit into from

Conversation

vax-r
Copy link
Contributor

@vax-r vax-r commented Jun 13, 2024

Summary

If the caller of select_task_rq_scx() calls with WF_EXEC, it's a valuable balancing opportunity to perform task migration. Although runqueue can't dictate where the task is going to run, we can still take advantage of this good opportunity by selecting an idle core, rather than doing nothing.

Note

This might be an over-concerned case for current implementation of sched_ext since there's no caller like sched_exec() under /kernel/sched/core.c which would call select_task_rq() with WF_EXEC . However it's still worth considering the changing and maybe add a proper caller with WF_EXEC.

If the caller of select_task_rq_scx() calls with WF_EXEC, it's a
valuable balancing opportunity to perform task migration. Although
runqueue can't dictate where the task is going to run, we can still take
advantage of this good opportunity by selecting an idle core, rather
than doing nothing.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Copy link
Collaborator

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! Left a comment for some discussion.

if (unlikely(wake_flags & WF_EXEC))
return prev_cpu;
if (unlikely(wake_flags & WF_EXEC)) {
int idle_cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we're going to do this we should probably use scx_select_cpu_dfl(). But we have to be a bit careful here, not every scheduler will use built-in idle tracking, so that may actually cause a scheduler that implements ops.update_idle() (and doesn't specify SCX_OPS_KEEP_BUILTIN_IDLE) to be evicted. Maybe we want something like this:

if (unlikely(wake_flags & WF_EXEC)) {
        if (static_branch_likely(&scx_builtin_idle_enabled)) {
                bool found;

                return scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found);
        } else {
                return prev_cpu;
        }
}

A couple more thoughts:

The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling TTWU_QUEUE when sched_ext is enabled, despite there being some performance implications so that we never skip running ops.select_cpu() on the wakeup path.

So if we decide to do this, we might also want to introduce a new op flag for this which controls whether it's enabled (e.g. SCX_OPS_AUTO_IDLE_EXEC) so that implementations can choose to opt in or out of letting the core scheduler reserve idle CPUs on the exec path.

@htejun wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at WF_EXEC, the cache footprint is low; however, unlike kernel scheduler, migration is often frequent and easy for SCX scheds. There is no built-in resistance to migration in SCX, so the benefit of extra migration opportunity is questionable. So, introducing unconditional migration to an idle CPU as the default behavior doesn't make much sense. It's likely to just add unnecessary migration overhead in exec path.

However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it could be that if an SCX scheduler implements high CPU stickiness like the built-in scheduler does, the cheap migration opportunity may be interesting, which is what the comment above is saying - if such cases arise, we can add another operation to ask the BPF scheduler whether it wants to preempt itself so that it goes through the scheduling path again.

Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more thoughts:

The potential downside of doing this is that the core scheduler is now doing stuff in the background that the BPF scheduler doesn't see, which could lead to confusing observations or unwanted behavior. For example, perhaps a BPF scheduler explicitly doesn't want to pick a new idle CPU here so that it can keep power consumption low. We already disable built-in behavior elsewhere in the kernel to give the BPF scheduler more control, such as disabling TTWU_QUEUE when sched_ext is enabled, despite there being some performance implications so that we never skip running ops.select_cpu() on the wakeup path.

Interesting, so now EXT scheduling class is isolated from the rest of scheduling class like fair, deadline and so on, I suppose it should at least have the same priority level like normal class ? If that's the case then if we load BPF scheduler in the kernel, sched_entity in kernel are still going to be scheduled by the default scheduler like CFS/EEVDF ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that an ops flag won't do here. It's highly unlikely that you'd always want to migrate to an idle CPU. There's no scenario where that's always a win. The scheduler must evaluate whether each relatively low cost migration opportunity is worth taking, which is what the kernel scheduler is doing too. Imagine a scheduler with high CPU stickiness and there are two CPUs. One CPU is loaded lower and a task is exec'ing there. The other CPU happened to be idle at that exact moment. If the task gets migrated to the more loaded CPU, it will just create more future need for migrations in the other direction and situations like this are quite likely in schedulers with high CPU stickiness.

Then that's why BPF scheduler has more advantage since it can take more controls on migrations due to different scheduling policies , we should leave this part for per-scheduler implementation to decide, rather than perform a no-brain migration whenever the migration cost is cheap. Thanks for providing this great insight .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair enough, if we want to support migrations on the exec path this is the way. So @vax-r, we'll probably drop this patch for the reasons Tejun mentioned.

No problem ! Thanks for sharing these insights and feedbacks. I'll close the PR later.

@vax-r vax-r closed this Jun 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants