Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustland: enhancements #72

Merged
merged 7 commits into from
Jan 7, 2024
Merged

scx_rustland: enhancements #72

merged 7 commits into from
Jan 7, 2024

Conversation

arighi
Copy link
Contributor

@arighi arighi commented Jan 6, 2024

Some fixes and enhancements to make scx_rustland more usable on a desktop / interactive system.

Copy link
Contributor

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Left one optional nit, and one suggestion for how we can make _get_task_cpu_slow() a lot faster.

Comment on lines 294 to 302
int i;

for (i = 0; i < num_possible_cpus; i++) {
cpu = (cpu + i) % num_possible_cpus;
if (bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;
}

return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else you could do is call bpf_cpumask_any_distribute(p->cpus_ptr) to choose a random CPU from p->cpus_ptr? Note that it returns a cpu >= num_cpus if none of the bits are set, so you'd have to do something like:

cpu = bpf_cpumask_any_distribute(p->cpus_ptr);
return cpu < num_possible_cpus ? cpu : -1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice suggestion! Thanks! I'll apply it and run some tests. I think we should still maintain the logic to try the prev_cpu before picking a random one using bpf_cpumask_any_distribute(), that should reduce some unnecessary migrations, but I'll test/measure both and check if one shows some clear benefits vs the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, agreed. I was thinking that bpf_cpumask_any_distribute() could replace that for loop. But trying to choose prev_cpu first seems like a good idea.

@@ -375,6 +411,14 @@ static bool is_task_cpu_available(struct task_struct *p, u64 enq_flags)
if (is_kthread(p) && p->nr_cpus_allowed == 1)
return true;

/*
* Do not bypass the user-space scheduler if there are pending
* activity, otherwise, we may risk to disrupt the scheduler's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional suggestion:

s/we may risk to disrupt/we may risk disrupting

@arighi
Copy link
Contributor Author

arighi commented Jan 6, 2024

Moreover, I'm not yet fully convinced about the last 2 commits (change the default timeout_ms and slice_ns).

While they can definitely provide some benefits for interactive workloads I think there's also value to keep them as they are, because that can help to better measure future improvements of the scheduler.

Maybe I can add a command line option that would set them if a certain command line option is passed...

Andrea Righi added 5 commits January 6, 2024 11:06
Change TaskTree.push() to accept directly a Task object, rather than
each individual attribute. Moreover, Task attributes don't need to be
public, since both TaskTree and Task are only used locally.

This makes the code more elegant and more readable.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Assign min_vruntime to the task before the weighted time slice is
evaluated, then add the time slice.

In this way we still ensure that the task's vruntime is in the range
(min_vruntime + 1, min_vruntime + max_slice_ns], but we don't nullify
the effect of the evaluated time slice if the starting vruntime of the
task is too small.

Also change update_enqueued() to return the evaluated weighted time
slice (that can be used in the future).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Instead of just trying the target CPU and the previously used CPU, we
could cycle among all the available CPUs (if both those CPUs cannot be
used), before using the global DSQ.

This allows to not de-prioritize too much tasks that can't be scheduled
on the CPU selected by the scheduler (or their previously used CPU), and
we can still dispatch them using SCX_DSQ_LOCAL_ON, like any other task.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
When we fail to push a task to the queued BPF map we fallback to direct
dispatch, but we can't use SCX_DSQ_LOCAL_ON. So, make sure to use
SCX_DSQ_GLOBAL in this case to prevent scheduler crashes.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We allow tasks to bypass the user-space scheduler and be dispatched
directly using a shortcut in the enqueue path, if their running CPU is
immediately available or if the task is per-CPU kthread.

However, the shortcut is disabled if the user-space scheduler has some
pending activities to do (to avoid disrupting too much its decision).

In this case the shortcut is disabled also for per-CPU kthreads and that
may cause priority-inversion problems in the system, triggering some
stall of some per-CPU kthreads (such as rcuog/N) and short system
lockups, if the system is overloaded.

Prevent this by always enabing the dispatch shortcut for per-CPU
kthreads.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi
Copy link
Contributor Author

arighi commented Jan 6, 2024

Alright, I dropped some commits and updated this PR only with the "safe" fixes.

For the more experimental ones I'd like to do more testing (even if the other changes make the scheduler more responsive, they also make it more unpredictable sometimes, so I would like to investigate more on that).

Andrea Righi added 2 commits January 7, 2024 16:14
Always report task comm, nr_queued and nr_scheduled in the log messages.
Moreover, report also task name (comm) and cpu when possible.

All these extra information can be really helpful to trace and debug
scheduling issues.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Cache slice_ns into the main scheduler class to avoid accessing it via
self.bpf.skel.rodata().slice_ns every single time.

This also makes the scheduler code more clear and more abstracted from
the BPF details.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@htejun htejun merged commit 63fe690 into main Jan 7, 2024
2 checks passed
@htejun htejun deleted the scx-rustland-enhancements branch January 10, 2024 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants