-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_rustland: refactoring #68
Conversation
No functional change, only a little polishing, including updates to comments and documentation to align with the latest changes in the code. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In the dispatch callback we can dispatch tasks to any CPU, according to the scheduler decisions, so there's no reason to check for the available dispatch slots in the current CPU only, to determine if we need to stop dispatching tasks. Since the scheduler is aware of the idle state of the CPUs (via the CPU ownership map) it has all the information to automatically regulate the flow of dispatched tasks and not overflow the dispatch slots, therefore it is safe to remove this check. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We always try to use the current CPU (from the .dispatch() callback) to run the user-space scheduler itself and if the current CPU is not usable (according to the cpumask) we just re-use the previouly used CPU. However, if the previously used CPU is also not usable, we may trigger the following error: sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201]) Potentially this can also happen with any task, so improve the dispatch logic as following: - dispatch on the target CPU, if usable - otherwise dispatch on the previously used CPU, if usable - otherwise dispatch on the global DSQ Moreover, rename dispatch_on_cpu() -> dispatch_task() for better clarity. This should be enough to handle all the possible decisions made by the user-space scheduler, making the dispatcher more robust. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! I left a couple of nits, but this LGTM. Please feel free to merge once they're addressed.
dbg_msg("%s: pid=%d global", __func__, p->pid); | ||
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns, | ||
enq_flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should eventually be an error condition in the scheduler? Unfortunately we don't yet have an ergonomic API for looking at a task's cpumask in user space so it might be a bit premature.
Also, on the other hand, maybe it would make the scheduler unnecessarily brittle in that a task could exit or something while it's being dispatched from user space. So this fallback is probably fine, but maybe we should add a comment explaining why this could happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should eventually be an error condition in the scheduler? Unfortunately we don't yet have an ergonomic API for looking at a task's cpumask in user space so it might be a bit premature.
Oh yes! Having an API to access cpumask would be super useful... we could use that in the scheduler to avoid selecting dispatching tasks on CPUs that can't be used.
Right now we could report that as a warning maybe? I don't think we should error the scheduler, it doesn't seem to be a too critical condition, considering that in the worst case scenario we can simply use the global DSQ and everything will work just fine.
Moreover even if we have access to the cpumask in the future we may still do something similar, unless we also provide some mechanisms to lock the cpumask.
Also, on the other hand, maybe it would make the scheduler unnecessarily brittle in that a task could exit or something while it's being dispatched from user space. So this fallback is probably fine, but maybe we should add a comment explaining why this could happen?
Absolutely, I'll add a comment. Maybe even introduce an extra counter, like nr_dispatch_failures
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we could report that as a warning maybe? I don't think we should error the scheduler, it doesn't seem to be a too critical condition, considering that in the worst case scenario we can simply use the global DSQ and everything will work just fine.
Yeah, agreed. And I don't think it necessarily indicates a bad / corrupted scheduler or anything anyways given that everything is async, so this is inherently a bit racy.
Absolutely, I'll add a comment. Maybe even introduce an extra counter, like nr_dispatch_failures or something similar?
SGTM!
@@ -596,6 +624,12 @@ void BPF_STRUCT_OPS(rustland_disable, struct task_struct *p) | |||
|
|||
/* | |||
* Heartbeat scheduler timer callback. | |||
* | |||
* If the system is completely idle the sched-ext wathcdog may incorrectly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
watchdog
Move the code responsible for interfacing with the BPF component into its own module and provide high-level abstractions for the user-space scheduler, hiding all the internal BPF implementation details. This makes the user-space scheduler code much more readable and it allows potential developers/contributors that want to focus at the pure scheduling details to modify the scheduler in a generic way, without having to worry about the internal BPF details. In the future we may even decide to provide the BPF abstraction as a separate crate, that could be used as a baseline to implement user-space schedulers in Rust. API overview ============ The main BPF interface is provided by BpfScheduler(). When this object is initialized it will take care of registering and initializing the BPF component. Then the scheduler can use the BpfScheduler() instance to receive tasks (in the form of QueuedTask object) and dispatch tasks (in the form of DispatchedTask objects), using respectively the methods dequeue_task() and dispatch_task(). The CPU ownership map can be accessed using the method get_cpu_pid(), this also allows to keep track of the idle and busy CPUs, with the corrsponding PIDs associated to them. BPF counters and statistics can be accessed using the methods nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be updated to notify the BPF component if the user-space scheduler has some pending work to do or not. Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can be used respectively to read the exit code and exit message from the BPF component, when the scheduler is unregistered. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
d5865e2
to
50735e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one small typo
cpu = scx_bpf_task_cpu(p); | ||
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) { | ||
/* | ||
* Both the designaged CPU and the previously used CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
designated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and fixed! I should really run a syntax checker. And something that checks when I confuse scx_rustland with scx_userland as well. 🤣
Thanks!
Introduce a new counter to report the amount of failed dispatches: if the scheduler designates a target CPU for a task, and both the chosen CPU and the previously utilized one are unavailable when the task is dispatched, the task will be sent to the global DSQ, and the counter will be incremented. Also mark all the methods to access these statistics counters as optional. In the future we may also provide a "verbose" option and show these statistics only when the scheduler runs in verbose mode. Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
50735e4
to
7813992
Compare
Minor improvement:
Code refactoring:
There is basically no functional change in this pull request, just some code refactoring. The important part is the new abstraction to the BPF component that makes the scheduler code much more readable and understandable.
Maybe in the future we could also consider to move this BPF abstraction to a more generic place, with a proper API documentation, to give the possibility to implement scx user-space schedulers in a more high-level abstracted way.