Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustland: refactoring #68

Merged
merged 5 commits into from
Jan 4, 2024
Merged

scx_rustland: refactoring #68

merged 5 commits into from
Jan 4, 2024

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Jan 4, 2024

Minor improvement:

  • scx_rustland: make dispatcher more robust
  • scx_rustland: remove unnecessary scx_bpf_dispatch_nr_slots() check
  • scx_rustland: introduce nr_failed_dispatches

Code refactoring:

  • scx_rustland: provide an abstraction layer for the BPF component
  • scx_rustland: update comments and documentation in the BPF part

There is basically no functional change in this pull request, just some code refactoring. The important part is the new abstraction to the BPF component that makes the scheduler code much more readable and understandable.

Maybe in the future we could also consider to move this BPF abstraction to a more generic place, with a proper API documentation, to give the possibility to implement scx user-space schedulers in a more high-level abstracted way.

No functional change, only a little polishing, including updates to
comments and documentation to align with the latest changes in the code.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In the dispatch callback we can dispatch tasks to any CPU, according to
the scheduler decisions, so there's no reason to check for the available
dispatch slots in the current CPU only, to determine if we need to stop
dispatching tasks.

Since the scheduler is aware of the idle state of the CPUs (via the CPU
ownership map) it has all the information to automatically regulate the
flow of dispatched tasks and not overflow the dispatch slots, therefore
it is safe to remove this check.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
We always try to use the current CPU (from the .dispatch() callback) to
run the user-space scheduler itself and if the current CPU is not usable
(according to the cpumask) we just re-use the previouly used CPU.

However, if the previously used CPU is also not usable, we may trigger
the following error:

 sched_ext: runtime error (SCX_DSQ_LOCAL[_ON] verdict target cpu 4 not allowed for scx_rustland[256201])

Potentially this can also happen with any task, so improve the dispatch
logic as following:

 - dispatch on the target CPU, if usable
 - otherwise dispatch on the previously used CPU, if usable
 - otherwise dispatch on the global DSQ

Moreover, rename dispatch_on_cpu() -> dispatch_task() for better
clarity.

This should be enough to handle all the possible decisions made by the
user-space scheduler, making the dispatcher more robust.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Copy link
Contributor

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! I left a couple of nits, but this LGTM. Please feel free to merge once they're addressed.

Comment on lines 294 to 310
dbg_msg("%s: pid=%d global", __func__, p->pid);
scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, slice_ns,
enq_flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should eventually be an error condition in the scheduler? Unfortunately we don't yet have an ergonomic API for looking at a task's cpumask in user space so it might be a bit premature.

Also, on the other hand, maybe it would make the scheduler unnecessarily brittle in that a task could exit or something while it's being dispatched from user space. So this fallback is probably fine, but maybe we should add a comment explaining why this could happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should eventually be an error condition in the scheduler? Unfortunately we don't yet have an ergonomic API for looking at a task's cpumask in user space so it might be a bit premature.

Oh yes! Having an API to access cpumask would be super useful... we could use that in the scheduler to avoid selecting dispatching tasks on CPUs that can't be used.

Right now we could report that as a warning maybe? I don't think we should error the scheduler, it doesn't seem to be a too critical condition, considering that in the worst case scenario we can simply use the global DSQ and everything will work just fine.

Moreover even if we have access to the cpumask in the future we may still do something similar, unless we also provide some mechanisms to lock the cpumask.

Also, on the other hand, maybe it would make the scheduler unnecessarily brittle in that a task could exit or something while it's being dispatched from user space. So this fallback is probably fine, but maybe we should add a comment explaining why this could happen?

Absolutely, I'll add a comment. Maybe even introduce an extra counter, like nr_dispatch_failures or something similar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we could report that as a warning maybe? I don't think we should error the scheduler, it doesn't seem to be a too critical condition, considering that in the worst case scenario we can simply use the global DSQ and everything will work just fine.

Yeah, agreed. And I don't think it necessarily indicates a bad / corrupted scheduler or anything anyways given that everything is async, so this is inherently a bit racy.

Absolutely, I'll add a comment. Maybe even introduce an extra counter, like nr_dispatch_failures or something similar?

SGTM!

@@ -596,6 +624,12 @@ void BPF_STRUCT_OPS(rustland_disable, struct task_struct *p)

/*
* Heartbeat scheduler timer callback.
*
* If the system is completely idle the sched-ext wathcdog may incorrectly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

watchdog

Move the code responsible for interfacing with the BPF component into
its own module and provide high-level abstractions for the user-space
scheduler, hiding all the internal BPF implementation details.

This makes the user-space scheduler code much more readable and it
allows potential developers/contributors that want to focus at the pure
scheduling details to modify the scheduler in a generic way, without
having to worry about the internal BPF details.

In the future we may even decide to provide the BPF abstraction as a
separate crate, that could be used as a baseline to implement user-space
schedulers in Rust.

API overview
============

The main BPF interface is provided by BpfScheduler(). When this object
is initialized it will take care of registering and initializing the BPF
component.

Then the scheduler can use the BpfScheduler() instance to receive tasks
(in the form of QueuedTask object) and dispatch tasks (in the form of
DispatchedTask objects), using respectively the methods dequeue_task()
and dispatch_task().

The CPU ownership map can be accessed using the method get_cpu_pid(),
this also allows to keep track of the idle and busy CPUs, with the
corrsponding PIDs associated to them.

BPF counters and statistics can be accessed using the methods
nr_*_mut(), in particular nr_queued_mut() and nr_scheduled_mut() can be
updated to notify the BPF component if the user-space scheduler has some
pending work to do or not.

Finally the methods read_bpf_exit_kind() and report_bpf_exit_kind() can
be used respectively to read the exit code and exit message from the BPF
component, when the scheduler is unregistered.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Copy link
Contributor

@Byte-Lab Byte-Lab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one small typo

cpu = scx_bpf_task_cpu(p);
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr)) {
/*
* Both the designaged CPU and the previously used CPU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

designated

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and fixed! I should really run a syntax checker. And something that checks when I confuse scx_rustland with scx_userland as well. 🤣

Thanks!

Introduce a new counter to report the amount of failed dispatches: if
the scheduler designates a target CPU for a task, and both the chosen
CPU and the previously utilized one are unavailable when the task is
dispatched, the task will be sent to the global DSQ, and the counter
will be incremented.

Also mark all the methods to access these statistics counters as
optional. In the future we may also provide a "verbose" option and show
these statistics only when the scheduler runs in verbose mode.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi arighi merged commit 96f3eb4 into main Jan 4, 2024
2 checks passed
@htejun htejun deleted the scx-rustland-refactoring branch January 10, 2024 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants