Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_rustland: prevent starvation and improve responsiveness #60

Merged
merged 8 commits into from
Jan 2, 2024

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Jan 1, 2024

Improvements and fixes to scx_rustland to prevent some cases of starvation and enhance responsiveness:

  • scx_rustland: evaluate the proper vruntime delta
  • scx_rustland: prevent starvation handling short-lived tasks properly
  • bpf_rustland: do not dispatch the scheduler to the global DSQ
  • scx_rustland: remove SCX_ENQ_LAST check in is_task_cpu_available()
  • scx_rustland: never account more than slice_ns to vruntime
  • scx_rustland: clean up old entries in the task map

Small code readability improvement:

  • scx_rustland: small code refactoring
  • scx_rustland: rename variable id -> pos for better clarity

With all these changes in place the scheduler can properly handle, on my 8-cores laptop, a parallel kernel build (make -j32), a stress test (stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4), a YouTube music video playing in the background, all while reading emails and writing the text of this PR.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
The user-space scheduler maintains an internal hash map of tasks
information (indexed by their pid). Tasks are only added to this hash
map and never removed. After running the scheduler for a while we may
experience a performance degration, because the hash map keeps growing.

Therefore implement a mechanism of garbage collector to remove the old
entries from the task map (periodically removing pids that don't exist
anymore).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
In any case make sure that we never account more than the maximum
slice_ns to a task's vruntime.

This helps to prevent starving a task for too long in the user-space
scheduler.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
With commit 49f2e7c ("scx_rustland: enable SCX_OPS_ENQ_LAST") we have
enabled SCX_OPS_ENQ_LAST that seems to save some unnecessary user-space
scheduler activations when the system is mostly idle.

We are also checking for the SCX_ENQ_LAST in the enqueue flags, that
apparently it is not needed and we can achieve the same behavior
dropping this check.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Never dispatch the user-space scheduler to the global DSQ, while all
the other tasks are dispatched to the local per-CPU DSQ.

Since tasks are consumed from the local DSQ first and then from the
global DSQ, we may end up starving the scheduler if we dispatch only
this one on the global DSQ.

In fact it is really easy to trigger a stall with a workload that
triggers many context switches in the system, for example (on a 8 cores
system):

 $ stress-ng --cpu 32 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 30s

 ...
 09:28:11 [WARN] EXIT: scx_rustland[1455943] failed to run for 5.275s
 09:28:11 [INFO] Unregister RustLand scheduler

To prevent this from happening also dispatch the user-space scheduler on
the local DSQ, using the current CPU where .dispatch() is called, if
possible, or the previously used CPU otherwise.

Apply the same logic when the scheduler is congested: dispatch on the
previously used CPU using the local DSQ.

In this way all tasks will always get the same "dispatch priority" and
we can prevent the scheduler starvation issue.

Note that with this change in place dispatch_global() is never used and
we can get rid of it.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
Prevent newly created short-lived tasks from starving the other tasks
sitting in the user-space scheduler.

This can be done setting an initial vruntime of (min_vruntime + 1) to
newly scheduled tasks, instead of min_vruntime: this ensures a
progressing global vruntime durig each scheduler run, providing a
priority boost to newer tasks (that is still beneficial for potential
short-lived tasks) while also preventing excessive starvation of the
other tasks sitting in the user-space scheduler, waiting to be
dispatched.

Without this change it is really easy to create a stall condition simply
by forking a bunch of short-lived tasks in a busy loop, with this change
applied the scheduler can handle properly the consistent flow of newly
created short-lived tasks, without introducing any stall.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi arighi force-pushed the scx-rustland-prevent-starvation branch from 719d093 to 90e92ac Compare January 1, 2024 15:58
The forumla used to evaluate the weighted time delta is not correct,
it's not considering the weight as a percentage. Fix this by using the
proper formula.

Moreover, take into account also the task weight when evaluating the
maximum time delta to account in vruntime and make sure that we never
charge a task more than slice_ns.

This helps to prevent starvation of low priority tasks.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
No functional change, make the user-space scheduler code a bit more
readable and more Rust idiomatic.

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@arighi arighi force-pushed the scx-rustland-prevent-starvation branch from 5151117 to 280796c Compare January 1, 2024 18:48
* previously used one.
*/
if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
cpu = scx_bpf_task_cpu(p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work fine but can you please verify that changing CPU affinity for a constantly running task which is bound to one CPU works? ie. Have one thread which busy loops, bind it to one cpu using taskset, and then change it to another CPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it seems to work:

09:10 arighi@gpd$ yes >/dev/null &
[1] 1797739
6.7.0-3-generic ~
09:10 arighi@gpd$ taskset -pc 2 $!
pid 1797739's current affinity list: 0-7
pid 1797739's new affinity list: 2
6.7.0-3-generic ~
09:10 arighi@gpd$ cut -d' ' -f39 /proc/$!/stat
2
6.7.0-3-generic ~
09:11 arighi@gpd$ taskset -pc 5 $!
pid 1797739's current affinity list: 2
pid 1797739's new affinity list: 5
6.7.0-3-generic ~
09:11 arighi@gpd$ cut -d' ' -f39 /proc/$!/stat
5

Of course it changes cpu only if the task is doing something (i.e. a sleep infinity won't change cpu), but the same happens with the default scheduler, so we shouldn't change any behavior here.

let mut delta = sum_exec_runtime - task_info.sum_exec_runtime;
// Never account more than max_slice_ns. This helps to prevent starving a task for too
// long in the scheduler task pool.
delta = delta.min(max_slice_ns);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this is necessary. Maybe this is needed to mask a different issue such as failing to wind current vtime forward when it should be? Or is this necessary because sometimes the scheduler takes a bit too long to kick them off cpu and their execution pattern becomes too chunky?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this is necessary. Maybe this is needed to mask a different issue such as failing to wind current vtime forward when it should be? Or is this necessary because sometimes the scheduler takes a bit too long to kick them off cpu and their execution pattern becomes too chunky?

I think I was just trying to fix what was fixed by 2900b20 ("scx_rustland: evaluate the proper vruntime delta") here, I should have squashed the commits together...

However, now I can see a case where the delta can be bigger than slice_ns, but that's a side effect of the fact that we don't clean entries synchronously from the self.task_info HashMap when the corresponding tasks exit.

If a pid is reused we would (incorrectly) get the previous TaskInfo entry with the sum_exec_runtime of the previous task.

Do we have a callback that we can use in sched_ext_ops to detect when a task exits in a synchronous way? I was trying to use .disable, but it seems to be called only when the scheduler exits, not when the task exits... is this the expected behavior? Am I missing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.disable should be called when a task gets freed, which may take place a bit after the task exits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.disable should be called when a task gets freed, which may take place a bit after the task exits.

ok, I'll try again with the .disable callback to notify the user-space sched when tasks are exiting, it's more elegant than running the periodic garbage collector.

(Maybe we still need to do the garbage collector in case we fail to notify the user-space scheduler, but I don't think it's a big problem for now).

Thanks!

@htejun htejun merged commit 6c437ae into sched-ext:main Jan 2, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants