Skip to content

Conversation

@daidavid
Copy link
Contributor

The current utilization based migration scheme look back period is too short and results in overly aggressive load balancing. For example, in certain workloads on a machine that has an average utilization at ~90%, sampling load deltas between compute domains still showed > 50% differences in load.

Allow users to define some threshold for task stealing using sum of util_avg of CPUs on a compute domain instead to smooth out load tracking and reduce overall task migrations.

These are sampled on a ~90% workload to measure the cost of cross CCX/LLC domain migrations.

bpftop sample:
+-----------------+--------------+---------------+
| lavd_dispatch | forced_steal | no_task_steal |
+-----------------+--------------+---------------+
| avg_runtime(ns) | 3593 | 762 |
| cputime(%) | 250.99 | 40.87 |
| events/s | 704720 | 746275 |
+-----------------+--------------+---------------+

latencies(ns):
+-----+---------------+--------------------+----------------------+
| | dispatch(avg) | consume(local_llc) | consume(remote_llc) |
+-----+---------------+--------------------+----------------------+
| avg | 3884 | 595.9 | 2216.6 |
| p90 | 7721 | 1952.3 | 5709.0 |
| p99 | 14715 | 6900.0 | 13023.5 |
+-----+---------------+--------------------+----------------------+

In general, cross CCX/LLC domain migrations are expensive at high utilization and results in lower IPC due to cache locality. Disabling forced task stealing allows us to trade off some work conservation for better cache characteristics and scheduler overhead.

@daidavid daidavid requested review from htejun and multics69 October 17, 2025 23:52
The current utilization based migration scheme look back period is too
short and results in overly aggressive load balancing. For example, in
certain workloads on a machine that has an average utilization at ~90%,
sampling load deltas between compute domains still showed > 50%
differences in load.

Allow users to define some threshold for task stealing using sum of
util_avg of CPUs on a compute domain instead to smooth out load tracking
and reduce overall task migrations.

These are sampled on a ~90% workload to measure the cost of cross CCX/LLC
domain migrations.

bpftop sample:
+-----------------+--------------+---------------+
| lavd_dispatch   | forced_steal | no_task_steal |
+-----------------+--------------+---------------+
| avg_runtime(ns) | 3593         | 762           |
| cputime(%)      | 250.99       | 40.87         |
| events/s        | 704720       | 746275        |
+-----------------+--------------+---------------+

latencies(ns):
+-----+---------------+--------------------+----------------------+
|     | dispatch(avg) | consume(local_llc) | consume(remote_llc)  |
+-----+---------------+--------------------+----------------------+
| avg | 3884          | 595.9              | 2216.6               |
| p90 | 7721          | 1952.3             | 5709.0               |
| p99 | 14715         | 6900.0             | 13023.5              |
+-----+---------------+--------------------+----------------------+

In general, cross CCX/LLC domain migrations are expensive at high
utilization and results in lower IPC due to cache locality. Disabling
forced task stealing allows us to trade off some work conservation for
better cache characteristics and scheduler overhead.

Co-authored-by: Claude <claude@anthropic.com>
Signed-off-by: David Dai <david.dai@linux.dev>
@daidavid daidavid force-pushed the lavd_config_mig_threshold branch from 540ba8e to c3ef934 Compare October 22, 2025 03:28
@multics69 multics69 added this pull request to the merge queue Oct 22, 2025
Merged via the queue into sched-ext:main with commit ca9c03d Oct 22, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants