-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_bpfland: topology awareness #509
Conversation
ccc3235
to
9f31f39
Compare
While the system is not saturated the scheduler will use the following strategy to select the next CPU for a task: - pick the same CPU if it's a full-idle SMT core - pick any full-idle SMT core in the primary scheduling group that shares the same L2 cache - pick any full-idle SMT core in the primary scheduling grouop that shares the same L3 cache - pick the same CPU (ignoring SMT) - pick any idle CPU in the primary scheduling group that shares the same L2 cache - pick any idle CPU in the primary scheduling group that shares the same L3 cache - pick any idle CPU in the system While the system is completely saturated (no idle CPUs available), tasks will be dispatched on the first CPU that becomes available. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The primary scheduling domain represents a group of CPUs in the system where the scheduler will initially attempt to assign tasks. Tasks will only be dispatched to CPUs within this primary domain until they are fully utilized, after which tasks may overflow to other available CPUs. The primary scheduling domain can defined using the option `--primary-domain CPUMASK` (by default all the CPUs in the system are used as primary domain). This change introduces two new special values for the CPUMASK argument: - `performance`: automatically detect the fastest CPUs in the system and use them as primary scheduling domain, - `powersave`: automatically detect the slowest CPUs in the system and use them as primary scheduling domain. The current logic only supports creating two groups: fast and slow CPUs. The fast CPU group is created by excluding CPUs with the lowest frequency from the overall set, which means that within the fast CPU group, CPUs may have different maximum frequencies. When using the `performance` mode the fast CPUs will be used as primary domain, whereas in `powersave` mode, the slow CPUs will be used instead. This option is particularly useful in hybrid architectures (with P-cores and E-cores), as it allows the use of bpfland to prioritize task scheduling on either P-cores or E-cores, depending on the desired performance profile. Example: - Dell Precision 5480 - CPU: 13th Gen Intel(R) Core(TM) i7-13800H - P-cores: 0-11 / max freq: 5.2GHz - E-cores: 12-19 / max freq: 4.0GHz $ scx_bpfland --primary-domain performance 0[||||||||| 24.5%] 10[|||||||| 22.8%] 1[|||||| 14.9%] 11[||||||||||||| 36.9%] 2[|||||| 16.2%] 12[ 0.0%] 3[||||||||| 25.3%] 13[ 0.0%] 4[||||||||||| 33.3%] 14[ 0.0%] 5[|||| 9.9%] 15[ 0.0%] 6[||||||||||| 31.5%] 16[ 0.0%] 7[||||||| 17.4%] 17[ 0.0%] 8[|||||||| 23.4%] 18[ 0.0%] 9[||||||||| 26.1%] 19[ 0.0%] Avg power consumption: 3.29W $ scx_bpfland --primary-domain powersave 0[| 2.5%] 10[ 0.0%] 1[ 0.0%] 11[ 0.0%] 2[ 0.0%] 12[|||| 8.0%] 3[ 0.0%] 13[||||||||||||||||||||| 64.2%] 4[ 0.0%] 14[|||||||||| 29.6%] 5[ 0.0%] 15[||||||||||||||||| 52.5%] 6[ 0.0%] 16[||||||||| 24.7%] 7[ 0.0%] 17[|||||||||| 30.4%] 8[ 0.0%] 18[||||||| 22.4%] 9[ 0.0%] 19[||||| 12.4%] Avg power consumption: 2.17W (Info collected from htop and turbostat) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
ccc3235
to
f8a2445
Compare
scheds/rust/scx_bpfland/src/main.rs
Outdated
std::fs::read_to_string(path).map(|content| content.trim().to_string()) | ||
} | ||
|
||
fn read_cpu_ids(sysfs_path: &str) -> Result<Vec<usize>, std::io::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't it use scx_utils::topology
? Is something missing there?
scheds/rust/scx_bpfland/src/main.rs
Outdated
if path.is_dir() && path.file_name().unwrap().to_str().unwrap_or("").starts_with("cpu") { | ||
if let Some(cpu_id_str) = path.file_name().unwrap().to_str().unwrap_or("").strip_prefix("cpu") { | ||
if let Ok(cpu_id) = cpu_id_str.parse::<usize>() { | ||
let max_freq_path = path.join("cpufreq/cpuinfo_max_freq"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, I think this is already available in the topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I'm planning to use scx_utils::topology
here, I just need to repeat some stress tests with cpu hotplugging, because I was able to trigger a sysfs read failure (or similar) with it. I'm going to reproduce, fix, and then rewrite this code to use topology.
Add the L2 / L3 cache id to the Cpu struct, to quickly determine the cache nodes associated to each CPU. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rely on scx_utils::Topology to get CPU and cache information, instead of re-implementing custom methods. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
@htejun just added a couple of commits on top to use @hodgesds FYI, I added two new methods to struct Cpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on 9950. Works fine, no regressions found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No regressions during heavy load
Introduce some concepts of topology awareness to scx_bpfland:
These changes enable better utilization of hybrid architectures (P-cores / E-cores) and generally improve performance by keeping tasks running on the same cores and caches, thereby enhancing the reuse of their working set.