scx_bpfland: topology awareness #509

arighi · 2024-08-19T09:53:50Z

Introduce some concepts of topology awareness to scx_bpfland:

L2 / L3 cache awareness
cpu frequency awreness (via primary scheduling domain)

These changes enable better utilization of hybrid architectures (P-cores / E-cores) and generally improve performance by keeping tasks running on the same cores and caches, thereby enhancing the reuse of their working set.

While the system is not saturated the scheduler will use the following strategy to select the next CPU for a task: - pick the same CPU if it's a full-idle SMT core - pick any full-idle SMT core in the primary scheduling group that shares the same L2 cache - pick any full-idle SMT core in the primary scheduling grouop that shares the same L3 cache - pick the same CPU (ignoring SMT) - pick any idle CPU in the primary scheduling group that shares the same L2 cache - pick any idle CPU in the primary scheduling group that shares the same L3 cache - pick any idle CPU in the system While the system is completely saturated (no idle CPUs available), tasks will be dispatched on the first CPU that becomes available. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>

The primary scheduling domain represents a group of CPUs in the system where the scheduler will initially attempt to assign tasks. Tasks will only be dispatched to CPUs within this primary domain until they are fully utilized, after which tasks may overflow to other available CPUs. The primary scheduling domain can defined using the option `--primary-domain CPUMASK` (by default all the CPUs in the system are used as primary domain). This change introduces two new special values for the CPUMASK argument: - `performance`: automatically detect the fastest CPUs in the system and use them as primary scheduling domain, - `powersave`: automatically detect the slowest CPUs in the system and use them as primary scheduling domain. The current logic only supports creating two groups: fast and slow CPUs. The fast CPU group is created by excluding CPUs with the lowest frequency from the overall set, which means that within the fast CPU group, CPUs may have different maximum frequencies. When using the `performance` mode the fast CPUs will be used as primary domain, whereas in `powersave` mode, the slow CPUs will be used instead. This option is particularly useful in hybrid architectures (with P-cores and E-cores), as it allows the use of bpfland to prioritize task scheduling on either P-cores or E-cores, depending on the desired performance profile. Example: - Dell Precision 5480 - CPU: 13th Gen Intel(R) Core(TM) i7-13800H - P-cores: 0-11 / max freq: 5.2GHz - E-cores: 12-19 / max freq: 4.0GHz $ scx_bpfland --primary-domain performance 0[||||||||| 24.5%] 10[|||||||| 22.8%] 1[|||||| 14.9%] 11[||||||||||||| 36.9%] 2[|||||| 16.2%] 12[ 0.0%] 3[||||||||| 25.3%] 13[ 0.0%] 4[||||||||||| 33.3%] 14[ 0.0%] 5[|||| 9.9%] 15[ 0.0%] 6[||||||||||| 31.5%] 16[ 0.0%] 7[||||||| 17.4%] 17[ 0.0%] 8[|||||||| 23.4%] 18[ 0.0%] 9[||||||||| 26.1%] 19[ 0.0%] Avg power consumption: 3.29W $ scx_bpfland --primary-domain powersave 0[| 2.5%] 10[ 0.0%] 1[ 0.0%] 11[ 0.0%] 2[ 0.0%] 12[|||| 8.0%] 3[ 0.0%] 13[||||||||||||||||||||| 64.2%] 4[ 0.0%] 14[|||||||||| 29.6%] 5[ 0.0%] 15[||||||||||||||||| 52.5%] 6[ 0.0%] 16[||||||||| 24.7%] 7[ 0.0%] 17[|||||||||| 30.4%] 8[ 0.0%] 18[||||||| 22.4%] 9[ 0.0%] 19[||||| 12.4%] Avg power consumption: 2.17W (Info collected from htop and turbostat) Signed-off-by: Andrea Righi <andrea.righi@linux.dev>

htejun · 2024-08-19T19:42:53Z

scheds/rust/scx_bpfland/src/main.rs

+        std::fs::read_to_string(path).map(|content| content.trim().to_string())
+    }
+
+    fn read_cpu_ids(sysfs_path: &str) -> Result<Vec<usize>, std::io::Error> {


Can't it use scx_utils::topology? Is something missing there?

htejun · 2024-08-19T19:44:05Z

scheds/rust/scx_bpfland/src/main.rs

+        if path.is_dir() && path.file_name().unwrap().to_str().unwrap_or("").starts_with("cpu") {
+            if let Some(cpu_id_str) = path.file_name().unwrap().to_str().unwrap_or("").strip_prefix("cpu") {
+                if let Ok(cpu_id) = cpu_id_str.parse::<usize>() {
+                    let max_freq_path = path.join("cpufreq/cpuinfo_max_freq");


Ditto, I think this is already available in the topology.

yep, I'm planning to use scx_utils::topology here, I just need to repeat some stress tests with cpu hotplugging, because I was able to trigger a sysfs read failure (or similar) with it. I'm going to reproduce, fix, and then rewrite this code to use topology.

Add the L2 / L3 cache id to the Cpu struct, to quickly determine the cache nodes associated to each CPU. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>

Rely on scx_utils::Topology to get CPU and cache information, instead of re-implementing custom methods. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>

arighi · 2024-08-20T08:35:32Z

@htejun just added a couple of commits on top to use scx_utils::Topology, instead of re-implementing the same methods in bpfland. I repeated my CPU hotplugging stress tests and I couldn't trigger any error (probably the issue was in my old code, not in Topology).

@hodgesds FYI, I added two new methods to struct Cpu: l2_id() and l3_id(), to get respectively the L2 and L3 cache ID associated to the CPU. It shouldn't break any of your stuff, but maybe take a quick look and see if it makes sense. Thanks!

ptr1337

Tested on 9950. Works fine, no regressions found.

sirlucjan

No regressions during heavy load

arighi force-pushed the bpfland-topology branch from ccc3235 to 9f31f39 Compare August 19, 2024 17:47

arighi added 2 commits August 19, 2024 20:19

arighi force-pushed the bpfland-topology branch from ccc3235 to f8a2445 Compare August 19, 2024 18:19

htejun approved these changes Aug 19, 2024

View reviewed changes

arighi added 2 commits August 20, 2024 10:16

scx_utils: Add L2 / L3 cache id to CPU

0b2dc6b

Add the L2 / L3 cache id to the Cpu struct, to quickly determine the cache nodes associated to each CPU. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>

scx_bpfland: get topology information from scx_utils::Topology

467d4b5

Rely on scx_utils::Topology to get CPU and cache information, instead of re-implementing custom methods. Signed-off-by: Andrea Righi <andrea.righi@linux.dev>

arighi requested a review from hodgesds August 20, 2024 08:29

htejun approved these changes Aug 20, 2024

View reviewed changes

ptr1337 approved these changes Aug 20, 2024

View reviewed changes

sirlucjan approved these changes Aug 20, 2024

View reviewed changes

hodgesds approved these changes Aug 20, 2024

View reviewed changes

arighi merged commit 33b6ada into main Aug 20, 2024
2 checks passed

arighi deleted the bpfland-topology branch August 20, 2024 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_bpfland: topology awareness #509

scx_bpfland: topology awareness #509

arighi commented Aug 19, 2024

htejun Aug 19, 2024

htejun Aug 19, 2024

arighi Aug 19, 2024

arighi commented Aug 20, 2024

ptr1337 left a comment

sirlucjan left a comment

scx_bpfland: topology awareness #509

scx_bpfland: topology awareness #509

Conversation

arighi commented Aug 19, 2024

htejun Aug 19, 2024

Choose a reason for hiding this comment

htejun Aug 19, 2024

Choose a reason for hiding this comment

arighi Aug 19, 2024

Choose a reason for hiding this comment

arighi commented Aug 20, 2024

ptr1337 left a comment

Choose a reason for hiding this comment

sirlucjan left a comment

Choose a reason for hiding this comment