Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_bpfland: topology awareness #509

Merged
merged 4 commits into from
Aug 20, 2024
Merged

scx_bpfland: topology awareness #509

merged 4 commits into from
Aug 20, 2024

Conversation

arighi
Copy link
Contributor

@arighi arighi commented Aug 19, 2024

Introduce some concepts of topology awareness to scx_bpfland:

  • L2 / L3 cache awareness
  • cpu frequency awreness (via primary scheduling domain)

These changes enable better utilization of hybrid architectures (P-cores / E-cores) and generally improve performance by keeping tasks running on the same cores and caches, thereby enhancing the reuse of their working set.

While the system is not saturated the scheduler will use the following
strategy to select the next CPU for a task:
  - pick the same CPU if it's a full-idle SMT core
  - pick any full-idle SMT core in the primary scheduling group that
    shares the same L2 cache
  - pick any full-idle SMT core in the primary scheduling grouop that
    shares the same L3 cache
  - pick the same CPU (ignoring SMT)
  - pick any idle CPU in the primary scheduling group that shares the
    same L2 cache
  - pick any idle CPU in the primary scheduling group that shares the
    same L3 cache
  - pick any idle CPU in the system

While the system is completely saturated (no idle CPUs available), tasks
will be dispatched on the first CPU that becomes available.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
The primary scheduling domain represents a group of CPUs in the system
where the scheduler will initially attempt to assign tasks. Tasks will
only be dispatched to CPUs within this primary domain until they are
fully utilized, after which tasks may overflow to other available CPUs.

The primary scheduling domain can defined using the option
`--primary-domain CPUMASK` (by default all the CPUs in the system are
used as primary domain).

This change introduces two new special values for the CPUMASK argument:
 - `performance`: automatically detect the fastest CPUs in the system
   and use them as primary scheduling domain,
 - `powersave`: automatically detect the slowest CPUs in the system and
   use them as primary scheduling domain.

The current logic only supports creating two groups: fast and slow CPUs.

The fast CPU group is created by excluding CPUs with the lowest
frequency from the overall set, which means that within the fast CPU
group, CPUs may have different maximum frequencies.

When using the `performance` mode the fast CPUs will be used as primary
domain, whereas in `powersave` mode, the slow CPUs will be used instead.

This option is particularly useful in hybrid architectures (with P-cores
and E-cores), as it allows the use of bpfland to prioritize task
scheduling on either P-cores or E-cores, depending on the desired
performance profile.

Example:

 - Dell Precision 5480
   - CPU: 13th Gen Intel(R) Core(TM) i7-13800H
     - P-cores:  0-11 / max freq: 5.2GHz
     - E-cores: 12-19 / max freq: 4.0GHz

 $ scx_bpfland --primary-domain performance

  0[|||||||||                24.5%]  10[||||||||                  22.8%]
  1[||||||                   14.9%]  11[|||||||||||||             36.9%]
  2[||||||                   16.2%]  12[                           0.0%]
  3[|||||||||                25.3%]  13[                           0.0%]
  4[|||||||||||              33.3%]  14[                           0.0%]
  5[||||                      9.9%]  15[                           0.0%]
  6[|||||||||||              31.5%]  16[                           0.0%]
  7[|||||||                  17.4%]  17[                           0.0%]
  8[||||||||                 23.4%]  18[                           0.0%]
  9[|||||||||                26.1%]  19[                           0.0%]

  Avg power consumption: 3.29W

 $ scx_bpfland --primary-domain powersave

  0[|                         2.5%]  10[                           0.0%]
  1[                          0.0%]  11[                           0.0%]
  2[                          0.0%]  12[||||                       8.0%]
  3[                          0.0%]  13[|||||||||||||||||||||     64.2%]
  4[                          0.0%]  14[||||||||||                29.6%]
  5[                          0.0%]  15[|||||||||||||||||         52.5%]
  6[                          0.0%]  16[|||||||||                 24.7%]
  7[                          0.0%]  17[||||||||||                30.4%]
  8[                          0.0%]  18[|||||||                   22.4%]
  9[                          0.0%]  19[|||||                     12.4%]

  Avg power consumption: 2.17W

(Info collected from htop and turbostat)

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
std::fs::read_to_string(path).map(|content| content.trim().to_string())
}

fn read_cpu_ids(sysfs_path: &str) -> Result<Vec<usize>, std::io::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't it use scx_utils::topology? Is something missing there?

if path.is_dir() && path.file_name().unwrap().to_str().unwrap_or("").starts_with("cpu") {
if let Some(cpu_id_str) = path.file_name().unwrap().to_str().unwrap_or("").strip_prefix("cpu") {
if let Ok(cpu_id) = cpu_id_str.parse::<usize>() {
let max_freq_path = path.join("cpufreq/cpuinfo_max_freq");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, I think this is already available in the topology.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I'm planning to use scx_utils::topology here, I just need to repeat some stress tests with cpu hotplugging, because I was able to trigger a sysfs read failure (or similar) with it. I'm going to reproduce, fix, and then rewrite this code to use topology.

Add the L2 / L3 cache id to the Cpu struct, to quickly determine the
cache nodes associated to each CPU.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
Rely on scx_utils::Topology to get CPU and cache information, instead of
re-implementing custom methods.

Signed-off-by: Andrea Righi <andrea.righi@linux.dev>
@arighi arighi requested a review from hodgesds August 20, 2024 08:29
@arighi
Copy link
Contributor Author

arighi commented Aug 20, 2024

@htejun just added a couple of commits on top to use scx_utils::Topology, instead of re-implementing the same methods in bpfland. I repeated my CPU hotplugging stress tests and I couldn't trigger any error (probably the issue was in my old code, not in Topology).

@hodgesds FYI, I added two new methods to struct Cpu: l2_id() and l3_id(), to get respectively the L2 and L3 cache ID associated to the CPU. It shouldn't break any of your stuff, but maybe take a quick look and see if it makes sense. Thanks!

Copy link
Contributor

@ptr1337 ptr1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on 9950. Works fine, no regressions found.

Copy link
Contributor

@sirlucjan sirlucjan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No regressions during heavy load

@arighi arighi merged commit 33b6ada into main Aug 20, 2024
2 checks passed
@arighi arighi deleted the bpfland-topology branch August 20, 2024 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants