Add CPU frequency related helpers and extend scx_layered #239

hodgesds · 2024-04-24T01:11:50Z

This change adds scx_bpf_cpuperf_cap, scx_bpf_cpuperf_cur and scx_bpf_cpuperf_set definitions that were recently introduced into sched_ext. It adds a perf field to scx_layered to allow for controlling performance per layer.

hodgesds · 2024-04-24T01:13:39Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -560,30 +560,42 @@ void BPF_STRUCT_OPS(layered_dispatch, s32 cpu, struct task_struct *prev)

 	/* consume preempting layers first */
 	bpf_for(idx, 0, nr_layers)
-		if (layers[idx].preempt && scx_bpf_consume(idx))


I'm not sure if layered_dispatch is the right place to put this, so please let me know if there is a more suitable place. Probably could have cleaned up some of the redundant checks as well.

I think we'll want to put this in layered_running(). That callback is invoked just before a task starts running on a CPU, so we can look at what layer it's in and apply the frequency tuning accordingly.

As an aside, layered_dispatch() isn't the only place where we can choose where a task will run. layered_select_cpu(), which is called on the task wakeup (and task fork) path, decide which CPU a task should be migrated to before being enqueued. We can "directly dispatch" to local DSQs from there, which is conceptually the same thing as calling scx_bpf_consume(). See scx_bpf_dispatch(p, SCX_DSQ_LOCAL, ...). In the former case, we're putting a task directly on the local DSQ where the core kernel will pull the task from to run, and in the latter case, we're taking a task from another DSQ and putting it on the local DSQ. The end result is the same -- we're specifying a task that should run next on that CPU. By instead just looking up a task's layer from layered_running(), we're guaranteed to hit all the cases.

If we want to speed up the context switch path, we could also reset the frequency in layered_stopping(), but that could have the downside of accidentally throttling a CPU that's running a high-pri layer task.

hodgesds · 2024-04-24T01:21:59Z

Test setup:

$ cat /proc/cpuinfo | grep 'model name' | head -n 1
model name      : Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c 
     80 schedutil

Config:

[
  {
    "name": "batch",
    "comment": "tasks under system.slice or tasks with nice value > 0",
    "matches": [
      [
        {
          "CgroupPrefix": "system.slice/"
        }
      ],
      [
        {
          "NiceAbove": 0
        }
      ]
    ],
    "kind": {
      "Confined": {
        "util_range": [
          0.01,
          1.0
        ],
        "cpus_range": [
          0,
          80
        ],
        "min_exec_us": 1000,
        "perf": 128
      }
    }
  },
  {
    "name": "normal",
    "comment": "the rest",
    "matches": [
      []
    ],
    "kind": {
      "Open": {
        "min_exec_us": 100,
        "preempt": true,
        "exclusive": true,
        "perf": 128
      }
    }
  }
]

The image shows it was able to clock down the frequency effectively.

Byte-Lab · 2024-04-24T01:51:35Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

+			if (layer->perf > 0)
+				scx_bpf_cpuperf_set(cpu, layer->perf);


So in lookup_layer_cpumask(), we invoke scx_bpf_error() if we fail to lookup the layer's cpumask. scx_bpf_error() will cause the main kernel to boot out the scheduler, so in general it's not typically useful to do any extra work on error paths, such as e.g. setting the frequency here.

Byte-Lab · 2024-04-24T02:15:25Z

scheds/rust/scx_layered/src/main.rs

+        #[serde(default)]
+        perf: u64,


So this is certainly one option for how to expose this knob to configurations. Another option would be to expose this as something like QoS, and have the scheduler implementation abstract how that's eventually applied with scx_bpf_cpuperf_set(). A few things to consider:

Say we have a layer that has very relaxed runtime requirements, and one with very sensitive latency requirements. If both layers are occupying CPUs on the system, it may be desirable to run the low-pri layer at a low frequency if the goal is to allow the high-pri layer running on other cores to run hotter. On the other hand, if the high-pri layer isn't running at that moment, we'd ideally like to run the lower-pri layer at a higher frequency given that it won't affect the high-pri layer. If we directly passed perf through the config like this, it would cause a low-pri layer to always run with low perf even if there's nothing else on the system for it to contend with. Maybe that's OK (or possibly even desirable given how predictable the behavior would be) for a configuration-driven scheduler which really isn't meant to be a general purpose scheduling solution, but it also does feel a bit suboptimal.

I might be getting a bit ahead of myself here, but if we ever want to have layered run on asym chips like ARM big.LITTLE, just specifying perf on its own wouldn't be sufficient. Maybe having QoS would be, and the scheduler could internally use scx_bpf_cpuperf_set(), scx_bpf_cpuperf_cap(), and scx_bpf_cpuperf_cur() to find the right CPU and frequency for it? Not sure, maybe we'd need new configs.

I expect it would be difficult for a user to know what value to choose for perf. It seems like what a user really wants to say is, "this layer is super important and should be boosted". Maybe it would even be as simple as keeping a cpumask that has bits set for every CPU running a high-pri task. If the weight of that cpumask is nonzero, then we'd know that we (may) need to throttle any non-high-pri tasks.

What do you think?

I think it'd be better to expose the raw number. Providing an interface which has higher level semantics requires a lot more knowledge than is currently available in the stack. We just don't have enough context to determine what a semantical QoS level should map to on a given hardware (e.g. the desktop processors I'm testing have only three or so perf levels and they aren't evenly spaced and I have no idea what the perf / power ratio is at each point).

Also, given how scx_layered is for hand crafted boutique setups, abstraction here can be more of a hurdle than being useful.

Yeah, I mostly went with the raw number out of pure laziness of being able to quickly test things without thinking about the API. At least with the uclamp API there are buckets, which could be interesting think about. Maybe another higher level abstraction would help with that, I'm not sure.

My only other thought is what is a reasonable default? I was thinking if the (default) value is 0 then don't apply any frequency scaling and let the frequency governor/driver handle scaling.

Byte-Lab · 2024-04-24T02:27:23Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

@@ -560,30 +560,42 @@ void BPF_STRUCT_OPS(layered_dispatch, s32 cpu, struct task_struct *prev)

 	/* consume preempting layers first */
 	bpf_for(idx, 0, nr_layers)
-		if (layers[idx].preempt && scx_bpf_consume(idx))


I think we'll want to put this in layered_running(). That callback is invoked just before a task starts running on a CPU, so we can look at what layer it's in and apply the frequency tuning accordingly.

As an aside, layered_dispatch() isn't the only place where we can choose where a task will run. layered_select_cpu(), which is called on the task wakeup (and task fork) path, decide which CPU a task should be migrated to before being enqueued. We can "directly dispatch" to local DSQs from there, which is conceptually the same thing as calling scx_bpf_consume(). See scx_bpf_dispatch(p, SCX_DSQ_LOCAL, ...). In the former case, we're putting a task directly on the local DSQ where the core kernel will pull the task from to run, and in the latter case, we're taking a task from another DSQ and putting it on the local DSQ. The end result is the same -- we're specifying a task that should run next on that CPU. By instead just looking up a task's layer from layered_running(), we're guaranteed to hit all the cases.

If we want to speed up the context switch path, we could also reset the frequency in layered_stopping(), but that could have the downside of accidentally throttling a CPU that's running a high-pri layer task.

This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](sched-ext/sched_ext#180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer. Signed-off-by: Daniel Hodges <hodges.daniel.scott@gmail.com>

Byte-Lab · 2024-04-24T18:36:29Z

Should we add logic to account for running on kernels that don't have scx_bpf_cpuperf_set()?

Edit: Working on this now

hodgesds commented Apr 24, 2024

View reviewed changes

hodgesds changed the title ~~Adds CPU frequency related helpers and extend scx_layered~~ Add CPU frequency related helpers and extend scx_layered Apr 24, 2024

Byte-Lab reviewed Apr 24, 2024

View reviewed changes

hodgesds force-pushed the cpufreq_helpers branch from 7ed200b to 32e97bf Compare April 24, 2024 14:28

htejun approved these changes Apr 24, 2024

View reviewed changes

htejun merged commit 9a9b4dd into sched-ext:main Apr 24, 2024

hodgesds deleted the cpufreq_helpers branch May 1, 2024 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU frequency related helpers and extend scx_layered #239

Add CPU frequency related helpers and extend scx_layered #239

hodgesds commented Apr 24, 2024

hodgesds Apr 24, 2024

Byte-Lab Apr 24, 2024

hodgesds commented Apr 24, 2024

Byte-Lab Apr 24, 2024

Byte-Lab Apr 24, 2024

htejun Apr 24, 2024

hodgesds Apr 24, 2024

Byte-Lab Apr 24, 2024

Byte-Lab commented Apr 24, 2024 •

edited

Add CPU frequency related helpers and extend scx_layered #239

Add CPU frequency related helpers and extend scx_layered #239

Conversation

hodgesds commented Apr 24, 2024

hodgesds Apr 24, 2024

Choose a reason for hiding this comment

Byte-Lab Apr 24, 2024

Choose a reason for hiding this comment

hodgesds commented Apr 24, 2024

Byte-Lab Apr 24, 2024

Choose a reason for hiding this comment

Byte-Lab Apr 24, 2024

Choose a reason for hiding this comment

htejun Apr 24, 2024

Choose a reason for hiding this comment

hodgesds Apr 24, 2024

Choose a reason for hiding this comment

Byte-Lab Apr 24, 2024

Choose a reason for hiding this comment

Byte-Lab commented Apr 24, 2024 • edited

Byte-Lab commented Apr 24, 2024 •

edited