layered: Fix static configuration, and dispatch for Grouped layers #148

Byte-Lab · 2024-02-21T21:40:53Z

For static configurations of confined or grouped layers, we weren't properly propagating them to the kernel.
We weren't trying to consume from the proper layer for grouped layers in layered_dispatch

We currently have a bug in layered wherein we could fail to propagate layer updates from user space to kernel space if a layer is never adjusted after it's first initialized. For example, in the following configuration: [ { "name": "workload.slice", "comment": "main workload slice", "matches": [ [ { "CgroupPrefix": "workload.slice/" } ] ], "kind": { "Grouped": { "cpus_range": [30, 30], "util_range": [ 0.0, 1.0 ], "preempt": false } } }, { "name": "normal", "comment": "the rest", "matches": [ [] ], "kind": { "Grouped": { "cpus_range": [2, 2], "util_range": [ 0.0, 1.0 ], "preempt": false } } } ] Both layers are static, and need only be resized a single time, so the configuration would never be propagated to the kernel due to us never calling update_bpf_layer_cpumask(). Let's instead have the initialization propagate changes to the skeleton before we attach the scheduler. This has the advantage both of fixing the bug mentioned above where a static configuration is never propagated to the kernel, and that we don't have a short period when the scheduler is first attached where we don't make optimal scheduling decisions due to the layer resizing not being propagated. Signed-off-by: David Vernet <void@manifault.com>

In layered_init, we're currently setting all bits in every layers' cpumask, and then asynchronously updating the cpumasks at later time to reflect their actual values at runtime. Now that we're updating the layered code to initialize the cpumasks before we attach the scheduler, we can instead have the init path actually refresh and initialize the cpumasks directly. Signed-off-by: David Vernet <void@manifault.com>

Currently, in layered_dispatch, we do the following: 1. Iterate over all preempt=true layers, and first try to consume from them. 2. Iterate over all confined layers, and consume from them if we find a layer with a cpumask that contains the consuming CPU. 3. Iterate over all grouped and open layers and consume from them in ordered sequence. In (2), we're only iterating over confined layers, but we should also be iterating over grouped layers. Otherwise, despite a consuming CPU being allocated to a specific grouped layer, the CPU will consume from whichever grouped or open layer has a task that's ready to run. Signed-off-by: David Vernet <void@manifault.com>

htejun · 2024-02-21T21:47:09Z

scheds/rust/scx_layered/src/bpf/main.bpf.c

 		return;

 	cpumaskw = bpf_map_lookup_elem(&layer_cpumasks, &idx);

+	bpf_rcu_read_lock();


What's this protecting?

Without the rcu region, the verifier doesn't like the cpumask kfunc arg:

; bpf_for(cpu, 0, nr_possible_cpus) { 441: (3e) if w1 >= w2 goto pc+23 ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R2_w=32 refs=9,28 ; if (!cpumaskw || !cpumaskw->cpumask) { 442: (15) if r6 == 0x0 goto pc+46 ; frame1: R6=map_value(off=0,ks=4,vs=8,imm=0) refs=9,28 ; if (!cpumaskw || !cpumaskw->cpumask) { 443: (79) r2 = *(u64 *)(r6 +0) ; frame1: R2_w=untrusted_ptr_or_null_bpf_cpumask(id=30,off=0,imm=0) R6=map_value(off=0,ks=4,vs=8,imm=0) refs=9,28 ; if (!cpumaskw || !cpumaskw->cpumask) { 444: (15) if r2 == 0x0 goto pc+44 ; frame1: R2_w=untrusted_ptr_bpf_cpumask(off=0,imm=0) refs=9,28 ; 445: (bc) w3 = w1 ; frame1: R1_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R3_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) refs=9,28 446: (74) w3 >>= 3 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=3,var_off=(0x0; 0x3)) refs=9,28 447: (bf) r4 = r7 ; frame1: R4_w=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055280,var_off=(0x0; 0x183f78)) R7=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055280,var_off=(0x0; 0x183f78)) refs=9,28 448: (0f) r4 += r3 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=3,var_off=(0x0; 0x3)) R4_w=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055283,var_off=(0x0; 0x183f7b)) refs=9,28 449: (07) r4 += 527572 ; frame1: R4_w=map_value(off=527572,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055283,var_off=(0x0; 0x183f7b)) refs=9,28 ; if (*u8_ptr & (1 << (cpu % 8))) { 450: (bc) w3 = w1 ; frame1: R1_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R3_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) refs=9,28 451: (54) w3 &= 7 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) refs=9,28 ; if (*u8_ptr & (1 << (cpu % 8))) { 452: (71) r4 = *(u8 *)(r4 +0) ; frame1: R4_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) refs=9,28 ; if (*u8_ptr & (1 << (cpu % 8))) { 453: (7c) w4 >>= w3 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) R4_w=scalar() refs=9,28 454: (54) w4 &= 1 ; frame1: R4=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=9,28 455: (16) if w4 == 0x0 goto pc+7 ; frame1: R4=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=9,28 ; bpf_cpumask_set_cpu(cpu, cpumaskw->cpumask); 456: (85) call bpf_cpumask_set_cpu#36671 R2 must be a rcu pointer processed 237 insns (limit 1000000) max_states_per_insn 1 total_states 24 peak_states 24 mark_read 9 -- END PROG LOAD LOG --

Otherwise, cpumaskw->cpumask could be freed out from under it on the ops.init() path. I'm kinda confused though why it's safe to pass cpumaskw->cpumask to the kfunc directly. I'd have expected us to need to put that into a local variable.

I see. The existing code is lacking the needed protection then. Hmm... why was it even loading?

I believe it's because it was previously only being called on paths with interrupts disabled. ops.init() is sleepable, so it explicitly needs the extra RCU protection.

Byte-Lab added 3 commits February 21, 2024 15:38

Byte-Lab requested a review from htejun February 21, 2024 21:40

htejun approved these changes Feb 21, 2024

View reviewed changes

Byte-Lab merged commit ebce76b into main Feb 21, 2024
1 check passed

Byte-Lab deleted the layered_fixes branch February 21, 2024 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

layered: Fix static configuration, and dispatch for Grouped layers #148

layered: Fix static configuration, and dispatch for Grouped layers #148

Byte-Lab commented Feb 21, 2024

htejun Feb 21, 2024

Byte-Lab Feb 21, 2024 •

edited

htejun Feb 21, 2024

Byte-Lab Feb 21, 2024

layered: Fix static configuration, and dispatch for Grouped layers #148

layered: Fix static configuration, and dispatch for Grouped layers #148

Conversation

Byte-Lab commented Feb 21, 2024

htejun Feb 21, 2024

Choose a reason for hiding this comment

Byte-Lab Feb 21, 2024 • edited

Choose a reason for hiding this comment

htejun Feb 21, 2024

Choose a reason for hiding this comment

Byte-Lab Feb 21, 2024

Choose a reason for hiding this comment

Byte-Lab Feb 21, 2024 •

edited