scx_flatcg: introduce CGROUP_MAX_RETRIES #80

arighi · 2024-01-10T08:25:33Z

We may end up stalling for too long in fcg_dispatch() if try_pick_next_cgroup() doesn't find another valid cgroup to pick. This can be quite risky, considering that we are holding the rq lock in dispatch().

This condition can be reproduced easily in our CI, where we can trigger stalling softirq works:

[ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Or rcu stalls:

[ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[ 47.731900] Sending NMI from CPU 0 to CPUs 1:

To mitigate this issue reduce the amount of try_pick_next_cgroup() retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).

We may end up stalling for too long in fcg_dispatch() if try_pick_next_cgroup() doesn't find another valid cgroup to pick. This can be quite risky, considering that we are holding the rq lock in dispatch(). This condition can be reproduced easily in our CI, where we can trigger stalling softirq works: [ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!! Or rcu stalls: [ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0 [ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0 [ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4) [ 47.731900] Sending NMI from CPU 0 to CPUs 1: To mitigate this issue reduce the amount of try_pick_next_cgroup() retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024). Signed-off-by: Andrea Righi <andrea.righi@canonical.com>

htejun · 2024-01-10T19:49:05Z

This shouldn't have happened but yeah MAX_LOOP is dangerous there. I'll follow up with root cause fix.

arighi · 2024-01-10T19:56:09Z

This shouldn't have happened but yeah MAX_LOOP is dangerous there. I'll follow up with root cause fix.

Maybe it just happens in our particular CI environment that has just the root cgroup and in a more "regular" system everything's fine. But it's still good to prevent the issue from happening.

arighi mentioned this pull request Jan 10, 2024

deadlock in scx_ops_bypass()? sched-ext/sched_ext#114

Closed

arighi force-pushed the scx-flatcg-mitigate-stall branch from 97b51b6 to 0198d89 Compare January 10, 2024 16:33

arighi force-pushed the scx-flatcg-mitigate-stall branch from 0198d89 to 0609abd Compare January 10, 2024 16:36

htejun merged commit ae50b15 into main Jan 10, 2024
2 checks passed

htejun deleted the scx-flatcg-mitigate-stall branch January 10, 2024 19:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_flatcg: introduce CGROUP_MAX_RETRIES #80

scx_flatcg: introduce CGROUP_MAX_RETRIES #80

arighi commented Jan 10, 2024

htejun commented Jan 10, 2024

arighi commented Jan 10, 2024

scx_flatcg: introduce CGROUP_MAX_RETRIES #80

scx_flatcg: introduce CGROUP_MAX_RETRIES #80

Conversation

arighi commented Jan 10, 2024

htejun commented Jan 10, 2024

arighi commented Jan 10, 2024