Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_flatcg: introduce CGROUP_MAX_RETRIES #80

Merged
merged 1 commit into from
Jan 10, 2024
Merged

Conversation

arighi
Copy link
Collaborator

@arighi arighi commented Jan 10, 2024

We may end up stalling for too long in fcg_dispatch() if try_pick_next_cgroup() doesn't find another valid cgroup to pick. This can be quite risky, considering that we are holding the rq lock in dispatch().

This condition can be reproduced easily in our CI, where we can trigger stalling softirq works:

[ 4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Or rcu stalls:

[ 47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 47.731900] rcu: 1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[ 47.731900] rcu: 3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[ 47.731900] rcu: (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[ 47.731900] Sending NMI from CPU 0 to CPUs 1:

To mitigate this issue reduce the amount of try_pick_next_cgroup() retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).

We may end up stalling for too long in fcg_dispatch() if
try_pick_next_cgroup() doesn't find another valid cgroup to pick. This
can be quite risky, considering that we are holding the rq lock in
dispatch().

This condition can be reproduced easily in our CI, where we can trigger
stalling softirq works:

[    4.972926] NOHZ tick-stop error: local softirq work is pending, handler #200!!!

Or rcu stalls:

[   47.731900] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   47.731900] rcu:     1-...!: (0 ticks this GP) idle=b29c/1/0x4000000000000000 softirq=2204/2204 fqs=0
[   47.731900] rcu:     3-...!: (0 ticks this GP) idle=db74/1/0x4000000000000000 softirq=2286/2286 fqs=0
[   47.731900] rcu:     (detected by 0, t=26002 jiffies, g=6029, q=54 ncpus=4)
[   47.731900] Sending NMI from CPU 0 to CPUs 1:

To mitigate this issue reduce the amount of try_pick_next_cgroup()
retries from BPF_MAX_LOOPS (8M) to CGROUP_MAX_RETRIES (1024).

Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
@htejun
Copy link
Contributor

htejun commented Jan 10, 2024

This shouldn't have happened but yeah MAX_LOOP is dangerous there. I'll follow up with root cause fix.

@htejun htejun merged commit ae50b15 into main Jan 10, 2024
2 checks passed
@htejun htejun deleted the scx-flatcg-mitigate-stall branch January 10, 2024 19:51
@arighi
Copy link
Collaborator Author

arighi commented Jan 10, 2024

This shouldn't have happened but yeah MAX_LOOP is dangerous there. I'll follow up with root cause fix.

Maybe it just happens in our particular CI environment that has just the root cgroup and in a more "regular" system everything's fine. But it's still good to prevent the issue from happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants