rusty: Rework deadline as a signed sum #309

Byte-Lab · 2024-05-23T05:13:31Z

Currently, a task's deadline is computed as its vtime + a scaled function of its average runtime (with its deadline being scaled down if it's more interactive). This makes sense intuitively, as we do want an interactive task to have an earlier deadline, but it also has some flaws.

For one thing, we're currently ignoring duty cycle when determining a task's deadline. This has a few implications. Firstly, because we reward tasks with higher waker and blocked frequencies due to considering them to be part of a work chain, we implicitly penalize tasks that rarely ever use the CPU because those frequencies are low. While those tasks are likely not part of a work chain, they also should get an interactivity boost just by pure virtue of not using the CPU very often. This should in theory be addressed by vruntime, but because we cap the amount of vtime that a task can accumulate to one slice, it may not be adequately reflected after a task runs for the first time.

Another problem is that we're minimizing a task's deadline if it's interactive, but we're also not really penalizing a task that's a super CPU hog by increasing its deadline. We sort of do a bit by applying a higher niceness which gives it a higher deadline for a lower weight, but its somewhat minimal considering that we're using niceness, and that the best an interactive task can do is minimize its deadline to near zero relative to its vtime.

What we really want to do is "negatively" scale an interactive task's deadline with the same magnitude as we "positively" scale a CPU-hogging task's deadline. To do this, we make two major changes to how we compute deadline:

Instead of using niceness, we now instead use our own straightforward scaling factor. This was chosen arbitrarily (by experimentation with some games) to be a scaling by 1000, but we can and should improve this in the future.
We now create a signed linear latency priority factor as a sum of the three following inputs:
- Work-chain factor (log_2 of product of blocked freq and waker freq)
- Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle -- higher duty cycle means lower factor)
- Average runtime factor (Higher avg runtime means higher average runtime factor)

We then compute the latency priority as:

lat_prio := Average runtime factor - (work-chain factor + duty cycle factor)

This gives us a signed value that can be negative. With this, we can compute a non-negative weight value by calculating a weight from the absolute value of lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive, it's the task's vtime PLUS scaled slice_ns.

This ends up working well because you get a higher weight both for highly interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets you scale a task's deadline "more negatively" for interactive tasks, and "more positively" for the CPU hogs.

With this change, we get a significant improvement in FPS. On a 7950X, if I run the following workload:

$ stress-ng -c $((8 * $(nproc)))

I get 60 FPS when playing Stellaris (while time is progressing at max speed), whereas EEVDF gets 6-7 FPS.
I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The Civ6 benchmark doesn't even start after over 4 minutes in the initial frame with EEVDF, but gets us 13s / turn with rusty.
It seems that EEVDF has improved with Terraria in v6.9. It was able to maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past. rusty is still able to maintain a solid 60-62FPS consistently with no problem, however.

Byte-Lab · 2024-05-23T05:29:32Z

Multiple pushes was from adjusting the sched_prio_to_latency_weight() multiplier. 1000 seems to work best for now, but definitely something to further improve on.

arighi

Nice work, I see a massive improvement in responsiveness when the system is overloaded!

I left a couple of comments, mostly about clarifying the code, but definitely approved!

arighi · 2024-05-23T06:47:41Z

scheds/rust/scx_rusty/src/bpf/main.bpf.c

+		task_dcycle = 0;
+	else
+		task_dcycle = DL_FULL_DCYCLE - task_dcycle;
+	dcycle_linear = bpf_log2l(max(task_dcycle, 1));


task_dcycle is actually the inverse duty cycle here, or the "idle duty cycle" IIUC, maybe we should rename the variable (i.e., inv_dcycle_linear?) or add a comment to clarify this?

scheds/rust/scx_rusty/src/bpf/main.bpf.c

scheds/rust/scx_rusty/src/bpf/intf.h

multics69

Thank you for the fantastic work! Reflecting duty cycles maches well with my observation in EEVDF and LAVD. While LAVD does not directly consider the duty cycle, after adding eligibility, I found that LAVD works better under highly loaded system. Since CPU hoggers are mostly ineligible, eligibility might have similar effect to duty cycle. However, the duty cycle is more direct measurement so I like it!

scheds/rust/scx_rusty/src/bpf/intf.h

multics69 · 2024-05-23T08:35:52Z

I ran the changes in my usual test case. It seems to have a regression.
Here are debug dumps.

DEBUG DUMP                                                                                                                                                                                                           
================================================================================                                                                                                                                     
                                                                                                                                                                                                                     
kworker/u64:0[43915] triggered exit kind 1026:                                                                                                                                                                       
  runnable task stall (cc1[73847] failed to run for 11.390s)                                                                                                                                                         
                                                                                                                                                                                                                     
Backtrace:                                                                                                                                                                                                           
  scx_watchdog_workfn+0x146/0x1d0                                                                                                                                                                                    
  process_one_work+0x193/0x3c0
  worker_thread+0x3a9/0x4f0
  kthread+0xd2/0x100                                 
  ret_from_fork+0x34/0x50                            
  ret_from_fork_asm+0x1a/0x30

Runqueue states                                      
---------------                                      

CPU 0   : nr_run=7 flags=0x0 cpu_rel=0 ops_qseq=150024 pnt_seq=208897
          curr=dav1d-worker[44211] class=ext_sched_class

 *R dav1d-worker[44211] -4ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      cpus=ffff                                      


  R cc1[73847] -11390ms                              
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[73793] -2910ms                               
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[74207] -2904ms                               
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

DEBUG DUMP                                                                                                                                                                                                           
================================================================================                                                                                                                                     
                                                                                                                                                                                                                     
kworker/u64:5[13216] triggered exit kind 1026:                                                                                                                                                                       
  runnable task stall (cc1[40964] failed to run for 11.790s)                                                                                                                                                         
                                                                                                                                                                                                                     
Backtrace:                                                                                                                                                                                                           
  scx_watchdog_workfn+0x146/0x1d0                                                                                                                                                                                    
  process_one_work+0x193/0x3c0                                                                                                                                                                                       
  worker_thread+0x3a9/0x4f0                                                                                                                                                                                          
  kthread+0xd2/0x100                                                                                                                                                                                                 
  ret_from_fork+0x34/0x50                                                                                                                                                                                            
  ret_from_fork_asm+0x1a/0x30                                                                                                                                                                                        
                                                                                                                                                                                                                     
Runqueue states                                                                                                                                                                                                      
---------------                                                                                                                                                                                                      
                                                                                                                                                                                                                     
CPU 0   : nr_run=16 flags=0x0 cpu_rel=0 ops_qseq=36886 pnt_seq=71515                                                                                                                                                 
          curr=cc1[43395] class=ext_sched_class                                                                                                                                                                      
                                                                                                                                                                                                                     
 *R cc1[43395] -4ms                                                                                                                                                                                                  
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      cpus=ffff                                      


  R cc1[40964] -11790ms                              
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[41146] -11124ms                              
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[41695] -7540ms                               
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

multics69 · 2024-05-23T08:38:10Z

What I did to trigger the above is as follow:

Run scx_rusty
Play a YouTube streaming on a browser (in my case, I played this on my Edge brower).
Run glxgears
Kernel compile with make -j$((8 * $(nproc)))
Keep all four on the screen

Byte-Lab · 2024-05-23T22:30:10Z

Pushed code that addresses @arighi and @multics69's feedback. Going to take a look at the regression that Changwoo pointed out before merging.

vax-r · 2024-05-26T17:39:55Z

Incredible works you guys are doing , I'm just here to learn what you changed and the reasons.
Must to say I learn a lot from this PR , it's amazing .

ptr1337 · 2024-05-29T15:37:50Z

Actually, this PR seems to introduce some regression for some users in a specific workload.

The user did following:

Kernel Compilation running in background with all cores
Rocket League with Metrics

Following version has been used for testing:

scx-scheds 0.19-5 (CachyOS repository, based on latest commit of this repository right now)
scx-scheds-git (latest commit, including this PR)

Here the video can be found. There is a quite huge interactivity/FPS drop visible in Rocket Leauge:
https://streamable.com/0jz2zu

Currently, a task's deadline is computed as its vtime + a scaled function of its average runtime (with its deadline being scaled down if it's more interactive). This makes sense intuitively, as we do want an interactive task to have an earlier deadline, but it also has some flaws. For one thing, we're currently ignoring duty cycle when determining a task's deadline. This has a few implications. Firstly, because we reward tasks with higher waker and blocked frequencies due to considering them to be part of a work chain, we implicitly penalize tasks that rarely ever use the CPU because those frequencies are low. While those tasks are likely not part of a work chain, they also should get an interactivity boost just by pure virtue of not using the CPU very often. This should in theory be addressed by vruntime, but because we cap the amount of vtime that a task can accumulate to one slice, it may not be adequately reflected after a task runs for the first time. Another problem is that we're minimizing a task's deadline if it's interactive, but we're also not really penalizing a task that's a super CPU hog by increasing its deadline. We sort of do a bit by applying a higher niceness which gives it a higher deadline for a lower weight, but its somewhat minimal considering that we're using niceness, and that the best an interactive task can do is minimize its deadline to near zero relative to its vtime. What we really want to do is "negatively" scale an interactive task's deadline with the same magnitude as we "positively" scale a CPU-hogging task's deadline. To do this, we make two major changes to how we compute deadline: 1. Instead of using niceness, we now instead use our own straightforward scaling factor. This was chosen arbitrarily to be a scaling by 1000, but we can and should improve this in the future. 2. We now create a _signed_ linear latency priority factor as a sum of the three following inputs: - Work-chain factor (log_2 of product of blocked freq and waker freq) - Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle -- higher duty cycle means lower factor) - Average runtime factor (Higher avg runtime means higher average runtime factor) We then compute the latency priority as: lat_prio := Average runtime factor - (work-chain factor + duty cycle factor) This gives us a signed value that can be negative. With this, we can compute a non-negative weight value by calculating a weight from the absolute value of lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive, it's the task's vtime PLUS scaled slice_ns. This ends up working well because you get a higher weight both for highly interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets you scale a task's deadline "more negatively" for interactive tasks, and "more positively" for the CPU hogs. With this change, we get a significant improvement in FPS. On a 7950X, if I run the following workload: $ stress-ng -c $((8 * $(nproc))) 1. I get 60 FPS when playing Stellaris (while time is progressing at max speed), whereas EEVDF gets 6-7 FPS. 2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The Civ6 benchmark doesn't even start after over 4 minutes in the initial frame with EEVDF, but gets us 13s / turn with rusty. 3. It seems that EEVDF has improved with Terraria in v6.9. It was able to maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past. rusty is still able to maintain a solid 60-62FPS consistently with no problem, however.

…e tasks In some scenarios, a CPU-intensive task may be on the critical path for interactive workloads. For example, you may have a game with CPU-intensive tasks that are crunching the logic for the game, and that's required for the game to proceed without being choppy. To support such workflows, this change adds logic to allow a non-interactive task to inherit the lower (i.e. stronger) latency priority of another task if it wakes or is woken by that task. Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab · 2024-07-25T18:44:42Z

Alright, this seems to beat the old rusty on RocketLeague now as well. Interestingly, EEVDF on v6.10 beats both versions of rusty on RocketLeague, but still gets trounced on Terraria and Civ6. On Civ6 especially the game becomes unplayable with EEVDF under heavy background load, but is still fairly responsive with rusty.

This isn't perfect, but I'm going to merge it for now and we can continue to iterate in tree.

Byte-Lab · 2024-07-25T19:50:40Z

Getting reports that this causes some regressions in cachy, so lemme revert while we investigate.

Byte-Lab requested review from arighi, htejun and multics69 May 23, 2024 05:13

Byte-Lab force-pushed the rusty_improved_dl branch 2 times, most recently from 8a22d24 to 9198488 Compare May 23, 2024 05:27

arighi approved these changes May 23, 2024

View reviewed changes

multics69 approved these changes May 23, 2024

View reviewed changes

scheds/rust/scx_rusty/src/bpf/intf.h Show resolved Hide resolved

htejun approved these changes May 23, 2024

View reviewed changes

Byte-Lab force-pushed the rusty_improved_dl branch from 9198488 to 5ba14c0 Compare May 23, 2024 22:29

Byte-Lab added 2 commits July 25, 2024 11:55

Byte-Lab force-pushed the rusty_improved_dl branch from 5ba14c0 to c1ad602 Compare July 25, 2024 16:56

Byte-Lab merged commit 09536aa into main Jul 25, 2024
1 check passed

Byte-Lab deleted the rusty_improved_dl branch July 25, 2024 18:44

Byte-Lab mentioned this pull request Jul 25, 2024

Revert "rusty: Rework deadline as a signed sum" #450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rusty: Rework deadline as a signed sum #309

rusty: Rework deadline as a signed sum #309

Byte-Lab commented May 23, 2024 •

edited

Loading

Byte-Lab commented May 23, 2024

arighi left a comment

arighi May 23, 2024

multics69 left a comment

multics69 commented May 23, 2024

multics69 commented May 23, 2024 •

edited

Loading

Byte-Lab commented May 23, 2024 •

edited

Loading

vax-r commented May 26, 2024

ptr1337 commented May 29, 2024

Byte-Lab commented Jul 25, 2024

Byte-Lab commented Jul 25, 2024 •

edited

Loading

rusty: Rework deadline as a signed sum #309

rusty: Rework deadline as a signed sum #309

Conversation

Byte-Lab commented May 23, 2024 • edited Loading

Byte-Lab commented May 23, 2024

arighi left a comment

Choose a reason for hiding this comment

arighi May 23, 2024

Choose a reason for hiding this comment

multics69 left a comment

Choose a reason for hiding this comment

multics69 commented May 23, 2024

multics69 commented May 23, 2024 • edited Loading

Byte-Lab commented May 23, 2024 • edited Loading

vax-r commented May 26, 2024

ptr1337 commented May 29, 2024

Byte-Lab commented Jul 25, 2024

Byte-Lab commented Jul 25, 2024 • edited Loading

Byte-Lab commented May 23, 2024 •

edited

Loading

multics69 commented May 23, 2024 •

edited

Loading

Byte-Lab commented May 23, 2024 •

edited

Loading

Byte-Lab commented Jul 25, 2024 •

edited

Loading