Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rusty: Rework deadline as a signed sum #309

Merged
merged 2 commits into from
Jul 25, 2024
Merged

rusty: Rework deadline as a signed sum #309

merged 2 commits into from
Jul 25, 2024

Conversation

Byte-Lab
Copy link
Contributor

@Byte-Lab Byte-Lab commented May 23, 2024

Currently, a task's deadline is computed as its vtime + a scaled function of its average runtime (with its deadline being scaled down if it's more interactive). This makes sense intuitively, as we do want an interactive task to have an earlier deadline, but it also has some flaws.

For one thing, we're currently ignoring duty cycle when determining a task's deadline. This has a few implications. Firstly, because we reward tasks with higher waker and blocked frequencies due to considering them to be part of a work chain, we implicitly penalize tasks that rarely ever use the CPU because those frequencies are low. While those tasks are likely not part of a work chain, they also should get an interactivity boost just by pure virtue of not using the CPU very often. This should in theory be addressed by vruntime, but because we cap the amount of vtime that a task can accumulate to one slice, it may not be adequately reflected after a task runs for the first time.

Another problem is that we're minimizing a task's deadline if it's interactive, but we're also not really penalizing a task that's a super CPU hog by increasing its deadline. We sort of do a bit by applying a higher niceness which gives it a higher deadline for a lower weight, but its somewhat minimal considering that we're using niceness, and that the best an interactive task can do is minimize its deadline to near zero relative to its vtime.

What we really want to do is "negatively" scale an interactive task's deadline with the same magnitude as we "positively" scale a CPU-hogging task's deadline. To do this, we make two major changes to how we compute deadline:

  1. Instead of using niceness, we now instead use our own straightforward scaling factor. This was chosen arbitrarily (by experimentation with some games) to be a scaling by 1000, but we can and should improve this in the future.

  2. We now create a signed linear latency priority factor as a sum of the three following inputs:

    • Work-chain factor (log_2 of product of blocked freq and waker freq)
    • Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle -- higher duty cycle means lower factor)
    • Average runtime factor (Higher avg runtime means higher average runtime factor)

We then compute the latency priority as:

lat_prio := Average runtime factor - (work-chain factor + duty cycle factor)

This gives us a signed value that can be negative. With this, we can compute a non-negative weight value by calculating a weight from the absolute value of lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive, it's the task's vtime PLUS scaled slice_ns.

This ends up working well because you get a higher weight both for highly interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets you scale a task's deadline "more negatively" for interactive tasks, and "more positively" for the CPU hogs.

With this change, we get a significant improvement in FPS. On a 7950X, if I run the following workload:

$ stress-ng -c $((8 * $(nproc)))
  1. I get 60 FPS when playing Stellaris (while time is progressing at max speed), whereas EEVDF gets 6-7 FPS.

  2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The Civ6 benchmark doesn't even start after over 4 minutes in the initial frame with EEVDF, but gets us 13s / turn with rusty.

  3. It seems that EEVDF has improved with Terraria in v6.9. It was able to maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past. rusty is still able to maintain a solid 60-62FPS consistently with no problem, however.

@Byte-Lab Byte-Lab force-pushed the rusty_improved_dl branch 2 times, most recently from 8a22d24 to 9198488 Compare May 23, 2024 05:27
@Byte-Lab
Copy link
Contributor Author

Multiple pushes was from adjusting the sched_prio_to_latency_weight() multiplier. 1000 seems to work best for now, but definitely something to further improve on.

Copy link
Collaborator

@arighi arighi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, I see a massive improvement in responsiveness when the system is overloaded!

I left a couple of comments, mostly about clarifying the code, but definitely approved!

task_dcycle = 0;
else
task_dcycle = DL_FULL_DCYCLE - task_dcycle;
dcycle_linear = bpf_log2l(max(task_dcycle, 1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_dcycle is actually the inverse duty cycle here, or the "idle duty cycle" IIUC, maybe we should rename the variable (i.e., inv_dcycle_linear?) or add a comment to clarify this?

scheds/rust/scx_rusty/src/bpf/main.bpf.c Outdated Show resolved Hide resolved
scheds/rust/scx_rusty/src/bpf/intf.h Show resolved Hide resolved
Copy link
Contributor

@multics69 multics69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fantastic work! Reflecting duty cycles maches well with my observation in EEVDF and LAVD. While LAVD does not directly consider the duty cycle, after adding eligibility, I found that LAVD works better under highly loaded system. Since CPU hoggers are mostly ineligible, eligibility might have similar effect to duty cycle. However, the duty cycle is more direct measurement so I like it!

scheds/rust/scx_rusty/src/bpf/intf.h Show resolved Hide resolved
@multics69
Copy link
Contributor

I ran the changes in my usual test case. It seems to have a regression.
Here are debug dumps.

DEBUG DUMP                                                                                                                                                                                                           
================================================================================                                                                                                                                     
                                                                                                                                                                                                                     
kworker/u64:0[43915] triggered exit kind 1026:                                                                                                                                                                       
  runnable task stall (cc1[73847] failed to run for 11.390s)                                                                                                                                                         
                                                                                                                                                                                                                     
Backtrace:                                                                                                                                                                                                           
  scx_watchdog_workfn+0x146/0x1d0                                                                                                                                                                                    
  process_one_work+0x193/0x3c0
  worker_thread+0x3a9/0x4f0
  kthread+0xd2/0x100                                 
  ret_from_fork+0x34/0x50                            
  ret_from_fork_asm+0x1a/0x30

Runqueue states                                      
---------------                                      

CPU 0   : nr_run=7 flags=0x0 cpu_rel=0 ops_qseq=150024 pnt_seq=208897
          curr=dav1d-worker[44211] class=ext_sched_class

 *R dav1d-worker[44211] -4ms
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      cpus=ffff                                      


  R cc1[73847] -11390ms                              
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[73793] -2910ms                               
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[74207] -2904ms                               
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20
DEBUG DUMP                                                                                                                                                                                                           
================================================================================                                                                                                                                     
                                                                                                                                                                                                                     
kworker/u64:5[13216] triggered exit kind 1026:                                                                                                                                                                       
  runnable task stall (cc1[40964] failed to run for 11.790s)                                                                                                                                                         
                                                                                                                                                                                                                     
Backtrace:                                                                                                                                                                                                           
  scx_watchdog_workfn+0x146/0x1d0                                                                                                                                                                                    
  process_one_work+0x193/0x3c0                                                                                                                                                                                       
  worker_thread+0x3a9/0x4f0                                                                                                                                                                                          
  kthread+0xd2/0x100                                                                                                                                                                                                 
  ret_from_fork+0x34/0x50                                                                                                                                                                                            
  ret_from_fork_asm+0x1a/0x30                                                                                                                                                                                        
                                                                                                                                                                                                                     
Runqueue states                                                                                                                                                                                                      
---------------                                                                                                                                                                                                      
                                                                                                                                                                                                                     
CPU 0   : nr_run=16 flags=0x0 cpu_rel=0 ops_qseq=36886 pnt_seq=71515                                                                                                                                                 
          curr=cc1[43395] class=ext_sched_class                                                                                                                                                                      
                                                                                                                                                                                                                     
 *R cc1[43395] -4ms                                                                                                                                                                                                  
      scx_state/flags=3/0x5 dsq_flags=0x0 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=(n/a)
      cpus=ffff                                      


  R cc1[40964] -11790ms                              
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[41146] -11124ms                              
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

  R cc1[41695] -7540ms                               
      scx_state/flags=3/0x1 dsq_flags=0x1 ops_state/qseq=0/0
      sticky/holding_cpu=-1/-1 dsq_id=0x0
      cpus=ffff                                      

    asm_sysvec_apic_timer_interrupt+0x1a/0x20

@multics69
Copy link
Contributor

multics69 commented May 23, 2024

What I did to trigger the above is as follow:

  1. Run scx_rusty
  2. Play a YouTube streaming on a browser (in my case, I played this on my Edge brower).
  3. Run glxgears
  4. Kernel compile with make -j$((8 * $(nproc)))
  5. Keep all four on the screen

@Byte-Lab
Copy link
Contributor Author

Byte-Lab commented May 23, 2024

Pushed code that addresses @arighi and @multics69's feedback. Going to take a look at the regression that Changwoo pointed out before merging.

@vax-r
Copy link
Contributor

vax-r commented May 26, 2024

Incredible works you guys are doing , I'm just here to learn what you changed and the reasons.
Must to say I learn a lot from this PR , it's amazing .

@ptr1337
Copy link
Contributor

ptr1337 commented May 29, 2024

Actually, this PR seems to introduce some regression for some users in a specific workload.

The user did following:

  • Kernel Compilation running in background with all cores
  • Rocket League with Metrics

Following version has been used for testing:

  • scx-scheds 0.19-5 (CachyOS repository, based on latest commit of this repository right now)
  • scx-scheds-git (latest commit, including this PR)

Here the video can be found. There is a quite huge interactivity/FPS drop visible in Rocket Leauge:
https://streamable.com/0jz2zu

Currently, a task's deadline is computed as its vtime + a scaled function of
its average runtime (with its deadline being scaled down if it's more
interactive). This makes sense intuitively, as we do want an interactive task
to have an earlier deadline, but it also has some flaws.

For one thing, we're currently ignoring duty cycle when determining a task's
deadline. This has a few implications. Firstly, because we reward tasks with
higher waker and blocked frequencies due to considering them to be part of a
work chain, we implicitly penalize tasks that rarely ever use the CPU because
those frequencies are low. While those tasks are likely not part of a work
chain, they also should get an interactivity boost just by pure virtue of not
using the CPU very often. This should in theory be addressed by vruntime, but
because we cap the amount of vtime that a task can accumulate to one slice, it
may not be adequately reflected after a task runs for the first time.

Another problem is that we're minimizing a task's deadline if it's interactive,
but we're also not really penalizing a task that's a super CPU hog by
increasing its deadline. We sort of do a bit by applying a higher niceness
which gives it a higher deadline for a lower weight, but its somewhat minimal
considering that we're using niceness, and that the best an interactive task
can do is minimize its deadline to near zero relative to its vtime.

What we really want to do is "negatively" scale an interactive task's deadline
with the same magnitude as we "positively" scale a CPU-hogging task's deadline.
To do this, we make two major changes to how we compute deadline:

1. Instead of using niceness, we now instead use our own straightforward
   scaling factor. This was chosen arbitrarily to be a scaling by 1000, but we
   can and should improve this in the future.

2. We now create a _signed_ linear latency priority factor as a sum of the
   three following inputs:
   - Work-chain factor (log_2 of product of blocked freq and waker freq)
   - Inverse duty cycle factor (log_2 of the inverse of a task's duty cycle --
     higher duty cycle means lower factor)
   - Average runtime factor (Higher avg runtime means higher average runtime
     factor)

We then compute the latency priority as:

	lat_prio := Average runtime factor - (work-chain factor + duty cycle factor)

This gives us a signed value that can be negative. With this, we can compute a
non-negative weight value by calculating a weight from the absolute value of
lat_prio, and use this to scale slice_ns. If lat_prio is negative we calculate
a task's deadline as its vtime MINUS its scaled slice_ns, and if it's positive,
it's the task's vtime PLUS scaled slice_ns.

This ends up working well because you get a higher weight both for highly
interactive tasks, and highly CPU-hogging / non-interactive tasks, which lets
you scale a task's deadline "more negatively" for interactive tasks, and "more
positively" for the CPU hogs.

With this change, we get a significant improvement in FPS. On a 7950X, if I run
the following workload:

	$ stress-ng -c $((8 * $(nproc)))

1. I get 60 FPS when playing Stellaris (while time is progressing at max
   speed), whereas EEVDF gets 6-7 FPS.

2. I get ~15-40 FPS while playing Civ6, whereas EEVDF seems to get < 1 FPS. The
   Civ6 benchmark doesn't even start after over 4 minutes in the initial frame
   with EEVDF, but gets us 13s / turn with rusty.

3. It seems that EEVDF has improved with Terraria in v6.9. It was able to
   maintain ~30-55 FPS, as opposed to the ~5-10FPS we've seen in the past.
   rusty is still able to maintain a solid 60-62FPS consistently with no
   problem, however.
…e tasks

In some scenarios, a CPU-intensive task may be on the critical path for
interactive workloads. For example, you may have a game with CPU-intensive
tasks that are crunching the logic for the game, and that's required for the
game to proceed without being choppy.

To support such workflows, this change adds logic to allow a non-interactive
task to inherit the lower (i.e. stronger) latency priority of another task if
it wakes or is woken by that task.

Signed-off-by: David Vernet <void@manifault.com>
@Byte-Lab
Copy link
Contributor Author

Alright, this seems to beat the old rusty on RocketLeague now as well. Interestingly, EEVDF on v6.10 beats both versions of rusty on RocketLeague, but still gets trounced on Terraria and Civ6. On Civ6 especially the game becomes unplayable with EEVDF under heavy background load, but is still fairly responsive with rusty.

This isn't perfect, but I'm going to merge it for now and we can continue to iterate in tree.

@Byte-Lab Byte-Lab merged commit 09536aa into main Jul 25, 2024
1 check passed
@Byte-Lab Byte-Lab deleted the rusty_improved_dl branch July 25, 2024 18:44
@Byte-Lab
Copy link
Contributor Author

Byte-Lab commented Jul 25, 2024

Getting reports that this causes some regressions in cachy, so lemme revert while we investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants