Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify thread types using Quality of Service API #14859

Merged
merged 6 commits into from May 26, 2023

Conversation

lunacookies
Copy link
Contributor

@lunacookies lunacookies commented May 20, 2023

Some background (in case you haven’t heard of QoS before)

Heterogenous multi-core CPUs are increasingly found in laptops and desktops (e.g. Alder Lake, Snapdragon 8cx Gen 3, M1). To maximize efficiency on this kind of hardware, it is important to provide the operating system with more information so threads can be scheduled on different core types appropriately.

The approach that XNU (the kernel of macOS, iOS, etc) and Windows have taken is to provide a high-level semantic API – quality of service, or QoS – which informs the OS of the program’s intent. For instance, you might specify that a thread is running a render loop for a game. This makes the OS provide this thread with as large a share of the system’s resources as possible. Specifying a thread is running an unimportant background task, on the other hand, is cause for it to be scheduled exclusively on high-efficiency cores instead of high-performance cores.

QoS APIs allows for easy configuration of many different parameters at once; for instance, setting QoS on XNU affects scheduling, timer latency, I/O priorities, and of course what core type the thread in question should run on. I don’t know any details on how QoS works on Windows, but I would guess it’s similar.

Hypothetically, taking advantage of these APIs would improve power consumption, thermals, battery life if applicable, etc.

Relevance to rust-analyzer

From what I can tell the philosophy behind both the XNU and Windows QoS APIs is that user interfaces should never stutter under any circumstances. You can see this in the array of QoS classes which are available: the highest QoS class in both APIs is one intended explicitly for UI render loops.

Imagine rust-analyzer is performing CPU-intensive background work – maybe you just invoked Find Usages on usize or opened a large project – in this scenario the editor’s render loop should absolutely get higher priority than rust-analyzer, no matter what. You could view it in terms of “realtime-ness”: flight control software is hard realtime, audio software is soft realtime, GUIs are softer realtime, and rust-analyzer is not realtime at all. Of course, maximizing responsiveness is important, but respecting the rest of the system is more important.

Implementation

I’ve tried my best to unify thread creation in stdx, where the new API I’ve introduced requires specifying a QoS class. Different points along the performance/efficiency curve can make a great difference; the M1’s e-cores use around three times less power than the p-cores, so putting in this effort is worthwhile IMO.

It’s worth mentioning that Linux does not yet have a QoS API. Maybe translating QoS into regular thread priorities would be acceptable? From what I can tell the only scheduling-related code in rust-analyzer is Windows-specific, so ignoring QoS entirely on Linux shouldn’t cause any new issues. Also, I haven’t implemented support for the Windows QoS APIs because I don’t have a Windows machine to test on, and because I’m completely unfamiliar with Windows APIs :)

I noticed that rust-analyzer handles some requests on the main thread (using .on_sync()) and others on a threadpool (using .on()). I think it would make sense to run the main thread at the User Initiated QoS and the threadpool at Utility, but only if all requests that are caused by typing use .on_sync() and all that don’t use .on(). I don’t understand how the .on_sync()/.on() split that’s currently present was chosen, so I’ve let this code be for the moment. Let me know if changing this to what I proposed makes any sense.

To avoid having to change everything back in case I’ve misunderstood something, I’ve left all threads at the Utility QoS for now. Of course, this isn’t what I hope the code will look like in the end, but I figured I have to start somewhere :P

References

  • Apple documentation related to QoS
  • pthread API for setting QoS on XNU
  • Windows’s QoS classes
  • Full documentation of XNU QoS classes. This documentation is only available as a huge not-very-readable comment in a header file, so I’ve reformatted it and put it here for reference.
    • QOS_CLASS_USER_INTERACTIVE: A QOS class which indicates work performed by this thread is interactive with the user.

      Such work is requested to run at high priority relative to other work on the system. Specifying this QOS class is a request to run with nearly all available system CPU and I/O bandwidth even under contention. This is not an energy-efficient QOS class to use for large tasks. The use of this QOS class should be limited to critical interaction with the user such as handling events on the main event loop, view drawing, animation, etc.

    • QOS_CLASS_USER_INITIATED: A QOS class which indicates work performed by this thread was initiated by the user and that the user is likely waiting for the results.

      Such work is requested to run at a priority below critical user-interactive work, but relatively higher than other work on the system. This is not an energy-efficient QOS class to use for large tasks. Its use should be limited to operations of short enough duration that the user is unlikely to switch tasks while waiting for the results. Typical user-initiated work will have progress indicated by the display of placeholder content or modal user interface.

    • QOS_CLASS_DEFAULT: A default QOS class used by the system in cases where more specific QOS class information is not available.

      Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than utility and background tasks. Threads created by pthread_create() without an attribute specifying a QOS class will default to QOS_CLASS_DEFAULT. This QOS class value is not intended to be used as a work classification, it should only be set when propagating or restoring QOS class values provided by the system.

    • QOS_CLASS_UTILITY: A QOS class which indicates work performed by this thread may or may not be initiated by the user and that the user is unlikely to be immediately waiting for the results.

      Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than low-level system maintenance tasks. The use of this QOS class indicates the work should be run in an energy and thermally-efficient manner. The progress of utility work may or may not be indicated to the user, but the effect of such work is user-visible.

    • QOS_CLASS_BACKGROUND: A QOS class which indicates work performed by this thread was not initiated by the user and that the user may be unaware of the results.

      Such work is requested to run at a priority below other work. The use of this QOS class indicates the work should be run in the most energy and thermally-efficient manner.

    • QOS_CLASS_UNSPECIFIED: A QOS class value which indicates the absence or removal of QOS class information.

      As an API return value, may indicate that threads or pthread attributes were configured with legacy API incompatible or in conflict with the QOS class system.

@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label May 20, 2023
@lunacookies
Copy link
Contributor Author

It’s worth noting that I haven’t observed much difference in my own benchmarks between Utility, User Initiated and User Interactive when the system is in a nominal state; rather, those QoS classes have more of an effect when the system is under load.

Explicitly specifying QoS using appropriate classes would mean, if the system is under heavy load, rust-analyzer’s large threadpool-based tasks (like find usages) would be de-prioritized by the scheduler in favor of high QoS tasks (like rust-analyzer’s main loop or VS Code’s rendering thread). If all p-cores are occupied the scheduler tends to move Utility threads onto the e-cores. This is the exact behavior we want in this scenario.

I don’t know about the specifics of XNU’s scheduler, but I wouldn’t be surprised if lower QoS tasks are moved onto e-cores if the battery is low, too. XNU is known to do things in that vein. For example, the libdispatch concurrency framework integrates with XNU so it can make optimal scheduling decisions. It takes a number of factors into account when choosing how many threads to allocate to the thread pools of each application, including the ratio of performance to efficiency cores and the heat dissipation capacity of the machine (!!).

@lunacookies lunacookies changed the title Specify thread priorities using Quality of Service API Specify thread types using Quality of Service API May 21, 2023
@Veykril
Copy link
Member

Veykril commented May 24, 2023

Are there no crates wrapping this yet? I'd expect that to be a thing already.

@lunacookies
Copy link
Contributor Author

I had a quick look, but there’s nothing I could find.

I think it’s mostly alright, since on XNU it’s a few calls to non-standard pthreads functions without any sketchy raw pointer shenanigans or anything. The Windows QoS APIs are accessible through winapi, which r-a already uses in the main loop.

@Veykril
Copy link
Member

Veykril commented May 24, 2023

That's fine, I was just curious if there was. Then we wouldn't have to duplicate this here.
@bors r+

@bors
Copy link
Collaborator

bors commented May 24, 2023

📌 Commit a416248 has been approved by Veykril

It is now in the queue for this repository.

@bors
Copy link
Collaborator

bors commented May 24, 2023

⌛ Testing commit a416248 with merge e93bcac...

bors added a commit that referenced this pull request May 24, 2023
Specify thread types using Quality of Service API

<details>
<summary>Some background (in case you haven’t heard of QoS before)</summary>

Heterogenous multi-core CPUs are increasingly found in laptops and desktops (e.g. Alder Lake, Snapdragon 8cx Gen 3, M1). To maximize efficiency on this kind of hardware, it is important to provide the operating system with more information so threads can be scheduled on different core types appropriately.

The approach that XNU (the kernel of macOS, iOS, etc) and Windows have taken is to provide a high-level semantic API – quality of service, or QoS – which informs the OS of the program’s intent. For instance, you might specify that a thread is running a render loop for a game. This makes the OS provide this thread with as large a share of the system’s resources as possible. Specifying a thread is running an unimportant background task, on the other hand, is cause for it to be scheduled exclusively on high-efficiency cores instead of high-performance cores.

QoS APIs allows for easy configuration of many different parameters at once; for instance, setting QoS on XNU affects scheduling, timer latency, I/O priorities, and of course what core type the thread in question should run on. I don’t know any details on how QoS works on Windows, but I would guess it’s similar.

Hypothetically, taking advantage of these APIs would improve power consumption, thermals, battery life if applicable, etc.

</details>

# Relevance to rust-analyzer

From what I can tell the philosophy behind both the XNU and Windows QoS APIs is that _user interfaces should never stutter under any circumstances._ You can see this in the array of QoS classes which are available: the highest QoS class in both APIs is one intended explicitly for UI render loops.

Imagine rust-analyzer is performing CPU-intensive background work – maybe you just invoked Find Usages on `usize` or opened a large project – in this scenario the editor’s render loop should absolutely get higher priority than rust-analyzer, no matter what. You could view it in terms of “realtime-ness”: flight control software is hard realtime, audio software is soft realtime, GUIs are softer realtime, and rust-analyzer is not realtime at all. Of course, maximizing responsiveness is important, but respecting the rest of the system is more important.

# Implementation

I’ve tried my best to unify thread creation in `stdx`, where the new API I’ve introduced _requires_ specifying a QoS class. Different points along the performance/efficiency curve can make a great difference; the M1’s e-cores use around three times less power than the p-cores, so putting in this effort is worthwhile IMO.

It’s worth mentioning that Linux does not [yet](https://youtu.be/RfgPWpTwTQo) have a QoS API. Maybe translating QoS into regular thread priorities would be acceptable? From what I can tell the only scheduling-related code in rust-analyzer is Windows-specific, so ignoring QoS entirely on Linux shouldn’t cause any new issues. Also, I haven’t implemented support for the Windows QoS APIs because I don’t have a Windows machine to test on, and because I’m completely unfamiliar with Windows APIs :)

I noticed that rust-analyzer handles some requests on the main thread (using `.on_sync()`) and others on a threadpool (using `.on()`). I think it would make sense to run the main thread at the User Initiated QoS and the threadpool at Utility, but only if all requests that are caused by typing use `.on_sync()` and all that don’t use `.on()`. I don’t understand how the `.on_sync()`/`.on()` split that’s currently present was chosen, so I’ve let this code be for the moment. Let me know if changing this to what I proposed makes any sense.

To avoid having to change everything back in case I’ve misunderstood something, I’ve left all threads at the Utility QoS for now. Of course, this isn’t what I hope the code will look like in the end, but I figured I have to start somewhere :P

# References

<ul>

<li><a href="https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheTaskLevel.html">Apple documentation related to QoS</a></li>
<li><a href="https://github.com/apple-oss-distributions/libpthread/blob/67e155c94093be9a204b69637d198eceff2c7c46/include/pthread/qos.h">pthread API for setting QoS on XNU</a></li>
<li><a href="https://learn.microsoft.com/en-us/windows/win32/procthread/quality-of-service">Windows’s QoS classes</a></li>
<li>
<details>
<summary>Full documentation of XNU QoS classes. This documentation is only available as a huge not-very-readable comment in a header file, so I’ve reformatted it and put it here for reference.</summary>
<ul>
<li><p><strong><code>QOS_CLASS_USER_INTERACTIVE</code>: A QOS class which indicates work performed by this thread is interactive with the user.</strong></p><p>Such work is requested to run at high priority relative to other work on the system. Specifying this QOS class is a request to run with nearly all available system CPU and I/O bandwidth even under contention. This is not an energy-efficient QOS class to use for large tasks. The use of this QOS class should be limited to critical interaction with the user such as handling events on the main event loop, view drawing, animation, etc.</p></li>
<li><p><strong><code>QOS_CLASS_USER_INITIATED</code>: A QOS class which indicates work performed by this thread was initiated by the user and that the user is likely waiting for the results.</strong></p><p>Such work is requested to run at a priority below critical user-interactive work, but relatively higher than other work on the system. This is not an energy-efficient QOS class to use for large tasks. Its use should be limited to operations of short enough duration that the user is unlikely to switch tasks while waiting for the results. Typical user-initiated work will have progress indicated by the display of placeholder content or modal user interface.</p></li>
<li><p><strong><code>QOS_CLASS_DEFAULT</code>: A default QOS class used by the system in cases where more specific QOS class information is not available.</strong></p><p>Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than utility and background tasks. Threads created by <code>pthread_create()</code> without an attribute specifying a QOS class will default to <code>QOS_CLASS_DEFAULT</code>. This QOS class value is not intended to be used as a work classification, it should only be set when propagating or restoring QOS class values provided by the system.</p></li>
<li><p><strong><code>QOS_CLASS_UTILITY</code>: A QOS class which indicates work performed by this thread may or may not be initiated by the user and that the user is unlikely to be immediately waiting for the results.</strong></p><p>Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than low-level system maintenance tasks. The use of this QOS class indicates the work should be run in an energy and thermally-efficient manner. The progress of utility work may or may not be indicated to the user, but the effect of such work is user-visible.</p></li>
<li><p><strong><code>QOS_CLASS_BACKGROUND</code>: A QOS class which indicates work performed by this thread was not initiated by the user and that the user may be unaware of the results.</strong></p><p>Such work is requested to run at a priority below other work. The use of this QOS class indicates the work should be run in the most energy and thermally-efficient manner.</p></li>
<li><p><strong><code>QOS_CLASS_UNSPECIFIED</code>: A QOS class value which indicates the absence or removal of QOS class information.</strong></p><p>As an API return value, may indicate that threads or pthread attributes were configured with legacy API incompatible or in conflict with the QOS class system.</p></li>
</ul>
</details>
</li>

</ul>
@lunacookies
Copy link
Contributor Author

Oh wait is it possible to cancel bors? I’d still like to figure out what an appropriate QoS for the main loop would be …

@Veykril
Copy link
Member

Veykril commented May 24, 2023

@bors r-

@Veykril
Copy link
Member

Veykril commented May 24, 2023

Was about to comment on the on vs on_sync matter afterwards (figured that oculd be done as follow up).

The only difference really is that on_sync ones block the main loop, so anything that needs to happen immediately (that is has to be done fast) goes there usually. So text document changes, on enter handling, matching brace stuff

@Veykril
Copy link
Member

Veykril commented May 24, 2023

In general, on_sync{_mut} handlers should not block.

@lnicola
Copy link
Member

lnicola commented May 24, 2023

I think completion would also be latency-sensitive.

@bors
Copy link
Collaborator

bors commented May 24, 2023

☀️ Try build successful - checks-actions
Build commit: e93bcac (e93bcac1442871a1b7664c4adeaa695258a5e14e)

@lunacookies
Copy link
Contributor Author

I’d suggest also adding on-type formatting and possibly even syntax highlighting to that list? I’m not sure if syntax highlighting is slow enough for this to be a bad idea though …

I would also mention inlay hints since those update while typing, but since VS Code adds in that horrible artificial delay the latency doesn’t matter.

In terms of QoS I didn’t spot any “special” threads in rust-analyzer that would need something other than Utility, except for the main loop and the cache priming threads. Main loop activity is primarily caused by the user’s actions and are required for the UI to respond, so I would suggest making the main loop User Initiated. I’m not so clear about cache priming, though: can rust-analyzer respond to LSP messages while the caches are being primed?

@Veykril
Copy link
Member

Veykril commented May 24, 2023

on-type formatting

not being run on the main thread I'd consider a bug actually.

syntax highlighting we can't put there, it kicks off analysis so it can block for a while. inlay hints likewise. Anything doing semantic things is a no-go. Cache priming runs on the thread pool, it basically just populates some salsa queries so it doesn't block anything I think (it supports query cancellation), so yes it can respond to lsp things while cache priming

@lnicola
Copy link
Member

lnicola commented May 24, 2023

Anything doing semantic things is a no-go.

Even completion? 😞

@Veykril
Copy link
Member

Veykril commented May 24, 2023

Especially completions 😅 While we want to handle those fast, they can take a lot of time so they should not be allowed to block the main thread.

@lunacookies
Copy link
Contributor Author

Alright, so in that case cache priming to me seems like more of a “this runs silently in the background to transparently make future things run faster” kind of deal. Does it make sense to mark that as Background instead?

I restarted VS Code and measured using the stopwatch on my phone how long it took for rust-analyzer to index the rust-analyzer repo. I ran five trials for each row in the table. (There are so many problems with everything about this but it gives a rough estimation of what to expect.)

QoS threads median stddev
UT 4 6.22 0.155
UT 8 5.78 0.066
UT 10 5.71 0.048
BG 2 48.99 3.372
BG 10 45.14 2.914

This computer has 8 p-cores and 2 e-cores. Utility can run on any core type while Background is restricted to the e-cores.

Evidently, with this core configuration using Background for indexing is a massive performance regression. Is this worth the potential battery saving? clangd used to use Background for indexing, but then switched to Utility to make indexing faster. Apple’s guidelines recommend using Background for tasks that take between minutes and hours, too.

Even though it’s likely that Background would save power here I’m not sure if the tradeoff is worth it.

@lunacookies
Copy link
Contributor Author

lunacookies commented May 24, 2023

For requests that are triggered by the user typing but which require semantic analysis (like completion and syntax highlighting!), we could have a secondary thread pool at a higher QoS. That way, these requests remain quick when the system is under load without blocking the main thread and without unnecessarily making other requests use more energy than needed. This multi-pool approach is the standard in the Apple concurrency APIs (libdispatch), where you have a separate thread pool (well, “concurrent dispatch queue” in libdispatch-speak) for each QoS level.

Implementing this is trivial (just make TaskPool hold two threadpool::ThreadPools internally and add a .spawn_user_interactive() method), but I’m not sure if the difference is worth it. Either way I’d leave that to a follow-up PR; I don’t have anything to add on this PR :)

@Veykril
Copy link
Member

Veykril commented May 24, 2023

I think having prime caching on utility should be fine, the slowdown is too high otherwise. People are free to turn it off for power saving reasons since there is a config there.

@lunacookies
Copy link
Contributor Author

lunacookies commented May 25, 2023

I ran some benchmarks and found that running all requests under Utility makes performance fall off a cliff when threads at a QoS higher than Utility are swamping the system (sounds a lot like a rustc build …). #14888 fixes this.

I simulated such a situation and performed the same edit many times. I used rust-analyzer’s profiling infrastructure (which is awesome!) to measure the processing time for syntax highlighting and completion requests:

median (ms) slowest (ms) overly long loop turns
before QoS 79 261 17
all Utility 234 12091 80
tweaked QoS per request 83 218 13

This is bad enough to make sure the PRs don’t get split across releases I think.

@Veykril
Copy link
Member

Veykril commented May 26, 2023

@bors delegate+

@lunacookies
Copy link
Contributor Author

@bors r+

@bors
Copy link
Collaborator

bors commented May 26, 2023

📌 Commit 430bdd3 has been approved by lunacookies

It is now in the queue for this repository.

@bors
Copy link
Collaborator

bors commented May 26, 2023

⌛ Testing commit 430bdd3 with merge 6bca9f2...

@bors
Copy link
Collaborator

bors commented May 26, 2023

☀️ Test successful - checks-actions
Approved by: lunacookies
Pushing 6bca9f2 to master...

1 similar comment
@bors
Copy link
Collaborator

bors commented May 26, 2023

☀️ Test successful - checks-actions
Approved by: lunacookies
Pushing 6bca9f2 to master...

@bors
Copy link
Collaborator

bors commented May 26, 2023

👀 Test was successful, but fast-forwarding failed: 422 Changes must be made through a pull request.

@bors bors merged commit 6bca9f2 into rust-lang:master May 26, 2023
9 of 10 checks passed
@lunacookies lunacookies deleted the qos branch May 26, 2023 17:36
bors added a commit that referenced this pull request May 31, 2023
Prioritize threads affected by user typing

To this end I’ve introduced a new custom thread pool type which can spawn threads using each QoS class. This way we can run latency-sensitive requests under one QoS class and everything else under another QoS class. The implementation is very similar to that of the `threadpool` crate (which is currently used by rust-analyzer) but with unused functionality stripped out.

I’ll have to rebase on master once #14859 is merged but I think everything else is alright :D
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants