New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify thread types using Quality of Service API #14859
Conversation
It’s worth noting that I haven’t observed much difference in my own benchmarks between Utility, User Initiated and User Interactive when the system is in a nominal state; rather, those QoS classes have more of an effect when the system is under load. Explicitly specifying QoS using appropriate classes would mean, if the system is under heavy load, rust-analyzer’s large threadpool-based tasks (like find usages) would be de-prioritized by the scheduler in favor of high QoS tasks (like rust-analyzer’s main loop or VS Code’s rendering thread). If all p-cores are occupied the scheduler tends to move Utility threads onto the e-cores. This is the exact behavior we want in this scenario. I don’t know about the specifics of XNU’s scheduler, but I wouldn’t be surprised if lower QoS tasks are moved onto e-cores if the battery is low, too. XNU is known to do things in that vein. For example, the libdispatch concurrency framework integrates with XNU so it can make optimal scheduling decisions. It takes a number of factors into account when choosing how many threads to allocate to the thread pools of each application, including the ratio of performance to efficiency cores and the heat dissipation capacity of the machine (!!). |
Are there no crates wrapping this yet? I'd expect that to be a thing already. |
I had a quick look, but there’s nothing I could find. I think it’s mostly alright, since on XNU it’s a few calls to non-standard pthreads functions without any sketchy raw pointer shenanigans or anything. The Windows QoS APIs are accessible through |
That's fine, I was just curious if there was. Then we wouldn't have to duplicate this here. |
Specify thread types using Quality of Service API <details> <summary>Some background (in case you haven’t heard of QoS before)</summary> Heterogenous multi-core CPUs are increasingly found in laptops and desktops (e.g. Alder Lake, Snapdragon 8cx Gen 3, M1). To maximize efficiency on this kind of hardware, it is important to provide the operating system with more information so threads can be scheduled on different core types appropriately. The approach that XNU (the kernel of macOS, iOS, etc) and Windows have taken is to provide a high-level semantic API – quality of service, or QoS – which informs the OS of the program’s intent. For instance, you might specify that a thread is running a render loop for a game. This makes the OS provide this thread with as large a share of the system’s resources as possible. Specifying a thread is running an unimportant background task, on the other hand, is cause for it to be scheduled exclusively on high-efficiency cores instead of high-performance cores. QoS APIs allows for easy configuration of many different parameters at once; for instance, setting QoS on XNU affects scheduling, timer latency, I/O priorities, and of course what core type the thread in question should run on. I don’t know any details on how QoS works on Windows, but I would guess it’s similar. Hypothetically, taking advantage of these APIs would improve power consumption, thermals, battery life if applicable, etc. </details> # Relevance to rust-analyzer From what I can tell the philosophy behind both the XNU and Windows QoS APIs is that _user interfaces should never stutter under any circumstances._ You can see this in the array of QoS classes which are available: the highest QoS class in both APIs is one intended explicitly for UI render loops. Imagine rust-analyzer is performing CPU-intensive background work – maybe you just invoked Find Usages on `usize` or opened a large project – in this scenario the editor’s render loop should absolutely get higher priority than rust-analyzer, no matter what. You could view it in terms of “realtime-ness”: flight control software is hard realtime, audio software is soft realtime, GUIs are softer realtime, and rust-analyzer is not realtime at all. Of course, maximizing responsiveness is important, but respecting the rest of the system is more important. # Implementation I’ve tried my best to unify thread creation in `stdx`, where the new API I’ve introduced _requires_ specifying a QoS class. Different points along the performance/efficiency curve can make a great difference; the M1’s e-cores use around three times less power than the p-cores, so putting in this effort is worthwhile IMO. It’s worth mentioning that Linux does not [yet](https://youtu.be/RfgPWpTwTQo) have a QoS API. Maybe translating QoS into regular thread priorities would be acceptable? From what I can tell the only scheduling-related code in rust-analyzer is Windows-specific, so ignoring QoS entirely on Linux shouldn’t cause any new issues. Also, I haven’t implemented support for the Windows QoS APIs because I don’t have a Windows machine to test on, and because I’m completely unfamiliar with Windows APIs :) I noticed that rust-analyzer handles some requests on the main thread (using `.on_sync()`) and others on a threadpool (using `.on()`). I think it would make sense to run the main thread at the User Initiated QoS and the threadpool at Utility, but only if all requests that are caused by typing use `.on_sync()` and all that don’t use `.on()`. I don’t understand how the `.on_sync()`/`.on()` split that’s currently present was chosen, so I’ve let this code be for the moment. Let me know if changing this to what I proposed makes any sense. To avoid having to change everything back in case I’ve misunderstood something, I’ve left all threads at the Utility QoS for now. Of course, this isn’t what I hope the code will look like in the end, but I figured I have to start somewhere :P # References <ul> <li><a href="https://developer.apple.com/library/archive/documentation/Performance/Conceptual/power_efficiency_guidelines_osx/PrioritizeWorkAtTheTaskLevel.html">Apple documentation related to QoS</a></li> <li><a href="https://github.com/apple-oss-distributions/libpthread/blob/67e155c94093be9a204b69637d198eceff2c7c46/include/pthread/qos.h">pthread API for setting QoS on XNU</a></li> <li><a href="https://learn.microsoft.com/en-us/windows/win32/procthread/quality-of-service">Windows’s QoS classes</a></li> <li> <details> <summary>Full documentation of XNU QoS classes. This documentation is only available as a huge not-very-readable comment in a header file, so I’ve reformatted it and put it here for reference.</summary> <ul> <li><p><strong><code>QOS_CLASS_USER_INTERACTIVE</code>: A QOS class which indicates work performed by this thread is interactive with the user.</strong></p><p>Such work is requested to run at high priority relative to other work on the system. Specifying this QOS class is a request to run with nearly all available system CPU and I/O bandwidth even under contention. This is not an energy-efficient QOS class to use for large tasks. The use of this QOS class should be limited to critical interaction with the user such as handling events on the main event loop, view drawing, animation, etc.</p></li> <li><p><strong><code>QOS_CLASS_USER_INITIATED</code>: A QOS class which indicates work performed by this thread was initiated by the user and that the user is likely waiting for the results.</strong></p><p>Such work is requested to run at a priority below critical user-interactive work, but relatively higher than other work on the system. This is not an energy-efficient QOS class to use for large tasks. Its use should be limited to operations of short enough duration that the user is unlikely to switch tasks while waiting for the results. Typical user-initiated work will have progress indicated by the display of placeholder content or modal user interface.</p></li> <li><p><strong><code>QOS_CLASS_DEFAULT</code>: A default QOS class used by the system in cases where more specific QOS class information is not available.</strong></p><p>Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than utility and background tasks. Threads created by <code>pthread_create()</code> without an attribute specifying a QOS class will default to <code>QOS_CLASS_DEFAULT</code>. This QOS class value is not intended to be used as a work classification, it should only be set when propagating or restoring QOS class values provided by the system.</p></li> <li><p><strong><code>QOS_CLASS_UTILITY</code>: A QOS class which indicates work performed by this thread may or may not be initiated by the user and that the user is unlikely to be immediately waiting for the results.</strong></p><p>Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than low-level system maintenance tasks. The use of this QOS class indicates the work should be run in an energy and thermally-efficient manner. The progress of utility work may or may not be indicated to the user, but the effect of such work is user-visible.</p></li> <li><p><strong><code>QOS_CLASS_BACKGROUND</code>: A QOS class which indicates work performed by this thread was not initiated by the user and that the user may be unaware of the results.</strong></p><p>Such work is requested to run at a priority below other work. The use of this QOS class indicates the work should be run in the most energy and thermally-efficient manner.</p></li> <li><p><strong><code>QOS_CLASS_UNSPECIFIED</code>: A QOS class value which indicates the absence or removal of QOS class information.</strong></p><p>As an API return value, may indicate that threads or pthread attributes were configured with legacy API incompatible or in conflict with the QOS class system.</p></li> </ul> </details> </li> </ul>
Oh wait is it possible to cancel bors? I’d still like to figure out what an appropriate QoS for the main loop would be … |
@bors r- |
Was about to comment on the The only difference really is that |
In general, |
I think completion would also be latency-sensitive. |
☀️ Try build successful - checks-actions |
I’d suggest also adding on-type formatting and possibly even syntax highlighting to that list? I’m not sure if syntax highlighting is slow enough for this to be a bad idea though … I would also mention inlay hints since those update while typing, but since VS Code adds in that horrible artificial delay the latency doesn’t matter. In terms of QoS I didn’t spot any “special” threads in rust-analyzer that would need something other than Utility, except for the main loop and the cache priming threads. Main loop activity is primarily caused by the user’s actions and are required for the UI to respond, so I would suggest making the main loop User Initiated. I’m not so clear about cache priming, though: can rust-analyzer respond to LSP messages while the caches are being primed? |
not being run on the main thread I'd consider a bug actually. syntax highlighting we can't put there, it kicks off analysis so it can block for a while. inlay hints likewise. Anything doing semantic things is a no-go. Cache priming runs on the thread pool, it basically just populates some salsa queries so it doesn't block anything I think (it supports query cancellation), so yes it can respond to lsp things while cache priming |
Even completion? 😞 |
Especially completions 😅 While we want to handle those fast, they can take a lot of time so they should not be allowed to block the main thread. |
Alright, so in that case cache priming to me seems like more of a “this runs silently in the background to transparently make future things run faster” kind of deal. Does it make sense to mark that as Background instead? I restarted VS Code and measured using the stopwatch on my phone how long it took for rust-analyzer to index the rust-analyzer repo. I ran five trials for each row in the table. (There are so many problems with everything about this but it gives a rough estimation of what to expect.)
This computer has 8 p-cores and 2 e-cores. Utility can run on any core type while Background is restricted to the e-cores. Evidently, with this core configuration using Background for indexing is a massive performance regression. Is this worth the potential battery saving? clangd used to use Background for indexing, but then switched to Utility to make indexing faster. Apple’s guidelines recommend using Background for tasks that take between minutes and hours, too. Even though it’s likely that Background would save power here I’m not sure if the tradeoff is worth it. |
For requests that are triggered by the user typing but which require semantic analysis (like completion and syntax highlighting!), we could have a secondary thread pool at a higher QoS. That way, these requests remain quick when the system is under load without blocking the main thread and without unnecessarily making other requests use more energy than needed. This multi-pool approach is the standard in the Apple concurrency APIs (libdispatch), where you have a separate thread pool (well, “concurrent dispatch queue” in libdispatch-speak) for each QoS level. Implementing this is trivial (just make |
I think having prime caching on utility should be fine, the slowdown is too high otherwise. People are free to turn it off for power saving reasons since there is a config there. |
I ran some benchmarks and found that running all requests under Utility makes performance fall off a cliff when threads at a QoS higher than Utility are swamping the system (sounds a lot like a rustc build …). #14888 fixes this. I simulated such a situation and performed the same edit many times. I used rust-analyzer’s profiling infrastructure (which is awesome!) to measure the processing time for syntax highlighting and completion requests:
This is bad enough to make sure the PRs don’t get split across releases I think. |
@bors delegate+ |
@bors r+ |
☀️ Test successful - checks-actions |
1 similar comment
☀️ Test successful - checks-actions |
👀 Test was successful, but fast-forwarding failed: 422 Changes must be made through a pull request. |
Prioritize threads affected by user typing To this end I’ve introduced a new custom thread pool type which can spawn threads using each QoS class. This way we can run latency-sensitive requests under one QoS class and everything else under another QoS class. The implementation is very similar to that of the `threadpool` crate (which is currently used by rust-analyzer) but with unused functionality stripped out. I’ll have to rebase on master once #14859 is merged but I think everything else is alright :D
Some background (in case you haven’t heard of QoS before)
Heterogenous multi-core CPUs are increasingly found in laptops and desktops (e.g. Alder Lake, Snapdragon 8cx Gen 3, M1). To maximize efficiency on this kind of hardware, it is important to provide the operating system with more information so threads can be scheduled on different core types appropriately.
The approach that XNU (the kernel of macOS, iOS, etc) and Windows have taken is to provide a high-level semantic API – quality of service, or QoS – which informs the OS of the program’s intent. For instance, you might specify that a thread is running a render loop for a game. This makes the OS provide this thread with as large a share of the system’s resources as possible. Specifying a thread is running an unimportant background task, on the other hand, is cause for it to be scheduled exclusively on high-efficiency cores instead of high-performance cores.
QoS APIs allows for easy configuration of many different parameters at once; for instance, setting QoS on XNU affects scheduling, timer latency, I/O priorities, and of course what core type the thread in question should run on. I don’t know any details on how QoS works on Windows, but I would guess it’s similar.
Hypothetically, taking advantage of these APIs would improve power consumption, thermals, battery life if applicable, etc.
Relevance to rust-analyzer
From what I can tell the philosophy behind both the XNU and Windows QoS APIs is that user interfaces should never stutter under any circumstances. You can see this in the array of QoS classes which are available: the highest QoS class in both APIs is one intended explicitly for UI render loops.
Imagine rust-analyzer is performing CPU-intensive background work – maybe you just invoked Find Usages on
usize
or opened a large project – in this scenario the editor’s render loop should absolutely get higher priority than rust-analyzer, no matter what. You could view it in terms of “realtime-ness”: flight control software is hard realtime, audio software is soft realtime, GUIs are softer realtime, and rust-analyzer is not realtime at all. Of course, maximizing responsiveness is important, but respecting the rest of the system is more important.Implementation
I’ve tried my best to unify thread creation in
stdx
, where the new API I’ve introduced requires specifying a QoS class. Different points along the performance/efficiency curve can make a great difference; the M1’s e-cores use around three times less power than the p-cores, so putting in this effort is worthwhile IMO.It’s worth mentioning that Linux does not yet have a QoS API. Maybe translating QoS into regular thread priorities would be acceptable? From what I can tell the only scheduling-related code in rust-analyzer is Windows-specific, so ignoring QoS entirely on Linux shouldn’t cause any new issues. Also, I haven’t implemented support for the Windows QoS APIs because I don’t have a Windows machine to test on, and because I’m completely unfamiliar with Windows APIs :)
I noticed that rust-analyzer handles some requests on the main thread (using
.on_sync()
) and others on a threadpool (using.on()
). I think it would make sense to run the main thread at the User Initiated QoS and the threadpool at Utility, but only if all requests that are caused by typing use.on_sync()
and all that don’t use.on()
. I don’t understand how the.on_sync()
/.on()
split that’s currently present was chosen, so I’ve let this code be for the moment. Let me know if changing this to what I proposed makes any sense.To avoid having to change everything back in case I’ve misunderstood something, I’ve left all threads at the Utility QoS for now. Of course, this isn’t what I hope the code will look like in the end, but I figured I have to start somewhere :P
References
Full documentation of XNU QoS classes. This documentation is only available as a huge not-very-readable comment in a header file, so I’ve reformatted it and put it here for reference.
QOS_CLASS_USER_INTERACTIVE
: A QOS class which indicates work performed by this thread is interactive with the user.Such work is requested to run at high priority relative to other work on the system. Specifying this QOS class is a request to run with nearly all available system CPU and I/O bandwidth even under contention. This is not an energy-efficient QOS class to use for large tasks. The use of this QOS class should be limited to critical interaction with the user such as handling events on the main event loop, view drawing, animation, etc.
QOS_CLASS_USER_INITIATED
: A QOS class which indicates work performed by this thread was initiated by the user and that the user is likely waiting for the results.Such work is requested to run at a priority below critical user-interactive work, but relatively higher than other work on the system. This is not an energy-efficient QOS class to use for large tasks. Its use should be limited to operations of short enough duration that the user is unlikely to switch tasks while waiting for the results. Typical user-initiated work will have progress indicated by the display of placeholder content or modal user interface.
QOS_CLASS_DEFAULT
: A default QOS class used by the system in cases where more specific QOS class information is not available.Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than utility and background tasks. Threads created by
pthread_create()
without an attribute specifying a QOS class will default toQOS_CLASS_DEFAULT
. This QOS class value is not intended to be used as a work classification, it should only be set when propagating or restoring QOS class values provided by the system.QOS_CLASS_UTILITY
: A QOS class which indicates work performed by this thread may or may not be initiated by the user and that the user is unlikely to be immediately waiting for the results.Such work is requested to run at a priority below critical user-interactive and user-initiated work, but relatively higher than low-level system maintenance tasks. The use of this QOS class indicates the work should be run in an energy and thermally-efficient manner. The progress of utility work may or may not be indicated to the user, but the effect of such work is user-visible.
QOS_CLASS_BACKGROUND
: A QOS class which indicates work performed by this thread was not initiated by the user and that the user may be unaware of the results.Such work is requested to run at a priority below other work. The use of this QOS class indicates the work should be run in the most energy and thermally-efficient manner.
QOS_CLASS_UNSPECIFIED
: A QOS class value which indicates the absence or removal of QOS class information.As an API return value, may indicate that threads or pthread attributes were configured with legacy API incompatible or in conflict with the QOS class system.