-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Optim] Runtime detection of MULX and ADX support #10
Comments
Yes, and it's argued that they are so widespread that non-adx code is on the brink of irrelevant. At least close enough to grant compile-time detection. Especially since more and more software is distributed as source code.
And? :-) :-) :-) But on serious note, performance is not all about benchmarking tight loop with short subroutine. What I'm trying to say is that conditional branch in a short subroutine is not optimal placement. It surely looks adequate in controlled benchmark environment, but in real life, where branch predictor is a limited resource, you'll risk misprediction penalties, and the said penalties are comparable to subroutine execution time. It can even be affected by linking order and other link-time optimizations. Run-time switch is better placed as high in call hierarchy as possible, say on point operation level. Or on whole library level, i.e. at compile time ;-) |
At the end of the day performance is not necessarily about doing things in shortest possible time, but rather about meeting some kind of a deadline [preferably with confidence]. In other words if user-facing or "desktop-class" application is not dependent on 100% availability of CPU time for 24 hours a day, one doesn't have to feel bad about not using latest instruction extension over not-that-big margin. As long as application actually does the job 100%. Well, one can still make case that faster code would consume less power. Though one has to recognize that relation is far from linear, in sense that 20% faster code won't yield 20% savings. Because a sizable portion of power is spent on |
I'm also in favor of auto-detection.
This is what we do presently. Here are some pain points:
Is it really best to do this at the layer above BLST (i.e., the application that imports BLST)? As I understand it, using the faster CPU instructions is always better if you support them and always worse if you don't. In my experience of using this library for a several months, the only thing the application gets from the ability to choose is the risk of making a bad decision. If it turns out that some user really wants to ignore runtime detection for whatever reason then an "ignore runtime detection" runtime/compile flag could be used, ultimately placing us in a position of being "as fast as is safe" out-of-the-box. Evaluating processor capability inside BLST means less code duplication across applications and it keeps the "which instruction set to use" concerns contained. If BLST finds benefit from some new CPU instruction then with self-contained runtime detection it can be implemented without breaking upstream apps (i.e., most of Eth2).
As above. |
I'll try to substantiate the feature request ContextPerformanceCurrent clients are squashing bottlenecks in sync, IO, hashing via various techniques including caching, in-memory DB and in general avoiding to do work. For example, when syncing, the bottlenecks are disk speed, network speed and cryptography, on a fast network with a SSD, only cryptography remains. 20% faster verification translates to a couple of hours difference in syncing speed and will translate to days as the blockchain grows. ErgonomicsClients have a difficult time already to make blockchain safe and ergonomic to use. They try to abstract theoretical concept and low-level concept to the uninitiated. A Furthermore, consumers of clients like staking pools or packagers like DappNode might run on cloud services where CPUs are not stable and would be forced to use the portable build. A 20% performance difference would translate to significant cost saving. A runtime detection would translate to savings significant maintenance, documentation and debugging headaches. I expect all clients have "What to do if you receive SIGILL" in their documentation at the moment. Analysis of runtime detection.Using branchesI'd argue that even if CPU runtime detection is done via a branch at the lowest possible level, the cost is negligible. Case in point, both Gurvy and my own library, constantine, use runtime detection just before fp_mul and are faster than BLST on G1 doubling. source eccbench 2021-01-29 (https://hackmd.io/@zkteam/eccbench#G1-mixed-additiondoubling2):
Hardware predictors since Haswell have been significantly improved, they are improved so much that for interpreter dispatch (unpredictable with up to 255 opcodes!) it's better to use switch dispatching instead of a table of function pointers:
Furthermore while it's true that hardware predictors are limited in space, there are still 4096 Branch Target Buffers (https://www.agner.org/optimize/microarchitecture.pdf, https://xania.org/201602/bpu-part-one, https://stackoverflow.com/questions/16513943/how-can-i-get-my-cpus-branch-target-bufferbtb-size). As the library is fully constant-time with almost no branches otherwise, it is completely unused. I'm not sure what the situation on ARM is but BLST doesn't use branch predictors due to its constant-time nature. Also the branch misprediction can only happen on the first call to multiplication would require at most 10 cycles misprediction/pipeline flush cost (https://lemire.me/blog/2019/10/15/mispredicted-branches-can-multiply-your-running-times/) but every other call would be free. A fp_mul operation would be 100 cycles with ADCX/ADOX or 120 cycles without, so out of the 12000+ multiplications you need for a full pairing, a 10 cycle misprediction cost doesn't even register as 0.0001%. Lastly it's likely that the branch misprediction, if it happens, is hidden in the cache miss to fetch the Fp elements data from memory (which is needed whatever branch is chosen). Not using branchesAlternatively, if branches needs to be avoided for other reasons, the technique from GCC function multiversioning can be used with an initial dispatch function that is stored in global memory and then replace itself with the proper function for the arch on the first call. This is detailed in https://www.agner.org/optimize/optimizing_cpp.pdf Detecting at a high-levelWe can indeed do CPU detection at a higher level however I think given the previous sections, this is extra code that saves a couple nanoseconds on a full pairing and significantly complexify builds. |
I'll be replying in smaller "bites." (Unfortunately maybe not as fast as you'd wish/expect;-) Also, these won't be categorical "no"-s, but we have to recognize certain things.
Rendered unacceptable for cryptographic applications in January 2018. On a related note, here is a homework assignment. Compile all the different things with
Constant-time-ness doesn't mean that there are no conditional branches, only that they are not dependent on
Going back to the beginning of this remark, excessive amount of branch prediction logic is actually bad for security. It doesn't matter how well the processor can predict branches in a controlled[!] environment, as long one can indirectly affect the logic in some way, the only viable strategy for cryptographic code is to have as few conditional branches as possible, especially on critical paths. You might say "you're being over-paranoid." Yeah, but unfortunately such statements work only till somebody figures out something... |
I think most of us think of many of these functions at a higher level than the individual math operations - we sign and verify things - thus, one possible way forward would be to have the each of the specialized implementations for a "family" of instruction sets exported and allow both of them to be built into a single "build" by creating several "namespaces": |
Hey hey heeey :) @dot-asm We've hit this issue a few days ago and been thinking a lot on how to work around it. For us it's kind of important to support pre-compiled binaries since most people prefer docker workflows; but even those who do not generally have different build and execution environments. Currently the only two options we have is to either blindly disable the ADX optimisations and make it slower for everybody; or to leave the optimisation in and hope for the best. The former seems to entail a significant hit on processing performance (measured in hours of extra processing time); whereas the latter is problematic in cloud environments where there's no direct control over the CPU allocation (or cross cloud deployments where it's even a bigger randomness on what CPU we get). Would it be possible to pick up this idea of runtime detection of ADX instruction sets and run the faster or slower algorithm based on that? If you don't have the time to dive into the implementation yourself, would you be open to us sending in a PR to do it? Thanks, Peter |
I agree that having runtime detection would be hugely beneficial for consumers of this library, and would love to see this issue get resolved, the sooner the better.
? |
The third option is two load two copies of blst (one for adx and one without adx) and during runtime check the CPU and call the right copy of blst. It's ugly, but we are going to try this approach in Grandine unless the proper solution will be in place at that time. |
How easy is it to auto-namespace the library? Is there a compiler flag we can turn on for this or does it require a fork? |
|
Given the significant speed increase (~20% for high level signing/verifying) provided by BMI2 and ADX support (for MULX, ADCX, ADOX instructions) (bench: status-im/nim-blst#1)
It would probably worthwhile to autodetect CPU support for those instructions at runtime.
This is especially interesting because the CPUs are now widespread (Broadwell, 2015 for Intel and Ryzen 2017 for AMD).
Cloudflare BN256 does that for example https://github.com/cloudflare/bn256/blob/3cac37b6/gfp_amd64.s#L108-L129
The text was updated successfully, but these errors were encountered: