[Optim] Runtime detection of MULX and ADX support #10

mratsim · 2020-07-16T11:42:32Z

Given the significant speed increase (~20% for high level signing/verifying) provided by BMI2 and ADX support (for MULX, ADCX, ADOX instructions) (bench: status-im/nim-blst#1)
It would probably worthwhile to autodetect CPU support for those instructions at runtime.
This is especially interesting because the CPUs are now widespread (Broadwell, 2015 for Intel and Ryzen 2017 for AMD).

Cloudflare BN256 does that for example https://github.com/cloudflare/bn256/blob/3cac37b6/gfp_amd64.s#L108-L129

dot-asm · 2020-07-19T14:36:59Z

This is especially interesting because the CPUs are now widespread

Yes, and it's argued that they are so widespread that non-adx code is on the brink of irrelevant. At least close enough to grant compile-time detection. Especially since more and more software is distributed as source code.

Cloudflare BN256 does that for example

And? :-) :-) :-) But on serious note, performance is not all about benchmarking tight loop with short subroutine. What I'm trying to say is that conditional branch in a short subroutine is not optimal placement. It surely looks adequate in controlled benchmark environment, but in real life, where branch predictor is a limited resource, you'll risk misprediction penalties, and the said penalties are comparable to subroutine execution time. It can even be affected by linking order and other link-time optimizations. Run-time switch is better placed as high in call hierarchy as possible, say on point operation level. Or on whole library level, i.e. at compile time ;-)

dot-asm · 2020-07-21T18:47:43Z

At the end of the day performance is not necessarily about doing things in shortest possible time, but rather about meeting some kind of a deadline [preferably with confidence]. In other words if user-facing or "desktop-class" application is not dependent on 100% availability of CPU time for 24 hours a day, one doesn't have to feel bad about not using latest instruction extension over not-that-big margin. As long as application actually does the job 100%. Well, one can still make case that faster code would consume less power. Though one has to recognize that relation is far from linear, in sense that 20% faster code won't yield 20% savings. Because a sizable portion of power is spent on ~~just~~ merely being up... In other words I for one am not actually convinced that an application distributed in binary form for "mass consumption" and not using ADCX would actually be inadequate... Alternatively (and apart from most obvious "two-shared-libraries-to-choose-from-at-run-time") one can provide two binaries to choose from at installation time. Or install both binaries and provide a wrapper that would evaluate processor capability at application startup and execute one of the two. Or one can wrap one binary into another and have first binary extract second one in case it's needed. Kind of self-wrapper. Things might get tricky if you run as Windows service though. In which case installation-time evaluation would have to do. And if still unsuitable for some reason, let's talk about that, I'm sure I'll be able to offer a solution...

paulhauner · 2021-02-08T23:23:16Z

I'm also in favor of auto-detection.

one can provide two binaries to choose from at installation time

This is what we do presently. Here are some pain points:

It's a pain for novice users. Many users have no idea what "portability" is, nor do they have any idea that their CPU has an instruction set that varies from others. I'm very fond of the idea of people using blockchains without knowing how their binary communicates with their CPU.
Cloud instances don't always make promises about which CPU you'll be running on. We use AWS t2 instances and it seems like there's a roughly 1/4 chance of getting a CPU that only works with BLST portable. Given that there are no promises about which CPU you'll get after a reboot, the practical outcome of this is either switching to c-class compute instances (more $$) or just choosing the portable version and observing tangibly increased CPU usage.
Additional runtime on build/test infra which costs real-world money.
- Also additional complexity in that infrastructure.

Or install both binaries and provide a wrapper that would evaluate processor capability at application startup and execute one of the two

Is it really best to do this at the layer above BLST (i.e., the application that imports BLST)? As I understand it, using the faster CPU instructions is always better if you support them and always worse if you don't. In my experience of using this library for a several months, the only thing the application gets from the ability to choose is the risk of making a bad decision. If it turns out that some user really wants to ignore runtime detection for whatever reason then an "ignore runtime detection" runtime/compile flag could be used, ultimately placing us in a position of being "as fast as is safe" out-of-the-box.

Evaluating processor capability inside BLST means less code duplication across applications and it keeps the "which instruction set to use" concerns contained. If BLST finds benefit from some new CPU instruction then with self-contained runtime detection it can be implemented without breaking upstream apps (i.e., most of Eth2).

Or one can wrap one binary into another and have first binary extract second one in case it's needed. Kind of self-wrapper.

As above.

mratsim · 2021-02-09T10:54:04Z

I'll try to substantiate the feature request

Context

Performance

Current clients are squashing bottlenecks in sync, IO, hashing via various techniques including caching, in-memory DB and in general avoiding to do work.
However there is one thing that we can't avoid, its verifying signatures and in all clients pairings are still one of the major bottlenecks. Having 20% faster verification is significant on the whole application.

For example, when syncing, the bottlenecks are disk speed, network speed and cryptography, on a fast network with a SSD, only cryptography remains. 20% faster verification translates to a couple of hours difference in syncing speed and will translate to days as the blockchain grows.

Ergonomics

Clients have a difficult time already to make blockchain safe and ergonomic to use. They try to abstract theoretical concept and low-level concept to the uninitiated.

A SIGILL breaks that abstraction.

Furthermore, consumers of clients like staking pools or packagers like DappNode might run on cloud services where CPUs are not stable and would be forced to use the portable build. A 20% performance difference would translate to significant cost saving. A runtime detection would translate to savings significant maintenance, documentation and debugging headaches. I expect all clients have "What to do if you receive SIGILL" in their documentation at the moment.

Analysis of runtime detection.

Using branches

I'd argue that even if CPU runtime detection is done via a branch at the lowest possible level, the cost is negligible. Case in point, both Gurvy and my own library, constantine, use runtime detection just before fp_mul and are faster than BLST on G1 doubling.

source eccbench 2021-01-29 (https://hackmd.io/@zkteam/eccbench#G1-mixed-additiondoubling2):

Code for detection:

https://github.com/ConsenSys/gurvy/blob/e350ead/bls381/fp/element_ops_amd64.s#L232-L233
https://github.com/mratsim/constantine/blob/c4a2dee/constantine/arithmetic/limbs_montgomery.nim#L423-L427 (This even checks both BMI2 and ADX instead of just ADX.

Hardware predictors since Haswell have been significantly improved, they are improved so much that for interpreter dispatch (unpredictable with up to 255 opcodes!) it's better to use switch dispatching instead of a table of function pointers:

Rohou et al, Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore, https://hal.inria.fr/hal-01100647/document

Furthermore while it's true that hardware predictors are limited in space, there are still 4096 Branch Target Buffers (https://www.agner.org/optimize/microarchitecture.pdf, https://xania.org/201602/bpu-part-one, https://stackoverflow.com/questions/16513943/how-can-i-get-my-cpus-branch-target-bufferbtb-size). As the library is fully constant-time with almost no branches otherwise, it is completely unused. I'm not sure what the situation on ARM is but BLST doesn't use branch predictors due to its constant-time nature.

Also the branch misprediction can only happen on the first call to multiplication would require at most 10 cycles misprediction/pipeline flush cost (https://lemire.me/blog/2019/10/15/mispredicted-branches-can-multiply-your-running-times/) but every other call would be free. A fp_mul operation would be 100 cycles with ADCX/ADOX or 120 cycles without, so out of the 12000+ multiplications you need for a full pairing, a 10 cycle misprediction cost doesn't even register as 0.0001%.

Lastly it's likely that the branch misprediction, if it happens, is hidden in the cache miss to fetch the Fp elements data from memory (which is needed whatever branch is chosen).

Not using branches

Alternatively, if branches needs to be avoided for other reasons, the technique from GCC function multiversioning can be used with an initial dispatch function that is stored in global memory and then replace itself with the proper function for the arch on the first call. This is detailed in https://www.agner.org/optimize/optimizing_cpp.pdf

Detecting at a high-level

We can indeed do CPU detection at a higher level however I think given the previous sections, this is extra code that saves a couple nanoseconds on a full pairing and significantly complexify builds.

dot-asm · 2021-02-12T12:02:38Z

I'll be replying in smaller "bites." (Unfortunately maybe not as fast as you'd wish/expect;-) Also, these won't be categorical "no"-s, but we have to recognize certain things.

Not using branches
...
(*CriticalFunction)

Rendered unacceptable for cryptographic applications in January 2018. On a related note, here is a homework assignment. Compile all the different things with clang[++] -mretpoline and re-compare the results.

As the library is fully constant-time with almost no branches otherwise, it is completely unused.

Constant-time-ness doesn't mean that there are no conditional branches, only that they are not dependent on ~~secret values~~ data.

while it's true that hardware predictors are limited in space, there are still ...

Going back to the beginning of this remark, excessive amount of branch prediction logic is actually bad for security. It doesn't matter how well the processor can predict branches in a controlled[!] environment, as long one can indirectly affect the logic in some way, the only viable strategy for cryptographic code is to have as few conditional branches as possible, especially on critical paths. You might say "you're being over-paranoid." Yeah, but unfortunately such statements work only till somebody figures out something...

arnetheduck · 2021-12-08T12:11:36Z

I think most of us think of many of these functions at a higher level than the individual math operations - we sign and verify things - thus, one possible way forward would be to have the each of the specialized implementations for a "family" of instruction sets exported and allow both of them to be built into a single "build" by creating several "namespaces": blsSignGeneric, blsSignAdx and so on - we can then, when consuming the library, choose between these in any way we find reasonable (detection, compile-time selection etc).

karalabe · 2023-04-27T08:41:39Z

Hey hey heeey :) @dot-asm

We've hit this issue a few days ago and been thinking a lot on how to work around it. For us it's kind of important to support pre-compiled binaries since most people prefer docker workflows; but even those who do not generally have different build and execution environments.

Currently the only two options we have is to either blindly disable the ADX optimisations and make it slower for everybody; or to leave the optimisation in and hope for the best. The former seems to entail a significant hit on processing performance (measured in hours of extra processing time); whereas the latter is problematic in cloud environments where there's no direct control over the CPU allocation (or cross cloud deployments where it's even a bigger randomness on what CPU we get).

Would it be possible to pick up this idea of runtime detection of ADX instruction sets and run the faster or slower algorithm based on that? If you don't have the time to dive into the implementation yourself, would you be open to us sending in a PR to do it?

Thanks, Peter

holiman · 2023-05-08T06:37:08Z

I agree that having runtime detection would be hugely beneficial for consumers of this library, and would love to see this issue get resolved, the sooner the better.

If you don't have the time to dive into the implementation yourself, would you be open to us sending in a PR to do it?

?

sauliusgrigaitis · 2023-05-12T13:08:34Z

Currently the only two options we have is to either blindly disable the ADX optimisations and make it slower for everybody; or to leave the optimisation in and hope for the best.

The third option is two load two copies of blst (one for adx and one without adx) and during runtime check the CPU and call the right copy of blst. It's ugly, but we are going to try this approach in Grandine unless the proper solution will be in place at that time.

mratsim · 2023-05-12T17:48:25Z

How easy is it to auto-namespace the library? Is there a compiler flag we can turn on for this or does it require a fork?

arnetheduck · 2023-05-12T17:52:04Z

Is there a compiler flag we can turn on for this or does it require a fork?

objcopy --redefine-sym works for static libraries - it's not .. hard, just messy - it's the kind of thing that is likely easier to maintain in a fork of blst that exists only for the purpose of naming symbols at compilation stage.

mratsim · 2023-07-10T14:52:00Z

Addressed on Jun 12, 2023:

CPU feature detection: f8af94a
0dbc2ce
54389c1
5fa162f

And SHA256 runtime detection as well:

c7a477e

mratsim mentioned this issue Feb 9, 2021

Feature: optional strict mode treeform/jsony#7

Open

sauliusgrigaitis mentioned this issue Jun 30, 2021

Should we support older CPUs? grandinetech/grandine#3

Closed

tizoc mentioned this issue Dec 25, 2021

Test fails with v1.14.0 tezedge/tezedge#907

Closed

karalabe mentioned this issue Apr 25, 2023

Crashes with SIGILL on Linux/amd64 i7 CPU ethereum/c-kzg-4844#291

Closed

yperbasis mentioned this issue Jun 8, 2023

Caught SIGILL in blst_cgo_init ledgerwatch/erigon#7683

Closed

prestonvanloon mentioned this issue Jul 6, 2023

Caught SIGILL in blst_cgo_init, consult <blst>/bindinds/go/README.md. prysmaticlabs/prysm#12567

Closed

mratsim closed this as completed Jul 10, 2023

mratsim mentioned this issue Jul 10, 2023

Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174

Open

sauliusgrigaitis mentioned this issue Sep 26, 2023

Fix features grandinetech/rust-kzg#238

Merged

yazzaoui mentioned this issue Nov 9, 2023

feature, consensus key on boarding and management autonity/autonity#858

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optim] Runtime detection of MULX and ADX support #10

[Optim] Runtime detection of MULX and ADX support #10

mratsim commented Jul 16, 2020

dot-asm commented Jul 19, 2020

dot-asm commented Jul 21, 2020 •

edited

Loading

paulhauner commented Feb 8, 2021 •

edited

Loading

mratsim commented Feb 9, 2021 •

edited

Loading

dot-asm commented Feb 12, 2021 •

edited

Loading

arnetheduck commented Dec 8, 2021

karalabe commented Apr 27, 2023

holiman commented May 8, 2023

sauliusgrigaitis commented May 12, 2023

mratsim commented May 12, 2023

arnetheduck commented May 12, 2023

mratsim commented Jul 10, 2023 •

edited

Loading

[Optim] Runtime detection of MULX and ADX support #10

[Optim] Runtime detection of MULX and ADX support #10

Comments

mratsim commented Jul 16, 2020

dot-asm commented Jul 19, 2020

dot-asm commented Jul 21, 2020 • edited Loading

paulhauner commented Feb 8, 2021 • edited Loading

mratsim commented Feb 9, 2021 • edited Loading

Context

Performance

Ergonomics

Analysis of runtime detection.

Using branches

Not using branches

Detecting at a high-level

dot-asm commented Feb 12, 2021 • edited Loading

arnetheduck commented Dec 8, 2021

karalabe commented Apr 27, 2023

holiman commented May 8, 2023

sauliusgrigaitis commented May 12, 2023

mratsim commented May 12, 2023

arnetheduck commented May 12, 2023

mratsim commented Jul 10, 2023 • edited Loading

dot-asm commented Jul 21, 2020 •

edited

Loading

paulhauner commented Feb 8, 2021 •

edited

Loading

mratsim commented Feb 9, 2021 •

edited

Loading

dot-asm commented Feb 12, 2021 •

edited

Loading

mratsim commented Jul 10, 2023 •

edited

Loading