Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
Add `f16b` floating-point type for native support of `bfloat16` #2690
After how #2686 exploded overnight to a degree where I had trouble even following it, I am almost afraid to leave the first comment here too
I'm also curious how this interacts with SIMD. Many of the domains where I see bfloats being proposed also greatly benefit from data parallelism and low-level performance tuning, so my (possibly inaccurate) mental picture was that bfloat-using software is heavily using intrinsics rather than scalar code or possibly auto-generates the instructions from DSLs .
Why can't this be implemented in a crate on crates.io ? The RFC does not mention this, and that would seem like the most important alternative design-wise to adding a new primitive type.
AFAICT, the only proposed API that a crate can't support is literals (e.g.
Also, how does this fit in the 2019 Roadmap?
The justification for having native support, rather than a crate, is that this needs dedicated code generation in LLVM to make use of hardware support for bfloat16 on specific platforms. Falling back to
My expectation would be that code using bfloat16 may use intrinsics, carefully written code, or both. Intrinsics force the compiler to generate specific instructions, rather than giving it the flexibility to group operations and take advantage of whatever instructions the native platform has.
bfloat16 is heavily used in matrix operations, especially in neural network implementations. That includes both writing kernels and moving data around, and for both purposes I'd like to have native support in Rust.
If this were a purely software format, I wouldn't suggest giving it native support, and a separate crate would make perfect sense. However, we're talking about a new floating-point format that has hardware support in CPUs and coprocessors, and is extensively used by prominent floating-point-crunching software.
Keep in mind that right now I'm proposing support in nightly, and I fully expect this to bake for a while before becoming stable.
I'll make several updates to the RFC to address the various concerns raised, including some clear discussion of the idea of a separate crate.
@darconeous As the RFC mentions, it could certainly be considered. I would suggest filing a separate RFC.
The proposal in this RFC is to not have any type be
One notable downside of the IEEE binary16 type, however: if it isn't supported in hardware, it's far more expensive to emulate, requiring either software implementation or conversions, along with handling for overflow and similar. That makes it hard to support universally.
Which platforms? Could you be more precise about which code generation is needed? And repeating @rkruppe question: why aren't intrinsics enough ?
How is the LLVM-IR emitted using intrinsics any different from "whatever else" this RFC is proposing? (It is completely unclear to me at this point what exactly this RFC is proposing).
Expanding on what @gnzlbg wrote: A "native" type in Rust will still call llvm intrinsics. If we expose those via Rust intrinsics and then higher level functions, you will lose nothing (even Rust's checked addition uses intrinsics). IIRC exposing more unstable intrinsics in
Once the support crate has given us enough data to know what to expose in order to make it stable, then an RFC can be openend to suggest that stable API.
Expanding on what @oli-obk a bit more, if we wanted to support IEEE-754
But this RFC is not about
The only argument in favor of this RFC with some weight is @joshtriplett 's: "this needs dedicated code generation in LLVM to make use of hardware support for bfloat16 on specific platforms".
Yet AFAICT, from reading the LangRef (which is often arguably incomplete), LLVM does not have data-types and/or intrinsics for
For some targets, some intrinsics for operating on
I need to add that I don't think these RFCs by the @rust-lang/lang team are a good use of our time.
It took less time to land some of these intrinsics to
It also feels that the RFC was written just because it could be written, not because it directly improves or solves a problem that actually needs solving  or because it is part of the roadmap or anything.
All of this would have been completely avoidable if somebody would have reviewed this before, or if there would have been some discussion about this somewhere (an rfc repo issue, an internal threads, a pre-RFC, etc.). If I could vote, I'd say let's close this, open an issue or an internal threads, and start by exploring basic design questions, like "why can't this be in a library".
 E.g. because a crates.io crate already exists to solve this problem, but it does not work well because of X, therefore we need a new primitive type - such a crate does exist for IEEE-754
As called out in the
"yet" would be the operative word. Support in toolchains (gcc, binutils, LLVM, etc) is in progress right now. The intention would be to add type-level support in LLVM (not just intrinsics) and then pipe that through to rustc.
I certainly would agree that without such native support, this could just as easily be a crate. The whole point of this RFC is to enable native support.
Assume for the moment the development of such support in LLVM. Would you agree that enabling such support in Rust would require native support, not just intrinsics and a crate?
This is precisely the argument that motivated submitting this RFC in the first place.
As I understand it, types natively supported by LLVM get handled directly rather than via intrinsics. I certainly want to see the intrinsics as well, but the point of this RFC is to work towards native support in LLVM (and future backends).
This RFC isn't asking for intrinsic support. Again, the whole point of this RFC is to enable native support, ideally in time for an upcoming processor. I'd like to make Rust an obvious first-class choice for development on that platform.
I'd really appreciate it if you stop repeatedly using insulting approaches like this, such as treating your own feeling that you don't think a problem needs solving as an authoritative pronouncement that the problem doesn't need solving.
I'm certainly interested in discussing the issue and approach. If your position is "we don't need this", that position has been heard.
That question has been asked, and answered (as has the question of why intrinsics don't suffice).
True, regular integer addition and such work without intrinsics, but AFAIK optimizations understand a lot of the intrinsics. Though, that is not a guarantee, especially with an introduction of a new type and new intrinsics that first need to get support from optimizations.
One thing still speaking for not making it a native type is that it's much easier to implement the fallback (for unsupported platforms) on a non-native type (since you just cfg stuff in the Rust source). There are some non-fun workarounds for
As @oli-obk's described from various angles above, whether we present bfloat16 operations to Rust programmers as first-class operations on a new primitive type or as intrinsic does not make an iota of a difference for the optimization and code generation pipeline. And in any case, what LLVM IR we need to generate (e.g.,
So the decision should be solely based on source-level factors, such as ease of use and portability. @joshtriplett rightly points out that platform-specific intrinsics will be a burden for portability, and that the type can be useful for its compact memory footprint even if the target hardware doesn't specifically support it. So if we go with intrinsics in rustc, we may want to make them portable (just like many other existing intrinsics are portable). However, if we're going to commit to providing these intrinsics portably and on stable, then I don't see much justification for not provide the significant usability boost of a proper type with literals, arithmetic operators, etc. to users.
As pointed out before, such a type could be built in library code, even as a third party crate, on top of intrinsics. What's more, this library can be built even on platform-specific intrinsics, by judicious use of
Another downside of implementing the "extend to f32" fallback in library code is that I don't see how it could benefit from allowing intermediate results of higher precision at the optimizer's discretion (as proposed in #2686), which would eliminate some intermediate f16->f32->f16 roundtrips. I am not entirely sure how this will play out in practice though, because if we generate the software fallback sequence in the MIR -> LLVM IR step, then we will also largely miss opportunities for higher precision. I expect it will ultimately depend on how LLVM implements bfloat16 support.
Aside from the above implementation considerations, I am still unsatisfied with the use cases.
Moving data around doesn't need any operations, you can just as well use any 16 bit data type for the memory loads and stores. Passing scalars by value across FFI may run into ABI issues that can only be solved with some compiler support, but again no need for arithmetic, maybe just a different
As I mentioned at the start, I don't see why writing those kernels in Rust is important enough to justify adding a new primitive type to Rust, for three reasons:
I have another suggestion, I'm not sure how much sense it makes:
Another approach would be to support this type via a crate, with a software emulation for the operations on a bfloat16. And then optimize them either in LLVM or as a MIR optimization.
So instead of exposing bfloat16 as the language/stdlib level, we make sure that operations on bfloat16 are optimized the same ways as if there were native type/intrinsics. I believe this could be implemented as "simple" peephole optimizations.
This would have advantage for backward/forward compatibility: we would not guarantee these optimizations and thus could drop them if the cost to maintain them is too big. Code that use them would still compile with older rustc version, it would just run slower.
I don't want to make any assumptions about the ABI for passing a
As you mentioned, that ABI could use a
I think that we've been talking past each other in the entire discussion of "intrinsics", and I'm going to update the RFC with some clarifications. I was assuming, here, that the mentions of "intrinsics" referred exclusively to platform-specific intrinsics that map one-to-one to hardware instructions, rather than to something portable provided by LLVM that's subject to all the same optimizations as any
I was responding to the former, when I suggested that I don't think an entirely intrinsic-based implementation makes sense. To respond to the latter, I don't have any problem with using "portable intrinsics" provided by LLVM, as long as they result in a type that Rust and LLVM can optimize just as well as
A correction: this has nothing to do with marketing. This is about the usability and performance of having first-class support. I'm expecting that some parts of the implementation will require a lang item, which we can't use from a crate.
I don't expect Rust to bring less to the table. One of my goals is "there should never be a use case for which C is more capable or more usable than Rust".
Native support for
I don't expect the use cases to be limited to those kernels; I listed them because they're a prominent use case. Imagine trying to list all the use cases for native floating-point support in Rust. It's a long list, and I'm expecting bfloat16 to apply to many of those.
I'd like to have both of those. I absolutely want all the target-specific intrinsics, but I'd also like to be able to write scalar code for automatic vectorization, offload engines, and similar. Currently, users of
Also, I'm expecting a somewhat smoother spectrum, with target-specific intrinsics for specific hotspots, and higher-level Rust code for anywhere that can use higher-level Rust code. Productivity is an aspect of performance; the code you haven't had time to write yet is the slowest of all, with a throughput of 0.
There's a PR to the
I think it is better for this to belong at the crate level rather than the language level, especially considering how sparse hardware support is. Exposing
I was asked to properly write my “concerns” here, so these are my concerns in the most constructive way I found to phrase them.
This text claims that LLVM will add a native type for
I don't see any information or arguments suggesting that the ABI story for
The text does not mention how these problems are already solved in Rust (e.g. no mention of the
The text prior-art section should be comprehensive.
The text claims that a library solution is impossible and proposes a solution based on the assumption that the claim is true.
Prior art in Rust (e.g. tensorflow-rs) and other languages (C++ Tensorflow, Julia) shows that it is indeed possible to implement
If LLVM ever adds native intrinsics for
If LLVM also adds a native type, some of these Rust intrinsics might need to, internally, and transparently to users,
LLVM might also add target-specific intrinsics for this type (e.g. AVX-512 extensions). Those would be exposed via
Prior-art suggests that the ABI of such a type might just be unspecified. In Rust, this means that we can do whatever we want, as long as we make the type an improper C type - for
It has also been claimed that such a library implementation cannot be efficient, but that claim is not supported. AFAICT, the claim is incorrect. A library and the native type can be made to generate identical LLVM IR (e.g. by making the library call
Even if a library solution has no codegen drawbacks over a language-level solution, there are many trade-offs at play that might make one approach better than the other.
For example, a library solution can be more flexible (e.g. support different behaviors via cargo features or different libraries), iterates quicker, and has a smaller impact on the implementation, than a language level solution. A native type solution would properly support literals, casts, ... and would have a "marketing advantage", e.g., Rust has first class support for
The text does not explore these trade-off and it should critically do so. For example, the better marketing claim can be easily misunderstood as Rust not being expressive enough to implement the type as a user-defined type (like Julia and C++ do). If the library ends up being worse than the language level solution, which would be fine, the text should make the extra effort of identifying whether other language features could make the library solution good enough (e.g. user-defined literal and user-defined casts), since maybe pursuing those instead would have a larger impact on the ecosystem than adding new primitive types every time we hit those roadblocks.
I'm surprised that it was submitted in this form into the process. It feels insufficiently baked and would benefit from iteration in internals or an RFC issue to collect different use cases, alternatives, prior art, etc. If the tensorflow-rs implementation cannot be refactored as a library, it might make sense to implement this as a library in the mean time first to gain experience, and wait to see what LLVM does. If the library is not good enough due to other limitations in the language, one can attempt to address those.
Some comments have argued that the intent was to discuss the idea. IIUC the RFC process is not the right venue for that, but that might be changing. If that was the intent, the text should have instead focus on discussing the problem that needs solving (e.g. how are Rust programs that are already using
Potential compromise: add
That way you still get the benefits of a primitive type, e.g. the implementation can be more straightforward, and ABI concerns can be handled in the compiler where they belong, but it wouldn't take up any room in the global namespace.
That would still commit the Rust compiler and stdlib to forever have a notion of bfloat16, which is a valid concern. But a hypothetical
Which problem does this solve?
The main advantage of the intrinsics is that people can just use a
The intrinsics could be portable, but I don't think that this would be a constraint, and there wouldn't be much value in doing that since a library would already contain a fallback implementation (so why duplicate that in the compiler?). That is, the library would be in charge of choosing intrinsics depending on target, codegen backend, etc. and using them correctly (all this can be done with
If we end up implementing intrinsics for all