-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
icelake like backend for RVV (RISC-V vector extension) #362
Comments
This is very cool! Looking forwards to seeing your code. How do you adapt the
I suppose this is due to the |
Fantastic. These are very impressive numbers.
Firstly, we have an empty architecture, ppc4, that is currently all falling back on scalar. You can just copy that, so take First you need macros to recognize the target at compile time. Start with
Right? So you need something like
It is acceptable to have constraints on the compiler. For example, not all compilers will allow us to build the icelake kernel. But you need a compile-time way to check support for overloaded intrinsic, evidently.
For practical reasons, you want a risc-v binary to run without crashing if the extension is missing. So anything that is not required by risc-v may need a runtime check. This runtime check can be moderately expensive because it is cached (done often just once). You must be concerned with software engineering: supporting multiple kernels is expensive... not only in coding time, but also in bug fixing and testing. With x64, we have millions of users (simdutf is part of Node.js so it is everywhere), but it is not so with risc-v. So I recommend reducing the code surface. Note that it is possible to do things in stages, and add support for extensions later.
Unfortunately, an ISA does not come with strict performance guarantees. For example, aarch64 processors can vary quite a bit. For icelake, the trick we use is to require VBMI2. Even if we did not need VBMI2, we know that if this extension is present, then the processor is sufficiently recent that the AVX-512 instructions don't have too many gotchas. But that's easy because we know in details the market. In practice, you often get that there is some dominance in the market. Most processors in use are X, Y, Z and you optimize for X, Y, Z while knowing that performance can be lesser for other processors. I'm afraid that there is no fool-proof approach but you can document your expectations. |
@camel-cdr I am getting the impression that we'll have at least one code reviewer (@clausecker). :-) |
👍
From what I can tell, checking for version 1.0 support with
I'm not sure what the correct way to check for it is, but hwprobe looks like it would work for checking
I'm not sure what BTW, here is a great resource to get an overview of the supported instructions: https://github.com/dzaima/intrinsics-viewer |
It's used to translate the various masks between the UTF-8 and the UTF-16 space, mainly for validation, but also for other things. There may be workarounds for this, but I suppose you would have mentioned redesigning the algorithm to not need this anymore.
This sounds like you're doing a different algorithm from the Icelake kernel? The algorithm used by the Icelake kernel is explained in detail in our paper. It does validation as a part of the transcoding process, but with a different approach from the previous Keiser et al. paper. |
So we are assuming that RISC-V runs on Linux. Is that true? Nothing in simdutf assumes Linux thus far. |
@camel-cdr You referred to icelake in your issue, and the icelake kernel is different, in part because it benefits from compress instructions (VBMI2). |
It certainly doesn't run on windows today, I think e.g. FreeBSD also has support, but I'd focus on Linux for now. Edit: and come to think of it Android as well.
I was referring to icelake because it uses a fully standalone implementation. The other backends partially use the Anyways, I've uploaded the code with explicit intrinsics to my github, and will look into creating a public simdutf dev branch soon: |
Hi! Regarding T-HEAD recently added a new vendor extension ( There was an agreement to support this upstream (unfortunately there are too much hardware floating around with it to avoid it). IIRC this lands in GCC 14 (or at least they will try to make it). I think, it got merged 7+ days ago and there is also a separate intrinsic header ( |
@davidlt That's great. I was aware of the XTheadVector patches, good to hear that they are now merged. PS: the mentioned article is done now: https://camel-cdr.github.io/rvv-bench-results/articles/vector-utf.html |
So I've run into a bit of a predicament. If I understood it correctly, the current behavior for x86 is to compile all backends using gcc and clang support For now, I'll only enable the rvv backend if it's explicitly compiled for rvv. |
Your understanding is correct. We proceed in this manner because we want to make the library available in a single-header form, without making assumptions about the build system. |
Hi, I've been working on RVV native Unicode conversion routines, and have optimized validating utf8->utf32, utf8->utf16 and partial utf16->utf8 working (see last part for benchmarks).
I'd like to upstream this to simdutf in a custom backend, similar to how the icelake one works.
For testing, I generate random valid input utf32, convert it to the input format, randomly perform n random bit flips on it, and validate the output against the simdutf scalar implementation. Ideally I'd also like to use coverage guided fuzzing, but I wasn't able to get fuzzing working on RISC-V yet.
The code will be published to my RVV benchmark soon (it still needs some cleanup), hopefully with an associated article/blog post.
Edit: Here is the code: utf8_to_utf32/utf8_to_utf16, utf16_to_utf8
There are some open questions, though.
What does one need to do/which files to touch/what tools are there, to add a new architecture to simdutf?
Should we use the explicit intrinsics or the overloaded intrinsics?
The overloaded intrinsics are IMO more readable and better refactorable.
From what I can tell both are "mandated" by the RVV intrinsics spec, but whiles clang supports them since supporting RVV intrinsics (clang 16 and above), gcc currently doesn't support them, but it looks like upstream is currently working on it. I expect that once RVV 1.0 hardware becomes more available, gcc will should have support. There is currently only one board Kendryte K230, which is slowly being rolled out in batches.
Which extensions should we target?
I think we should orient our self by the RVA profiles and only support the standard V extension, so 8 to 64 bit wide elements, with a VLEN >= 128 bits, and not things like Zve64x.
Supporting Zvbb is also quite useful, as it has an endianness swap instruction, but I think we should make this optional and detect support from compiler settings.
Can should we assume fast vrgather and vcompress?
RVV has two permutation instructions that currently vary widely in performance between processors:
vcompress.vm
:vrgather.vv
:...
*bobcat: note that this is an open source proof-of-concept core, and they explicitly stated, that they didn't optimize the permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a vrgather fast path for vl<=256. I think they didn't have much incentive to make this fast, as the x280 mostly targets AI.
My code currently uses e8m1 vrgather and e8m2 vcompress, which works great on the C9xx cores, but not so great on the others. I suspect, however, that well see future desktop cores implement fast vcompress and at least fast LMUL=1 vrgather.
For one, because vcompress implementations can be scaled up almost linearly with vector length, which doesn't seem to be true for vrgather without exploding the gate count (Although admittedly I don't know much about hardware design). Secondly because using vrgather for 4 bit LUTs and in lane shuffles will be the most common operations, so vendors will need to optimize for those.
For now, I wouldn't add gather free implementations and performance measurements, but that might be necessary in the future, if I'm wrong about this.
Benchmarks
Processors:
C908: in-order at 1.6GHz, supports RVV 1.0 with VLEN=128
C920: out-of-order double issue at 2GHz, supports RVV 0.7.1 with VLEN=128
I needed to manually convert the assembly to rvv 0.7.1, which increased the code size by about 20 instructions. I've yet to the conversion for the utf16_to_utf8 code, so there aren't any results for that below.
Implementations:
utf8_to_utf32/utf8_to_utf16: fast path for 1 byte, 1/2 byte, 1/2/3 byte, average > 2 bytes, general case
Emoji-Lipsum could probably be artificially speed up by an all 4 byte case, but I don't think that is a realistic case to optimize for, so I left it out.
utf16_to_utf8: fast path for 1 byte output, 1/2 byte output consumes everything until 3/4 byte output, which is converted with scalar code until a 1/2 byte output is reached.
I plan on adding a 1/2/3 vectorized path, and maybe an 1/2/3/4, if I can figure it out.
Metric:
b/c
is "input bytes processed"/cycle.c908 utf8_to_utf32
c908 utf8_to_utf16
c908 utf16_to_utf8
c920 utf8_to_utf32
c920 utf8_to_utf16
The text was updated successfully, but these errors were encountered: