Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Risc-V Vector Intrinsic Support #642

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

gkalsi
Copy link

@gkalsi gkalsi commented Aug 8, 2023

The Risc-V ISA has an optional "V" extension for vector support. This patch introduces vector accelerated routines for the following methods:

  • local_lpc_compute_autocorrelation
  • local_lpc_compute_residual_from_qlp_coefficients

This patch disables building with Risc-V Vector support. It can be enabled using --enable-riscv-vector-optimizations in autotools or -DRISCV_VECTOR=ON using cmake.

Building with vector support is disabled by default because the patch was tested only on QEMU for now for correctness.

Limitations:

  • RiscV vector support is limited to very modern compilers (Clang 16 or later) for the time being
  • The width of each vector element on Risc-V is configurable by the silicon vendor. This patch assumes a reasonable width of at least 128 bits per vector register for now.

Future Work:

  • Only local_lpc_compute_residual_from_qlp_coefficients has been optimized for now and the implementation was based heavily on the Intel AVX implementation. There is likely a more idiomatic Risc-V implementation that is feasible.

The Risc-V ISA has an optional "V" extension for vector support. This patch
introduces vector accelerated routines for the following methods:
 + local_lpc_compute_autocorrelation
 + local_lpc_compute_residual_from_qlp_coefficients

This patch disables building with Risc-V Vector support. It can be enabled
using `--enable-riscv-vector-optimizations` in autotools or `-DRISCV_VECTOR=ON`
using cmake.

Building with vector support is disabled by default because the patch was tested
only on QEMU for now for correctness.

Limitations:
 + RiscV vector support is limited to very modern compilers (Clang 16 or later)
   for the time being
 + The width of each vector element on Risc-V is configurable by the silicon
   vendor. This patch assumes a reasonable width of at least 128 bits per vector
   register for now.

Future Work:
 + Only local_lpc_compute_residual_from_qlp_coefficients has been optimized for
   now and the implementation was based heavily on the Intel AVX implementation.
   There is likely a more idiomatic Risc-V implementation that is feasible.
@gkalsi
Copy link
Author

gkalsi commented Aug 8, 2023

Hi Flac maintainers,

I hope I'm not overstepping my bounds with this PR. I've taken the liberty of porting a handful of functions into their vectorized equivalents in Risc-V. I've started with just the two methods to get some early feedback from you folks to make sure that these are patches that you'd be interested in and that I'm on the right track. Any feedback and comments would be appreciated!

@ktmf01
Copy link
Collaborator

ktmf01 commented Aug 8, 2023

Hi,

Thanks for wanting to contribute to FLAC! I do have a few questions though

  • Does this code need run-time detection? You say this ISA extensions is optional, would it make sense to have a binary that runs on CPUs with and without this extension?
  • Have you measured whether these intrinsics functions provide performance improvement?
  • If yes, could you try compiling plain C specifically with the vector ISA extension enabled and see whether these intrinsics functions outperform the autovectorized variants? As far as I know, clang 16 got pretty good at autovectorizing FLAC code.

The main problem I have with this PR is that I cannot test it. Also, it is not covered by CI nor by fuzzing. So, the performance improvement must be pretty good over autovectorized code for this to get merged.

I would be much more inclined to merge something with only a little platform specific code, like this: https://github.com/xiph/flac/blob/master/src/libFLAC/lpc_intrin_fma.c But, that would only make sense with run-time detection. With that approach, most of the code is 'shared' and thus covered by CI and fuzzing.

@gkalsi
Copy link
Author

gkalsi commented Aug 9, 2023

Thanks for the prompt and thoughtful response!

To answer your questions inline:

Does this code need run-time detection?

Yes, good point. I'll implement that and send you a follow up patch.

Have you measured whether these intrinsics functions provide performance improvement?

Unfortunately not directly. I don't have access to any hardware that supports the Risc-V Vector extension at the moment. The best I've been able to do so far is estimating performance based on instruction counts. I haven't had a chance to compare how well it does against Clang's auto-vectorization but I can certainly look into it.

The main problem I have with this PR is that I cannot test it. Also, it is not covered by CI nor by fuzzing. So, the performance improvement must be pretty good over autovectorized code for this to get merged.

So far I've been testing my changes using QEMU. The FLAC tests seem to pass with my changes. I also implemented a test harness that compares the results from my vectorized routines against the canonical C implementations. The results also seem to match up. I'm happy to share that test harness with you if that'd be helpful. Is emulation an acceptable test bench?

I would be much more inclined to merge something with only a little platform specific code, like this: https://github.com/xiph/flac/blob/master/src/libFLAC/lpc_intrin_fma.c But, that would only make sense with run-time detection.

That sounds reasonable to me. Let me get back to you with run-time detection and some considerations for minimizing platform specific code.

Thanks again!

@enh-google
Copy link

That sounds reasonable to me. Let me get back to you with run-time detection and some considerations for minimizing platform specific code.

pro tip: it turns out that getauxval(AT_HWCAP) does actually report 'V' (on kernels new enough to actually support V state save/restore; and obviously we don't want to try to run on older kernels anyway), there just isn't a predefined constant for it. so as long as you're prepared to check bit (1 << ('V' - 'A')), you can actually go the easy route rather than needing to use the __riscv_hwprobe() stuff (which afaik currently only exists in Android's libc; it's still "coming soon" for other libcs anyway, so for portable code right now you don't have much of a choice other than getauxval()).

@ktmf01
Copy link
Collaborator

ktmf01 commented Aug 9, 2023

I have considered buying a MangoPi sometime ago, just for fun. I read however that that CPU (Allwinner D1 C906) was designed prior to the vector ISA extension being finalized. Is there any similar cheap hardware available that is 'up-to-date'?

@gkalsi
Copy link
Author

gkalsi commented Aug 9, 2023

@enh-google Thanks for the tip!
@ktmf01 Not that I know of so far 🙁 however I'm hoping some up-to-date hardware will be available soon

@ktmf01
Copy link
Collaborator

ktmf01 commented Aug 9, 2023

Okay, then we will have a bit of a problem here I guess. Recently, some ARM64 intrinsics were merged, and they were slower than plain C on the first try and only slightly faster on the second. Furthermore, that 75% speed increase is no longer applicable as Clang 16 had some improvements specific to that function. I'm seriously considering benchmarking again and dropping some functions there. I consider code that is not an improvement a liability, really, especially if it is not covered by oss-fuzz.

In other words: using intrinsics does not automatically result in improvements. I really need some measurements before being able to merge this. I've dropped functions specific to ppc64 in the last release because after spending a lot of time getting hold of someone with barely-working ppc64 hardware, it turned out these functions were actually no improvement at all. And for these measurements, someone needs hardware to measure...

edit: just to be clear, I feel no need to make these measurements myself, as long as they seem reasonably reliable. I've used gprof in the past with seemingly good results.

@enh-google
Copy link

i don't think riscv64 auto-vectorization is anything like as well developed as arm64 atm, so it's quite possible that the optimal strategy will be to use intrinsics in the short term and delete them when the compilers[1] catch up.


  1. since Android only uses clang, i actually have no idea what state gcc is in --- but differences between compilers is another reason to assume there's not going to be a single clear answer in the short term :-(

This patch dynamically detects whether the RiscV vector unit
is available and only enables the intrinsic routines if it is.

Tested by launching QEMU with the Vector Extensions enabled
and disabled and observed that the intrinsic routines were
only patched in when vector was enabled.
@gkalsi gkalsi marked this pull request as draft August 10, 2023 04:23
@gkalsi
Copy link
Author

gkalsi commented Aug 10, 2023

That makes sense. I've converted this patch to a draft until I can obtain some measurements. I suspect the best I'll be able to do in the immediate term is a comparison of retired instruction counts using an emulator. I think that might be a strong proxy for performance gains, especially if the difference is significant -- however I suppose that's a discussion we can have once I have some data :)

In the meantime I've added support for dynamically detecting the presence of the vector unit on the CPU. Cheers!

@ktmf01
Copy link
Collaborator

ktmf01 commented Aug 10, 2023

i don't think riscv64 auto-vectorization is anything like as well developed as arm64 atm,

The main problem fixed with clang 16 for arm64 (and probably x86 as well, but I haven't checked) was resolving dependencies and generating proper IR. AFAIK that is mostly architecture independent and the hardest part of autovectorization.

Anyway, we'll see how this turns out.

@enh-google
Copy link

In the meantime I've added support for dynamically detecting the presence of the vector unit on the CPU.

(yeah, that lgtm.)

AFAIK that is mostly architecture independent and the hardest part of autovectorization.

i'm not an expert (but can connect you with folks who are if you like), but aiui riscv64 autovectorization still has two competing proposals for reasons i never understood :-(

actually, @appujee because i know autovectorization is one of his favorite subjects, and he might be interested to have a real-world example to test with (with at least the eventual goal of being able to delete the intrinsic implementation, even if that takes a couple more years and we need the intrinsics in the meantime!).

@appujee
Copy link

appujee commented Aug 11, 2023

The main problem fixed with clang 16 for arm64 (and probably x86 as well, but I haven't checked) was resolving dependencies and generating proper IR. AFAIK that is mostly architecture independent and the hardest part of autovectorization.

The llvm vectorizer is target independent (for the most part) so any improvements, like dependency analysis, translate to all architecture. However, RISCV vectorization is slightly different because the vectors are variable length (similar to SVE of AArch64). So questions like:

  • what is a right vector factor for a workload
  • should vectorization factor be hard coded or determined at runtime
  • is it okay to have an epilogue and have sve behave like the old vectorizer

remain unanswered. (Side note: I'm not sure if we have tested aarch64-sve for flac CGLAGS+='-march=armv8-a+sve')

Even though RISCV autovectorization in clang has made decent progress, it is a WIP with some interesting optimizations still in review. Things also break in weird ways as vectorizer evolves with contributions from multiple parties. Because of limited availability of hardware, it is difficult to make a case for autovectorization without measuring meaningful workloads. I agree with @enh-google that it is better to use intrinsics for now and move to autovectorization later when it makes sense.

@enh-google
Copy link

remain unanswered. (Side note: I'm not sure if we have tested aarch64-sve for flac CGLAGS+='-march=armv8-a+sve')

(fwiw i did try that earlier this year [https://android-review.googlesource.com/q/topic:%22armv9%22] but all i got for my troubles was an lld crash. someone should try that again, though i suspect that the flac maintainer here is claiming that arm64 autovectorization works well for arm64 ASIMD, not arm64 SVE?)

@negge
Copy link

negge commented Aug 16, 2023

I tried building this MR so I could collect some performance statistics. Unfortunately, the patch does not build with gcc-13.2.0:

$ make
[  0%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/bitmath.c.o
[  1%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/bitreader.c.o
[  2%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/bitwriter.c.o
[  3%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/cpu.c.o
/home/negge/git/flac/src/libFLAC/cpu.c: In function ‘rv64_cpu_info’:
/home/negge/git/flac/src/libFLAC/cpu.c:247:36: warning: implicit declaration of function ‘__riscv_vsetvlmax_e8m1’ [-Wimplicit-function-declaration]
  247 |                 info->rv64.vlenb = __riscv_vsetvlmax_e8m1();
      |                                    ^~~~~~~~~~~~~~~~~~~~~~
/home/negge/git/flac/src/libFLAC/cpu.c:247:36: warning: nested extern declaration of ‘__riscv_vsetvlmax_e8m1’ [-Wnested-externs]
[  4%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/crc.c.o
[  4%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/fixed.c.o
[  5%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/fixed_intrin_sse2.c.o
[  6%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/fixed_intrin_ssse3.c.o
[  7%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/fixed_intrin_sse42.c.o
[  8%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/fixed_intrin_avx2.c.o
[  8%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/float.c.o
[  9%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/format.c.o
[ 10%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/lpc.c.o
[ 11%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/lpc_intrin_neon.c.o
[ 12%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/lpc_intrin_sse2.c.o
[ 12%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/lpc_intrin_sse41.c.o
[ 13%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/lpc_intrin_avx2.c.o
[ 14%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/lpc_intrin_fma.c.o
[ 15%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/lpc_intrin_riscv.c.o
[ 16%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/md5.c.o
[ 16%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/memory.c.o
[ 17%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/metadata_iterators.c.o
[ 18%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/metadata_object.c.o
[ 19%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/stream_decoder.c.o
[ 20%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/stream_encoder.c.o
[ 20%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/stream_encoder_intrin_sse2.c.o
[ 21%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/stream_encoder_intrin_ssse3.c.o
[ 22%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/stream_encoder_intrin_avx2.c.o
[ 23%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/stream_encoder_framing.c.o
[ 24%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/window.c.o
[ 25%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/ogg_decoder_aspect.c.o
[ 25%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/ogg_encoder_aspect.c.o
[ 26%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/ogg_helper.c.o
[ 27%] Building C object src/libFLAC/CMakeFiles/FLAC.dir/ogg_mapping.c.o
[ 28%] Linking C static library libFLAC.a
[ 28%] Built target FLAC
[ 29%] Building CXX object src/libFLAC++/CMakeFiles/FLAC++.dir/metadata.cpp.o
[ 29%] Building CXX object src/libFLAC++/CMakeFiles/FLAC++.dir/stream_decoder.cpp.o
[ 30%] Building CXX object src/libFLAC++/CMakeFiles/FLAC++.dir/stream_encoder.cpp.o
[ 31%] Linking CXX static library libFLAC++.a
[ 31%] Built target FLAC++
[ 32%] Building C object src/share/replaygain_analysis/CMakeFiles/replaygain_analysis.dir/replaygain_analysis.c.o
[ 32%] Linking C static library libreplaygain_analysis.a
[ 32%] Built target replaygain_analysis
[ 33%] Building C object src/share/replaygain_synthesis/CMakeFiles/replaygain_synthesis.dir/replaygain_synthesis.c.o
[ 34%] Linking C static library libreplaygain_synthesis.a
[ 34%] Built target replaygain_synthesis
[ 35%] Building C object src/share/getopt/CMakeFiles/getopt.dir/getopt.c.o
[ 36%] Building C object src/share/getopt/CMakeFiles/getopt.dir/getopt1.c.o
[ 37%] Linking C static library libgetopt.a
[ 37%] Built target getopt
[ 37%] Building C object src/share/grabbag/CMakeFiles/grabbag.dir/alloc.c.o
[ 38%] Building C object src/share/grabbag/CMakeFiles/grabbag.dir/cuesheet.c.o
[ 39%] Building C object src/share/grabbag/CMakeFiles/grabbag.dir/file.c.o
[ 40%] Building C object src/share/grabbag/CMakeFiles/grabbag.dir/picture.c.o
[ 41%] Building C object src/share/grabbag/CMakeFiles/grabbag.dir/replaygain.c.o
[ 41%] Building C object src/share/grabbag/CMakeFiles/grabbag.dir/seektable.c.o
[ 42%] Building C object src/share/grabbag/CMakeFiles/grabbag.dir/snprintf.c.o
[ 43%] Linking C static library libgrabbag.a
[ 43%] Built target grabbag
[ 44%] Building C object src/share/utf8/CMakeFiles/utf8.dir/charset.c.o
[ 45%] Building C object src/share/utf8/CMakeFiles/utf8.dir/iconvert.c.o
[ 46%] Building C object src/share/utf8/CMakeFiles/utf8.dir/utf8.c.o
[ 47%] Linking C static library libutf8.a
[ 47%] Built target utf8
[ 48%] Building C object src/flac/CMakeFiles/flacapp.dir/analyze.c.o
[ 48%] Building C object src/flac/CMakeFiles/flacapp.dir/decode.c.o
[ 49%] Building C object src/flac/CMakeFiles/flacapp.dir/encode.c.o
[ 50%] Building C object src/flac/CMakeFiles/flacapp.dir/foreign_metadata.c.o
[ 51%] Building C object src/flac/CMakeFiles/flacapp.dir/main.c.o
[ 52%] Building C object src/flac/CMakeFiles/flacapp.dir/local_string_utils.c.o
[ 52%] Building C object src/flac/CMakeFiles/flacapp.dir/utils.c.o
[ 53%] Building C object src/flac/CMakeFiles/flacapp.dir/vorbiscomment.c.o
[ 54%] Linking C executable flac
/usr/lib/gcc/riscv64-unknown-linux-gnu/13/../../../../riscv64-unknown-linux-gnu/bin/ld: ../libFLAC/libFLAC.a(cpu.c.o): in function `.L0 ':
cpu.c:(.text+0x36): undefined reference to `__riscv_vsetvlmax_e8m1'
collect2: error: ld returned 1 exit status
make[2]: *** [src/flac/CMakeFiles/flacapp.dir/build.make:217: src/flac/flac] Error 1
make[1]: *** [CMakeFiles/Makefile2:729: src/flac/CMakeFiles/flacapp.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

I believe the RISC-V intrinsics are still evolving and this may be a discrepancy between version 0.11 and 0.12.

This was causing a build failure if riscv_vector was not available
on the system.
@gkalsi
Copy link
Author

gkalsi commented Aug 16, 2023

Thanks @negge -- looks like I was calling __riscv_vsetvlmax_e8m1 even if riscv_vector.h wasn't present.
I realized that I should probably set a reasonable -march=... if the build system requests vector optimizations. I've been using -march=rv64gczve64d thus far, is that a reasonable set of extensions to expect?

@enh-google
Copy link

i think Android will say the equivalent of rv64gcv (see https://android-review.googlesource.com/c/platform/build/soong/+/2679376 for example).

i wouldn't hard-code any vector size, but especially not 64, which i wouldn't expect to be common. (128 seems like the likely sweet spot for the foreseeable future, based on arm64 experience, so for qemu where we did have to hard-code something, that's the vector length we've told qemu to assume for now.)

When detecting "riscv_vector.h" using autotools and cmake,
invoke the toolchain with -march=rv64gcv.
@gkalsi
Copy link
Author

gkalsi commented Aug 17, 2023

@enh-google Thanks, that's helpful! -march=rv64gcv sounds good to me.
@negge Thanks for that, it should be working in your VM environment with my latest patches. Use cmake -DRISCV_VECTOR=ON or /configure --enable-riscv-vector-optimizations if building with cmake or autotools respectively.

@camel-cdr
Copy link

camel-cdr commented May 28, 2024

Now that there are two RVV 1.0 devboards available, this can probably be revisited. (CanMV k230 with XuanTie C908, and Banana Pi BPI-F3 with SpacemiT X60, I've got them both and can help benchmark)

I've got some comments on the current implementation:

  • FLAC__lpc_compute_autocorrelation_intrin_riscv:
    It currently always uses LMUL=8, that will perform quite bad, if lag<=8, since most implementations dispatch based on LMUL, not vl. I'd add a switch that selects a path with the appropriate minimal LMUL.

  • FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_riscv:
    Why does this set vl to 4? This would only perform well on VLEN=128 architectures. From what I can tell there shouldn't be any dependencies between iterations, so you should be able to use vlmax instead. A vslide1up approach might also be better than the overlapping loads, but I'm not sure, this would need to be measured. Some of the code paths could also take advantage of LMUL.

A general note: I think instead of the prefix _riscv, this should use _rvv (for RISC-V Vector extension), since not all RISC-V implementations implement RVV, and there might be other extension in the future that may be used for separate optimizations (e.g. packed SIMD, once that's ratified).

benchmark examples/c/encode/file/encode_file average execution speed on sample file:

C908, in-order core with VLEN=128:
    RVV: 0.531 sec
    scalar: 0.503 sec
    no RVV autocorrelation: 0.502 sec

As predicted, the vectorized autocorrelation doesn't perform well yet. Without it the performance seems to match between RVV and scalar.

The RVV implementation on the C908 isn't very powerful, I'll try to run it on the X60, which has twice the VLEN and execution unit width, once I've got some more time.

Do you have a standard benchmark suite? I'd like to look into some alternative RVV implementations, once I have the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants