-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for WASM 128-bit SIMD #56
Conversation
* Adds feature `wasm32_simd128` consistent with `aarch64_neon` as there is no auto-detection for SIMD in WASM. * Updates `README.md` with WASM build instructions. * Integrates with tests--though the `should_panic` test case will be ignored due to limitations of the test driver with `wasm32-wasi`. * Adds ignore for IntelliJ based IDEs (e.g., CLion). Implementation for rusticstuff#36
This adds a small `cdylib` shim library around simdutf8 with cargo configuration to target `wasm32-unknown-unknown`. The benchmarks are augmented to optionally embed Wasmer and at compile time build the WASM shim and embed it in the benchmarks, compile the WASM module copy in the test slice and invoke the shim routine to benchmark it. A limitation of this current approach is that it measures the overhead of calling across the WASM runtime boundary for each iteration. Another limitation is that it is currently using the default cranelift backend for Wasmer (which is the backend for wasmtime), but the LLVM backend is more performant according to Wasmer's documentation. The benchmarks still allow you to get a reasonable set of expectation of performance (relative numbers against std vs basic for example).
I've added preliminary benchmark integrations in this commit. The approach is basically what I posited in the TODO above--I cross compile a small C-ABI shim crate targeting On my NUC (Intel i7-10710U running Ubuntu 20.04 with performance governor on) I got some promising benchmarks that demonstrate that the WASM SIMD is definitely performing better. $ for X in std compat basic; do (cargo bench --features=simdutf8_wasm --bench="throughput_wasm_$X" -- --save-baseline wasm-$X); done Do you want me to add this feature to this PR, or do you want me to break them up (e.g., have this PR stand on its own, and have follow up PRs and make the task list above issues)? I am very sympathetic large PRs so please let me know how you'd like me to proceed. |
First of all: Great work, thank you so much! I will do a first review shortly.
You can put it in this PR, just don't rebase for now, so I know which commits I have already looked at. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again!
What I am not 100% clear on is, if we want to provide a knob so that the library user can turn on the simd implementation at runtime without having to compile with the +simd128
target feature, because for browsers there is wasm-feature-detect which could be used to check for simd availability. What do you think?
Excellent, I have pushed that commit to the PR. |
My current understanding (which could be wrong) of WASM ISA extensions such as SIMD is is that there is no way within WASM for us to detect this, so the code targeting WASM itself cannot detect SIMD or not--in the host system (e.g., a Javascript host) there may or may not be such a facility, but it is not clear to me how we could reliably interact with it from the library in a portable (as in non-JS engines) way. I do think from the library's perspective, we should definitely allow static flexibility at a minimum because ultimately we want to be able target any WASM runtime irrespective of its native capabilities. |
Co-authored-by: Hans Kratz <hans@appfour.com>
Removes the `wasm32_simd128` feature in favor of controlling this behavior with just `target_feature = simd128` which is consistent with x86 `no_std`. Updates the documentation to explain the requirement of the target feature to drive selection and moves test/dev docs for WASM into its own file as to not clutter the README. Also makes one minor change to the bench build script to make clippy happy.
This adds two end-user features: `simdutf8_wasm_cranelift` and `simdutf8_wasm_llvm`. These features enable the WASM benchmarking and select the appropriate Wasmer backend. The benchmarking doc has been updated with more detailed instructions for the WASM setup as the LLVM backend requires more setup and it makes clearer how to run all the API benchmarks under WASM. Note that the Singlepass backend was not included as it failed to load our SIMD code even though Wasmer docs indicate that Singlepass supports simd128.
Also adds inlining tests for WASM builds.
Added the CI including inlining tests, I had it configured in another branch to run with that branch, here was a run of the workflow: |
#[cfg(all( | ||
feature = "public_imp", | ||
target_arch = "wasm32", | ||
target_feature = "simd128" | ||
))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI integration exposed this bug--the test should only be exposed if WASM SIMD is enabled.
Updated `wasm-runner` fixes the issue. Adds verbosity to CI WASM run to make it clearer as to what is being run.
Looking good! It will be a few days until I have time to review the rest though. |
@almann Are you sure there is no way around the shim for benchmarking? It looks like there is some basic support for wasi benchmarking in the criterion master branch: bheisler/criterion.rs#461 (comment). |
Diving into that issue and the associated PR in the downstream dependency ( @hkratz, did I maybe miss something here? |
Refactors underlying wasm code to be generic with respect to wasmer and wasmtime. Renames feature flags to be more in line with the different WASM backends to benchmark. Updates `BENCHMARKING.md` with the changes.
I also added Wasmtime support to the benchmarks in 48eacc4. I found actually benchmarks on the different WASM runtimes to be of interest, so assuming we keep the shim approach, I think these benchmarks are useful and a user can select whichever one (or all of them) as they see fit. For example here is the comparison of Wasmer Cranelift/LLVM back-ends vs. Wasmtime (which only has a Cranelift backend) for the basic validator (on Ubuntu 20.04, Ryzen 3950X [Zen2]): The caveat of the shim benchmarks apply, but it does suggest interesting differences between runtime and this code. |
FWIW, I tried playing around with this. I can definitely patch around RUSTFLAGS="-C target-feature=+simd128" CARGO_TARGET_WASM32_WASI_RUNNER="wasm-runner wasmer" cargo bench --no-default-features --target=wasm32-wasi --bench=throughput_basic
...
thread 'main' panicked at 'The global thread pool has not been initialized.: ThreadPoolBuildError { kind: IOError(Error { kind: Unsupported, message: "operation not supported on this platform" }) }', /home/almann/.cargo/registry/src/github.com-1ecc6299db9ec823/rayon-core-1.9.1/src/registry.rs:170:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
error: failed to run `/home/almann/CLionProjects/simdutf8/bench/target/wasm32-wasi/release/deps/throughput_basic-716ee0bc7dd8261c.wasm`
│ 1: RuntimeError: unreachable
at __rust_start_panic (throughput_basic-716ee0bc7dd8261c.wasm[2750]:0x1dd651)
at rust_panic (throughput_basic-716ee0bc7dd8261c.wasm[2728]:0x1dcd76)
at std::panicking::rust_panic_with_hook::heefe90fce7240a03 (throughput_basic-716ee0bc7dd8261c.wasm[2721]:0x1dc9e3)
at std::panicking::begin_panic_handler::{{closure}}::h4f5bcef500c8e46d (throughput_basic-716ee0bc7dd8261c.wasm[2702]:0x1dbab7)
at std::sys_common::backtrace::__rust_end_short_backtrace::h53dbd65c57a9d0e6 (throughput_basic-716ee0bc7dd8261c.wasm[2701]:0x1dba06)
at rust_begin_unwind (throughput_basic-716ee0bc7dd8261c.wasm[2720]:0x1dc531)
at core::panicking::panic_fmt::h92e73467e6a2c091 (throughput_basic-716ee0bc7dd8261c.wasm[2894]:0x1eb224)
at core::result::unwrap_failed::h03dbf875e0072fe2 (throughput_basic-716ee0bc7dd8261c.wasm[2921]:0x1ec477)
at rayon_core::registry::global_registry::hb1f293f2c8dfd05b (throughput_basic-716ee0bc7dd8261c.wasm[1292]:0x12d0e8)
at rayon_core::current_num_threads::h35bc7a6b8dbf8c63 (throughput_basic-716ee0bc7dd8261c.wasm[1303]:0x12e058)
at rayon::iter::plumbing::bridge::ha8b8b23fcd9b38a6 (throughput_basic-716ee0bc7dd8261c.wasm[1021]:0x1058ae)
at criterion::analysis::estimates::h6521ee9def717bc3 (throughput_basic-716ee0bc7dd8261c.wasm[659]:0xb4089)
at criterion::analysis::common::h567d2c8475d0f44a (throughput_basic-716ee0bc7dd8261c.wasm[244]:0x22eb7)
at criterion::benchmark_group::BenchmarkGroup<M>::bench_with_input::hb0a4d9a4d0aa4552 (throughput_basic-716ee0bc7dd8261c.wasm[254]:0x258b5)
at simdutf8_bench::bench_input::hd027083377a798b4 (throughput_basic-716ee0bc7dd8261c.wasm[115]:0x11047)
at simdutf8_bench::criterion_benchmark::hf2c8a1c63f585fec (throughput_basic-716ee0bc7dd8261c.wasm[116]:0x1143d)
at throughput_basic::main::h9dc93b5f5255a651 (throughput_basic-716ee0bc7dd8261c.wasm[245]:0x24030)
at std::sys_common::backtrace::__rust_begin_short_backtrace::h2c5b7aaf52faf128 (throughput_basic-716ee0bc7dd8261c.wasm[127]:0x1234b)
at std::rt::lang_start::{{closure}}::h557e866ad9f65275 (throughput_basic-716ee0bc7dd8261c.wasm[147]:0x13ecd)
at std::rt::lang_start_internal::h44883fd0d3691d00 (throughput_basic-716ee0bc7dd8261c.wasm[2697]:0x1db731)
at __original_main (throughput_basic-716ee0bc7dd8261c.wasm[246]:0x240dd)
at _start (throughput_basic-716ee0bc7dd8261c.wasm[20]:0x18e8)
at _start.command_export (throughput_basic-716ee0bc7dd8261c.wasm[3016]:0x1f37d1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, criterion unconditionally uses rayon for its analysis. I was misled by the comment I linked. The WASM benchmarking is a bit more complex than I would like but I don't see a way around it and it is contained in the bench.
So once you fix the nits from this review I will merge this PR. Fuzz testing can be done independently.
Co-authored-by: Hans Kratz <hans@appfour.com>
FWIW, The CI failure appears to be transient, I pushed to https://github.com/almann/simdutf8/actions/runs/1674697191 Looking at the error--there was a problem downloading Wasmer, which didn't fail that step (but failed when attempting to run the tests as the runtime was missing):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for implementing this!
In rusticstuff#56, some code review suggestions got a bit mangled, this was fixed in `lib.rs`, but not in the corresponding README.
In #56, some code review suggestions got a bit mangled, this was fixed in `lib.rs`, but not in the corresponding README.
There are some TODOs (see below), but the implementation and test suite integration has been done.
wasm32_simd128
consistent withaarch64_neon
as there is noauto-detection for SIMD in WASM.
README.md
with WASM build instructions.should_panic
test case will be ignoreddue to limitations of the test driver with
wasm32-wasi
.TODO
wasm32-wasi
for the benchmarks have issues with bothwasmer
andwasmtime
. I suspect this is partially because of criterion expectingwasm32
targets to run to havewasm-bindgen
. One option would be to compile a shim of the library as astaticlib
crate targetingwasm32-unknown-unknown
and usingwasmer
orwasmtime
to compile/link the function and benchmark it in the current framework.Implementation for #36