Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better AAarch64 performance #9

Open
moscicky opened this issue Jul 21, 2023 · 2 comments
Open

Better AAarch64 performance #9

moscicky opened this issue Jul 21, 2023 · 2 comments

Comments

@moscicky
Copy link

moscicky commented Jul 21, 2023

Even tough JEP 438 states that both x64 and AArch64 architectures should benefit from new vector api, currently performance of simdjson-java on M1 mac is way worse than other parsers:

Benchmark                                                                   Mode  Cnt     Score    Error  Units
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_fastjson        thrpt    5  1229.991 ± 39.538  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jackson         thrpt    5  1099.877 ±  9.560  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter        thrpt    5   607.902 ± 10.469  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_jsoniter_scala  thrpt    5  1930.694 ± 41.766  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjson        thrpt    5    26.287 ±  0.295  ops/s
ParseAndSelectBenchmark.countUniqueUsersWithDefaultProfile_simdjsonPadded  thrpt    5    26.516 ±  0.686  ops/s

This may be due to the usage of 256 bit vectors, I have found an thread which states that:

on AArch64 NEON, the max hardware vector size is 128 bits. So for 256-bits, we are not able to intrinsify to use SIMD directly, which will fall back to Java implementation of those APIs

When running the benchmark with '-XX:+UnlockDiagnosticVMOptions', '-XX:+PrintIntrinsics' the following output can be observed, supporting this theory:

** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=load vlen=32 etype=byte ismask=no
** not supported: arity=1 op=store vlen=32 etype=byte ismask=no
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte
** not supported: arity=0 op=broadcast vlen=32 etype=byte ismask=0 bcast_mode=0
** not supported: arity=2 op=comp/0 vlen=32 etype=byte ismask=usestore
** not supported: arity=1 op=cast#438/3 vlen2=32 etype2=byte

Obviously AArch64 support is not as important as x64, but it may be interesting to make the implementation flexible to support both architectures. Perhaps the C++ implementation can be used as a reference.

Anyway, great work so far on the Java port, the results on x64 are very impressive!

@piotrrzysko
Copy link
Member

Thanks for researching that!

Would you mind adding and running the following test:

    @Test
    public void printPreferableSpecies() {
        System.out.println(ByteVector.SPECIES_PREFERRED);
    }

?

It would tell us what the preferable vector length is for your machine. Unfortunately, it's challenging to just replace ByteVector.SPECIES_256 with ByteVector.SPECIES_PREFERRED in the library. In several places, we have to perform bitwise operations where it's easier to know the length upfront to avoid, for example, using extra masks to extract a relevant part of a long.

In my opinion, to support different vector lengths, we would need to provide dedicated implementations for each length and then, based on ByteVector.SPECIES_PREFERRED, pick the one that is the best for the machine the library is used on.

@moscicky
Copy link
Author

moscicky commented Jul 22, 2023

No problem, the output is:

Species[byte, 16, S_128_BIT]

The approach you suggest sounds reasonable 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants