-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Heavily Optimized std.mem.eql with SIMD
#18389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Co-authored-by: Protty <45520026+kprotty@users.noreply.github.com>
this allows stage2 to make use of the optimized eql.
squeek502
reviewed
Dec 28, 2023
Co-authored-by: Ryan Liptak <squeek502@hotmail.com>
instead of just checking pow of 2.
Vexu
reviewed
Dec 28, 2023
karlseguin
reviewed
Dec 28, 2023
d24b412 to
3325f52
Compare
no point in having it public as it creates a duplicate way of doing something. this conflicts with the zen.
sno2
reviewed
Dec 29, 2023
rootbeer
reviewed
Dec 29, 2023
Vexu
approved these changes
Dec 30, 2023
Member
|
Great work! I love to see this kind of collaboration. |
rwsalie
pushed a commit
to rwsalie/zig
that referenced
this pull request
Jan 27, 2024
* optimized memeql * add `sched_setaffinity` to `std.os.linux` Co-authored-by: Protty <45520026+kprotty@users.noreply.github.com> Co-authored-by: Ryan Liptak <squeek502@hotmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current stdlib implementation is not as good as it can be for
u8, which is the most commonly used one by far. @kprotty and I have taken the task of vectorizing and accelerating this function by orders of magnitude. The idea here is to split this api into 2 different functions. The normaleqlfunction, which works on anything, and the much more optimizedeqlBytes, which specifically optimizes bytes.eqlcallseqlBytesifTisu8.For any other integers, we still use a much faster xor accumulator design, see benchmarks below.
Benchmarks
The benchmarks are run in as good of a script as I was able to make, including using cpu counters, pinning the process to 1 core, and clearing caches between different function runs. Warmups happen for each function, as well as at least 1000 runs.
The CPU the benchmarks were run on:
Important to note that the cpu does have
constant_tscenabled.There are 3 benchmarks for the 3 categories,
start,middle,same.startis when the difference is at the first byte. This is meant to show the start-up cost, and how fast it exists.middleshows a sort of "middle" case. Is it able to find it faster in smaller inputs? Does it have an early exit?sameis the worst case scenario. Here the entire input must be checked, and is actually the most common input. People will usually use equality to see it's true rather than false.Run with:
zig build-exe benchmark.zig -lc -OReleaseFastbenchmarks
The
starton both versions is measured under 100 cycles, which I consider to be starting to get close to the error margin. At this level, there isn't any point to comparing them, and both running this fast will have virtually no performance difference for the user.Another benchmark I tried was running the
perf_test.ziginside oflib/std/zig. I felt that the parser and tokenizer would be good spot with manymem.eqlusages.parsing speed: 117.96MiB/sparsing speed: 124.25MiB/sthat is a solid 6% speed increase on average. Note that I up the iteration count from 100 to 10000, to get the most precise readings.
I also added
sched_setaffinity, as it seems thegetaffinitywas present, but there was nosetaffinityand I needed it for the benchmarking.