Skip to content

Conversation

@Rexicon226
Copy link
Contributor

@Rexicon226 Rexicon226 commented Dec 28, 2023

The current stdlib implementation is not as good as it can be for u8, which is the most commonly used one by far. @kprotty and I have taken the task of vectorizing and accelerating this function by orders of magnitude. The idea here is to split this api into 2 different functions. The normal eql function, which works on anything, and the much more optimized eqlBytes, which specifically optimizes bytes. eql calls eqlBytes if T is u8.

For any other integers, we still use a much faster xor accumulator design, see benchmarks below.

Benchmarks

The benchmarks are run in as good of a script as I was able to make, including using cpu counters, pinning the process to 1 core, and clearing caches between different function runs. Warmups happen for each function, as well as at least 1000 runs.

The CPU the benchmarks were run on:

AMD Ryzen 9 6900HS Creator Edition (16) @ 3.293GHz

Important to note that the cpu does have constant_tsc enabled.

There are 3 benchmarks for the 3 categories, start, middle, same.

start is when the difference is at the first byte. This is meant to show the start-up cost, and how fast it exists.
middle shows a sort of "middle" case. Is it able to find it faster in smaller inputs? Does it have an early exit?
same is the worst case scenario. Here the entire input must be checked, and is actually the most common input. People will usually use equality to see it's true rather than false.

Run with: zig build-exe benchmark.zig -lc -OReleaseFast

benchmarks

plot

The start on both versions is measured under 100 cycles, which I consider to be starting to get close to the error margin. At this level, there isn't any point to comparing them, and both running this fast will have virtually no performance difference for the user.

Another benchmark I tried was running the perf_test.zig inside of lib/std/zig. I felt that the parser and tokenizer would be good spot with many mem.eql usages.

stdlib xor eql
parsing speed: 117.96MiB/s parsing speed: 124.25MiB/s

that is a solid 6% speed increase on average. Note that I up the iteration count from 100 to 10000, to get the most precise readings.

I also added sched_setaffinity , as it seems the getaffinity was present, but there was no setaffinity and I needed it for the benchmarking.

Rexicon226 and others added 4 commits December 26, 2023 21:20
Co-authored-by: Protty <45520026+kprotty@users.noreply.github.com>
this allows stage2 to make use of the optimized eql.
Rexicon226 and others added 2 commits December 28, 2023 00:00
Co-authored-by: Ryan Liptak <squeek502@hotmail.com>
instead of just checking pow of 2.
Rexicon226 and others added 3 commits December 28, 2023 18:13
no point in having it public as it creates a duplicate way of doing something. this conflicts with the zen.
@andrewrk andrewrk merged commit 2f8e434 into ziglang:master Jan 9, 2024
@andrewrk
Copy link
Member

andrewrk commented Jan 9, 2024

Great work! I love to see this kind of collaboration.

your next mission...

@Rexicon226 Rexicon226 deleted the optimized-mem-eql branch January 9, 2024 07:53
rwsalie pushed a commit to rwsalie/zig that referenced this pull request Jan 27, 2024
* optimized memeql
* add `sched_setaffinity` to `std.os.linux`

Co-authored-by: Protty <45520026+kprotty@users.noreply.github.com>
Co-authored-by: Ryan Liptak <squeek502@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants