Misc Updates #57

Dr-Emann · 2023-10-10T01:13:28Z

Update to rust edition 2018, remove unneeded extern crates
Remove trait bounds from types. This is unlikely to actually impact anyone, but could reduce having to propagate bounds to any structs which contain these types.
Port benchmarking to criterion, for more thorough benchmarking, and the ability to run benchmarking on stable
Follow some clippy suggestions to mark functions as #[must_use]
Update to maintained versions of (dev) dependencies

Dr-Emann · 2023-10-10T03:23:30Z

Here's the benchmark results for my machines. ~~It looks like jetscii is somewhat behind memchr (and sometimes std) in most cases, unfortunately.~~

At least on my machines, it looks like memchr's memmem always beats our substring search.
It looks like our fallback implementation is pretty weak, and gets trounced by memchr, and even often std, this is probably at least partly because of the required dynamic dispatch per character if using a static object.
- We might be able to do something like jetscii!{ static XML5: AsciiChars = b"<>&'\""; }, and allow the macro to use a monomorphized type somehow?
- We might be able to have the fallback function do the searching, which would amortize the dynamic function call overhead
For small-medium numbers of characters, even with the SSE implementation, memchr may be an alternative if you're going to be searching the whole string anyway, not once for the first character in a set.

Windows x64 Benchmark Results

Environment details:

Windows 10, Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, AES, AVX, AVX2, FMA3, TSX

find_last_space/ascii_chars
                        time:   [2.3670 ms 2.4117 ms 2.4535 ms]
                        thrpt:  [1.9901 GiB/s 2.0247 GiB/s 2.0629 GiB/s]
find_last_space/stdlib_find_string
                        time:   [16.520 ms 16.725 ms 16.907 ms]
                        thrpt:  [295.73 MiB/s 298.95 MiB/s 302.66 MiB/s]
find_last_space/stdlib_find_char
                        time:   [1.7436 ms 1.7464 ms 1.7493 ms]
                        thrpt:  [2.7914 GiB/s 2.7959 GiB/s 2.8004 GiB/s]
find_last_space/stdlib_find_char_set
                        time:   [17.826 ms 18.047 ms 18.238 ms]
                        thrpt:  [274.15 MiB/s 277.05 MiB/s 280.49 MiB/s]
find_last_space/stdlib_find_closure
                        time:   [17.683 ms 17.946 ms 18.181 ms]
                        thrpt:  [275.01 MiB/s 278.62 MiB/s 282.75 MiB/s]
find_last_space/stdlib_iter_position
                        time:   [13.672 ms 13.694 ms 13.720 ms]
                        thrpt:  [364.43 MiB/s 365.13 MiB/s 365.71 MiB/s]
find_last_space/memchr  time:   [480.79 µs 485.71 µs 491.34 µs]
                        thrpt:  [9.9376 GiB/s 10.053 GiB/s 10.156 GiB/s]

find_xml_3/ascii_chars  time:   [2.5123 ms 2.5158 ms 2.5195 ms]
                        thrpt:  [1.9380 GiB/s 1.9409 GiB/s 1.9436 GiB/s]
find_xml_3/stdlib_find_char_set
                        time:   [18.783 ms 19.131 ms 19.463 ms]
                        thrpt:  [256.90 MiB/s 261.35 MiB/s 266.20 MiB/s]
find_xml_3/stdlib_find_closure
                        time:   [20.092 ms 20.114 ms 20.137 ms]
                        thrpt:  [248.30 MiB/s 248.59 MiB/s 248.85 MiB/s]
find_xml_3/stdlib_iter_position
                        time:   [6.7072 ms 6.7163 ms 6.7257 ms]
                        thrpt:  [743.42 MiB/s 744.45 MiB/s 745.46 MiB/s]
find_xml_3/memchr       time:   [672.72 µs 688.70 µs 703.63 µs]
                        thrpt:  [6.9394 GiB/s 7.0899 GiB/s 7.2583 GiB/s]

find_xml_5/ascii_chars  time:   [2.4397 ms 2.4723 ms 2.5012 ms]
                        thrpt:  [1.9522 GiB/s 1.9750 GiB/s 2.0014 GiB/s]
find_xml_5/stdlib_find_char_set
                        time:   [17.258 ms 17.834 ms 18.395 ms]
                        thrpt:  [271.81 MiB/s 280.36 MiB/s 289.72 MiB/s]
find_xml_5/stdlib_find_closure
                        time:   [20.058 ms 20.079 ms 20.101 ms]
                        thrpt:  [248.74 MiB/s 249.02 MiB/s 249.28 MiB/s]
find_xml_5/stdlib_iter_position
                        time:   [6.5170 ms 6.6021 ms 6.6773 ms]
                        thrpt:  [748.80 MiB/s 757.34 MiB/s 767.22 MiB/s]
find_xml_5/memchr       time:   [1.2204 ms 1.2601 ms 1.3022 ms]
                        thrpt:  [3.7498 GiB/s 3.8749 GiB/s 4.0009 GiB/s]

find_big_16/ascii_chars time:   [2.5150 ms 2.5193 ms 2.5242 ms]
                        thrpt:  [1.9344 GiB/s 1.9381 GiB/s 1.9415 GiB/s]
find_big_16/stdlib_find_char_set
                        time:   [20.098 ms 20.123 ms 20.150 ms]
                        thrpt:  [248.14 MiB/s 248.47 MiB/s 248.78 MiB/s]
find_big_16/stdlib_find_closure
                        time:   [14.302 ms 14.907 ms 15.536 ms]
                        thrpt:  [321.84 MiB/s 335.42 MiB/s 349.61 MiB/s]
find_big_16/stdlib_iter_position
                        time:   [14.219 ms 14.408 ms 14.609 ms]
                        thrpt:  [342.25 MiB/s 347.02 MiB/s 351.64 MiB/s]
find_big_16/memchr      time:   [3.5348 ms 3.6534 ms 3.7668 ms]
                        thrpt:  [1.2963 GiB/s 1.3365 GiB/s 1.3814 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [21.674 ns 21.716 ns 21.767 ns]
                        thrpt:  [43.813 MiB/s 43.916 MiB/s 44.001 MiB/s]
find_big_16_early_return/stdlib_find_char_set
                        time:   [7.3062 ns 7.5579 ns 7.7857 ns]
                        thrpt:  [122.49 MiB/s 126.18 MiB/s 130.53 MiB/s]
find_big_16_early_return/stdlib_find_closure
                        time:   [9.2522 ns 9.2635 ns 9.2758 ns]
                        thrpt:  [102.81 MiB/s 102.95 MiB/s 103.08 MiB/s]
find_big_16_early_return/stdlib_iter_position
                        time:   [5.4156 ns 5.4234 ns 5.4331 ns]
                        thrpt:  [175.53 MiB/s 175.84 MiB/s 176.10 MiB/s]
find_big_16_early_return/memchr
                        time:   [3.5441 ms 3.5672 ms 3.5910 ms]
                        thrpt:  [278.48   B/s 280.33   B/s 282.16   B/s]

find_substring/substring
                        time:   [2.0862 ms 2.0987 ms 2.1088 ms]
                        thrpt:  [2.3154 GiB/s 2.3266 GiB/s 2.3405 GiB/s]
find_substring/stdlib_find_string
                        time:   [2.6564 ms 2.6819 ms 2.7030 ms]
                        thrpt:  [1.8064 GiB/s 1.8207 GiB/s 1.8381 GiB/s]
find_substring/memchr   time:   [767.40 µs 781.51 µs 792.49 µs]
                        thrpt:  [6.1613 GiB/s 6.2479 GiB/s 6.3628 GiB/s]

Macos M1 Pro Benchmarks

find_last_space/ascii_chars
                        time:   [4.9585 ms 4.9618 ms 4.9654 ms]
                        thrpt:  [1007.0 MiB/s 1007.7 MiB/s 1008.4 MiB/s]
find_last_space/stdlib_find_string
                        time:   [3.2210 ms 3.2276 ms 3.2361 ms]
                        thrpt:  [1.5089 GiB/s 1.5128 GiB/s 1.5159 GiB/s]
find_last_space/stdlib_find_char
                        time:   [244.10 µs 244.25 µs 244.42 µs]
                        thrpt:  [19.977 GiB/s 19.991 GiB/s 20.004 GiB/s]
find_last_space/stdlib_find_char_set
                        time:   [3.3083 ms 3.3103 ms 3.3125 ms]
                        thrpt:  [1.4740 GiB/s 1.4750 GiB/s 1.4759 GiB/s]
find_last_space/stdlib_find_closure
                        time:   [3.3053 ms 3.3072 ms 3.3093 ms]
                        thrpt:  [1.4755 GiB/s 1.4764 GiB/s 1.4773 GiB/s]
find_last_space/stdlib_iter_position
                        time:   [1.6432 ms 1.6458 ms 1.6483 ms]
                        thrpt:  [2.9623 GiB/s 2.9668 GiB/s 2.9715 GiB/s]
find_last_space/memchr  time:   [62.832 µs 62.871 µs 62.910 µs]
                        thrpt:  [77.616 GiB/s 77.664 GiB/s 77.713 GiB/s]

find_xml_3/ascii_chars  time:   [4.9489 ms 4.9521 ms 4.9556 ms]
                        thrpt:  [1009.0 MiB/s 1009.7 MiB/s 1010.3 MiB/s]
find_xml_3/stdlib_find_char_set
                        time:   [3.5181 ms 3.5205 ms 3.5231 ms]
                        thrpt:  [1.3859 GiB/s 1.3870 GiB/s 1.3879 GiB/s]
find_xml_3/stdlib_find_closure
                        time:   [4.7493 ms 4.7526 ms 4.7561 ms]
                        thrpt:  [1.0266 GiB/s 1.0274 GiB/s 1.0281 GiB/s]
find_xml_3/stdlib_iter_position
                        time:   [2.4937 ms 2.4952 ms 2.4969 ms]
                        thrpt:  [1.9556 GiB/s 1.9569 GiB/s 1.9581 GiB/s]
find_xml_3/memchr       time:   [157.27 µs 157.51 µs 157.72 µs]
                        thrpt:  [30.958 GiB/s 31.001 GiB/s 31.048 GiB/s]

find_xml_5/ascii_chars  time:   [4.9523 ms 4.9568 ms 4.9621 ms]
                        thrpt:  [1007.6 MiB/s 1008.7 MiB/s 1009.6 MiB/s]
find_xml_5/stdlib_find_char_set
                        time:   [3.5223 ms 3.5243 ms 3.5265 ms]
                        thrpt:  [1.3846 GiB/s 1.3855 GiB/s 1.3862 GiB/s]
find_xml_5/stdlib_find_closure
                        time:   [4.4779 ms 4.4810 ms 4.4844 ms]
                        thrpt:  [1.0888 GiB/s 1.0897 GiB/s 1.0904 GiB/s]
find_xml_5/stdlib_iter_position
                        time:   [2.4932 ms 2.4950 ms 2.4970 ms]
                        thrpt:  [1.9555 GiB/s 1.9570 GiB/s 1.9585 GiB/s]
find_xml_5/memchr       time:   [275.68 µs 276.01 µs 276.34 µs]
                        thrpt:  [17.670 GiB/s 17.691 GiB/s 17.712 GiB/s]

find_big_16/ascii_chars time:   [4.9420 ms 4.9464 ms 4.9507 ms]
                        thrpt:  [1010.0 MiB/s 1010.8 MiB/s 1011.7 MiB/s]
find_big_16/stdlib_find_char_set
                        time:   [3.3028 ms 3.3075 ms 3.3131 ms]
                        thrpt:  [1.4738 GiB/s 1.4763 GiB/s 1.4784 GiB/s]
find_big_16/stdlib_find_closure
                        time:   [3.3341 ms 3.3364 ms 3.3388 ms]
                        thrpt:  [1.4625 GiB/s 1.4635 GiB/s 1.4645 GiB/s]
find_big_16/stdlib_iter_position
                        time:   [2.2261 ms 2.2276 ms 2.2291 ms]
                        thrpt:  [2.1905 GiB/s 2.1919 GiB/s 2.1935 GiB/s]
find_big_16/memchr      time:   [628.92 µs 629.29 µs 629.68 µs]
                        thrpt:  [7.7545 GiB/s 7.7592 GiB/s 7.7638 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [961.88 ps 963.12 ps 964.20 ps]
                        thrpt:  [989.08 MiB/s 990.19 MiB/s 991.47 MiB/s]
find_big_16_early_return/stdlib_find_char_set
                        time:   [947.76 ps 950.89 ps 953.64 ps]
                        thrpt:  [1000.0 MiB/s 1002.9 MiB/s 1006.2 MiB/s]
find_big_16_early_return/stdlib_find_closure
                        time:   [1.2614 ns 1.2633 ns 1.2656 ns]
                        thrpt:  [753.55 MiB/s 754.92 MiB/s 756.07 MiB/s]
find_big_16_early_return/stdlib_iter_position
                        time:   [680.30 ps 680.87 ps 681.37 ps]
                        thrpt:  [1.3668 GiB/s 1.3678 GiB/s 1.3690 GiB/s]
find_big_16_early_return/memchr
                        time:   [565.92 µs 566.40 µs 566.84 µs]
                        thrpt:  [1.7228 KiB/s 1.7241 KiB/s 1.7256 KiB/s]

find_substring/substring
                        time:   [13.200 ms 13.211 ms 13.224 ms]
                        thrpt:  [378.11 MiB/s 378.46 MiB/s 378.78 MiB/s]
find_substring/stdlib_find_string
                        time:   [497.01 µs 497.94 µs 498.79 µs]
                        thrpt:  [9.7893 GiB/s 9.8060 GiB/s 9.8243 GiB/s]
find_substring/memchr   time:   [175.29 µs 175.48 µs 175.66 µs]
                        thrpt:  [27.797 GiB/s 27.826 GiB/s 27.855 GiB/s]

shepmaster · 2023-10-13T20:22:35Z

Hey, this is wonderful, thank you! All of the code changes look fine and I'd be happy to merge them. That being said...

It looks like jetscii is somewhat behind memchr (and sometimes std) in most cases, unfortunately.

This is quite surprising! I ran the current set of benchmarks on my Windows machine running Ubuntu inside of WSL:

test	MB/s
bench::xml_delim_5_ascii_chars	13515
bench::xml_delim_5_stdlib_find_char_closure	1412
bench::xml_delim_5_stdlib_find_char_set	1812
bench::xml_delim_5_stdlib_iterator_position	4215

Comparing to the criterion benchmarks:

name	MB/s
find_xml_5/ascii_chars	6765
find_xml_5/stdlib_find_closure	1438
find_xml_5/stdlib_find_char_set	1844
find_xml_5/stdlib_iter_position	4360

Something seems quite suspicious as the speed of Jetscii is almost exactly half. Was the old benchmark flawed? Is criterion doing something different?

benches/benchmarks.rs

shepmaster · 2023-10-13T21:12:33Z

Hmm, CI is indicating that these can't be const yet, as the closure may need to be dropped. To my knowledge, there's no way of specifying that we only want closures that will not be dropped.

Dr-Emann · 2023-10-13T21:28:01Z

Interesting! In the case where we know we have SSE 4.2, we don't use the passed fallback at all, so it has to be dropped in the function.

Probably can just do some trickery so the macro result is const (probably with a #[doc(hidden)] somewhere), but the new fn won't be.

Dr-Emann · 2023-10-13T23:19:15Z

cargo semver-checks reports:

     Parsing jetscii v0.5.3 (current)
     Parsing jetscii v0.5.3 (baseline, cached)
    Checking jetscii v0.5.3 -> v0.5.3 (no change)
   Completed [   0.007s] 51 checks; 50 passed, 1 failed, 0 unnecessary

--- failure inherent_method_must_use_added: inherent method #[must_use] added ---

Description:
An inherent method is now #[must_use]. Downstream crates that did not use its return value will get a compiler lint.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-must_use-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.24.0/src/lints/inherent_method_must_use_added.ron

Failed in:
  method jetscii::ByteSubstring::new in /Users/zach/Development/tmp/jetscii/src/lib.rs:321
  method jetscii::ByteSubstring::find in /Users/zach/Development/tmp/jetscii/src/lib.rs:343
  method jetscii::Substring::new in /Users/zach/Development/tmp/jetscii/src/lib.rs:359
  method jetscii::Substring::find in /Users/zach/Development/tmp/jetscii/src/lib.rs:372
       Final [   0.007s] semver requires new minor version: 0 major and 1 minor checks failed

I personally think that's fine, I don't mind giving extra warnings in a minor version bump if the previous use was clearly useless.

Dr-Emann · 2023-10-14T03:08:34Z

Something seems quite suspicious as the speed of Jetscii is almost exactly half. Was the old benchmark flawed? Is criterion doing something different?

I agree, and I can reproduce on my x64 machine. It's really confusing, I don't see anything wrong with either benchmark, and I don't see any difference (within 0.2% difference) on my m1 mac, or if I force the fallback impl on the x64 machine, but I see the 2x difference if I force the sse implementation or let the runtime switch pick the simd implementation, so it's somehow specific to the sse implementation.

Dr-Emann · 2023-10-14T03:27:52Z

AHAH! I figured it out: see e8ace35.

Basically, just needed to add some #[inline]s, criterion compiles benchmarks as a separate crate, and some tiny functions not being eligible for cross-crate inlining made a huge impact. The criterion results were actually more accurate to what a user would have seen from using the library.

Dr-Emann · 2023-10-14T04:06:25Z

Updated the benchmark results comment with updated results.

benches/benchmarks.rs

dralley · 2023-10-15T02:56:56Z

I can confirm that the inlining improves the performance in practice

R5 3600

[dalley@localhost quick-xml]$ critcmp jetscii jetscii-fixed --filter escape
group                                       jetscii                                jetscii-fixed
-----                                       -------                                -------------
escape_text/escaped_chars_long              1.34    401.6±2.72ns        ? ?/sec    1.00    299.7±1.00ns        ? ?/sec
escape_text/escaped_chars_short             1.12    317.9±3.96ns        ? ?/sec    1.00    284.1±3.72ns        ? ?/sec
escape_text/no_chars_to_escape_long         2.02    191.4±0.51ns        ? ?/sec    1.00     94.8±0.02ns        ? ?/sec
escape_text/no_chars_to_escape_short        1.25     10.5±0.04ns        ? ?/sec    1.00      8.4±0.07ns        ? ?/sec

i7-8665U

[dalley@thinkpad quick-xml]$ critcmp jetscii jetscii-fixed --filter escape
group                                       jetscii                                jetscii-fixed
-----                                       -------                                -------------
escape_text/escaped_chars_long              1.44   559.0±55.57ns        ? ?/sec    1.00   389.5±20.32ns        ? ?/sec
escape_text/escaped_chars_short             1.01   383.4±12.31ns        ? ?/sec    1.00    379.2±8.76ns        ? ?/sec
escape_text/no_chars_to_escape_long         1.92   246.0±16.76ns        ? ?/sec    1.00    127.8±1.74ns        ? ?/sec
escape_text/no_chars_to_escape_short        1.32     12.7±0.22ns        ? ?/sec    1.00      9.6±0.73ns        ? ?/sec

.github/workflows/ci.yml

dralley · 2023-10-17T15:32:57Z

@Dr-Emann https://github.com/shepmaster/jetscii/actions/runs/6537761959/job/17752265432?pr=57#step:3:199

shepmaster · 2023-10-18T01:17:58Z

Ok, I think I have sorted out CI. I merged my branch and yours and added a little tweak to set the target, as you had it. That combined CI run was green.

I think that means that you should be able to rebase on top of my changes, make the same tweak, and then we will be good for another review!

Remove unneeded `extern crate`s

…ros only

Also, add some benchmarks against memchr

Add a test that looks for the first item in a long haystack

The memmap crate is unmaintained, instead, use the maintained memmap2 crate

Structs don't need the bounds, only the implementations

Mostly just adding #[must_use]

This speeds up the criteron benchmarks by almost 2x I believe this is needed because e.g. Bytes::find is inlined, and calls `find` generically, which will call PackedCompareControl methods. So the code calling the methods will be inlined into the calling crate, but the implemetations of the PackedCompareControl are not accessable to the code in the calling crate, so they will end up as actual function calls. However these functions are _super_ simple, and inlining them helps a LOT, so adding `#[inline]` to these functions, and making their implementation available to calling crates has a huge effect. This was only seen when moving to criterion because previously, nightly benchmarks were implemented in the library crate itself, and so these functions were already elegable for inlining. Criteron results were actually more accurate to what callers of the crate would actually see!

@BurntSushi

Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment)) On my M1, tt appears to be slower but competitive with memchr up to memchr3, then start being the from 5-16

Dr-Emann · 2023-10-18T02:15:33Z

I'm thinking of reverting the changes to make the constructor const, since we might want to end up e.g. using memchr's memmem searcher, which wouldn't be const, and I don't know if we want to make the constructors const.

We may not want to be stuck with const-constructable implementations

Dr-Emann · 2023-10-18T22:11:36Z

The teddy results are pretty promising: on my machine, it seems to beat jetscii's simd implementation in everything but the "found a result on the first byte" test. I wonder if we can't just find a point and do if len < SMALL { fallback } else { teddy }

BurntSushi · 2023-10-18T22:16:38Z

@Dr-Emann Can you share your benchmark results?

Dr-Emann · 2023-10-18T22:41:50Z

Sure:

Skylake i5

Including only ascii_chars, "teddy" and memchr

find_last_space/ascii_chars
                        time:   [1.2851 ms 1.4119 ms 1.5502 ms]
                        thrpt:  [3.1499 GiB/s 3.4582 GiB/s 3.7995 GiB/s]
find_last_space/teddy   time:   [648.84 µs 685.98 µs 727.01 µs]
                        thrpt:  [6.7163 GiB/s 7.1180 GiB/s 7.5254 GiB/s]
find_last_space/memchr  time:   [182.70 µs 185.98 µs 188.93 µs]
                        thrpt:  [25.845 GiB/s 26.254 GiB/s 26.725 GiB/s]

find_xml_3/ascii_chars  time:   [1.5516 ms 1.6435 ms 1.7471 ms]
                        thrpt:  [2.7948 GiB/s 2.9709 GiB/s 3.1469 GiB/s]
find_xml_3/teddy        time:   [627.14 µs 645.32 µs 663.75 µs]
                        thrpt:  [7.3564 GiB/s 7.5665 GiB/s 7.7858 GiB/s]
find_xml_3/memchr       time:   [492.12 µs 513.30 µs 537.36 µs]
                        thrpt:  [9.0866 GiB/s 9.5126 GiB/s 9.9220 GiB/s]

find_xml_5/ascii_chars  time:   [1.0851 ms 1.1917 ms 1.3254 ms]
                        thrpt:  [3.6841 GiB/s 4.0975 GiB/s 4.4997 GiB/s]
find_xml_5/teddy        time:   [268.58 µs 271.36 µs 274.06 µs]
                        thrpt:  [17.817 GiB/s 17.994 GiB/s 18.180 GiB/s]
find_xml_5/memchr       time:   [735.38 µs 794.74 µs 862.71 µs]
                        thrpt:  [5.6598 GiB/s 6.1439 GiB/s 6.6399 GiB/s]

find_big_16/ascii_chars time:   [1.3726 ms 1.4775 ms 1.6049 ms]
                        thrpt:  [3.0425 GiB/s 3.3048 GiB/s 3.5573 GiB/s]
find_big_16/teddy       time:   [272.21 µs 275.01 µs 278.07 µs]
                        thrpt:  [17.560 GiB/s 17.755 GiB/s 17.937 GiB/s]
find_big_16/memchr      time:   [2.6971 ms 2.8265 ms 2.9630 ms]
                        thrpt:  [1.6479 GiB/s 1.7275 GiB/s 1.8104 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [16.542 ns 17.138 ns 17.735 ns]
                        thrpt:  [53.773 MiB/s 55.647 MiB/s 57.652 MiB/s]
find_big_16_early_return/teddy
                        time:   [83.147 ns 85.570 ns 87.869 ns]
                        thrpt:  [10.853 MiB/s 11.145 MiB/s 11.470 MiB/s]
find_big_16_early_return/memchr
                        time:   [3.0589 ms 3.1347 ms 3.2110 ms]
                        thrpt:  [311.43   B/s 319.01   B/s 326.92   B/s]

Somehow teddy got a lot faster going from 3 patterns to 5?

M1 Mac

find_last_space/ascii_chars
                        time:   [4.8982 ms 4.9039 ms 4.9105 ms]
                        thrpt:  [1018.2 MiB/s 1019.6 MiB/s 1020.8 MiB/s]
find_last_space/teddy   time:   [213.33 µs 218.76 µs 225.05 µs]
                        thrpt:  [21.696 GiB/s 22.320 GiB/s 22.888 GiB/s]
find_last_space/memchr  time:   [62.353 µs 62.567 µs 62.814 µs]
                        thrpt:  [77.735 GiB/s 78.041 GiB/s 78.309 GiB/s]

find_xml_3/ascii_chars  time:   [5.1492 ms 5.1767 ms 5.2203 ms]
                        thrpt:  [957.80 MiB/s 965.86 MiB/s 971.02 MiB/s]
find_xml_3/teddy        time:   [214.30 µs 214.78 µs 215.23 µs]
                        thrpt:  [22.687 GiB/s 22.734 GiB/s 22.785 GiB/s]
find_xml_3/memchr       time:   [156.83 µs 157.77 µs 158.84 µs]
                        thrpt:  [30.741 GiB/s 30.949 GiB/s 31.134 GiB/s]

find_xml_5/ascii_chars  time:   [4.8970 ms 4.9074 ms 4.9205 ms]
                        thrpt:  [1016.2 MiB/s 1018.9 MiB/s 1021.0 MiB/s]
find_xml_5/teddy        time:   [205.09 µs 205.62 µs 206.19 µs]
                        thrpt:  [23.681 GiB/s 23.747 GiB/s 23.808 GiB/s]
find_xml_5/memchr       time:   [267.96 µs 268.15 µs 268.38 µs]
                        thrpt:  [18.194 GiB/s 18.209 GiB/s 18.222 GiB/s]

find_big_16/ascii_chars time:   [4.8841 ms 4.8865 ms 4.8892 ms]
                        thrpt:  [1022.7 MiB/s 1023.2 MiB/s 1023.7 MiB/s]
find_big_16/teddy       time:   [204.06 µs 204.21 µs 204.41 µs]
                        thrpt:  [23.888 GiB/s 23.911 GiB/s 23.928 GiB/s]
find_big_16/memchr      time:   [648.66 µs 648.89 µs 649.21 µs]
                        thrpt:  [7.5212 GiB/s 7.5248 GiB/s 7.5276 GiB/s]

find_big_16_early_return/ascii_chars
                        time:   [940.90 ps 942.13 ps 943.39 ps]
                        thrpt:  [1010.9 MiB/s 1012.3 MiB/s 1013.6 MiB/s]
find_big_16_early_return/teddy
                        time:   [9.7040 ns 9.7161 ns 9.7293 ns]
                        thrpt:  [98.021 MiB/s 98.154 MiB/s 98.277 MiB/s]
find_big_16_early_return/memchr
                        time:   [590.33 µs 591.29 µs 592.35 µs]
                        thrpt:  [1.6486 KiB/s 1.6516 KiB/s 1.6543 KiB/s]

No big jump in teddy performance, but it does become the fastest still between 3 and 5 characters, for this case at least.

BurntSushi · 2023-10-18T23:08:53Z

Somehow teddy got a lot faster going from 3 patterns to 5?

Interesting. If I get a chance tomorrow, I'll port your benchmark into aho-corasick's rebar benchmark suite and see if I can do some analysis for you.

There was some discussion about how to compare jetscii with Teddy and some interesting benchmark results[1]. I decided to import the benchmarks and see what things look like here. [1]: shepmaster/jetscii#57

BurntSushi · 2023-10-20T16:26:22Z

All righty, here we go.

I started by checking out this PR and running the benchmarks as written on my i9-12900K x86-64 CPU and my M2 mac mini aarch64 CPU. This should give us a comparison point with which to ground ourselves. On x86-64 (i9-12900K):

$ critcmp base -g '([^/]+/)(?:memchr|ascii|teddy).*' -f 'find_(big_16|last_space|xml)'
group                        base/ascii_chars                       base/memchr                                base/teddy
-----                        ----------------                       -----------                                ----------
find_last_space/             6.59   390.4±10.64µs    12.5 GB/sec    1.00     59.3±0.85µs    82.4 GB/sec        1.44     85.4±1.60µs    57.2 GB/sec
find_xml_3/                  5.17    381.6±5.27µs    12.8 GB/sec    1.00     73.8±1.60µs    66.1 GB/sec        1.08     79.6±0.82µs    61.4 GB/sec
find_xml_5/                  4.88    386.6±6.72µs    12.6 GB/sec    1.84    145.9±8.94µs    33.5 GB/sec        1.00     79.2±0.84µs    61.7 GB/sec
find_big_16/                 4.84    382.9±6.51µs    12.8 GB/sec    5.27    416.5±4.98µs    11.7 GB/sec        1.00     79.1±0.67µs    61.8 GB/sec
find_big_16_early_return/    1.00      2.6±0.11ns   369.9 MB/sec    141566.54   365.0±8.07µs     2.7 KB/sec    4.94     12.7±0.06ns    74.9 MB/sec

And on aarch64 (M2 mac mini):

$ critcmp base -g '([^/]+/)(?:memchr|ascii|teddy).*' -f 'find_(big_16|last_space|xml)'
group                        base/ascii_chars                       base/memchr                                base/teddy
-----                        ----------------                       -----------                                ----------
find_last_space/             77.06     4.5±0.00ms  1112.1 MB/sec    1.00     58.3±0.17µs    83.7 GB/sec        3.21    187.5±0.54µs    26.0 GB/sec
find_xml_3/                  31.93     4.5±0.00ms  1112.1 MB/sec    1.00    140.8±0.39µs    34.7 GB/sec        1.33    187.5±0.72µs    26.0 GB/sec
find_xml_5/                  23.98     4.5±0.00ms  1112.2 MB/sec    1.32    246.5±0.58µs    19.8 GB/sec        1.00    187.4±0.56µs    26.0 GB/sec
find_big_16/                 23.98     4.5±0.02ms  1111.7 MB/sec    3.19    597.5±1.59µs     8.2 GB/sec        1.00    187.6±0.89µs    26.0 GB/sec
find_big_16_early_return/    1.00      0.9±0.01ns  1052.6 MB/sec    598135.90   541.9±1.84µs     1845 B/sec    7.79      7.1±0.02ns   135.1 MB/sec

I then ported these benchmarks into aho-corasick's rebar benchmark suite. I didn't bother with porting the memchr benchmarks, since it's a little tricky to write that generically. From the root of aho-corasick's repository:

$ rebar build
$ rebar measure -f '^jetscii/' -e rust/aho-corasick/packed -e jetscii -t # for testing
$ rebar measure -f '^jetscii/' -e rust/aho-corasick/packed -e jetscii | tee tmp/results.csv
$ rebar cmp tmp/results.csv -f repeateda
benchmark                          rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                          ---------------------------------------  ---------------------------------
jetscii/space-repeateda            53.9 GB/s (1.00x)                        12.9 GB/s (4.17x)
jetscii/xmldelim3-repeateda        56.9 GB/s (1.00x)                        12.9 GB/s (4.40x)
jetscii/xmldelim5-repeateda        58.6 GB/s (1.00x)                        12.9 GB/s (4.54x)
jetscii/big16-repeateda            59.0 GB/s (1.00x)                        12.9 GB/s (4.57x)
jetscii/big16earlyshort-repeateda  127.2 MB/s (1.00x)                       73.4 MB/s (1.73x)
jetscii/big16earlylong-repeateda   529.8 MB/s (1.42x)                       752.9 MB/s (1.00x)

And on aarch64 (M2 mac mini):

$ rebar cmp tmp/results.csv -f repeateda
benchmark                          rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                          ---------------------------------------  ---------------------------------
jetscii/space-repeateda            26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/xmldelim3-repeateda        26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/xmldelim5-repeateda        26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/big16-repeateda            26.1 GB/s (1.00x)                        1666.7 MB/s (16.03x)
jetscii/big16earlyshort-repeateda  1907.3 MB/s (1.00x)                      1907.3 MB/s (1.00x)
jetscii/big16earlylong-repeateda   14.0 GB/s (1.00x)                        14.0 GB/s (1.00x)

So the first thing that jumps out at me is that Teddy has pretty consistent timings for space, xmldelim3, xmldelim5 and big16 in both rebar and Criterion. In other words, I'm not able to reproduce your dip in find_xml_3/xmldelim3.

The other interesting thing is the discrepancy in timings for the "early return" benchmark in Criterion versus rebar. I did have to tweak them somewhat, since the benchmarks I have look for all counts instead of just the first one. (rebar doesn't require this, and I could make it only do one search, but it didn't seem worth it.) So I split it into two: one benchmark whose haystack is Pa and another whose haystack is format!("P{}", "a".repeat(14)). In the former case, Teddy is a little faster and in the latter case, jetscii is a litter faster. This is definitely a "latency sensitive" benchmark where the timings are essentially a reflection of the overhead of a search call.

It is indeed somewhat common to try to optimize for the latency case by doing case analysis on the length of the haystack. memchr does this for its memmem search routine, for example. The catch is that as you add case analysis, the overhead of your search routine, and thus performance on latency sensitive workloads, tends to get worse. So in effect, there is a balancing point one might want to achieve. Ideally, Teddy would just do this automatically, and it does kind of try. Namely, for very short haystacks, it uses Rabin-Karp. I haven't spent a ton of time optimizing that case though.

One of the unfortunate things about latency sensitive workloads is that you are somewhat hosed already. You tend to burn a lot of time starting and stopping the search routine very frequently. This is why if you take just about any optimized search routine with high throughput and execute a search that has a high match count, that throughput will drop precipitously and you might not do much better (perhaps even worse) than the naive approach. The naive approach tends to do very well in latency sensitive workloads.

I don't have any good guesses as to the reason for the discrepancy in the "early return" benchmarks between Criterion and rebar. It's worth pointing out that we're dealing with low-single-digit nanosecond timings here, so even a small amount of noise could explain this discprenancy.

Another thing I did here was take a step back and look at the benchmark itself. At least the XML delimiter benchmarks look like they ought to be searching XML data. Instead, the benchmarks (except for the new "early return" one you added) are essentially all best cases for throughput: the haystack consists of the same byte repeated with no match, and finally followed by a single matching byte at the end. But what happens if we search a real XML document?

I picked a fairly small XML document describing some mental health stuff and ran the same benchmarks on it for x86-64:

$ rebar cmp tmp/results.csv -f mentalhealth
benchmark                       rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                       ---------------------------------------  ---------------------------------
jetscii/space-mentalhealth      1201.2 MB/s (1.00x)                      1206.6 MB/s (1.00x)
jetscii/xmldelim3-mentalhealth  2.3 GB/s (1.00x)                         2.1 GB/s (1.07x)
jetscii/xmldelim5-mentalhealth  1928.2 MB/s (1.02x)                      1960.7 MB/s (1.00x)
jetscii/big16-mentalhealth      5.2 GB/s (1.00x)                         4.5 GB/s (1.16x)

And aarch64 (M2 mac mini):

$ rebar cmp tmp/results.csv -f mentalhealth
benchmark                       rust/aho-corasick/packed/leftmost-first  rust/jetscii/ascii-chars/prebuilt
---------                       ---------------------------------------  ---------------------------------
jetscii/space-mentalhealth      728.9 MB/s (1.55x)                       1129.7 MB/s (1.00x)
jetscii/xmldelim3-mentalhealth  1425.9 MB/s (1.01x)                      1446.2 MB/s (1.00x)
jetscii/xmldelim5-mentalhealth  1298.8 MB/s (1.11x)                      1436.0 MB/s (1.00x)
jetscii/big16-mentalhealth      3.5 GB/s (1.00x)                         1453.9 MB/s (2.49x)

Things here are quite a bit more competitive. My guess as to why this is, is because the benchmarks shift more towards latency sensitive workloads versus the pure throughput benchmarks in this PR. Essentially, as you move more and more towards latency sensitive workloads, differences between search routines tend to shrink, especially when comparing something that has very high throughput (Teddy) versus something that is less so (jetscii). Indeed, as I understand it, jetscii is based on the pcmpestri intrinsic from SSE4.2, and it's somewhat of a tortured technique because of its (documented) extremely high latency (18 cycles as written in the pcmpestri docs). This is why you won't find it used in Hyperscan at all. Here, it does look decentish for looking for a small set of bytes, but it is absolutely terrible for substring search. From the memchr repo (which already had jetscii in its rebar benchmarks) on x86-64:

$ rebar cmp benchmarks/record/x86_64/2023-08-25.csv -e jetscii/memmem/prebuilt -e rust/memchr/memmem/prebuilt --intersection
benchmark                                                   rust/jetscii/memmem/prebuilt  rust/memchr/memmem/prebuilt
---------                                                   ----------------------------  ---------------------------
memmem/byterank/binary                                      2.9 GB/s (1.50x)              4.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  6.9 GB/s (7.78x)              53.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren            6.8 GB/s (7.87x)              53.9 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux                      6.9 GB/s (8.22x)              56.4 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str                   7.0 GB/s (7.60x)              53.3 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty                 6.9 GB/s (7.66x)              52.6 GB/s (1.00x)
memmem/code/rust-library-common-fn                          5.7 GB/s (5.00x)              28.4 GB/s (1.00x)
memmem/code/rust-library-common-paren                       2.3 GB/s (2.12x)              4.8 GB/s (1.00x)
memmem/code/rust-library-common-let                         4.6 GB/s (4.30x)              19.8 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        5.8 GB/s (7.15x)              41.6 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash                      6.1 GB/s (8.07x)              49.3 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky               7.7 GB/s (8.15x)              63.2 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match                1351.7 MB/s (1.34x)           1811.6 MB/s (1.00x)
memmem/pathological/rare-repeated-small-tricky              7.2 GB/s (3.61x)              25.9 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match               1305.9 MB/s (1.40x)           1821.8 MB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet           4.9 GB/s (1.00x)              4.1 GB/s (1.22x)
memmem/pathological/defeat-simple-vector-freq-alphabet      7.8 GB/s (2.48x)              19.2 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  137.3 MB/s (9.03x)            1239.8 MB/s (1.00x)
memmem/subtitles/common/huge-en-that                        4.6 GB/s (8.18x)              37.4 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                         3.5 GB/s (4.42x)              15.2 GB/s (1.00x)
memmem/subtitles/common/huge-en-one-space                   590.8 MB/s (2.31x)            1365.2 MB/s (1.00x)
memmem/subtitles/common/huge-ru-that                        3.7 GB/s (10.09x)             37.2 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not                         1859.8 MB/s (8.28x)           15.0 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space                   954.4 MB/s (2.80x)            2.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-that                        5.9 GB/s (6.58x)              38.6 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      3.6 GB/s (5.36x)              19.4 GB/s (1.00x)
memmem/subtitles/common/huge-zh-one-space                   1887.2 MB/s (2.54x)           4.7 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  7.7 GB/s (6.64x)              51.1 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes             6.3 GB/s (8.44x)              52.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes              7.7 GB/s (8.27x)              63.5 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space                    4.0 GB/s (15.79x)             63.6 GB/s (1.00x)
memmem/subtitles/never/teeny-en-john-watson                 1161.0 MB/s (1.53x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes            1161.0 MB/s (1.53x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes             1112.6 MB/s (1.60x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   1161.0 MB/s (1.53x)           1780.2 MB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  2.6 GB/s (24.44x)             63.7 GB/s (1.00x)
memmem/subtitles/never/teeny-ru-john-watson                 1602.2 MB/s (1.56x)           2.4 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  6.6 GB/s (9.10x)              60.1 GB/s (1.00x)
memmem/subtitles/never/teeny-zh-john-watson                 1285.4 MB/s (1.53x)           1970.9 MB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               7.5 GB/s (8.26x)              62.4 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock                      7.5 GB/s (8.04x)              60.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle                 6.3 GB/s (8.83x)              55.2 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle                   7.0 GB/s (6.38x)              44.5 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle                   7.6 GB/s (6.11x)              46.4 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              953.7 MB/s (1.65x)            1570.8 MB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock                     920.8 MB/s (1.38x)            1271.6 MB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               2.6 GB/s (24.19x)             63.4 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock                      2.6 GB/s (23.83x)             61.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              1381.2 MB/s (1.53x)           2.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock                     1381.2 MB/s (1.16x)           1602.2 MB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               5.6 GB/s (9.93x)              55.6 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock                      5.6 GB/s (6.97x)              38.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              1055.9 MB/s (1.00x)           1055.9 MB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     1055.9 MB/s (1.08x)           1137.1 MB/s (1.00x)

The only case that memchr::memmem does worse on is a pathological benchmark that I specifically constructed to defeat its heuristics. And even then, it does decently compared to pcmpestri. Same deal on aarch64 (which doesn't have anything like pcmpestri and thus I believe forces jetscii into naive substring search):


$ rebar cmp benchmarks/record/aarch64/2023-08-27.csv -e jetscii/memmem/prebuilt -e rust/memchr/memmem/prebuilt --intersection
benchmark                                                   rust/jetscii/memmem/prebuilt  rust/memchr/memmem/prebuilt
---------                                                   ----------------------------  ---------------------------
memmem/byterank/binary                                      335.2 MB/s (9.55x)            3.1 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength                  401.0 MB/s (77.75x)           30.4 GB/s (1.00x)
memmem/code/rust-library-never-fn-strength-paren            401.0 MB/s (76.23x)           29.9 GB/s (1.00x)
memmem/code/rust-library-never-fn-quux                      401.0 MB/s (77.24x)           30.2 GB/s (1.00x)
memmem/code/rust-library-rare-fn-from-str                   633.8 MB/s (47.66x)           29.5 GB/s (1.00x)
memmem/code/rust-library-common-fn-is-empty                 401.0 MB/s (75.50x)           29.6 GB/s (1.00x)
memmem/code/rust-library-common-fn                          401.0 MB/s (47.33x)           18.5 GB/s (1.00x)
memmem/code/rust-library-common-paren                       393.9 MB/s (8.18x)            3.1 GB/s (1.00x)
memmem/code/rust-library-common-let                         386.2 MB/s (34.63x)           13.1 GB/s (1.00x)
memmem/pathological/md5-huge-no-hash                        662.1 MB/s (39.62x)           25.6 GB/s (1.00x)
memmem/pathological/md5-huge-last-hash                      644.1 MB/s (40.73x)           25.6 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-tricky               418.4 MB/s (76.20x)           31.1 GB/s (1.00x)
memmem/pathological/rare-repeated-huge-match                1450.4 MB/s (1.37x)           1980.7 MB/s (1.00x)
memmem/pathological/rare-repeated-small-tricky              416.9 MB/s (54.52x)           22.2 GB/s (1.00x)
memmem/pathological/rare-repeated-small-match               1431.2 MB/s (1.33x)           1909.3 MB/s (1.00x)
memmem/pathological/defeat-simple-vector-alphabet           359.3 MB/s (8.59x)            3.0 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-freq-alphabet      660.3 MB/s (23.52x)           15.2 GB/s (1.00x)
memmem/pathological/defeat-simple-vector-repeated-alphabet  173.8 MB/s (4.80x)            835.1 MB/s (1.00x)
memmem/subtitles/common/huge-en-that                        356.7 MB/s (45.56x)           15.9 GB/s (1.00x)
memmem/subtitles/common/huge-en-you                         392.6 MB/s (20.99x)           8.0 GB/s (1.00x)
memmem/subtitles/common/huge-en-one-space                   273.3 MB/s (2.63x)            717.6 MB/s (1.00x)
memmem/subtitles/common/huge-ru-that                        317.9 MB/s (60.57x)           18.8 GB/s (1.00x)
memmem/subtitles/common/huge-ru-not                         245.8 MB/s (41.39x)           9.9 GB/s (1.00x)
memmem/subtitles/common/huge-ru-one-space                   332.4 MB/s (2.96x)            984.3 MB/s (1.00x)
memmem/subtitles/common/huge-zh-that                        403.5 MB/s (49.09x)           19.3 GB/s (1.00x)
memmem/subtitles/common/huge-zh-do-not                      358.9 MB/s (31.78x)           11.1 GB/s (1.00x)
memmem/subtitles/common/huge-zh-one-space                   382.4 MB/s (6.90x)            2.6 GB/s (1.00x)
memmem/subtitles/never/huge-en-john-watson                  417.8 MB/s (75.68x)           30.9 GB/s (1.00x)
memmem/subtitles/never/huge-en-all-common-bytes             382.3 MB/s (61.10x)           22.8 GB/s (1.00x)
memmem/subtitles/never/huge-en-some-rare-bytes              417.8 MB/s (75.68x)           30.9 GB/s (1.00x)
memmem/subtitles/never/huge-en-two-space                    304.7 MB/s (113.48x)          33.8 GB/s (1.00x)
memmem/subtitles/never/teeny-en-john-watson                 635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/teeny-en-all-common-bytes            635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/teeny-en-some-rare-bytes             635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/teeny-en-two-space                   321.7 MB/s (83.00x)           26.1 GB/s (1.00x)
memmem/subtitles/never/huge-ru-john-watson                  644.9 MB/s (48.17x)           30.3 GB/s (1.00x)
memmem/subtitles/never/teeny-ru-john-watson                 953.7 MB/s (42.00x)           39.1 GB/s (1.00x)
memmem/subtitles/never/huge-zh-john-watson                  390.0 MB/s (76.61x)           29.2 GB/s (1.00x)
memmem/subtitles/never/teeny-zh-john-watson                 703.9 MB/s (42.00x)           28.9 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock-holmes               414.8 MB/s (74.88x)           30.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-sherlock                      414.8 MB/s (75.52x)           30.6 GB/s (1.00x)
memmem/subtitles/rare/huge-en-medium-needle                 637.5 MB/s (45.49x)           28.3 GB/s (1.00x)
memmem/subtitles/rare/huge-en-long-needle                   651.7 MB/s (51.53x)           32.8 GB/s (1.00x)
memmem/subtitles/rare/huge-en-huge-needle                   636.1 MB/s (52.91x)           32.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock-holmes              635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-en-sherlock                     635.8 MB/s (42.00x)           26.1 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock-holmes               644.9 MB/s (48.17x)           30.3 GB/s (1.00x)
memmem/subtitles/rare/huge-ru-sherlock                      254.3 MB/s (121.56x)          30.2 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock-holmes              953.7 MB/s (42.00x)           39.1 GB/s (1.00x)
memmem/subtitles/rare/teeny-ru-sherlock                     953.7 MB/s (42.00x)           39.1 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock-holmes               640.8 MB/s (46.13x)           28.9 GB/s (1.00x)
memmem/subtitles/rare/huge-zh-sherlock                      358.9 MB/s (84.32x)           29.6 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock-holmes              721.1 MB/s (41.00x)           28.9 GB/s (1.00x)
memmem/subtitles/rare/teeny-zh-sherlock                     703.9 MB/s (42.00x)           28.9 GB/s (1.00x)

So at least with respect to substring search, it's hard to imagine any general circumstance in which you'd want to use jetscii over memchr::memmem. For searching for sets of bytes greater than 3, the benchmark results are a bit less clear with some minor open questions. I would say that Teddy's throughput appears to be definitively better than jetscii though. But if you're searching XML for delimiters, then you're in "latency sensitive" territory probably due to how frequent the delimiters are likely to occur. (Unless you commonly search XML documents with huge data relative to the markup.) In that case, my suspicion is that your best bet is to write a bespoke SIMD algorithm. If you don't want to go down that path, then I would pick a sample of representative XML documents and bake-off Teddy versus jetscii versus memchr versus something more naive.

There was some discussion about how to compare jetscii with Teddy and some interesting benchmark results[1]. I decided to import the benchmarks and see what things look like here. [1]: shepmaster/jetscii#57

shepmaster · 2023-11-12T19:07:28Z

Thanks @BurntSushi, that is an awesome writeup! I'll try to respond to points that caught my eye or I think I can be useful on...

find_big_16_early_return

The differences of 369.9 MB/sec vs 2.7 KB/sec vs 74.9 MB/sec really feels like it has to be some kind of testing error just based on how huge a difference they are. 😉

At least the XML delimiter benchmarks look like they ought to be searching XML data

So my pet project since Rust 0.12 has been to write an XML parser and related machinery. Sometime in the last 15 years, I read a post (I want to say from Daniel Lemire, but I can't find it now) that talked about using PCMPSTRx for XML (and Wikipedia backs up my memory that those were even made to help XML). That's what spawned Jetscii (and cupid, peresil). My goal is all about making it go faster.

Amusingly, in my current rewrite, I use memchr[3]. On the XML file you posted, I parsed 1058794 tokens and wrote them back out to /dev/null in 78.7ms (roughly 90 MB/sec). My go-to testing file are Wikipedia dumps, and I have a 218M example that does 33428805 tokens in 1.7s (roughly 130MB/sec). xmllint (which isn't a one-to-one comparison) takes 2.1s.

but it is absolutely terrible for substring search.

Yep, this was never really a goal for the project, it was mostly a matter of someone (maybe me?) saying "oh you could use it for both cases!" and I added the code.

on aarch64 (which doesn't have anything like pcmpestri

Right, which is a partial reason I went with memchr in my current rewrite — my main development is on an M1 now, so any actual speed benefit from Jetscii won't help me anymore. Also, my brain doesn't really think in SIMD intrinsics, so I can't look at what aarch64 has available and construct something at all similar.

hard to imagine any general circumstance in which you'd want to use jetscii over memchr::memmem.

Totally makes sense to me.

your best bet is to write a bespoke SIMD algorithm

I've had some back-of-mind idea to perform some sort of en-masse lookup table... thing.

Specifically, in my new implementation, I have a fixed size buffer. One thing I could see doing is searching the entire buffer for every & / < / > / etc. and then stuff those results into a bitmask. I could then do some bit twiddling and quickly find the next relevant special character. I haven't done any benchmarking of this, nor do I have a great idea for an x86_64 and aarch64 SIMD way to find all those bytes in the first place. I'm certainly not hoping that the really smart people reading this get nerd-sniped into helping me solve that problem 😇.

shepmaster · 2023-11-12T19:07:45Z

@Dr-Emann What do you see as our next steps forward?

Dr-Emann · 2023-11-15T02:20:11Z

I think this PR is still a good first step, just some modernization, and benchmarking updates.

I think next would be replacing the substring search with memchr::memmem. Maybe deprecating it in favor of memmem entirely.

I think what I would like to do next would be a breaking change: to change the fallback implementation to fn(&[u8]) -> Option<usize> rather than fn(u8) -> bool, to avoid the dynamic call overhead for every char for the fallback case, which would at least make things a little more competitive on aarch64. I'd also like to select the algorithm at construction time rather than searching time, and maybe try adding a special case for 1, 2, and 3 chars to use memchr.

I don't really have any experience at all in making simd algorithms (yet?), so I don't have much input on the bespoke SIMD algorithm part.

Dr-Emann force-pushed the updates branch from 6d50168 to 8d4eba6 Compare October 10, 2023 02:32

This was referenced Oct 12, 2023

Use memchr to search for characters to escape tafia/quick-xml#664

Open

Add benchmarks to source code #54

Open

shepmaster reviewed Oct 13, 2023

View reviewed changes

benches/benchmarks.rs Outdated Show resolved Hide resolved

shepmaster reviewed Oct 13, 2023

View reviewed changes

benches/benchmarks.rs Outdated Show resolved Hide resolved

Dr-Emann force-pushed the updates branch from 3c1c91e to 016e328 Compare October 13, 2023 22:58

Dr-Emann force-pushed the updates branch 3 times, most recently from 319143a to b3468b6 Compare October 14, 2023 02:17

dralley mentioned this pull request Oct 14, 2023

A note of caution on putting benchmarks inside the crate nvzqz/divan#9

Open

BurntSushi reviewed Oct 14, 2023

View reviewed changes

benches/benchmarks.rs Outdated Show resolved Hide resolved

Dr-Emann force-pushed the updates branch from e8ace35 to c8339a2 Compare October 14, 2023 19:48

shepmaster reviewed Oct 16, 2023

View reviewed changes

.github/workflows/ci.yml Show resolved Hide resolved

Dr-Emann force-pushed the updates branch from c8339a2 to 2b65380 Compare October 16, 2023 18:28

Dr-Emann added 5 commits October 17, 2023 21:33

Update to edition 2018

95b2281

Remove unneeded `extern crate`s

Make functions const

43961a5

Allow using static with the output of the bytes/ascii_chars mac…

d3bda48

…ros only

Port benchmarks to stable criterion

6bf8735

Also, add some benchmarks against memchr

Add a benchmark for a full 16 chars

25e62c7

Dr-Emann and others added 8 commits October 17, 2023 21:33

Add benchmark for worst case for memchr

0c42609

Add a test that looks for the first item in a long haystack

Use memmap2

d85b235

The memmap crate is unmaintained, instead, use the maintained memmap2 crate

Remove bounds on struct definitions

0b1259d

Structs don't need the bounds, only the implementations

Update changelog

f7fd429

Apply clippy suggesions

8dc6fca

Mostly just adding #[must_use]

Add the "teddy" algorithm from aho-corasick

0cf5a36

Per suggestion from @BurntSushi [here](tafia/quick-xml#664 (comment)) On my M1, tt appears to be slower but competitive with memchr up to memchr3, then start being the from 5-16

Fix CI by setting up target configuration for ASAN

08ff586

Dr-Emann force-pushed the updates branch from 2b65380 to 0cf5a36 Compare October 18, 2023 01:43

Don't make things const

434c045

We may not want to be stuck with const-constructable implementations

BurntSushi mentioned this pull request Oct 20, 2023

benchmarks: import jetscii's benchmarks BurntSushi/aho-corasick#131

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc Updates #57

Misc Updates #57

Dr-Emann commented Oct 10, 2023 •

edited

Dr-Emann commented Oct 10, 2023 •

edited

shepmaster commented Oct 13, 2023 •

edited

shepmaster commented Oct 13, 2023

Dr-Emann commented Oct 13, 2023

Dr-Emann commented Oct 13, 2023

Dr-Emann commented Oct 14, 2023

Dr-Emann commented Oct 14, 2023 •

edited

Dr-Emann commented Oct 14, 2023

dralley commented Oct 15, 2023 •

edited

dralley commented Oct 17, 2023

shepmaster commented Oct 18, 2023

Dr-Emann commented Oct 18, 2023

Dr-Emann commented Oct 18, 2023

BurntSushi commented Oct 18, 2023

Dr-Emann commented Oct 18, 2023 •

edited

BurntSushi commented Oct 18, 2023

BurntSushi commented Oct 20, 2023

shepmaster commented Nov 12, 2023

shepmaster commented Nov 12, 2023

Dr-Emann commented Nov 15, 2023

Misc Updates #57

Are you sure you want to change the base?

Misc Updates #57

Conversation

Dr-Emann commented Oct 10, 2023 • edited

Dr-Emann commented Oct 10, 2023 • edited

shepmaster commented Oct 13, 2023 • edited

shepmaster commented Oct 13, 2023

Dr-Emann commented Oct 13, 2023

Dr-Emann commented Oct 13, 2023

Dr-Emann commented Oct 14, 2023

Dr-Emann commented Oct 14, 2023 • edited

Dr-Emann commented Oct 14, 2023

dralley commented Oct 15, 2023 • edited

dralley commented Oct 17, 2023

shepmaster commented Oct 18, 2023

Dr-Emann commented Oct 18, 2023

Dr-Emann commented Oct 18, 2023

BurntSushi commented Oct 18, 2023

Dr-Emann commented Oct 18, 2023 • edited

BurntSushi commented Oct 18, 2023

BurntSushi commented Oct 20, 2023

shepmaster commented Nov 12, 2023

shepmaster commented Nov 12, 2023

Dr-Emann commented Nov 15, 2023

Dr-Emann commented Oct 10, 2023 •

edited

Dr-Emann commented Oct 10, 2023 •

edited

shepmaster commented Oct 13, 2023 •

edited

Dr-Emann commented Oct 14, 2023 •

edited

dralley commented Oct 15, 2023 •

edited

Dr-Emann commented Oct 18, 2023 •

edited