Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD vectorization behind a feature flag #120

Open
dtolnay opened this issue Jul 18, 2016 · 17 comments
Open

SIMD vectorization behind a feature flag #120

dtolnay opened this issue Jul 18, 2016 · 17 comments

Comments

@dtolnay
Copy link
Member

dtolnay commented Jul 18, 2016

RapidJSON does quite a lot of this.

@maciejhirsz
Copy link

It would be interesting to see hermetic tests done for that vs iteration with a LUT. You need to:

  • check how many vectors you can use (len >> 4, not a big deal)
  • vectorize the bytes - memcpy should be fast
  • do 4? instructions (check for backslash, quote, control characters with the exception of whitespace)
  • flatten the boolean vector to a single value.

And then you still need to do something about the remaining len % 16 bytes.

@maciejhirsz
Copy link

I made a very rough test in json-rust. Using this as test data:

{"foo":"This is a super long string, omg so long, why is it so long? It is so long because it needs to contain many multiplications of 16 bytes in order for us to get any benefits from SIMD at all! That's why! It's kind of annoying like that, but hey, what can you do..."}

Without SIMD:

test json_rust_parse ... bench:         414 ns/iter (+/- 143) = 657 MB/s

With SIMD:

test json_rust_parse ... bench:         267 ns/iter (+/- 28) = 1018 MB/s

This is pretty much the best case scenario for this kind of a thing, but it works. The implementation is pretty simple, before I start running a loop checking bytes one by one, I do this:

        for _ in 0 .. ($parser.length - $parser.index) >> 4 {
            let bytes = simd::u8x16::load($parser.source.as_bytes(), $parser.index);

            if (bytes.lt(CT_SIMD) | bytes.eq(BS_SIMD) | bytes.eq(QU_SIMD)).any() {
                break;
            }

            $parser.index += 16;
        }

Where CT_SIMD is a splat of 0x20, BS_SIMD is a splat of 0x5C and QU_SIMD is a splat of 0x22 (as you'd expect).

Gonna run this against json-benchmark, will come back with numbers.

@maciejhirsz
Copy link

maciejhirsz commented Jul 19, 2016

On json-benchmark the numbers are pretty much the same on all tests. The version with SIMD is on dev branch if you want to try on your hardware. I only did it for parsing, not serializing.

@dtolnay
Copy link
Member Author

dtolnay commented Jul 19, 2016

Same here, even a little slower. It may be that unaligned u8x16 loads are slow - you don't have this part that establishes 16-byte alignment. That would also serve to mitigate the impact on parsing short strings.

I would be interested in benchmarking two variants of your code - one that aligns to 16 bytes before the SIMD loop like in RapidJSON, and one that goes 16 bytes past the first 16-byte alignment before starting SIMD. It is all about balancing no-SIMD for short strings (which are most strings) and SIMD for long strings (which are slow otherwise).

@maciejhirsz
Copy link

I think the simd lib does the alignment via the repr annotation:

/// A SIMD vector of 16 `u8`s.
#[repr(simd)]
#[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
#[derive(Debug, Copy)]
pub struct u8x16(u8, u8, u8, u8, u8, u8, u8, u8,
                 u8, u8, u8, u8, u8, u8, u8, u8);

I've been only catching up with low level stuff this year, so things like byte alignments are still semi exotic to me. I've only looked this up, and the author of the answer is the author of the simd implementation.

@dtolnay
Copy link
Member Author

dtolnay commented Jul 19, 2016

I am talking about alignment of $parser.source[$parser.index]. If that is not 16-byte aligned, there is nothing the SIMD library can do about it.

@maciejhirsz
Copy link

Gotcha! I assume casting pointer into usize and grabbing modulo 16 out of it, then running over that many bytes with regular check should be sufficient.

@dtolnay
Copy link
Member Author

dtolnay commented Jul 19, 2016

Yes.

And then for the other test, that many bytes + 16 more.

@maciejhirsz
Copy link

I don't seem to get any statistically significant difference. Increased the string length to 3155 characters.

No SIMD:

test json_rust_parse ... bench:       2,592 ns/iter (+/- 1,071) = 1221 MB/s

SIMD, no alignment:

test json_rust_parse ... bench:         683 ns/iter (+/- 481) = 4633 MB/s

SIMD, with alignment:

test json_rust_parse ... bench:         754 ns/iter (+/- 502) = 4197 MB/s

SIMD, with alignment offset by 1:

test json_rust_parse ... bench:         755 ns/iter (+/- 358) = 4192 MB/s

Pushed to dev along with quickly hacked benchmarks if you want to play around.

@maciejhirsz
Copy link

Ok, got it, seems u8x16::load doesn't actually use memcpy. Reworked into this mess:

let mut bytes: simd::u8x16 = mem::uninitialized();
let bytes_ptr: *mut u8 = mem::transmute(&mut bytes);
ptr::copy_nonoverlapping(
    $parser.byte_ptr.offset($parser.index as isize),
    bytes_ptr,
    16
);

Now I get expected, results. With alignment:

test json_rust_parse ... bench:         625 ns/iter (+/- 321) = 5064 MB/s

With alignment offset 1:

test json_rust_parse ... bench:         651 ns/iter (+/- 367) = 4861 MB/s

Difference isn't staggering, but it's there.

@alexbool
Copy link

Wow, that's interesting!
I also have to say that all modern (at least Intel's) CPUs have AVX2 in them, which allows 256 bit vectors. Would be nice to try. This must be #[cfg]'d of course.

@arthurprs
Copy link

SIMD is on track for stabilization 🎉

@ijl
Copy link

ijl commented Jan 23, 2019

Has anyone done anything with this?

@mmstick
Copy link

mmstick commented Feb 23, 2019

This recently became a thing: https://github.com/lemire/simdjson

@Licenser
Copy link

I ported simdjson to rust the performance boost SIMD gives is significant. Even skipping the tape tricks and adopting a more rust-ergonomic data representation it is up to 300% faster than vanilla-rust serde-json

@JimChenWYU
Copy link

How has the discussion progressed so far? It's been 7 years already.

@ifsheldon
Copy link

sonic-rs also shows promising results on json_benchmark. It's developed by Bytedance and its APIs are very similar to serde_json.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

9 participants