Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: implementation of SSE 4.2 compatible parsing (incl. utf8) #30

Merged
merged 9 commits into from Jul 29, 2019

Conversation

sunnygleason
Copy link
Member

Greetings! Thank you so much for your amazing work with
simdjson-rs.

I did some work earlier this year on an SSE 4.2 port of simdjson,
and I'd love to gain more experience with Rust.

I humbly submit this code for your comment & consideration. I'm
not an expert in Rust feature detection & conditional compilation,
but hopefully this code gives an idea of what the SSE-compatible
code looks like.

If you think it is an interesting idea, I'd be happy to do whatever
work to get it in shape for possible merging.

The items that are still noticeable TODO:

  • utf8 validation (can port SSE version for conditional usage)
  • conditional compilation/feature detection

Thank you again for your consideration!

Sincerely,

-Sunny Gleason

@Licenser
Copy link
Member

Hi Sunny,
That’s a great idea! Making simdjson-rs usable on more platforms is an absolute improvement. I’ll be traveling today so I’m restricted to flaky internet and a tablet so I can’t really try the code but from what I’ve seen it looks good.

For encapsulation given that it will span multiple modules we might want to have a avx and a sse42 sub module but that would be some housekeeping not a blocker :). It’s just a bit of thinking ahead for later since arm would also be a good target later on and with 3 different targets sub modules become a little cleaner.

The conditional compilation can be a bit tricky ( and require some experimentation ) but you are on the right track with guarding the mod line.

If you don’t have access to a avx capable host I can set you up with a user on a test box of mine.

@sunnygleason
Copy link
Member Author

@Licenser thank you so much for the directions - I've done a bit of tweaking to demonstrate the parts that are avx and sse specific. I'm still new to rust modules, so I wasn't able to get the code organized exactly like I hoped with conditional compilation. I was able to hack it by using a symlink of lib.rs to the appropriate architecture-specific version.

I was hoping maybe you would know what to do when you see the patterns between the _avx2 and _sse42 versions of the files. Thank you again for your consideration, I look forward to any pointers you might have re: the modularization!

@Licenser
Copy link
Member

So there is a trick of getting a list of target features. For me the command would be:

 $ rustc --target=x86_64-apple-darwin --print target-features

Available features for this target:
    16bit-mode                    - 16-bit mode (i8086).
    32bit-mode                    - 32-bit mode (80386).
    3dnow                         - Enable 3DNow! instructions.
    3dnowa                        - Enable 3DNow! Athlon instructions.
    64bit                         - Support 64-bit instructions.
    64bit-mode                    - 64-bit mode (x86_64).
    adx                           - Support ADX instructions.
    aes                           - Enable AES instructions.
    atom                          - Intel Atom processors.
    avx                           - Enable AVX instructions.
    avx2                          - Enable AVX2 instructions.
    avx512bitalg                  - Enable AVX-512 Bit Algorithms.
    avx512bw                      - Enable AVX-512 Byte and Word Instructions.
    avx512cd                      - Enable AVX-512 Conflict Detection Instructions.
    avx512dq                      - Enable AVX-512 Doubleword and Quadword Instructions.
    avx512er                      - Enable AVX-512 Exponential and Reciprocal Instructions.
    avx512f                       - Enable AVX-512 instructions.
    avx512ifma                    - Enable AVX-512 Integer Fused Multiple-Add.
    avx512pf                      - Enable AVX-512 PreFetch Instructions.
    avx512vbmi                    - Enable AVX-512 Vector Byte Manipulation Instructions.
    avx512vbmi2                   - Enable AVX-512 further Vector Byte Manipulation Instructions.
    avx512vl                      - Enable AVX-512 Vector Length eXtensions.
    avx512vnni                    - Enable AVX-512 Vector Neural Network Instructions.
    avx512vpopcntdq               - Enable AVX-512 Population Count Instructions.
    bmi                           - Support BMI instructions.
    bmi2                          - Support BMI2 instructions.
    cldemote                      - Enable Cache Demote.
    clflushopt                    - Flush A Cache Line Optimized.
    clwb                          - Cache Line Write Back.
    clzero                        - Enable Cache Line Zero.
    cmov                          - Enable conditional move instructions.
    cx16                          - 64-bit with cmpxchg16b.
    ermsb                         - REP MOVS/STOS are fast.
    f16c                          - Support 16-bit floating point conversion instructions.
    false-deps-lzcnt-tzcnt        - LZCNT/TZCNT have a false dependency on dest register.
    false-deps-popcnt             - POPCNT has a false dependency on dest register.
    fast-11bytenop                - Target can quickly decode up to 11 byte NOPs.
    fast-15bytenop                - Target can quickly decode up to 15 byte NOPs.
    fast-bextr                    - Indicates that the BEXTR instruction is implemented as a single uop with good throughput..
    fast-gather                   - Indicates if gather is reasonably fast..
    fast-hops                     - Prefer horizontal vector math instructions (haddp, phsub, etc.) over normal vector instructions with shuffles.
    fast-lzcnt                    - LZCNT instructions are as fast as most simple integer ops.
    fast-partial-ymm-or-zmm-write - Partial writes to YMM/ZMM registers are fast.
    fast-scalar-fsqrt             - Scalar SQRT is fast (disable Newton-Raphson).
    fast-shld-rotate              - SHLD can be used as a faster rotate.
    fast-variable-shuffle         - Shuffles with variable masks are fast.
    fast-vector-fsqrt             - Vector SQRT is fast (disable Newton-Raphson).
    fma                           - Enable three-operand fused multiple-add.
    fma4                          - Enable four-operand fused multiple-add.
    fsgsbase                      - Support FS/GS Base instructions.
    fxsr                          - Support fxsave/fxrestore instructions.
    gfni                          - Enable Galois Field Arithmetic Instructions.
    glm                           - Intel Goldmont processors.
    glp                           - Intel Goldmont Plus processors.
    idivl-to-divb                 - Use 8-bit divide for positive values less than 256.
    idivq-to-divl                 - Use 32-bit divide for positive values less than 2^32.
    invpcid                       - Invalidate Process-Context Identifier.
    lea-sp                        - Use LEA for adjusting the stack pointer.
    lea-uses-ag                   - LEA instruction needs inputs at AG stage.
    lwp                           - Enable LWP instructions.
    lzcnt                         - Support LZCNT instruction.
    macrofusion                   - Various instructions can be fused with conditional branches.
    merge-to-threeway-branch      - Merge branches to a three-way conditional branch.
    mmx                           - Enable MMX instructions.
    movbe                         - Support MOVBE instruction.
    movdir64b                     - Support movdir64b instruction.
    movdiri                       - Support movdiri instruction.
    mpx                           - Support MPX instructions.
    mwaitx                        - Enable MONITORX/MWAITX timer functionality.
    nopl                          - Enable NOPL instruction.
    pad-short-functions           - Pad short functions.
    pclmul                        - Enable packed carry-less multiplication instructions.
    pconfig                       - platform configuration instruction.
    pku                           - Enable protection keys.
    popcnt                        - Support POPCNT instruction.
    prefer-256-bit                - Prefer 256-bit AVX instructions.
    prefetchwt1                   - Prefetch with Intent to Write and T1 Hint.
    prfchw                        - Support PRFCHW instructions.
    ptwrite                       - Support ptwrite instruction.
    rdpid                         - Support RDPID instructions.
    rdrnd                         - Support RDRAND instruction.
    rdseed                        - Support RDSEED instruction.
    retpoline                     - Remove speculation of indirect branches from the generated code, either by avoiding them entirely or lowering them with a speculation blocking construct..
    retpoline-external-thunk      - When lowering an indirect call or branch using a `retpoline`, rely on the specified user provided thunk rather than emitting one ourselves. Only has effect when combined with some other retpoline feature..
    retpoline-indirect-branches   - Remove speculation of indirect branches from the generated code..
    retpoline-indirect-calls      - Remove speculation of indirect calls from the generated code..
    rtm                           - Support RTM instructions.
    sahf                          - Support LAHF and SAHF instructions.
    sgx                           - Enable Software Guard Extensions.
    sha                           - Enable SHA instructions.
    shstk                         - Support CET Shadow-Stack instructions.
    slm                           - Intel Silvermont processors.
    slow-3ops-lea                 - LEA instruction with 3 ops or certain registers is slow.
    slow-incdec                   - INC and DEC instructions are slower than ADD and SUB.
    slow-lea                      - LEA instruction with certain arguments is slow.
    slow-pmaddwd                  - PMADDWD is slower than PMULLD.
    slow-pmulld                   - PMULLD instruction is slow.
    slow-shld                     - SHLD instruction is slow.
    slow-two-mem-ops              - Two memory operand instructions are slow.
    slow-unaligned-mem-16         - Slow unaligned 16-byte memory access.
    slow-unaligned-mem-32         - Slow unaligned 32-byte memory access.
    soft-float                    - Use software floating point features..
    sse                           - Enable SSE instructions.
    sse-unaligned-mem             - Allow unaligned memory operands with SSE instructions.
    sse2                          - Enable SSE2 instructions.
    sse3                          - Enable SSE3 instructions.
    sse4.1                        - Enable SSE 4.1 instructions.
    sse4.2                        - Enable SSE 4.2 instructions.
    sse4a                         - Support SSE 4a instructions.
    ssse3                         - Enable SSSE3 instructions.
    tbm                           - Enable TBM instructions.
    tremont                       - Intel Tremont processors.
    vaes                          - Promote selected AES instructions to AVX512/AVX registers.
    vpclmulqdq                    - Enable vpclmulqdq instructions.
    waitpkg                       - Wait and pause enhancements.
    wbnoinvd                      - Write Back No Invalidate.
    x87                           - Enable X87 float instructions.
    xop                           - Enable XOP instructions.
    xsave                         - Support xsave instructions.
    xsavec                        - Support xsavec instructions.
    xsaveopt                      - Support xsaveopt instructions.
    xsaves                        - Support xsaves instructions.

Use +feature to enable a feature, or -feature to disable it.
For example, rustc -C -target-cpu=mycpu -C target-feature=+feature1,-feature2

For us avx (or avx2) and sse4 are the ones that matter.

with that knowledge we can use pre compiler guards to handle dependant compilation.

if you have say stage1/avx.rs and stage1/sse4.rs you can do something like:

stage1.rs

#[cfg(target_feature = "avx")]
mod avx;
#[cfg(target_feature = "avx")]
pub use avx::*;
// if we have avx we already included the above and any avx cpu has sse4 too AFAIK
#[cfg(all(target_feature = "sse4"), not(target_feature = "avx"))] 
mod sse4;
#[cfg(all(target_feature = "sse4"), not(target_feature = "avx"))] 
pub use sse4::*;

// the pub use makes it still possible to `use stage1::*` from outside so we contain the guard.

Note: not tested since I'm traveling and don't have full access to all my computers, sorry about that.

@Licenser
Copy link
Member

Back from the travels, if you want any help with the dependant compilation I can jump in and make a PR against your branch with an example but it's up to you -I don't want to steal the learning opportunity.

@sunnygleason
Copy link
Member Author

@Licenser oh nice! I am just hopping back onto this now & should have a revised PR for you in the next 2-3h if all goes well. I love the learning and I'll take any suggestions you have as well - the goal is to get it into a form you feel proud to merge...


impl<'de> Deserializer<'de> {
#[cfg_attr(not(feature = "no-inline"), inline(always))]
pub fn parse_str_(&mut self) -> Result<&'de str> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_str_ is the only method currently platform-specific in Deserializer (made it public to ease testing, might not have to be that way)

@@ -9,6 +9,8 @@ use std::arch::x86_64::*;

use std::mem;

pub const SIMDJSON_PADDING: usize = mem::size_of::<__m256i>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made constant architecture-specific, even though values are same


impl<'de> Deserializer<'de> {
#[cfg_attr(not(feature = "no-inline"), inline(always))]
pub fn parse_str_(&mut self) -> Result<&'de str> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_str_ is the only method currently platform-specific in Deserializer (made it public to ease testing, might not have to be that way)

*/

// all byte values must be no larger than 0xF4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation is largely the same as the AVX2 version, just half-width


// all byte values must be no larger than 0xF4
#[cfg_attr(not(feature = "no-inline"), inline)]
fn avxcheck_smaller_than_0xf4(current_bytes: __m128i, has_error: &mut __m128i) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the avx* names are a misnomer, I held off on a renaming party (for now) thinking we'd need to change both places (AVX2 and SSE42)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ja with rust we have modules so we don't really need function prefixes :) good catch!

src/value/generator.rs Outdated Show resolved Hide resolved
@sunnygleason
Copy link
Member Author

@Licenser ok, this is ready for you to take a look -- let me know what you think, thank you again for the tips!

I'm still stuck a tiny bit on the conditional compilation -- I think it's close, just needs sensible defaulting so things like cargo test work with the right features etc.

Interesting thing -- benchmarks tended to be within 10-20% of AVX2 (good news, "slow" is still fast on slow platforms; bad news, "fast" could probably be faster on fast platforms)

@sunnygleason
Copy link
Member Author

Laptop benchmarks (not the best, but slightly interesting)

AVX2

Benchmarking apache_builds/simd_json
Benchmarking apache_builds/simd_json: Warming up for 3.0000 s
Benchmarking apache_builds/simd_json: Collecting 100 samples in estimated 5.8026 s (20k iterations)
Benchmarking apache_builds/simd_json: Analyzing
apache_builds/simd_json time:   [234.49 us 235.77 us 237.01 us]
                        thrpt:  [512.13 MiB/s 514.83 MiB/s 517.64 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Benchmarking apache_builds/simd_json-owned
Benchmarking apache_builds/simd_json-owned: Warming up for 3.0000 s
Benchmarking apache_builds/simd_json-owned: Collecting 100 samples in estimated 7.8109 s (10k iterations)
Benchmarking apache_builds/simd_json-owned: Analyzing
apache_builds/simd_json-owned
                        time:   [733.70 us 736.96 us 740.80 us]
                        thrpt:  [163.85 MiB/s 164.70 MiB/s 165.43 MiB/s]
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

SSE42

Benchmarking apache_builds/simd_json
Benchmarking apache_builds/simd_json: Warming up for 3.0000 s
Benchmarking apache_builds/simd_json: Collecting 100 samples in estimated 5.7037 s (20k iterations)
Benchmarking apache_builds/simd_json: Analyzing
apache_builds/simd_json time:   [250.20 us 251.54 us 252.96 us]
                        thrpt:  [479.84 MiB/s 482.55 MiB/s 485.13 MiB/s]
                 change:
                        time:   [+4.8700% +6.3486% +7.5335%] (p = 0.00 < 0.05)
                        thrpt:  [-7.0057% -5.9696% -4.6438%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Benchmarking apache_builds/simd_json-owned
Benchmarking apache_builds/simd_json-owned: Warming up for 3.0000 s
Benchmarking apache_builds/simd_json-owned: Collecting 100 samples in estimated 7.9369 s (10k iterations)
Benchmarking apache_builds/simd_json-owned: Analyzing
apache_builds/simd_json-owned
                        time:   [746.03 us 749.28 us 752.88 us]
                        thrpt:  [161.22 MiB/s 161.99 MiB/s 162.70 MiB/s]
                 change:
                        time:   [-0.6106% +0.5125% +1.4529%] (p = 0.36 > 0.05)
                        thrpt:  [-1.4321% -0.5099% +0.6144%]
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  7 (7.00%) high mild
  3 (3.00%) high severe

@sunnygleason sunnygleason changed the title RFC: implementation of SSE 4.2 compatible parsing (utf8 TODO) RFC: implementation of SSE 4.2 compatible parsing (incl. utf8) Jul 27, 2019
Copy link
Member

@Licenser Licenser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all in all looks great! I would restructure it a bit to make it more rusty and less cy.

Basically add a avx2.rs and move:

  • avx2_deser.rs -> avx2/deser.rs
  • avx2_stage1.rs -> avx2/stage1.rs
  • avx2_utf8check.rs -> avx2/utf8check.rs

(and the same for sse42 files)

I think that would clean the file structure up a bit and contain the different implementations to the degree that we only need a single compilation dependant line (or two) in lib.rs to include either the sse42 or the avx2 version.

That plus moving from crate features to cpu features and I think this is done. :)

Cargo.toml Outdated
@@ -46,6 +46,10 @@ harness = false

[features]
default = ["swar-number-parsing", "serde_impl"]
# AVX2 compatibility -- TODO, figure out default
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those should not be features they are architecture dependancies. rust exposes them already and will pick the 'right' one depending on the compilation target.

Exposing them as features can lead to comilations that are either less performat when rust uses polyfills for them or straight out won't run.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://doc.rust-lang.org/reference/conditional-compilation.html is the documentation. I think the right set would be:

#[cfg(feature = "avx2")] and #[cfg(all(not(feature = "avx2"), feature = "sse4.2")] (we need to exclude avx on the second one since avx2 CPUs usually (always?) support sse4.2 as well)

src/sse42_deser.rs Outdated Show resolved Hide resolved
Copy link
Member

@Licenser Licenser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I will through it to 24h of fuzzing just to be sure and merge it if nothing breaks!

@sunnygleason
Copy link
Member Author

@Licenser thank you again! (snuck one commit by you for the generator compile-time constant)

@Licenser
Copy link
Member

Licenser commented Jul 28, 2019

To make it compile on the test bench I had to add the following changes. Can we include them? Other then that the fuzzer is running :)

diff --git a/simd-fuzz-target/Makefile b/simd-fuzz-target/Makefile
index 4d3c548..061b8bb 100644
--- a/simd-fuzz-target/Makefile
+++ b/simd-fuzz-target/Makefile
@@ -1,7 +1,11 @@
 run: build
-	RUSTFLAGS='-C codegen-units=1' cargo +nightly afl fuzz -i in -o out target/debug/simd-fuzz-target
+	RUSTFLAGS='-C codegen-units=1 -C target-cpu=native' cargo +nightly afl fuzz -i in -o out target/debug/simd-fuzz-target
 build: 
 	RUSTFLAGS='-C codegen-units=1' cargo +nightly afl build
+run-sse: build-sse
+	RUSTFLAGS='-C codegen-units=1 -C target-cpu=native -C target-feature=-avx2' cargo +nightly afl fuzz -i in -o out target/debug/simd-fuzz-target
+build-sse: 
+	RUSTFLAGS='-C codegen-units=1 -C target-cpu=native -C target-feature=-avx2' cargo +nightly afl build
 
 copy:
 	for from in `ls out/crashes/id*`; do to=`echo $$from | sed -e 's;out/crashes/id:;crash;' -e 's;,.*;.json;'`; cp $$from ../simdjson-rs/data/crash/$$to; done
diff --git a/src/lib.rs b/src/lib.rs
index 1723909..2516a0e 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -78,31 +78,31 @@ mod charutils;
 #[macro_use]
 mod macros;
 mod error;
-mod stringparse;
 mod numberparse;
 mod parsedjson;
 mod portability;
+mod stringparse;
 
 #[cfg(target_feature = "avx2")]
 mod avx2;
 #[cfg(target_feature = "avx2")]
-const SIMDJSON_PADDING : usize = crate::avx2::stage1::SIMDJSON_PADDING;
-#[cfg(target_feature = "avx2")]
 pub use crate::avx2::deser::*;
+#[cfg(target_feature = "avx2")]
+use crate::avx2::stage1::SIMDJSON_PADDING;
 
 #[cfg(all(target_feature = "sse4.2", not(target_feature = "avx2")))]
 mod sse42;
 #[cfg(all(target_feature = "sse4.2", not(target_feature = "avx2")))]
-const SIMDJSON_PADDING : usize = crate::sse42::stage1::SIMDJSON_PADDING;
-#[cfg(all(target_feature = "sse4.2", not(target_feature = "avx2")))]
 pub use crate::sse42::deser::*;
+#[cfg(all(target_feature = "sse4.2", not(target_feature = "avx2")))]
+use crate::sse42::stage1::SIMDJSON_PADDING;
 
 mod stage2;
 pub mod value;
 
 use crate::numberparse::Number;
-use std::str;
 use std::mem;
+use std::str;
 
 pub use crate::error::{Error, ErrorType};
 pub use crate::value::*;
diff --git a/src/stage2.rs b/src/stage2.rs
index 29ebe04..7bbb1da 100644
--- a/src/stage2.rs
+++ b/src/stage2.rs
@@ -1,7 +1,10 @@
 #![allow(dead_code)]
+#[cfg(target_feature = "avx2")]
+use crate::avx2::stage1::SIMDJSON_PADDING;
 use crate::charutils::*;
-use crate::{Deserializer, Error, ErrorType, Result, SIMDJSON_PADDING};
-//use crate::portability::*;
+#[cfg(all(target_feature = "sse4.2", not(target_feature = "avx2")))]
+use crate::sse42::stage1::SIMDJSON_PADDING;
+use crate::{Deserializer, Error, ErrorType, Result};
 
 #[cfg_attr(not(feature = "no-inline"), inline(always))]
 pub fn is_valid_true_atom(loc: &[u8]) -> bool {

@Licenser
Copy link
Member

looking good so far! :D
image

@sunnygleason
Copy link
Member Author

@Licenser thanks again - patch applied (modulo one trailing whitespace char after a colon character in the makefile)!

@@ -102,7 +107,7 @@ pub trait BaseGenerator {
// quote characters that gives us a bitmask of 0x1f for that
// region, only quote (`"`) and backslash (`\`) are not in
// this range.
if is_x86_feature_detected!("avx2") {
if AVX2_PRESENT {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even easier, put this around the block:

#[cfg(target_feature = "avx2")]
{
...
}

That way the code doesn't even get generated when we don't have avx2 present :)

@Licenser Licenser merged commit efd382c into simd-lite:master Jul 29, 2019
@Licenser
Copy link
Member

Thanks! Some great work :D

@sunnygleason sunnygleason deleted the rfc-sse42-impl branch August 16, 2019 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants