Allow double quotes in URIs (fixes #87) #89

nox · 2021-03-16T13:06:23Z

No description provided.

seanmonstar · 2021-03-24T01:24:56Z

Ok, so I got a chance to look into this, and re-learned how the SIMD kind of works: it will simply try to skip bytes that are considered valid for that token, and once it finds anything not valid, it returns at that position, and the regular scalar parser continues from there. That way, the scalar parser can be the single place to say whether the invalid byte is an error, or actually a delimiter to move on to the next token.

So, even if the SIMD parsers aren't "fixed", they'll simply return the position of the specific byte it didn't like, and the scalar parser would continue from there happily. Of course, fixing it would mean it's faster even with those characters.

Then, I realized with some debug prints that your change does fix the AVX2 parsing ", but it was catching the { just before it anyways. Then, I tried to decipher what the _mm256_setr_epi8 code was trying to do, with your change, and I admit it's beyond me (I'm not thrilled about that...). But so basically, it all does "work", but if there is a { early in the token, it will bump down to the scalar parser.

nox · 2021-03-24T10:56:49Z

I can try to fix the { and we can discuss about this change a bit more too, because in the wild it turns out there are clients passing non-ASCII (I only saw valid UTF-8 though) URIs, and httparse should probably let that through too, or at least behind a flag.

vkrasnov · 2021-03-25T16:22:06Z

Hi, the way I understand the code the principle is pretty simple.

Each token is broken into two nibbles, then using two 16 byte lookup tables two values are calculated:

The bitmask for the column in the URI token table, and the actual row the token is in.

If we look at the URI token table we see:

    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
//  \0                            \n
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
//  commands
    0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  \w !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
//  0  1  2  3  4  5  6  7  8  9  :  ;  <  =  >  ?
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  P  Q  R  S  T  U  V  W  X  Y  Z  [  \  ]  ^  _
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  `  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
//  p  q  r  s  t  u  v  w  x  y  z  {  |  }  ~  del
//   ====== Extended ASCII (aka. obs-text) ======
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

So for the top nibble, it tells us the row in which the token is, as a bitmask with only one bit set. For row 0 it returns 0x01, for row 1 -> 0x02 .. for row 7 -> 0x80. It returns 0 for rows 8-15, so those are never URI token.

It ANDs the bitmask with the bitmask for the lower nibble, and if the result is nonzero it is a URI token.

So looking at a column we can understand what permutation with need for the lower nibble.

We only look at the first 8 rows for each column. If only the first raw is zero we want to zero out bit 0 in the mask, i.e. 0xfe, if two lower rows are zero we want 0xfc, for the last column where top row and first row are zero we clear bits zero and 7 and get 0x7e.

So the current table in fact has many many false positives, that are simply not tested for and can be modified significantly.

let URI: __m128i = _mm_setr_epi8(
        0xb8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c,
    );

Should really be

 let URI: __m128i = _mm_setr_epi8(
        0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0xfc, 0xf4, 0xfc, 0xf4, 0x7c,
    );

And for the AVX2 code just twice that:

    let URI: __m256i = _mm256_setr_epi8(
        0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0xfc, 0xf4, 0xfc, 0xf4, 0x7c,
        0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0xfc, 0xf4, 0xfc, 0xf4, 0x7c,
    );

vkrasnov · 2021-03-25T22:06:17Z

I edited my comment with updated values. Miscalculated on the first go.

nox · 2021-03-26T08:37:51Z

Thanks again @vkrasnov! I updated the code and added a comment. Now that I understood the code, I wonder if we could generate those sequences of values from URI_MAP at build time instead hah.

seanmonstar · 2021-03-26T21:13:42Z

generate those sequences of values from URI_MAP at build time instead hah.

I had a similar related thought, that it'd be good to have a unit test that just checks all those values against what's in the URI_MAP... No need to hold this PR up though.

Allow double quotes in URIs (fixes seanmonstar#87)

ec2989d

nox force-pushed the allow-quote-in-uri branch from 60dbc79 to ec2989d Compare March 16, 2021 13:08

Make the SIMD tables correspond exactly to URI_MAP

8bd42db

seanmonstar approved these changes Mar 26, 2021

View reviewed changes

seanmonstar merged commit a2fbe1c into seanmonstar:master Mar 26, 2021

nox mentioned this pull request Jun 18, 2022

Hyper/Rocket URI validity discord: rwf2/Rocket#2234

Closed

dyxushuai mentioned this pull request Feb 11, 2024

feat(lexer): add SIMD optimization to the lexer oxc-project/oxc#2338

Draft

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow double quotes in URIs (fixes #87) #89

Allow double quotes in URIs (fixes #87) #89

nox commented Mar 16, 2021

seanmonstar commented Mar 24, 2021 •

edited

nox commented Mar 24, 2021

vkrasnov commented Mar 25, 2021 •

edited

vkrasnov commented Mar 25, 2021

nox commented Mar 26, 2021

seanmonstar commented Mar 26, 2021

Allow double quotes in URIs (fixes #87) #89

Allow double quotes in URIs (fixes #87) #89

Conversation

nox commented Mar 16, 2021

seanmonstar commented Mar 24, 2021 • edited

nox commented Mar 24, 2021

vkrasnov commented Mar 25, 2021 • edited

vkrasnov commented Mar 25, 2021

nox commented Mar 26, 2021

seanmonstar commented Mar 26, 2021

seanmonstar commented Mar 24, 2021 •

edited

vkrasnov commented Mar 25, 2021 •

edited