Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow double quotes in URIs (fixes #87) #89

Merged
merged 2 commits into from
Mar 26, 2021

Conversation

nox
Copy link
Contributor

@nox nox commented Mar 16, 2021

No description provided.

@seanmonstar
Copy link
Owner

seanmonstar commented Mar 24, 2021

Ok, so I got a chance to look into this, and re-learned how the SIMD kind of works: it will simply try to skip bytes that are considered valid for that token, and once it finds anything not valid, it returns at that position, and the regular scalar parser continues from there. That way, the scalar parser can be the single place to say whether the invalid byte is an error, or actually a delimiter to move on to the next token.

So, even if the SIMD parsers aren't "fixed", they'll simply return the position of the specific byte it didn't like, and the scalar parser would continue from there happily. Of course, fixing it would mean it's faster even with those characters.

Then, I realized with some debug prints that your change does fix the AVX2 parsing ", but it was catching the { just before it anyways. Then, I tried to decipher what the _mm256_setr_epi8 code was trying to do, with your change, and I admit it's beyond me (I'm not thrilled about that...). But so basically, it all does "work", but if there is a { early in the token, it will bump down to the scalar parser.

@nox
Copy link
Contributor Author

nox commented Mar 24, 2021

I can try to fix the { and we can discuss about this change a bit more too, because in the wild it turns out there are clients passing non-ASCII (I only saw valid UTF-8 though) URIs, and httparse should probably let that through too, or at least behind a flag.

@vkrasnov
Copy link

vkrasnov commented Mar 25, 2021

Hi, the way I understand the code the principle is pretty simple.

Each token is broken into two nibbles, then using two 16 byte lookup tables two values are calculated:

The bitmask for the column in the URI token table, and the actual row the token is in.

If we look at the URI token table we see:

    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
//  \0                            \n
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
//  commands
    0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  \w !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
//  0  1  2  3  4  5  6  7  8  9  :  ;  <  =  >  ?
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  P  Q  R  S  T  U  V  W  X  Y  Z  [  \  ]  ^  _
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
//  `  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
//  p  q  r  s  t  u  v  w  x  y  z  {  |  }  ~  del
//   ====== Extended ASCII (aka. obs-text) ======
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

So for the top nibble, it tells us the row in which the token is, as a bitmask with only one bit set. For row 0 it returns 0x01, for row 1 -> 0x02 .. for row 7 -> 0x80. It returns 0 for rows 8-15, so those are never URI token.

It ANDs the bitmask with the bitmask for the lower nibble, and if the result is nonzero it is a URI token.

So looking at a column we can understand what permutation with need for the lower nibble.

We only look at the first 8 rows for each column. If only the first raw is zero we want to zero out bit 0 in the mask, i.e. 0xfe, if two lower rows are zero we want 0xfc, for the last column where top row and first row are zero we clear bits zero and 7 and get 0x7e.

So the current table in fact has many many false positives, that are simply not tested for and can be modified significantly.

let URI: __m128i = _mm_setr_epi8(
        0xb8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0x7c, 0x54, 0x7c, 0xd4, 0x7c,
    );

Should really be

 let URI: __m128i = _mm_setr_epi8(
        0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0xfc, 0xf4, 0xfc, 0xf4, 0x7c,
    );

And for the AVX2 code just twice that:

    let URI: __m256i = _mm256_setr_epi8(
        0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0xfc, 0xf4, 0xfc, 0xf4, 0x7c,
        0xf8, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc, 0xfc,
        0xfc, 0xfc, 0xfc, 0xfc, 0xf4, 0xfc, 0xf4, 0x7c,
    );

@vkrasnov
Copy link

I edited my comment with updated values. Miscalculated on the first go.

@nox
Copy link
Contributor Author

nox commented Mar 26, 2021

Thanks again @vkrasnov! I updated the code and added a comment. Now that I understood the code, I wonder if we could generate those sequences of values from URI_MAP at build time instead hah.

@seanmonstar
Copy link
Owner

generate those sequences of values from URI_MAP at build time instead hah.

I had a similar related thought, that it'd be good to have a unit test that just checks all those values against what's in the URI_MAP... No need to hold this PR up though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants