-
-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow double quotes in URIs (fixes #87) #89
Conversation
Ok, so I got a chance to look into this, and re-learned how the SIMD kind of works: it will simply try to skip bytes that are considered valid for that token, and once it finds anything not valid, it returns at that position, and the regular scalar parser continues from there. That way, the scalar parser can be the single place to say whether the invalid byte is an error, or actually a delimiter to move on to the next token. So, even if the SIMD parsers aren't "fixed", they'll simply return the position of the specific byte it didn't like, and the scalar parser would continue from there happily. Of course, fixing it would mean it's faster even with those characters. Then, I realized with some debug prints that your change does fix the AVX2 parsing |
I can try to fix the |
Hi, the way I understand the code the principle is pretty simple. Each token is broken into two nibbles, then using two 16 byte lookup tables two values are calculated: The bitmask for the column in the URI token table, and the actual row the token is in. If we look at the URI token table we see:
So for the top nibble, it tells us the row in which the token is, as a bitmask with only one bit set. For row 0 it returns 0x01, for row 1 -> 0x02 .. for row 7 -> 0x80. It returns 0 for rows 8-15, so those are never URI token. It ANDs the bitmask with the bitmask for the lower nibble, and if the result is nonzero it is a URI token. So looking at a column we can understand what permutation with need for the lower nibble. We only look at the first 8 rows for each column. If only the first raw is zero we want to zero out bit 0 in the mask, i.e. 0xfe, if two lower rows are zero we want 0xfc, for the last column where top row and first row are zero we clear bits zero and 7 and get 0x7e. So the current table in fact has many many false positives, that are simply not tested for and can be modified significantly.
Should really be
And for the AVX2 code just twice that:
|
I edited my comment with updated values. Miscalculated on the first go. |
Thanks again @vkrasnov! I updated the code and added a comment. Now that I understood the code, I wonder if we could generate those sequences of values from URI_MAP at build time instead hah. |
I had a similar related thought, that it'd be good to have a unit test that just checks all those values against what's in the URI_MAP... No need to hold this PR up though. |
No description provided.