-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixing issue 1480 #1485
fixing issue 1480 #1485
Conversation
@jkeiser I tweaked the documentation a tiny bit because I do not think that the |
Will read more later, I'm getting ready to get on the road. I thought the fix would probably be following the memcmp with a check for ". But it's true that a long string would blow the padding, and I forgot about that! One note in case you hadn't got it: I'm pretty sure you can exploit raw_json_token or equivalent to get a comfortable maximum for the string. Even better, document length - index + SIMDJSON_PADDING gives you an absolute bound for a safe comparison. |
Since 99% of keys are under 32 bytes, and they are generally specified as constants, I was really hoping I could use some template array magic or initializer list stuff to do the fast comparison when length is known at compile time to be <= 32. But I didn't figure it out at the time. |
@jkeiser That's the part I don't know how to do well. The keys are typically compile-time constants, so you can do things like But how do we make use of that? According to stack overflow, see https://stackoverflow.com/questions/46919582/is-it-possible-to-test-if-a-constexpr-function-is-evaluated-at-compile-time, C++20 solves the problem: constexpr int foo(int s)
{
if (std::is_constant_evaluated()) // note: not "if constexpr"
/* evaluated at compile time */;
else
/* evaluated at run time */;
} |
Note that the compiler can do a lot. If it knows that the keys is a compile-time constant, it will optimize accordingly. Well. GNU GCC and clang will, I don't know what Visual Studio will do. Visual Studio may need some C++20 magic. |
@jkeiser Let me be concrete... If you take something like that... bool is_match( const char*target, const char * base) {
size_t pos{0};
for(;target[pos];pos++) {
if(target[pos] == '"') { return false; }
if(target[pos] != base[pos]) { return false; }
}
return true;
}
bool check(const char * base) {
return is_match("fsfsdfds", base);
} The compiler will obliterate the comparison with '"' when it knows the input string at compile time. So code that looks challenging can be optimized. (This does not seem to work with Visual Studio.) |
@jkeiser What I am saying is that you may not need template magic. |
Note that it is very tempting to get clever here. The idea is to very quickly dismiss keys that are not a match. Suppose that the target has at least 7 bytes. Pick up a 64-bit word. Copy up to 8 characters from the target to the this 64-bit word, terminating with a quote if possible. Let us call this word FILTER. Go through the keys. Load a 64-bit word (always safe because of padding) at that location. Do an XOR with FILTER. You get zero if and only if the first 8 characters match (including possible the quote). If so, investigate further, if not move on. If the target has fewer than 7 bytes, then you need to use a mask, but there are only 7 cases so you can use a lookup table or some other cleverness. I'll open an issue. I suggest we handle clever optimizations separately. This PR is meant to be a bug fix. |
@lemire completely agree extra optimization should be separate. I hadn't had time to read your actual plan or code yet and was just posting things I already knew. Do you have a sense whether it did affect performance? Probably twitter.json is the one most likely to be affected (Partial Tweets). Reading code now. |
doc/ondemand_design.md
Outdated
direct raw ASCII comparisons: `key().raw()` provides direct access to the unescaped string. | ||
You can compare `key()` with unescaped C strings (e.g., `key()=="test"`). Importantly, | ||
the C string must not contain an unescaped quote character (`"`) which you can check with | ||
`raw_json_string::is_free_from_unescaped_quote("test")`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more common case is covered by "you can compare directly with a raw string as long as the string you are comparing it to does not have escape characters in it." The unescaped quote character in the raw JSON is not a factor there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I don't think this restriction ever applies, actually. The only way we could screw it up is if the user compares something with a raw string terminated with a single backslash, which is invalid JSON anyway. I think the best way to state the restriction is "if you want to compare with the raw JSON key, you must compare against a key with valid JSON (which requires escape characters if they will contain control characters, newlines, tabs, backslash and quote)."
The restriction we need to worry about is \uXXXX, actually. That's the best way to accidentally get a false negative. "This can return a false negative if the JSON file contains keys with \uXXXX in it. If this is a concern, you can call safe_equal() or is_free_from_unicode_escapes()."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! You're talking about pre-validating your own keys. That makes sense. I think the right check (as noted in another comment) is is_normalized_json_string(), ensuring that one-character escapes are used, \u is only used for otherwise unrepresentable characters, and everything else is raw UTF-8.
However, what you did is a big step better than what we have already, which I put together rather haphazardly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkeiser Suppose that you have the following json {"a":1}
and I do key() == "a\":1}bababababbababa....."
. The comparison could possibly lead to a buffer overrun... can't it? Clearly, a":1}bababababbababa.....
is not a valid JSON string, but imagine that I am an adversary trying to make simdjson crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! I wasn't thinking about keys being provided by users--this is more designed around keys provided by the developer. Totally makes sense.
@@ -59,6 +61,9 @@ class object { | |||
* Use find_field() if you are sure fields will be in order (or are willing to treat it as if the | |||
* field wasn't there when they aren't). | |||
* | |||
* If you have multiple fields with a matching key ({"x": 1, "x": 1}) be mindful | |||
* that only one field is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, good point!
@@ -13,26 +13,142 @@ simdjson_really_inline simdjson_warn_unused simdjson_result<std::string_view> ra | |||
return result; | |||
} | |||
|
|||
simdjson_really_inline simdjson_warn_unused simdjson_result<std::string_view> raw_json_string::unescape(json_iterator &iter) const noexcept { | |||
return unescape(iter.string_buf_loc()); | |||
simdjson_really_inline bool raw_json_string::is_free_from_unescaped_quote(std::string_view target) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted before, I am not sure this is a relevant restriction if the user is given the right constraints. This, on the other hand, might help:
- is_normalized_json_string(key): validates that your string does not contain invalid characters, has no \u escapes (except for some under 0x20 with no escape character equivalent, I guess?), and is not terminated by an odd number of backslashes.
simdjson_unused simdjson_really_inline bool operator==(const raw_json_string &a, std::string_view b) noexcept { | ||
return !memcmp(a.raw(), b.data(), b.size()); | ||
|
||
simdjson_really_inline bool raw_json_string::unsafe_is_equal(size_t length, std::string_view target) const noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it would be fine to just add max_length into raw_json_string and use peek_length(), but wouldn't hold this up for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkeiser Yes. We could do it. As you can imagine, I was trying to be non-invasive because I was concerned with messing up your code. It is not hard!!!
Not yet. But I wanted to get the design right first. Before merging, we will have to assess the performance effect. |
const char * r{raw()}; | ||
size_t pos{0}; | ||
bool escaping{false}; | ||
for(;target[pos];pos++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stringparsing methods already come pretty close to this, and may be the better way to go long term since we have them available. Absolutely not something we should do in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything I've suggested is something we should (if we decide to do it) do in followup PRs. This is a giant step forward and I am eager for it to be merged.
@jkeiser I can finish this up, and do some performance tuning and pick what is fastest, but I think we need to resolve the API issue first... So I say that the problem is with unescaped quotes, and you seem to think that they do not cause problems. We should resolve that. Then you say that we should validate "\u", but I am not sure I follow. That is, I can do the performant implementation, but we should discuss what we document. |
I eventually figured out what you were getting at, which probably made my comments confusing :) I'll put it in one comment for clarity:
|
I have created an issue #1489 We could both improve the performance and add more robustness. |
GNU GCC 9, AMD Rome. Current main branch:
Current PR:
So a 1.6% slowdown. The version with
I run this command: |
We should merge it and see whether we can get the perf back without the potential bugs after :) |
Upcoming commit resolves the performance degradation entirely as I expected should be possible...
The trick is that most keys will fit in 32 bytes which we can use to simplify the logic. |
…ack to something closer to the original.
@jkeiser I also decided to undo much change which moved us from std::string_view to explicit C char* comparison. I want to change as little as possible. |
If the tests turn green, I will merge. This should have no negative effect on performance in practice. |
Merging. I think this is fine now. |
The raw_json_string is unsafe in some ways and this PR makes it better. One key design constraint is that raw_json_string is unaware of this length. It has no idea. You have do everything manually. This exposes you to buffer overruns and to incorrect comparisons.
Two problems could happen:
A. The comparator just did byte-by-byte (memcmp) comparison without checking for the final quote. So the keys "ab" and "ac" would both match "a". That's pretty bad.
B. Just as bad, you could pass "ababaababababababababababa...aaa" (make it very long) as a potential key against the json document
{"a":1}
and it would just happily compare, reading inside the JSON input for as long as you want (buffer overflow!!!).So there are two types of algorithm you could use:
peek_length()
method right before accessing the key and it should give you a sensible upper bound on the string. Note that you must do this on your own because the raw_json_string does not track its length.baba"fdfdsfddsfsd...fdsfd
since a key "baba" would then suffer from a buffer overrun.For the public API, we offer raw_json_string and it does not know its length, so only option 2 is viable, unless you change raw_json_string so that it becomes more like a string_view... and can track an upper bound on its length.
We relied on C++ magic to do the comparison between a C string and raw_json_string. I think that what would happen is that the C string would be automatically converted to string_view (which, presumably, involves finding its length with strlen though that can done at compile time). I checked and it seems that at least GNU GCC is able to basically do it for free (no std::string_view instance needs to be created) when the C string is a compile-time constant (if not, then you get a call to strlen + the comparison which is inefficient). For now, I removed the std::string_view API within raw_json_string and put just a C string API since this is what it is designed for. As far as the public API goes, it serves us well enough.
I have a reason for removing the overload to string_view and it has to do with the fact that there are two ways to skin this cat. Internally, we now have two
unsafe_is_equal
comparators. One with strategy 1 and one with strategy 2. Currently the PR relies on strategy 2. My main justification is that it is the less intrusive change to the current codebase.Expected effects of this PR:
a. For our internal functions, instead of doing a straight memcmp, we do a loop and we check the final quote. I do not think it is should affect the performance. You would expect memcmp to mostly do well on longer strings or on strings that have a predictible length... which is not our case.
b. For the public API (
my.key() == "a"
) , then it is much the same. We check the final quote which is a bit of extra work and we do not do the memcmp.c. The code made assumptions that the user-provided input was not adversarial and this is still true with this PR. The only added safety is that we are protected against inputs that are longer than the remaining bytes in the JSON document, but only if the user provides a non-escaped JSON key. If the user inserts a quote, we are in trouble. We can "easily" flip the problem around and produce slightly safe code with strategy 1. However, it is more likely, I feel, to come at a computational cost so we should proceed with some care.
Fixes #1480