Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing issue 1480 #1485

Merged
merged 13 commits into from
Mar 9, 2021
Merged

fixing issue 1480 #1485

merged 13 commits into from
Mar 9, 2021

Conversation

lemire
Copy link
Member

@lemire lemire commented Mar 5, 2021

The raw_json_string is unsafe in some ways and this PR makes it better. One key design constraint is that raw_json_string is unaware of this length. It has no idea. You have do everything manually. This exposes you to buffer overruns and to incorrect comparisons.

Two problems could happen:

A. The comparator just did byte-by-byte (memcmp) comparison without checking for the final quote. So the keys "ab" and "ac" would both match "a". That's pretty bad.
B. Just as bad, you could pass "ababaababababababababababa...aaa" (make it very long) as a potential key against the json document {"a":1} and it would just happily compare, reading inside the JSON input for as long as you want (buffer overflow!!!).

So there are two types of algorithm you could use:

  1. Find some bound on how far off you can read within the document without a buffer overrun: if you know where you are in the JSON input and you know how long the JSON document, then you have one such upperbound. But you can also call a peek_length() method right before accessing the key and it should give you a sensible upper bound on the string. Note that you must do this on your own because the raw_json_string does not track its length.
  2. Or else, you rely on the fact that the user is not going to give you as a possible key something evil like baba"fdfdsfddsfsd...fdsfd since a key "baba" would then suffer from a buffer overrun.

For the public API, we offer raw_json_string and it does not know its length, so only option 2 is viable, unless you change raw_json_string so that it becomes more like a string_view... and can track an upper bound on its length.

We relied on C++ magic to do the comparison between a C string and raw_json_string. I think that what would happen is that the C string would be automatically converted to string_view (which, presumably, involves finding its length with strlen though that can done at compile time). I checked and it seems that at least GNU GCC is able to basically do it for free (no std::string_view instance needs to be created) when the C string is a compile-time constant (if not, then you get a call to strlen + the comparison which is inefficient). For now, I removed the std::string_view API within raw_json_string and put just a C string API since this is what it is designed for. As far as the public API goes, it serves us well enough.

I have a reason for removing the overload to string_view and it has to do with the fact that there are two ways to skin this cat. Internally, we now have two unsafe_is_equal comparators. One with strategy 1 and one with strategy 2. Currently the PR relies on strategy 2. My main justification is that it is the less intrusive change to the current codebase.

Expected effects of this PR:

a. For our internal functions, instead of doing a straight memcmp, we do a loop and we check the final quote. I do not think it is should affect the performance. You would expect memcmp to mostly do well on longer strings or on strings that have a predictible length... which is not our case.
b. For the public API (my.key() == "a") , then it is much the same. We check the final quote which is a bit of extra work and we do not do the memcmp.
c. The code made assumptions that the user-provided input was not adversarial and this is still true with this PR. The only added safety is that we are protected against inputs that are longer than the remaining bytes in the JSON document, but only if the user provides a non-escaped JSON key. If the user inserts a quote, we are in trouble. We can "easily" flip the problem around and produce slightly safe code with strategy 1. However, it is more likely, I feel, to come at a computational cost so we should proceed with some care.

Fixes #1480

@lemire lemire requested a review from jkeiser March 5, 2021 19:57
@lemire
Copy link
Member Author

lemire commented Mar 5, 2021

@jkeiser I tweaked the documentation a tiny bit because I do not think that the raw_json_string should be the default for end users. There are better ways.

@lemire lemire added this to the 0.9 milestone Mar 5, 2021
@jkeiser
Copy link
Member

jkeiser commented Mar 5, 2021

Will read more later, I'm getting ready to get on the road. I thought the fix would probably be following the memcmp with a check for ". But it's true that a long string would blow the padding, and I forgot about that!

One note in case you hadn't got it: I'm pretty sure you can exploit raw_json_token or equivalent to get a comfortable maximum for the string. Even better, document length - index + SIMDJSON_PADDING gives you an absolute bound for a safe comparison.

@jkeiser
Copy link
Member

jkeiser commented Mar 5, 2021

Since 99% of keys are under 32 bytes, and they are generally specified as constants, I was really hoping I could use some template array magic or initializer list stuff to do the fast comparison when length is known at compile time to be <= 32. But I didn't figure it out at the time.

@lemire
Copy link
Member Author

lemire commented Mar 5, 2021

@jkeiser That's the part I don't know how to do well. The keys are typically compile-time constants, so you can do things like static_assert to check that they fit your expectations. For example, static_assert(x.size()<32, "") would work.

But how do we make use of that? According to stack overflow, see https://stackoverflow.com/questions/46919582/is-it-possible-to-test-if-a-constexpr-function-is-evaluated-at-compile-time, C++20 solves the problem:

constexpr int foo(int s)
{
    if (std::is_constant_evaluated()) // note: not "if constexpr"
        /* evaluated at compile time */;
    else
        /* evaluated at run time */;
}

@lemire
Copy link
Member Author

lemire commented Mar 5, 2021

Since 99% of keys are under 32 bytes, and they are generally specified as constants, I was really hoping I could use some template array magic or initializer list stuff to do the fast comparison when length is known at compile time to be <= 32. But I didn't figure it out at the time.

Note that the compiler can do a lot. If it knows that the keys is a compile-time constant, it will optimize accordingly.

Well. GNU GCC and clang will, I don't know what Visual Studio will do. Visual Studio may need some C++20 magic.

@lemire
Copy link
Member Author

lemire commented Mar 5, 2021

@jkeiser Let me be concrete... If you take something like that...

bool is_match( const char*target, const char * base) {
  size_t pos{0};
  for(;target[pos];pos++) {
      if(target[pos] == '"') { return false; }
      if(target[pos] != base[pos]) { return false; }
  }
  return true;
}


bool  check(const char * base) {
    return is_match("fsfsdfds", base);
}

The compiler will obliterate the comparison with '"' when it knows the input string at compile time. So code that looks challenging can be optimized.

(This does not seem to work with Visual Studio.)

@lemire
Copy link
Member Author

lemire commented Mar 5, 2021

@jkeiser What I am saying is that you may not need template magic.

@lemire
Copy link
Member Author

lemire commented Mar 6, 2021

Note that it is very tempting to get clever here.

The idea is to very quickly dismiss keys that are not a match.

Suppose that the target has at least 7 bytes.

Pick up a 64-bit word. Copy up to 8 characters from the target to the this 64-bit word, terminating with a quote if possible. Let us call this word FILTER.

Go through the keys. Load a 64-bit word (always safe because of padding) at that location. Do an XOR with FILTER. You get zero if and only if the first 8 characters match (including possible the quote). If so, investigate further, if not move on.

If the target has fewer than 7 bytes, then you need to use a mask, but there are only 7 cases so you can use a lookup table or some other cleverness.

I'll open an issue. I suggest we handle clever optimizations separately. This PR is meant to be a bug fix.

@jkeiser
Copy link
Member

jkeiser commented Mar 8, 2021

@lemire completely agree extra optimization should be separate. I hadn't had time to read your actual plan or code yet and was just posting things I already knew. Do you have a sense whether it did affect performance? Probably twitter.json is the one most likely to be affected (Partial Tweets).

Reading code now.

direct raw ASCII comparisons: `key().raw()` provides direct access to the unescaped string.
You can compare `key()` with unescaped C strings (e.g., `key()=="test"`). Importantly,
the C string must not contain an unescaped quote character (`"`) which you can check with
`raw_json_string::is_free_from_unescaped_quote("test")`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more common case is covered by "you can compare directly with a raw string as long as the string you are comparing it to does not have escape characters in it." The unescaped quote character in the raw JSON is not a factor there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I don't think this restriction ever applies, actually. The only way we could screw it up is if the user compares something with a raw string terminated with a single backslash, which is invalid JSON anyway. I think the best way to state the restriction is "if you want to compare with the raw JSON key, you must compare against a key with valid JSON (which requires escape characters if they will contain control characters, newlines, tabs, backslash and quote)."

The restriction we need to worry about is \uXXXX, actually. That's the best way to accidentally get a false negative. "This can return a false negative if the JSON file contains keys with \uXXXX in it. If this is a concern, you can call safe_equal() or is_free_from_unicode_escapes()."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! You're talking about pre-validating your own keys. That makes sense. I think the right check (as noted in another comment) is is_normalized_json_string(), ensuring that one-character escapes are used, \u is only used for otherwise unrepresentable characters, and everything else is raw UTF-8.

However, what you did is a big step better than what we have already, which I put together rather haphazardly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkeiser Suppose that you have the following json {"a":1} and I do key() == "a\":1}bababababbababa.....". The comparison could possibly lead to a buffer overrun... can't it? Clearly, a":1}bababababbababa..... is not a valid JSON string, but imagine that I am an adversary trying to make simdjson crash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I wasn't thinking about keys being provided by users--this is more designed around keys provided by the developer. Totally makes sense.

@@ -59,6 +61,9 @@ class object {
* Use find_field() if you are sure fields will be in order (or are willing to treat it as if the
* field wasn't there when they aren't).
*
* If you have multiple fields with a matching key ({"x": 1, "x": 1}) be mindful
* that only one field is returned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point!

@@ -13,26 +13,142 @@ simdjson_really_inline simdjson_warn_unused simdjson_result<std::string_view> ra
return result;
}

simdjson_really_inline simdjson_warn_unused simdjson_result<std::string_view> raw_json_string::unescape(json_iterator &iter) const noexcept {
return unescape(iter.string_buf_loc());
simdjson_really_inline bool raw_json_string::is_free_from_unescaped_quote(std::string_view target) noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted before, I am not sure this is a relevant restriction if the user is given the right constraints. This, on the other hand, might help:

  • is_normalized_json_string(key): validates that your string does not contain invalid characters, has no \u escapes (except for some under 0x20 with no escape character equivalent, I guess?), and is not terminated by an odd number of backslashes.

simdjson_unused simdjson_really_inline bool operator==(const raw_json_string &a, std::string_view b) noexcept {
return !memcmp(a.raw(), b.data(), b.size());

simdjson_really_inline bool raw_json_string::unsafe_is_equal(size_t length, std::string_view target) const noexcept {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it would be fine to just add max_length into raw_json_string and use peek_length(), but wouldn't hold this up for it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkeiser Yes. We could do it. As you can imagine, I was trying to be non-invasive because I was concerned with messing up your code. It is not hard!!!

@lemire
Copy link
Member Author

lemire commented Mar 8, 2021

Do you have a sense whether it did affect performance?

Not yet. But I wanted to get the design right first. Before merging, we will have to assess the performance effect.

const char * r{raw()};
size_t pos{0};
bool escaping{false};
for(;target[pos];pos++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stringparsing methods already come pretty close to this, and may be the better way to go long term since we have them available. Absolutely not something we should do in this PR.

Copy link
Member

@jkeiser jkeiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything I've suggested is something we should (if we decide to do it) do in followup PRs. This is a giant step forward and I am eager for it to be merged.

@lemire
Copy link
Member Author

lemire commented Mar 8, 2021

@jkeiser I can finish this up, and do some performance tuning and pick what is fastest, but I think we need to resolve the API issue first... So I say that the problem is with unescaped quotes, and you seem to think that they do not cause problems. We should resolve that. Then you say that we should validate "\u", but I am not sure I follow.

That is, I can do the performant implementation, but we should discuss what we document.

@jkeiser
Copy link
Member

jkeiser commented Mar 8, 2021

I eventually figured out what you were getting at, which probably made my comments confusing :) I'll put it in one comment for clarity:

  1. I see that you were mainly worried about the quote because it could cause a buffer overrun if the key is designed adversarially. That is fair and seems like a primary concern.
  2. I was talking about normalized and valid JSON because I was thinking about non-adversarial use cases that will work incorrectly. Basically, the subset of keys it's designed for is keys that are statically known and restricted to raw UTF-8 without escapes. Even if the statically known user-provided keys are in this set, it could still fail to match valid JSON if the JSON file itself has the same key, with \uXXXX modifiers (e.g. obj["a"] will fail to match { "\u0041": "hello world" }). The reverse is also true. There is value in checking for consistency in keys if they are coming from outside, but it seems like a low priority.
  3. I'm not worried about peek_length(), and fixing the bug seems more important than fixing that :)

@lemire
Copy link
Member Author

lemire commented Mar 8, 2021

There is value in checking for consistency in keys if they are coming from outside, but it seems like a low priority.

I have created an issue #1489 We could both improve the performance and add more robustness.

@lemire
Copy link
Member Author

lemire commented Mar 8, 2021

GNU GCC 9, AMD Rome.

Current main branch:

partial_tweets<simdjson_ondemand>/manual_time     208761 ns       242875 ns         3366             1011           3.04671G               0              0    703.746k              1.11438          4.82444k       3.39518G          2.41881M                    3.83017                     3.43705           482.444k     1023.93   631.515k        2.8173G/s          0          0    707.16k         1.11978   4.79016k/s 3.38741G/s     2.41881M               3.83017                3.42045        100       479.016k/s [BEST: throughput=  3.05 GB/s doc_throughput=  4824 docs/s instructions=     2418808 cycles=      703746 branch_miss=    1011 cache_miss=       0 cache_ref=         0 items=       100 avg_time=    208761 ns]

Current PR:

partial_tweets<simdjson_ondemand>/manual_time     212125 ns       246173 ns         3285             1067            3.0013G               0              0    714.412k              1.13127          4.75254k       3.39527G          2.44165M                    3.86634                     3.41771           475.254k     1099.21   631.515k       2.77263G/s          0          0   718.165k         1.13721   4.71419k/s 3.38557G/s     2.44165M               3.86634                3.39985        100       471.419k/s [BEST: throughput=  3.00 GB/s doc_throughput=  4752 docs/s instructions=     2441651 cycles=      714412 branch_miss=    1067 cache_miss=       0 cache_ref=         0 items=       100 avg_time=    212125 ns]

So a 1.6% slowdown.

The version with size_t max_key_length = _json_iter->peek_length() - 2 does worse as I expected intuitively...

partial_tweets<simdjson_ondemand>/manual_time     215444 ns       250317 ns         3247           1.318k           2.96166G               0              0    723.972k              1.14641          4.68977k       3.39526G          2.43193M                    3.85095                     3.35916           468.977k    1.31279k   631.515k       2.72992G/s          0          0   727.642k         1.15222   4.64158k/s 3.37741G/s     2.43194M               3.85095                3.34221        100       464.158k/s [BEST: throughput=  2.96 GB/s doc_throughput=  4689 docs/s instructions=     2431935 cycles=      723972 branch_miss=    1318 cache_miss=       0 cache_ref=         0 items=       100 avg_time=    215443 ns]

I run this command: for i in {1..4}; do ./buildpr/benchmark/bench_ondemand --benchmark_filter="partial_tweets<simdjson_ondemand>" --benchmark_counters_tabular=true |& grep simdjson_ondemand ; done.

@jkeiser
Copy link
Member

jkeiser commented Mar 8, 2021

We should merge it and see whether we can get the perf back without the potential bugs after :)

@lemire
Copy link
Member Author

lemire commented Mar 8, 2021

Upcoming commit resolves the performance degradation entirely as I expected should be possible...

partial_tweets<simdjson_ondemand>/manual_time     207547 ns       241662 ns         3358              980           3.06941G               0              0    698.546k              1.10614          4.86039k        3.3952G          2.41952M                    3.83129                     3.46365           486.039k     1004.94   631.515k       2.83378G/s          0          0   702.771k         1.11283   4.81818k/s 3.38608G/s     2.41952M               3.83129                3.44283        100       481.818k/s [BEST: throughput=  3.07 GB/s doc_throughput=  4860 docs/s instructions=     2419519 cycles=      698546 branch_miss=     980 cache_miss=       0 cache_ref=         0 items=       100 avg_time=    207547 ns]

The trick is that most keys will fit in 32 bytes which we can use to simplify the logic.

@lemire
Copy link
Member Author

lemire commented Mar 8, 2021

@jkeiser I also decided to undo much change which moved us from std::string_view to explicit C char* comparison. I want to change as little as possible.

@lemire
Copy link
Member Author

lemire commented Mar 8, 2021

If the tests turn green, I will merge.

This should have no negative effect on performance in practice.

@lemire
Copy link
Member Author

lemire commented Mar 9, 2021

Merging. I think this is fine now.

@lemire lemire merged commit 8e8fbc4 into master Mar 9, 2021
@lemire lemire deleted the dlemire/fix_issue_1480 branch March 9, 2021 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

raw_json_string operator== return true for unqeual keys (OnDemand API)
2 participants