Why we do not have a std::string_view version for the key() of field (not key_raw_json_token)? #2149

renzibei · 2024-03-15T02:54:48Z

Hi, I've been exploring simdjson's API, particularly focusing on iterating over JSON objects and accessing their keys. From my understanding and current experimentation, it appears there are several methods to access keys but with certain limitations:

field::key() returns a simdjson::ondemand::raw_json_string, which provides the raw key but not in a directly usable std::string_view format without further processing.
field::unescaped_key() does return an unescaped key as std::string_view, which is great but involves the overhead of unescaping.
We also have access to key_raw_json_token, which retains the quotes.

Considering the efficiency and ease of use of std::string_view in C++ for string operations, I'm wondering why there isn't a method like field::key_sv() (or any similarly named method) that directly returns a std::string_view of the key, akin to field::key() but without the need to deal with raw_json_string or token specifics. This addition would streamline operations that can benefit from the lightweight and efficient nature of std::string_view, especially in scenarios where string manipulation or comparisons are frequent.

Is there a technical or design rationale for this absence, or could this be considered for future implementation? Such a feature would significantly enhance usability for developers focusing on performance-critical applications.

The text was updated successfully, but these errors were encountered:

jkeiser · 2024-03-15T15:14:03Z

Sure! It's not unreasonable to have field.key().escaped() or maybe field.escaped_key() if the former isn't feasible. There's no reason not to have such a method accessible as long as the user has to type the word raw or escaped somewhere.

Just for context, the rationales behind not making key() convert automatically or easily to string_view were:

Converting to string_view requires scanning the string to get the length, which is a waste of time for many use cases (string comparison, raw copy, and unescaping), so we don't want the easiest methods to start off doing it--this is why key() returns the raw_json_string gets returned in the first place, so these operations can be done without preemptively taking that overhead.
Processing strings without unescaping should be explicit in the code, requiring you to write "escaped" or "raw" to do it. Due to the rarity of escapes, accidentally forgetting to unescape is the kind of "silent but deadly" bug that tends to make its way to production and fail mysteriously and intermittently there. In some ways, it's like asking the user to sign a waiver saying they understand the risks :) Counterpoint: escapes in keys (rather than values) are so rare that we already have one API--object["key"]--that processes escaped keys without anything explicitly indicating it in the code. We still try to minimize the number of places we do this, and point 1 still applies, though.

lemire · 2024-03-15T15:37:28Z

I agree with @jkeiser that it is reasonable to extend the API further.

lemire · 2024-03-15T15:38:29Z

Note that is it not difficult to implement. We effectively have the code already.

renzibei · 2024-03-15T18:39:46Z

I think the length of the escaped_key() should be length of key_raw_json_token() minus two bytes?
Following is the code of key_raw_json_token() I found. Should escaped_key() be very similar, but cutting one byte and the head and tail of key_raw_json_token()?

simdjson_inline std::string_view field::key_raw_json_token() const noexcept {
  SIMDJSON_ASSUME(first.buf != nullptr); // We would like to call .alive() by Visual Studio won't let us.
  return std::string_view(reinterpret_cast<const char*>(first.buf-1), second.iter._json_iter->token.peek(-1) - first.buf + 1);
}

jkeiser · 2024-03-15T19:14:17Z

Unfortunately, key_raw_json_token() will include everything from the open quote to the character just before the :. This means if there are any spaces between the key and the colon, it is included in key_raw_json_token(): for { "abc" : "def" }, the raw_json_token will have four spaces at the end, for example.

lemire · 2024-03-15T19:25:14Z

@renzibei @jkeiser We must do some non-trivial work even if we do not unescape.

It might answer this question: Is there a technical or design rationale for this absence, or could this be considered for future implementation?

The answer is that it is not free.

renzibei · 2024-03-15T19:39:06Z

It might indeed require scanning for the ending double quote to find the length of the key, but for ease of use, I think this may be worthwhile. If we compare the key multiple times with some other strings, then we've almost certainly scanned the key multiple times already. Additionally, in situations where the key is used as the key in a hash map, having the length information and a std::string_view becomes indispensable. Given these considerations, I believe the cost of determining the length could be justified.

We can remind the user that this operation to get a string_view has some cost, but smaller than unescaped_key().

lemire · 2024-03-15T19:44:16Z

@renzibei When checking for equality, we do not actually need to find the end quote... so it is not work that we do in any case, or that we could necessarily amortize in practice.

There is no argument against the fact that the feature request is valid and we will provide it.

Let me be clear : we will provide this functionality in a future release. In fact, I am openly inviting folks to provide a pull request. If nobody does it, I will.

jkeiser · 2024-03-15T19:48:10Z

Yeah, to be specific, the code to compare the key with a string is basically strncmp(field.key().buf, str.data(), str.len()) && *(field.key().buf+str.len()) == '"'. So we don't scan for it; we just check if the quote is where it should be given the length of the string we're comparing to.

renzibei · 2024-03-15T19:48:39Z

@lemire Thanks for the reply.

You mentioned that the function has been implemented somewhere already?

We effectively have the code already.

renzibei · 2024-03-15T19:51:02Z

Yeah, to be specific, the code to compare the key with a string is basically strncmp(field.key().buf, str.data(), str.len()) && *(field.key().buf+str.len()) == '"'. So we don't scan for it; we just check if the quote is where it should be given the length of the string we're comparing to.

I understand. What I'm saying is that, when we compare the key() to other strings multiple times, we may have accessed the whole key memory already. The strncmp here or memcmp can be viewed as a scan of the memory.

lemire · 2024-03-15T19:54:14Z

The strncmp here or memcmp can be viewed as a scan of the memory.

But that's not what we do in the code.

lemire · 2024-03-15T19:55:23Z

You mentioned that the function has been implemented somewhere already?

We can locate the start and the end. It is a simple matter of backtracking and finding the quote.

Except for the copy-pasting, the whole thing can be implemented with one or two extra lines of code.

The expensive part is to write the documentation and the new tests.

renzibei · 2024-03-15T21:11:13Z

This week I'm preoccupied with some commitments. If you have the bandwidth to tackle this soon, that would be fantastic. Otherwise, I'd be happy to contribute a pull request, potentially after one or two weeks. Of course, if anyone else has the capacity to jump in sooner, that would be great as well.

lemire · 2024-03-17T15:24:01Z

This will be part of the next release.

lemire added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers on demand Related to simdjson::ondemand functionality labels Mar 15, 2024

PragmaTwice mentioned this issue Mar 17, 2024

Add escaped_key method to ondemand::field #2150

Merged

lemire closed this as completed in #2150 Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why we do not have a std::string_view version for the key() of field (not key_raw_json_token)? #2149

Why we do not have a std::string_view version for the key() of field (not key_raw_json_token)? #2149

renzibei commented Mar 15, 2024

jkeiser commented Mar 15, 2024 •

edited

lemire commented Mar 15, 2024

lemire commented Mar 15, 2024

renzibei commented Mar 15, 2024 •

edited

jkeiser commented Mar 15, 2024

lemire commented Mar 15, 2024

renzibei commented Mar 15, 2024

lemire commented Mar 15, 2024

jkeiser commented Mar 15, 2024

renzibei commented Mar 15, 2024

renzibei commented Mar 15, 2024

lemire commented Mar 15, 2024

lemire commented Mar 15, 2024

renzibei commented Mar 15, 2024

lemire commented Mar 17, 2024

Why we do not have a std::string_view version for the key() of field (not key_raw_json_token)? #2149

Why we do not have a std::string_view version for the key() of field (not key_raw_json_token)? #2149

Comments

renzibei commented Mar 15, 2024

jkeiser commented Mar 15, 2024 • edited

lemire commented Mar 15, 2024

lemire commented Mar 15, 2024

renzibei commented Mar 15, 2024 • edited

jkeiser commented Mar 15, 2024

lemire commented Mar 15, 2024

renzibei commented Mar 15, 2024

lemire commented Mar 15, 2024

jkeiser commented Mar 15, 2024

renzibei commented Mar 15, 2024

renzibei commented Mar 15, 2024

lemire commented Mar 15, 2024

lemire commented Mar 15, 2024

renzibei commented Mar 15, 2024

lemire commented Mar 17, 2024

jkeiser commented Mar 15, 2024 •

edited

renzibei commented Mar 15, 2024 •

edited