Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why we do not have a std::string_view version for the key() of field (not key_raw_json_token)? #2149

Closed
renzibei opened this issue Mar 15, 2024 · 15 comments · Fixed by #2150
Closed
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed on demand Related to simdjson::ondemand functionality

Comments

@renzibei
Copy link

Hi, I've been exploring simdjson's API, particularly focusing on iterating over JSON objects and accessing their keys. From my understanding and current experimentation, it appears there are several methods to access keys but with certain limitations:

  1. field::key() returns a simdjson::ondemand::raw_json_string, which provides the raw key but not in a directly usable std::string_view format without further processing.
  2. field::unescaped_key() does return an unescaped key as std::string_view, which is great but involves the overhead of unescaping.
  3. We also have access to key_raw_json_token, which retains the quotes.

Considering the efficiency and ease of use of std::string_view in C++ for string operations, I'm wondering why there isn't a method like field::key_sv() (or any similarly named method) that directly returns a std::string_view of the key, akin to field::key() but without the need to deal with raw_json_string or token specifics. This addition would streamline operations that can benefit from the lightweight and efficient nature of std::string_view, especially in scenarios where string manipulation or comparisons are frequent.

Is there a technical or design rationale for this absence, or could this be considered for future implementation? Such a feature would significantly enhance usability for developers focusing on performance-critical applications.

@jkeiser
Copy link
Member

jkeiser commented Mar 15, 2024

Sure! It's not unreasonable to have field.key().escaped() or maybe field.escaped_key() if the former isn't feasible. There's no reason not to have such a method accessible as long as the user has to type the word raw or escaped somewhere.

Just for context, the rationales behind not making key() convert automatically or easily to string_view were:

  1. Converting to string_view requires scanning the string to get the length, which is a waste of time for many use cases (string comparison, raw copy, and unescaping), so we don't want the easiest methods to start off doing it--this is why key() returns the raw_json_string gets returned in the first place, so these operations can be done without preemptively taking that overhead.

  2. Processing strings without unescaping should be explicit in the code, requiring you to write "escaped" or "raw" to do it. Due to the rarity of escapes, accidentally forgetting to unescape is the kind of "silent but deadly" bug that tends to make its way to production and fail mysteriously and intermittently there. In some ways, it's like asking the user to sign a waiver saying they understand the risks :) Counterpoint: escapes in keys (rather than values) are so rare that we already have one API--object["key"]--that processes escaped keys without anything explicitly indicating it in the code. We still try to minimize the number of places we do this, and point 1 still applies, though.

@lemire
Copy link
Member

lemire commented Mar 15, 2024

I agree with @jkeiser that it is reasonable to extend the API further.

@lemire lemire added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers on demand Related to simdjson::ondemand functionality labels Mar 15, 2024
@lemire
Copy link
Member

lemire commented Mar 15, 2024

Note that is it not difficult to implement. We effectively have the code already.

@renzibei
Copy link
Author

renzibei commented Mar 15, 2024

I think the length of the escaped_key() should be length of key_raw_json_token() minus two bytes?
Following is the code of key_raw_json_token() I found. Should escaped_key() be very similar, but cutting one byte and the head and tail of key_raw_json_token()?

simdjson_inline std::string_view field::key_raw_json_token() const noexcept {
  SIMDJSON_ASSUME(first.buf != nullptr); // We would like to call .alive() by Visual Studio won't let us.
  return std::string_view(reinterpret_cast<const char*>(first.buf-1), second.iter._json_iter->token.peek(-1) - first.buf + 1);
}

@jkeiser
Copy link
Member

jkeiser commented Mar 15, 2024

Unfortunately, key_raw_json_token() will include everything from the open quote to the character just before the :. This means if there are any spaces between the key and the colon, it is included in key_raw_json_token(): for { "abc" : "def" }, the raw_json_token will have four spaces at the end, for example.

@lemire
Copy link
Member

lemire commented Mar 15, 2024

@renzibei @jkeiser We must do some non-trivial work even if we do not unescape.

It might answer this question: Is there a technical or design rationale for this absence, or could this be considered for future implementation?

The answer is that it is not free.

@renzibei
Copy link
Author

It might indeed require scanning for the ending double quote to find the length of the key, but for ease of use, I think this may be worthwhile. If we compare the key multiple times with some other strings, then we've almost certainly scanned the key multiple times already. Additionally, in situations where the key is used as the key in a hash map, having the length information and a std::string_view becomes indispensable. Given these considerations, I believe the cost of determining the length could be justified.

We can remind the user that this operation to get a string_view has some cost, but smaller than unescaped_key().

@lemire
Copy link
Member

lemire commented Mar 15, 2024

@renzibei When checking for equality, we do not actually need to find the end quote... so it is not work that we do in any case, or that we could necessarily amortize in practice.

There is no argument against the fact that the feature request is valid and we will provide it.

Let me be clear : we will provide this functionality in a future release. In fact, I am openly inviting folks to provide a pull request. If nobody does it, I will.

@jkeiser
Copy link
Member

jkeiser commented Mar 15, 2024

Yeah, to be specific, the code to compare the key with a string is basically strncmp(field.key().buf, str.data(), str.len()) && *(field.key().buf+str.len()) == '"'. So we don't scan for it; we just check if the quote is where it should be given the length of the string we're comparing to.

@renzibei
Copy link
Author

@lemire Thanks for the reply.

You mentioned that the function has been implemented somewhere already?

We effectively have the code already.

@renzibei
Copy link
Author

Yeah, to be specific, the code to compare the key with a string is basically strncmp(field.key().buf, str.data(), str.len()) && *(field.key().buf+str.len()) == '"'. So we don't scan for it; we just check if the quote is where it should be given the length of the string we're comparing to.

I understand. What I'm saying is that, when we compare the key() to other strings multiple times, we may have accessed the whole key memory already. The strncmp here or memcmp can be viewed as a scan of the memory.

@lemire
Copy link
Member

lemire commented Mar 15, 2024

The strncmp here or memcmp can be viewed as a scan of the memory.

But that's not what we do in the code.

@lemire
Copy link
Member

lemire commented Mar 15, 2024

You mentioned that the function has been implemented somewhere already?

We can locate the start and the end. It is a simple matter of backtracking and finding the quote.

Except for the copy-pasting, the whole thing can be implemented with one or two extra lines of code.

The expensive part is to write the documentation and the new tests.

@renzibei
Copy link
Author

This week I'm preoccupied with some commitments. If you have the bandwidth to tackle this soon, that would be fantastic. Otherwise, I'd be happy to contribute a pull request, potentially after one or two weeks. Of course, if anyone else has the capacity to jump in sooner, that would be great as well.

@lemire
Copy link
Member

lemire commented Mar 17, 2024

This will be part of the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed on demand Related to simdjson::ondemand functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants