feedback from using simdjson: raw_json_string #2130

lano1106 · 2024-02-13T06:22:20Z

lano1106
Feb 13, 2024

I am currently in the process to migrating my app code from rapidjson to simdjson... it is long tedious and boring but the task is progressing well and I am getting better...

it is very satisfying to be able to simplify the SAX style code and replace dozens of stateful methods with a simple loop function...

I wrote my first custom type template method full specialization!

My first difficulty or incomprehension is:

I am getting weird typecast compiler errors...

lets say I create a function taking an ondemand::array reference... then I call

func(field.value().get_array())

the compiler will complain that it cannot convert an ondemand::implementation::icelake::array rvalue to a reference to ondemand::array...

I guess this is some sort of implementation artifact. If I pin the return value to a local variable, I can then pass a reference to that local variable. I am just confused about having to create a local copy of the return value when the documentation is specifically warning about not make copies...

I am having a hard time working with raw_json_string... A common idiom that I have in my code is if the key is unrecognized, I am logging it.

The API expose an ostream operator but it is forcing me to do some gymnastic because my logging subsystem is more of the form of printf style...

Another usage that I do with JSON keys, it is to pass them to a lexer... raw_json_string would be workable but it does not expose any length info... Having that would be very useful... The len does not need to point precisely the end of the key but some arbitrary limit would be fine. I would be happy the length to be set to the next marker/pointer simdjson has in its indexes data...

for now I guess that I can provide the lenght of the longest token in the lexer specifications... It should be safe since the quote char appears nowhere in the specs therefore if there is no match, the lexer will not go further than the closing quote...

yeah... I just wanted to let you know that in my opinion... there is a very small something missing to raw_json_string to make it really user friendly...

lano1106 · 2024-02-13T06:44:42Z

lano1106
Feb 13, 2024
Author

I just saw how raw_json_string::unsafe_is_equal() is implemented... I guess that I could specify SIMDJSON_PADDING to my lexers... All their tokens are all much smaller than SIMDJSON_PADDING...

0 replies

lemire · 2024-02-13T18:14:33Z

lemire
Feb 13, 2024
Maintainer

there is a very small something missing to raw_json_string to make it really user friendly...

A raw_json_string is basically meant to support comparisons (e.key() == "joe") and little else. We expect users that need full access to the key to call unescaped_key(). E.g., std::string_view keyv = field.unescaped_key();. This will do all decoding, find the location of the final quote and so forth. It returns an std::string_view which should meet your requirements. You can also decode directly to your own buffer (see the documentation).

aw_json_string would be workable but it does not expose any length info... Having that would be very useful... The len does not need to point precisely the end of the key but some arbitrary limit would be fine. I would be happy the length to be set to the next marker/pointer simdjson has in its indexes data...

The next simdjson will include a new method to allow you to get exactly that... it will be called key_raw_json_token(). See

#2131

I am getting weird typecast compiler errors... lets say I create a function taking an ondemand::array reference... then I call func(field.value().get_array())

Almost all data structures in On Demand are lightweight and meant to be passed by value, not by reference.

However, I think that the problem you are encountering is not with simdjson per se. It is a C++ language issue. The following C++ code is invalid:

#include <vector>

std::vector<int> get(int num) {
    return std::vector<int>(num);
}

int f(std::vector<int>& ref) {
    return ref.size();
}

int test() {
    return f(get(10));
}

If you want to pass a reference, you first have to create a value to reference:

#include <vector>

std::vector<int> get(int num) {
    return std::vector<int>(num);
}

int f(std::vector<int>& ref) {
    return ref.size();
}

int test() {
    auto value = get(10);
    return f(value);
}

(This code is now valid.)

The same applies with simdjson types.

To pass an array to a function, just pass it by value. There is typically no point to the pass-by-reference paradigm in simdjson.

Example:

#include "simdjson.h"
#include <iostream>

// prints the content of the array as hexadecimal 64-bit integers
void f(simdjson::ondemand::array v) {
  for(uint64_t val : v) {
    std::cout << "0x" << std::hex << val << std::endl;
  }
}


int main(void) {
  simdjson::padded_string json = R"( [ 897314173811950000, 3122321 ])"_padded;
  simdjson::ondemand::parser parser;
  simdjson::ondemand::document doc = parser.iterate(json);
  f(doc.get_array());
  return EXIT_SUCCESS;
}

4 replies

lano1106 Feb 14, 2024
Author

Here is an example of how I am using the raw key and the little gymnastic required to make it fit in my logging system

using stream_array_buffer = boost::iostreams::stream_buffer<boost::iostreams::array_sink>;
/*
 * processResult()
 */
void Token::processResult(simdjson::ondemand::value &resValue)
{
    using namespace std::literals::string_view_literals;

    simdjson::ondemand::object o = resValue.get_object();

    for (auto field : o) [[likely]] {
        simdjson::ondemand::raw_json_string key{field.key()};

        if (key.unsafe_is_equal("token"sv)) {
            field.value().get_string(m_token);
        } else if (key.unsafe_is_equal("expires"sv)) {
            uint64_t v = field.value().get_uint64();

            m_expirationTS = v + time(nullptr);
        } else {
            char abuf[Base::OST_SMALL_LINE];
            stream_array_buffer sbuf{abuf};
            std::ostream ost{&sbuf};

            ost << "Unrecognized token key " << key;
            WARN0_SZ(abuf, ost.tellp());
        }
    }
}

I feel like similar gymnastic will be required with the new key_raw_json_token()

with something like

strV.remove_suffix(std::size(strV) - strV.find_last_of('"'));

not totally unmanageable... but annoying to write over and over and possibly error prone...

lano1106 Feb 16, 2024
Author

I have applied your commit to my 3.6.4 build.

This has forced me to discover singleheader/amalgamate.py !

lano1106 Feb 22, 2024
Author

I want to report something weird that I have experienced. With the following code, the closing quote was included in the input string_view parameter idSv (one extra unexpected char):

void Builder::processObject(std::string_view idSv, simdjson::ondemand::object o)
inline std::string_view
noquote_notrailingspace(std::string_view v) { return {std::data(v)+1, std::size(v)-2}; }
processObject(Base::noquote_notrailingspace(curField.key_raw_json_token()),
                              curField.value().get_object());

this seems to be a compiler glitch. By storing the key_raw_json_token() result into an intermediary variable made the problem go away:

std::string_view rjt{curField.key_raw_json_token()};

processObject(Base::noquote_notrailingspace(rjt),
                      curField.value().get_object());

lano1106 Feb 22, 2024
Author

about my issue... it may not be a compiler glitch issue after all... This might be related to the order that the functions are called on the curField object...

by breaking up the statement into 2. I am guaranteed that the key is going to be accessed before the value...

maybe I did stumble into something...

lano1106 · 2024-02-14T18:49:11Z

lano1106
Feb 14, 2024
Author

Hi Daniel, yes... key_raw_json_token() is going to be very appreciated in my simdjson usage...

in the meantime, I have discovered raw_json... Which I think I can do a good usage of it because I know for sure that the vast majority of strings in my documents contains zero escape chars...

perhaps another suggestion... maybe having a function the trim the quotes could be handy... I see myself doing it all over the place as soon as I start using raw_json()... I guess the main reason for keeping them is to make the string a valid json document by itself... (especially when called on arrays or objects)

You have nailed correctly what I am encountering as attempting to pass an rvalue reference to a function expecting a reference...
I guess that I could use a universal reference... but I do not want to go there... this is where I find C++ becoming hairy...

another unintended consequence of your very elegant solution for allowing using the API with or without exceptions is that it makes it harder to use the auto keyword with return value...

Concerning the simdjson lightweight object philosophy, I'll try to consider these objects a little bit like string_view...
but I need to reread the user manual... there is a part warning about something called focus and it was using 2 small objects to show a potential issue with that... and I also remember seeing something about NOT passing these object by value... Even mentionning that simdjson was providing a small ondemand::document_reference class or something like that...

I will need to clarify that point because this is a little bit confusing...

FYI, my migration to simdjson is doing well... I have so far migrated well contained and small json parsers... possibly to get the hand on how to use well simdjson... no performance sensitive json handling has been touched yet. The performance sensitive part is the core of the app where there is the most json parsing code. I am keeping this part last because it is going to be the bulk of the migration. The biggest hardest and the longest. That part, there is no partial migration possible. It is an all or nothing situation.
I wanted some proof of concept validation before touching this part...

I have created a thin abstraction layer using type erasure that allows me to switch from one lib to another at compile time in case there is a future need for that.

One thing that I can say so far from my experience... It is that the binary size is significantly increasing. I am a bit surprised by that because

rapidjson is a template heavy header only lib vs simdjson that you can create a shared library for it
I am ripping off a lot of SAX code

one thing that I can say, is that I am keeping the debug symbols in to make possible core dumps easy to analyze... maybe code size is not that much bigger but using simdjson generates a lot of debug symbols....

I am not sure if you can comment on my last observation... about what you know on the binary generated size from using simdjson compared to other json libs

19 replies

lemire Feb 15, 2024
Maintainer

I am slightly confused about the problem you are encountering with the quotes. If I am given a string that starts with a quote and ends with quote, and I want to remove them, I would just write a function such as this one...

std::string_view noquote(std::string_view v) { return {v.data()+1, v.find_last_of('"')-1}; }

Right? I can add this to our documentation if people find it useful.

lano1106 Feb 15, 2024
Author

sharing the last code snippet made me realize something...

skipping the open quote is trivial
I'll need more experience but I feel like having the exact string value end is sometimes unnecessary and it might be hard for the lib to know and there is no free lunch to search for the end of the string if it is not needed.

I think that offering a raw_json_token() variant that would simply be using raw_json_token() return value and simply trim the string_view before returning it to the user would do it.

it would:

provide a nice convenience for users needing this
trivial to implement with strV.remove_suffix(std::size(strV) - strV.find_last_of('"'));
let the user choose between this new function and raw_json_token() based on his knowledged of what is needed in his context. ie: you only pay for what you use principle

lemire Feb 15, 2024
Maintainer

We are NOT going to do this. It is literally one line of C++ code to remove the quotes. And, honestly, it is a highly specialized use case. We expect users to call get_string(). For very specific purposes, people may want to have raw access, but we expect these people to do their own processing. I will add a remark to the documentation so that people know it is easy to remove quotes.

lemire Feb 15, 2024
Maintainer

Previous comment starts with: We are NOT going to do this. This is not something we want to do.

lano1106 Feb 15, 2024
Author

I am slightly confused about the problem you are encountering with the quotes. If I am given a string that starts with a quote and ends with quote, and I want to remove them, I would just write a function such as this one...
std::string_view noquote(std::string_view v) { return {v.data()+1, v.find_last_of('"')-1}; }
Right? I can add this to our documentation if people find it useful.

you are right. this is that trivial... I am a bit clumsy with string manipulation through string_view interface...

lano1106 · 2024-02-14T20:32:23Z

lano1106
Feb 14, 2024
Author

I have another question popping out in my mind...

lets say that I have an object... I fetch the first field with find_field. ie: find_field("event")

next depending on the value of that field, the object is dispatched to different functions.

Each function then are going to process the remaining fields. (JSON polymorphism?)

What is the best approach to continue the fields traversal to where the object was left at?

would a for (auto field : object) work or the internal object iterator would first be rewinded to the beginning?

If I know that event is going to be the first field. Would it better to access it with an iterator and pass the iterator to the functions to continue the iteration?

1 reply

lemire Feb 15, 2024
Maintainer

If I know that event is going to be the first field. Would it better to access it with an iterator and pass the iterator to the functions to continue the iteration?

That seems wise.

lano1106 · 2024-02-15T05:03:54Z

lano1106
Feb 15, 2024
Author

having a function the trim the quotes could be handy

If you encounter a string, the get_string() function returns a string value without the quotes. With a key, the unescaped_key() will return the key without the quotes. Otherwise, the convention is that you get a raw, unprocessed JSON token. If you think we should offer more functionality, I would invite a pull request.

you are correct but my understanding with get_string() is that the whole string is going to be scanned for possible escaping substitution. idk if you think the idea unreasonable but I think that there would be an audience for having a raw string without quotes... I did not dig deep enough in simdjson code to figure out if quotes are indexed along the other important json markers... so maybe it is possible to leverage simd magic to make that operation fast... Otherwise, the best way that I know to do it is to use search for the last quote char and trim the string_view from there up to the end...

I did not make any benchmark but I would think that this is faster than performing the regular escape processing...

if you tell me that you feel it is a good idea that you would like to see added into the project... I could look into it...

2 replies

lemire Feb 15, 2024
Maintainer

I did not dig deep enough in simdjson code to figure out if quotes are indexed along the other important json markers...

Please see the start of page 7 in the paper On-Demand JSON: A Better Way to Parse Documents?

https://arxiv.org/pdf/2312.17149.pdf

It describes what is indexed.

lano1106 Feb 15, 2024
Author

that was an interesting read. thx for the share...

lano1106 · 2024-02-22T05:20:01Z

lano1106
Feb 22, 2024
Author

FYI, I have completed my migration from rapidjson to simdjson... The difference is not immediately obvious since my program parse very small json packets. The speed gain might be in the order of uSecs while there is a standard deviation in the network RTT 0.5 msec. So basically any speed gain is not immediately obvious.

CPU usage may have been reduced but this has not been scientifically measured. Maybe my new simdjson code will shine during the occasional packet burst but I'll need another 24-48h to conclude anything in that regard...

Bottomline, I am glad to have replaced that ugly SAX code. I think, it was worth the move only for code maintainability and for the easy of writing new JSON code in the future...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feedback from using simdjson: raw_json_string #2130

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

feedback from using simdjson: raw_json_string #2130

lano1106 Feb 13, 2024

Replies: 6 comments · 26 replies

lano1106 Feb 13, 2024 Author

lemire Feb 13, 2024 Maintainer

lano1106 Feb 14, 2024 Author

lano1106 Feb 16, 2024 Author

lano1106 Feb 22, 2024 Author

lano1106 Feb 22, 2024 Author

lano1106 Feb 14, 2024 Author

lemire Feb 15, 2024 Maintainer

lano1106 Feb 15, 2024 Author

lemire Feb 15, 2024 Maintainer

lemire Feb 15, 2024 Maintainer

lano1106 Feb 15, 2024 Author

lano1106 Feb 14, 2024 Author

lemire Feb 15, 2024 Maintainer

lano1106 Feb 15, 2024 Author

lemire Feb 15, 2024 Maintainer

lano1106 Feb 15, 2024 Author

lano1106 Feb 22, 2024 Author

lano1106
Feb 13, 2024

Replies: 6 comments 26 replies

lano1106
Feb 13, 2024
Author

lemire
Feb 13, 2024
Maintainer

lano1106 Feb 14, 2024
Author

lano1106 Feb 16, 2024
Author

lano1106 Feb 22, 2024
Author

lano1106 Feb 22, 2024
Author

lano1106
Feb 14, 2024
Author

lemire Feb 15, 2024
Maintainer

lano1106 Feb 15, 2024
Author

lemire Feb 15, 2024
Maintainer

lemire Feb 15, 2024
Maintainer

lano1106 Feb 15, 2024
Author

lano1106
Feb 14, 2024
Author

lemire Feb 15, 2024
Maintainer

lano1106
Feb 15, 2024
Author

lemire Feb 15, 2024
Maintainer

lano1106 Feb 15, 2024
Author

lano1106
Feb 22, 2024
Author