-
Notifications
You must be signed in to change notification settings - Fork 978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster and more correct serialization #1168
Conversation
@lemire Have you considered using the amazing work of Raffaello Giulietti "The Schubfach way to render doubles" to serialize 64-bit floating point numbers into the shortest and the most precise text representation? Here is the original implementation for Java from the authors. Here is C++ implementation of the algorithm by Alexander Bolz. Here is an adoption of the Schubfach way in one of the fastest JSON serializer for Scala. BTW, there is the Dragonbox algorithm with C++ implementation, that is based on the Schubfach algorithm. |
@plokhotnyuk Do you have a link to a C/C++ along with benchmarks? At this point in time, I am not even trying to achieve best performance, let alone optimal. I am moving us off the C++ streams. Period. If there is ready and tested code that I could adopt, please share... but it needs to be mature code. |
I am still flabbergasted that C++ streams are this slow. I figured it'd be 50% worse, or something, and I could live with that at the time, but wow. Just wow. |
I hear that they are the root cause of the fires on the West Coast. |
@plokhotnyuk Useful pointers. Thanks. |
@plokhotnyuk I have looked at the C++ implementations and none of them look mature enough. It may well be great work, but I don't trust the look of the code, and I do not want to spend the time at this point to examine it. |
@jkeiser This is ready for review.This modifies your code, so please review. It can be improved. At this point, it is enough that it is correct, and much faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is just a thought. Don't take it as a request for change, just mull it over.
If we were to rename mini_formatter -> json_formatter, and expose it, we could support string, cout, and FILE * in one interface:
template<typename T=string>
struct json_formatter {
T out;
private:
void write(const char *buf, size_t len);
};
// Might need template<> in front of each of these
void json_formatter<string>::write(const char *buf, size_t len) { out.append(buf, len); }
void json_formatter<ostream &>::write(const char *buf, size_t len) { out << string_view(buf, len); }
void json_formatter<FILE *>::write(const char *buf, size_t len) { fputs(out, buf, len); }
template<typename T>
inline ostream &operator<<(ostream &out, T value) { json_formatter(out).write(value); }
inline json_formatter &operator<<(json_formatter &out, dom::element value) { ... }
inline json_formatter &operator<<(json_formatter &out, dom::array value) { ... }
inline json_formatter &operator<<(json_formatter &out, dom::object value) { ... }
inline json_formatter &operator<<(json_formatter &out, dom::document value) { ... }
inline json_formatter &operator<<(json_formatter &out, error_code value) { ... }
template<typename T>
inline json_formatter &operator<<(json_formatter &out, simdjson_result<T> value) {
T actual;
error_code error = value.get(actual);
if (error) { out << error; }
else { out << actual; }
return *this;
}
This gives you the option to keep the formatter around between invocations, which I think is one of
the reasons you wrote string_builder :)
Usage:
// string
json_formatter formatter;
formatter << doc;
// formatter.out has the string
// ostream &
cout << doc;
// FILE *
json_formatter(stdout) << doc;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions, but nothing that screams "need fix." It's good as-is, backwards-compatible and brings substantial benefit. The only drawback is it has to malloc while printing and for large files will be waiting on cache misses a lot, which the old one theoretically was not; but I don't know how important this is.
If we want to fix that, we can fix it later. This is way better than what we had!
I'll take your concerns point by point and try to address them. It is important to get these things right. |
This might be totally fine but I don't want to mix API design with this PR. I am mostly preoccupied with getting the fundamentals right. So what I'll do is hide away everything in "internal" and make it private, so that we have the freedom of tweaking the API later. |
are not part of our public API and are subject to change at any time!
@jkeiser My latest commit hides away string_buffer and mini_formatter in the internal namespace and I have put clear warning that it is not part of our public API. So we can take a second pass and expose whatever we like later. |
My thinking was as follows. Suppose that you have an application where you routine grab a subset of a JSON document and you pass it along. The way we do things currently, we would constantly be allocating memory to hold the string content even if all of your strings are about the same size. In such a scenario, you'd want to keep reusing your buffer. But my thinking is not very relevant since I don't have a real use case for it. So let us hide all this where the users won't see it. This way, we can change it later. |
It's a good thought, and I agree anytime we have allocation, we should let people reuse it. |
Looks good, thanks :) |
Merging. |
Currently, in simdjson, we have the recently introduced
std::minify
which can convert a JSON element back to a minified strings. Internally, it works with C++ streams. There are three problems with C++ streams...This PR fixes these issues with
simdjson::minify
. It adds a new equivalent functionto_string
, as it is a standard way to convert objects to strings in C++11. I would also addto_chars
to be C++17-ish, but I stopped short of it because C++17 is not our primary concern.This PR is backward compatible: there should be no API-breaking changes. However, the serialized output will differ. But that is fine. The new serialized output should be more precise and be locale independent.
It almost identical code from the previous code, I just copied @jkeiser 's code in a stream-free manner. The result is a 10x performance boost (see numbers below). This is still very far from having optimal speed but it is probably good enough to eliminate serialization as a bottleneck.
As stated elsewhere, this should be in 0.6 as I think it is of some importance.
There are other benefits to this approach as it opens the door to the addition of a prettifier.
Fixes #933
According to @plokhotnyuk The floating-point serialization could probably be even better, but this can be handled in the future.