-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize serde_json::from_reader
#1007
Conversation
b18a343
to
a16902c
Compare
Thanks you very much for your work! Unfortunately the author of this project lacks time for it (@dtolnay), so be prepared to wait for YEARS to merge your PR. Using a fork would be easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
An internal buffer that applies to serde_json::Deserializer::from_reader is not going to be acceptable. The caller is allowed to read additional content from the stream that comes after the JSON data, directly from the original stream. For example this PR breaks the following use case:
use serde::Deserialize;
use serde_json::Value;
fn main() {
let mut reader = br#"[1] {"...":true}"#.as_slice();
let header = Value::deserialize(&mut serde_json::Deserializer::from_reader(&mut reader)).unwrap();
println!("{:?}", header);
let body = Value::deserialize(&mut serde_json::Deserializer::from_reader(&mut reader)).unwrap();
println!("{:?}", body);
}
Before:
Array [Number(1)]
Object {"...": Bool(true)}
After:
Array [Number(1)]
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error("EOF while parsing a value", line: 1, column: 0)', src/main.rs:10:92
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
For serde_json::from_reader specifically, that concern does not apply because it enforces the stream only contains a JSON value. We potentially could buffer inside of serde_json::from_reader without buffering serde_json::Deserializer::from_reader. However, I would be concerned about the wisdom of maintaining 2 different reader implementations (and forcing users to pay for that in serde_json compile time) when a caller could achieve the same effect by using BufReader.
I haven't looked closely at what else is in the PR but you said there are "various optimizations" lumped in here aside from the internal buffering. If any of those can be decoupled from an internal buffer, I would be open to reviewing those in a separate PR.
Thanks for response! Since the buffering approach won't work, I'll go ahead and close this. |
This improves
serde_json::from_reader
performance by adding internal buffering and various optimizations, partially resolving #160. The optimization in which copying into the scratch space is avoided was inspired by @bouk's work (#160 (comment)) -- thank you, @bouk.Buffering is used regardless of whether or not the client provides a
BufRead
Reader
. While this means there are two layers of buffering when the client uses aBufRead
, the performance turns out to be better if they do use one anyway, so the main downside is the additional memory use. This approach also eliminates the foot-gun of usingfrom_reader
without aBufRead
.I'm wondering if an approach like this would be accepted. If so, I'd improve the test coverage before submitting.
Using the example from #160 (comment) as a benchmark (taking middle of five measurements):
Before:
BufReader
: 13.460sBufReader
: 0.213sAfter:
BufReader
, 128 byte internal buffer: 0.170s (98% improvement from baseline)BufReader
, 2048 byte internal buffer: 0.069s (99% improvement from baseline, 59% from 128 byte buffer)BufReader
, 4096 byte internal buffer: 0.066s (99.5% improvement from baseline, 61% from 128 byte buffer)BufReader
, 128 byte internal buffer: 0.067s (69% improvement from baseline)BufReader
, 2048 byte internal buffer: 0.062s (71% improvement from baseline, 7% from 128 byte buffer)BufReader
, 4096 byte internal buffer: 0.063s (70% improvement from baseline, 6% from 128 byte buffer)serde_json::from_vec
, in comparison, takes 0.063s including reading the file into aVec
.Benchmark code can be found here -- https://gist.github.com/tgnottingham/9eb4b84ea82a6578628a7b7ae2ffc191.
Using a different benchmark:
Before:
BufReader
: 16.147sBufReader
: 0.346sAfter:
BufReader
, 128 byte internal buffer: 0.317s (98% improvement from baseline)BufReader
, 2048 byte internal buffer: 0.192s (98.8% improvement from baseline, 39% from 128 byte buffer)BufReader
, 4096 byte internal buffer: 0.187s (98.8% improvement from baseline, 41% from 128 byte buffer)BufReader
, 128 byte internal buffer: 0.188s (46% improvement from baseline)BufReader
, 2048 byte internal buffer: 0.186s (46% improvement from baseline, 1% from 128 byte buffer)BufReader
, 4096 byte internal buffer: 0.186s (46% improvement from baseline, 1% from 128 byte buffer)serde_json::from_vec
, in comparison, takes 0.170s including reading the file into aVec
.Benchmark code can be found here -- https://gist.github.com/tgnottingham/950569d1fd85d58116b32614bddcd03a.
A 2048 byte buffer was the smallest of power-of-two size where the performance of not using
BufReader
started to compete with using one. A 4096 byte buffer improved performance slightly more so that they were almost on par.The PR currently uses a 128 byte buffer to be conservative with memory, but I'm interested in knowing if a larger buffer size would be accepted, since it improves run time considerably when a
BufReader
isn't used.