Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high memory usage with serde_json::Value #635

Open
Diggsey opened this issue Mar 20, 2020 · 7 comments
Open

Very high memory usage with serde_json::Value #635

Diggsey opened this issue Mar 20, 2020 · 7 comments

Comments

@Diggsey
Copy link
Contributor

Diggsey commented Mar 20, 2020

Unfortunately, due to Value being completely public I don't know how much can be done about this without breaking changes. However, a couple of times I've run into problems with exceptionally high memory usage when using a Value.

I don't think there's a bug here, just that common uses seem to be much more memory intensive than similar code in dynamic languages, where this kind of data is already heavily optimised.

I think it comes from several factors:

  • Each Value is 32 bytes on a 64-bit system even though the majority of Values will be the leaf nodes (numbers, strings, nulls, etc.) which don't need that much space. If Value were more highly optimized for leaf nodes, I think this could easily be halved.

  • Maps are optimized for access time rather than space efficiency, and this is made worse because there are lots of "empty" Value slots, each of which is another 32 bytes.

  • Strings are owned. When converting from a struct with to_value, object keys will all be known statically, and those strings will already be embedded in the program as static data, so using a Cow<'static, str> could dramatically reduce memory usage.

  • Strings are exclusively owned. When deserializing into a Value it's likely that there will be lots of duplicate strings, but there is no possibility for them to be shared with the current Value representation.

I think a more space-efficient Value type could be introduced. Keys could be stored as a pointer-sized union of &'static str, Arc<String> using a tag in the low bits to differentiate. The deserializer could automatically intern strings as they are deserialized. Value could be shrunk to 16 bytes, and store short strings inline. Maps could use a simple Vec representation for small numbers of elements to avoid any wasted space. The improved cache-coherency could also improve performance. All access to "compact values" should be done via methods to allow further optimisations in the future. There would also need to be a version of the json!() macro that produced this compact type.

@serde-rs serde-rs deleted a comment from GopherJ Mar 21, 2020
@serde-rs serde-rs deleted a comment from GopherJ May 3, 2020
@Diggsey
Copy link
Contributor Author

Diggsey commented Nov 19, 2020

@dtolnay I started working on a crate to address these issues:

https://github.com/Diggsey/ijson
https://docs.rs/ijson

It is functionally complete but needs a lot more testing, etc. to get to a point where I can recommend people actually use it. That said, it demonstrates that significant improvements are possible.

Is this something you'd be interested in bringing into serde-json some time down the line?

@rimutaka
Copy link
Contributor

This came to me as a bit of an unpleasant surprise when my AWS Lambdas started running out of memory. I was sizing them based what is being retrieved from the DB. For example, ElasticSearch returns a 8,683KB document, I deser it into Value and the next RAM reading gives me delta of 98,484KB of RAM use. That's more than 10x the original size.

@dtolnay , David, is this high memory consumption a necessary price to pay for speed?
Is 561ms using from_slice() on an 8.6MB JSON string considered fast?

@Diggsey
Copy link
Contributor Author

Diggsey commented Nov 29, 2021

@rimutaka serde_json is much more efficient at deserializing into structs, compared to the Value type, so if that is possible for your usecase, then that's the best option.

@rimutaka
Copy link
Contributor

@Diggsey , thanks for the suggestion. Do you know if Value is more compact if I deser into a struct and then convert it into Value?

@Diggsey
Copy link
Contributor Author

Diggsey commented Nov 29, 2021

It would only be more compact if some fields are dropped as part of the deserialization into a struct (if say they are not required).

@rimutaka
Copy link
Contributor

Memory allocation log for processing 10MB of JSON data:

  • JSON as String of 10,812,199 bytes => +10,150KB allocated
  • JSON converted into struct => +63,996KB allocated
  • struct converted into Value => 188,162KB allocated

I can understand high memory consumption when JSON is converted into Value because the size of collections is not known, so more is allocated than needed to make it faster.
When a struct is converted into Value the size of collections is known in advance.
Why do we still get such large memory overhead? Is it inevitable or can be improved?

@CinchBlue
Copy link

FWIW I think I encountered this on the current version of the google_sheets4 crate -- it uses serde_json::Value and my server goes OOM if I try to deserialize a large spreadsheet w/ multiple tabs with 20GB usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants