Very high memory usage with `serde_json::Value` #635

Diggsey · 2020-03-20T00:00:06Z

Unfortunately, due to Value being completely public I don't know how much can be done about this without breaking changes. However, a couple of times I've run into problems with exceptionally high memory usage when using a Value.

I don't think there's a bug here, just that common uses seem to be much more memory intensive than similar code in dynamic languages, where this kind of data is already heavily optimised.

I think it comes from several factors:

Each Value is 32 bytes on a 64-bit system even though the majority of Values will be the leaf nodes (numbers, strings, nulls, etc.) which don't need that much space. If Value were more highly optimized for leaf nodes, I think this could easily be halved.
Maps are optimized for access time rather than space efficiency, and this is made worse because there are lots of "empty" Value slots, each of which is another 32 bytes.
Strings are owned. When converting from a struct with to_value, object keys will all be known statically, and those strings will already be embedded in the program as static data, so using a Cow<'static, str> could dramatically reduce memory usage.
Strings are exclusively owned. When deserializing into a Value it's likely that there will be lots of duplicate strings, but there is no possibility for them to be shared with the current Value representation.

I think a more space-efficient Value type could be introduced. Keys could be stored as a pointer-sized union of &'static str, Arc<String> using a tag in the low bits to differentiate. The deserializer could automatically intern strings as they are deserialized. Value could be shrunk to 16 bytes, and store short strings inline. Maps could use a simple Vec representation for small numbers of elements to avoid any wasted space. The improved cache-coherency could also improve performance. All access to "compact values" should be done via methods to allow further optimisations in the future. There would also need to be a version of the json!() macro that produced this compact type.

The text was updated successfully, but these errors were encountered:

Diggsey · 2020-11-19T02:58:30Z

@dtolnay I started working on a crate to address these issues:

https://github.com/Diggsey/ijson
https://docs.rs/ijson

It is functionally complete but needs a lot more testing, etc. to get to a point where I can recommend people actually use it. That said, it demonstrates that significant improvements are possible.

Is this something you'd be interested in bringing into serde-json some time down the line?

rimutaka · 2021-11-28T22:38:52Z

This came to me as a bit of an unpleasant surprise when my AWS Lambdas started running out of memory. I was sizing them based what is being retrieved from the DB. For example, ElasticSearch returns a 8,683KB document, I deser it into Value and the next RAM reading gives me delta of 98,484KB of RAM use. That's more than 10x the original size.

@dtolnay , David, is this high memory consumption a necessary price to pay for speed?
Is 561ms using from_slice() on an 8.6MB JSON string considered fast?

Diggsey · 2021-11-29T00:05:20Z

@rimutaka serde_json is much more efficient at deserializing into structs, compared to the Value type, so if that is possible for your usecase, then that's the best option.

rimutaka · 2021-11-29T00:55:49Z

@Diggsey , thanks for the suggestion. Do you know if Value is more compact if I deser into a struct and then convert it into Value?

Diggsey · 2021-11-29T01:22:44Z

It would only be more compact if some fields are dropped as part of the deserialization into a struct (if say they are not required).

rimutaka · 2022-01-31T10:39:01Z

Memory allocation log for processing 10MB of JSON data:

JSON as String of 10,812,199 bytes => +10,150KB allocated
JSON converted into struct => +63,996KB allocated
struct converted into Value => 188,162KB allocated

I can understand high memory consumption when JSON is converted into Value because the size of collections is not known, so more is allocated than needed to make it faster.
When a struct is converted into Value the size of collections is known in advance.
Why do we still get such large memory overhead? Is it inevitable or can be improved?

CinchBlue · 2024-04-18T10:20:42Z

FWIW I think I encountered this on the current version of the google_sheets4 crate -- it uses serde_json::Value and my server goes OOM if I try to deserialize a large spreadsheet w/ multiple tabs with 20GB usage.

serde-rs deleted a comment from GopherJ Mar 21, 2020

serde-rs deleted a comment from GopherJ May 3, 2020

rimutaka mentioned this issue Jan 31, 2022

HTML UI Lambda memory usage stackmuncher/stm_server#23

Open

svix-jplatte mentioned this issue Mar 13, 2024

Don't make a serialization roundtrip in openapi.json route svix/svix-webhooks#1269

Merged

FabianLars mentioned this issue Mar 27, 2024

[bug] Memory leaks when reading files tauri-apps/tauri#9190

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very high memory usage with `serde_json::Value` #635

Very high memory usage with `serde_json::Value` #635

Diggsey commented Mar 20, 2020

Diggsey commented Nov 19, 2020

rimutaka commented Nov 28, 2021

Diggsey commented Nov 29, 2021

rimutaka commented Nov 29, 2021

Diggsey commented Nov 29, 2021

rimutaka commented Jan 31, 2022

CinchBlue commented Apr 18, 2024

Very high memory usage with serde_json::Value #635

Very high memory usage with serde_json::Value #635

Comments

Diggsey commented Mar 20, 2020

Diggsey commented Nov 19, 2020

rimutaka commented Nov 28, 2021

Diggsey commented Nov 29, 2021

rimutaka commented Nov 29, 2021

Diggsey commented Nov 29, 2021

rimutaka commented Jan 31, 2022

CinchBlue commented Apr 18, 2024

Very high memory usage with `serde_json::Value` #635

Very high memory usage with `serde_json::Value` #635