Key duplication #117

nzok · 2022-03-21T10:44:14Z

When you have a lots of objects with the same keys, JSON requires each key to be represented
over and over again, no matter how many times it is used.
UBJSON does exactly the same thing, which is very odd for a "binary" format is trying to save
space.

Just for the purpose of discussion, a decoder could maintain an array of 256 values.
: would set cache[] to value.
! would use cache[].

Pretty much this idea is used in the Erlang-oriented UBF, and it pays off very well there.

Was anything like this ever considered for UBJSON, and if not, what is the argument against it?

nzok · 2022-03-21T10:47:08Z

Curse markdown. My example was mangled.
: byte value would set cache[byte] to value.
! byte would use cache[byte].
Since it's only worth caching something if it is used at least twice,
: byte value could also use the value it just saved.

rkalla · 2022-03-21T13:09:18Z

nzok you are exactly right - the UBJSON community spent 3 or 4 years discussing the Pros/Cons of supporting an schema/templated representation and never got to consensus. That was 103% my fault not being educated enough in format designs to force a single decision after the years of discussion. If you look back through this github, find the 1 or 2 issues that have 100+ replies and you'll find the conversations. FWIW, every time I thought I "had it", and would proposed it, people smarter than me would point out Edge Case #1 - #31 and blow my mind again at how naive I was. Gave me a huge amount of respect for people/teams that can knock our formal specs without a hitch, but also gave me PTSD if I'm being frank :)

…

On Mon, Mar 21, 2022 at 3:47 AM nzok ***@***.***> wrote: Curse markdown. My example was mangled. : byte value would set cache[byte] to value. ! byte would use cache[byte]. Since it's only worth caching something if it is used at least twice, : byte value could also use the value it just saved. — Reply to this email directly, view it on GitHub <#117 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACG26BTOWO7DKSSC53JWELVBBHTPANCNFSM5RHIVJNA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

nzok · 2022-03-23T11:00:58Z

There are two quite different things: "schemas" (express the inherent structure of the data) and "caching" (make the representation smaller). I quickly hacked together a good-enough approximation of a UBJSON writer that could be run with and without string caching. For the purpose of this experiment, I did not bother trying to design a good caching scheme, I just used the simplest thing that would not have me hanging my head in shame. Pass 1: count the strings, take the most frequent 256, and make a dictionary of those. Pass 2: write UBJSON and if a string is in the dictionary, use @n string for its first occurrence and @n! for later occurrences. The "plain" form omits pass 1 and uses an empty dictionary in pass 2.
Results on some test cases:
342791 repos.json
329908 plain / 256043 cached = 1.288486699499694

2539325 citys.json
2167267 plain / 1754269 cached = 1.235424555755132

877674 ~/.vscode/extensions/vscjava.vscode-maven-0.34.1/resources/archetypes.json
814368 plain / 656459 cached = 1.240546629720973

163321 /home/ok/.vscode/extensions/mikhail-arkhipov.r-0.0.26/ls/Microsoft.R.Host.Broker.deps.json
118454 plain / 70481 cached = 1.680651523105518

1252174 ~/.cache/gnome-software/odrs/ratings.json
656349 plain / 463771 cached = 1.415243730203053

I've just thought of a one-pass scheme, but the experiment has established what I wanted to know: there IS a very worthwhile saving to be had from a simple string caching scheme that knows nothing about the structure of the data.

I would be happy to provide the code if that was thought to be useful. It's about 200 SLOC of Smalltalk.

rkalla · 2022-03-23T15:38:04Z

Nicely done! This is an impl detail so if you want to link to it for anyone else working on an impl that would be great.

…

On Wed, Mar 23, 2022, 4:01 AM nzok ***@***.***> wrote: There are two quite different things: "schemas" (express the inherent structure of the data) and "caching" (make the representation smaller). I quickly hacked together a good-enough approximation of a UBJSON writer that could be run with and without string caching. For the purpose of this experiment, I did not bother trying to design a good caching scheme, I just used the simplest thing that would not have me hanging my head in shame. Pass 1: count the strings, take the most frequent 256, and make a dictionary of those. Pass 2: write UBJSON and if a string is in the dictionary, use @n <https://github.com/n> string for its first occurrence and @n <https://github.com/n>! for later occurrences. The "plain" form omits pass 1 and uses an empty dictionary in pass 2. Results on some test cases: 342791 repos.json 329908 plain / 256043 cached = 1.288486699499694 2539325 citys.json 2167267 plain / 1754269 cached = 1.235424555755132 877674 ~/.vscode/extensions/vscjava.vscode-maven-0.34.1/resources/archetypes.json 814368 plain / 656459 cached = 1.240546629720973 163321 /home/ok/.vscode/extensions/mikhail-arkhipov.r-0.0.26/ls/Microsoft.R.Host.Broker.deps.json 118454 plain / 70481 cached = 1.680651523105518 1252174 ~/.cache/gnome-software/odrs/ratings.json 656349 plain / 463771 cached = 1.415243730203053 I've just thought of a one-pass scheme, but the experiment has established what I wanted to know: there IS a very worthwhile saving to be had from a simple string caching scheme that knows nothing about the structure of the data. I would be happy to provide the code if that was thought to be useful. It's about 200 SLOC of Smalltalk. — Reply to this email directly, view it on GitHub <#117 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACG26GA7ULWX27XNC7CR73VBL2XNANCNFSM5RHIVJNA> . You are receiving this because you commented.Message ID: ***@***.***>

DylanYoung · 2023-03-19T16:19:00Z

How is it an implementation detail if it changes the spec?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Key duplication #117

Key duplication #117

nzok commented Mar 21, 2022

nzok commented Mar 21, 2022

rkalla commented Mar 21, 2022 via email

nzok commented Mar 23, 2022

rkalla commented Mar 23, 2022 via email

DylanYoung commented Mar 19, 2023

Key duplication #117

Key duplication #117

Comments

nzok commented Mar 21, 2022

nzok commented Mar 21, 2022

rkalla commented Mar 21, 2022 via email

nzok commented Mar 23, 2022

rkalla commented Mar 23, 2022 via email

DylanYoung commented Mar 19, 2023