Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key duplication #117

Open
nzok opened this issue Mar 21, 2022 · 5 comments
Open

Key duplication #117

nzok opened this issue Mar 21, 2022 · 5 comments

Comments

@nzok
Copy link

nzok commented Mar 21, 2022

When you have a lots of objects with the same keys, JSON requires each key to be represented
over and over again, no matter how many times it is used.
UBJSON does exactly the same thing, which is very odd for a "binary" format is trying to save
space.

Just for the purpose of discussion, a decoder could maintain an array of 256 values.
: would set cache[] to value.
! would use cache[].

Pretty much this idea is used in the Erlang-oriented UBF, and it pays off very well there.

Was anything like this ever considered for UBJSON, and if not, what is the argument against it?

@nzok
Copy link
Author

nzok commented Mar 21, 2022

Curse markdown. My example was mangled.
: byte value would set cache[byte] to value.
! byte would use cache[byte].
Since it's only worth caching something if it is used at least twice,
: byte value could also use the value it just saved.

@rkalla
Copy link
Collaborator

rkalla commented Mar 21, 2022 via email

@nzok
Copy link
Author

nzok commented Mar 23, 2022

There are two quite different things: "schemas" (express the inherent structure of the data) and "caching" (make the representation smaller). I quickly hacked together a good-enough approximation of a UBJSON writer that could be run with and without string caching. For the purpose of this experiment, I did not bother trying to design a good caching scheme, I just used the simplest thing that would not have me hanging my head in shame. Pass 1: count the strings, take the most frequent 256, and make a dictionary of those. Pass 2: write UBJSON and if a string is in the dictionary, use @n string for its first occurrence and @n! for later occurrences. The "plain" form omits pass 1 and uses an empty dictionary in pass 2.
Results on some test cases:
342791 repos.json
329908 plain / 256043 cached = 1.288486699499694

2539325 citys.json
2167267 plain / 1754269 cached = 1.235424555755132

877674 ~/.vscode/extensions/vscjava.vscode-maven-0.34.1/resources/archetypes.json
814368 plain / 656459 cached = 1.240546629720973

163321 /home/ok/.vscode/extensions/mikhail-arkhipov.r-0.0.26/ls/Microsoft.R.Host.Broker.deps.json
118454 plain / 70481 cached = 1.680651523105518

1252174 ~/.cache/gnome-software/odrs/ratings.json
656349 plain / 463771 cached = 1.415243730203053

I've just thought of a one-pass scheme, but the experiment has established what I wanted to know: there IS a very worthwhile saving to be had from a simple string caching scheme that knows nothing about the structure of the data.

I would be happy to provide the code if that was thought to be useful. It's about 200 SLOC of Smalltalk.

@rkalla
Copy link
Collaborator

rkalla commented Mar 23, 2022 via email

@DylanYoung
Copy link

How is it an implementation detail if it changes the spec?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants