-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Key duplication #117
Comments
Curse markdown. My example was mangled. |
nzok you are exactly right - the UBJSON community spent 3 or 4 years
discussing the Pros/Cons of supporting an schema/templated representation
and never got to consensus. That was 103% my fault not being educated
enough in format designs to force a single decision after the years of
discussion.
If you look back through this github, find the 1 or 2 issues that have 100+
replies and you'll find the conversations.
FWIW, every time I thought I "had it", and would proposed it, people
smarter than me would point out Edge Case #1 - #31 and blow my mind again
at how naive I was.
Gave me a huge amount of respect for people/teams that can knock our formal
specs without a hitch, but also gave me PTSD if I'm being frank :)
…On Mon, Mar 21, 2022 at 3:47 AM nzok ***@***.***> wrote:
Curse markdown. My example was mangled.
: byte value would set cache[byte] to value.
! byte would use cache[byte].
Since it's only worth caching something if it is used at least twice,
: byte value could also use the value it just saved.
—
Reply to this email directly, view it on GitHub
<#117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACG26BTOWO7DKSSC53JWELVBBHTPANCNFSM5RHIVJNA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
There are two quite different things: "schemas" (express the inherent structure of the data) and "caching" (make the representation smaller). I quickly hacked together a good-enough approximation of a UBJSON writer that could be run with and without string caching. For the purpose of this experiment, I did not bother trying to design a good caching scheme, I just used the simplest thing that would not have me hanging my head in shame. Pass 1: count the strings, take the most frequent 256, and make a dictionary of those. Pass 2: write UBJSON and if a string is in the dictionary, use @n string for its first occurrence and @n! for later occurrences. The "plain" form omits pass 1 and uses an empty dictionary in pass 2. 2539325 citys.json 877674 ~/.vscode/extensions/vscjava.vscode-maven-0.34.1/resources/archetypes.json 163321 /home/ok/.vscode/extensions/mikhail-arkhipov.r-0.0.26/ls/Microsoft.R.Host.Broker.deps.json 1252174 ~/.cache/gnome-software/odrs/ratings.json I've just thought of a one-pass scheme, but the experiment has established what I wanted to know: there IS a very worthwhile saving to be had from a simple string caching scheme that knows nothing about the structure of the data. I would be happy to provide the code if that was thought to be useful. It's about 200 SLOC of Smalltalk. |
Nicely done!
This is an impl detail so if you want to link to it for anyone else working
on an impl that would be great.
…On Wed, Mar 23, 2022, 4:01 AM nzok ***@***.***> wrote:
There are two quite different things: "schemas" (express the inherent
structure of the data) and "caching" (make the representation smaller). I
quickly hacked together a good-enough approximation of a UBJSON writer that
could be run with and without string caching. For the purpose of this
experiment, I did not bother trying to design a good caching scheme, I just
used the simplest thing that would not have me hanging my head in shame.
Pass 1: count the strings, take the most frequent 256, and make a
dictionary of those. Pass 2: write UBJSON and if a string is in the
dictionary, use @n <https://github.com/n> string for its first occurrence
and @n <https://github.com/n>! for later occurrences. The "plain" form
omits pass 1 and uses an empty dictionary in pass 2.
Results on some test cases:
342791 repos.json
329908 plain / 256043 cached = 1.288486699499694
2539325 citys.json
2167267 plain / 1754269 cached = 1.235424555755132
877674
~/.vscode/extensions/vscjava.vscode-maven-0.34.1/resources/archetypes.json
814368 plain / 656459 cached = 1.240546629720973
163321
/home/ok/.vscode/extensions/mikhail-arkhipov.r-0.0.26/ls/Microsoft.R.Host.Broker.deps.json
118454 plain / 70481 cached = 1.680651523105518
1252174 ~/.cache/gnome-software/odrs/ratings.json
656349 plain / 463771 cached = 1.415243730203053
I've just thought of a one-pass scheme, but the experiment has established
what I wanted to know: there IS a very worthwhile saving to be had from a
simple string caching scheme that knows nothing about the structure of the
data.
I would be happy to provide the code if that was thought to be useful.
It's about 200 SLOC of Smalltalk.
—
Reply to this email directly, view it on GitHub
<#117 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACG26GA7ULWX27XNC7CR73VBL2XNANCNFSM5RHIVJNA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
How is it an implementation detail if it changes the spec? |
When you have a lots of objects with the same keys, JSON requires each key to be represented
over and over again, no matter how many times it is used.
UBJSON does exactly the same thing, which is very odd for a "binary" format is trying to save
space.
Just for the purpose of discussion, a decoder could maintain an array of 256 values.
: would set cache[] to value.
! would use cache[].
Pretty much this idea is used in the Erlang-oriented UBF, and it pays off very well there.
Was anything like this ever considered for UBJSON, and if not, what is the argument against it?
The text was updated successfully, but these errors were encountered: