Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object Keys must omit [S]tring marker #28

Closed
ghost opened this issue May 29, 2013 · 5 comments
Closed

Object Keys must omit [S]tring marker #28

ghost opened this issue May 29, 2013 · 5 comments
Milestone

Comments

@ghost
Copy link

ghost commented May 29, 2013

JSON dictates that Keys can only be Strings. Spending some time reading online it looks like non-String keys, while supported by some parsers and platforms, are problomatic and can lead to hard-to-handle issues -- namely because not ALL platforms or parsers support it.

Right now it is entirely possible to write valid UBJSON that uses non-string values as Keys. This breaks the transitivity of UBJSON to JSON, back to UBJSON again.

This proposal is two fold:

  1. Require that keys are STRINGS.
  2. Require that the S markers are omitted from objects when writing out the keys, saving 1 byte per entry.

I win here is that when moving from Draft 8 to Draft 9 when we normalized lengths on the integer numeric types, we grew Strings by 1 byte each. This would be a nice savings by re-reducing the size of objects again.

Using the example from the Type Ref page:
http://ubjson.org/type-reference/

This leads to a 7% reduction in size.

I realize this requires parsers to maintain more state than they currently have, but it shouldn't be more than a toggle indicating if a KEY or VALUE is being parsed; if the toggle is KEY then the marker should be i/I/l/L -- if it isn't, then an exception.

@ghost ghost self-assigned this May 29, 2013
@kxepal
Copy link
Collaborator

kxepal commented May 29, 2013

While it's looks oblivious optimization, I feel "explicit is better than implicit" rule worth something.

However, if Draft 10 ships with STC type, UBJSON parser will be already stateful and this optimization is acceptable.

Moreover, this optimization makes more complex to check for valid object key data: previously parser just read single byte and had check it for equality to S value. If key is malformed, you'd got assertion at this point. Now it have to check it against B/i/I/l/L values, but you get an assertion only at value decoding since encoded key may be just a number. Since my thoughts aren't clear at this night, short example:

Sample hash-map data for some generic language:

{"foo": "bar", 4: "baz"}

UBJSON data:

[{]
    [S][i][3][foo][S][i][3][bar]
    [i][4][S][i][3][baz]
[}]

Draft 9 decoder will raise assertion for [i][4] bytes sequence since it's malformed object key.

Draft 10 will count [i][4] as "string key with length 4" and will try to read 4 bytes ahead. It will count [S][i][3][b] as key and will fail on decoding [a] marker raising exception about malformed value - but it is not.

Sure the encoder should validate hash map keys before serialization, but nobody perfect.

  • 0.33(3) for this change.

@ghost
Copy link
Author

ghost commented May 29, 2013

Alex, you have an uncanny ability to think of edge cases faster than anyone I've ever met; I love it :)

I want this simplification (just like I still want #27 ) but the edge case you identified concerns me. We lose the ability to accurately/quickly communicate the exact position of a malformed construct in the stream.

You identified this edge case very quickly, makes me wonder what other edge cases we don't see yet that might bite us in the ass.

@rogusdev
Copy link

Actually, in the example given, draft 10 should raise an exception on the very second byte: [S], not on [i][4]. When it encounters the [S] byte as the beginning of the key string's length, [S] is not a valid integer type, so that is the exception here, and is in fact the exact position of the malformed data, so everything still works properly, to my mind.

@ghost
Copy link
Author

ghost commented May 30, 2013

@rogusdev True, but changing the example to read:

[{]
    [i][3][foo][S][i][3][bar]
    [i][4][S][i][3][baz]
[}]

still illustrates Alex's point fairly well. Removing the markers does make parsing a bit more sloppy in the face of invalid data unfortunately.

@ghost
Copy link
Author

ghost commented May 30, 2013

Closing this -- after discussing it at length with others it does start to make the spec sloppy.

If we end up solving this with a Typed Group or something akin to that, fine, but for now I would rather OBJECTS continue be well-defined and easily parsed than save a few bytes in the name of structure.

@ghost ghost closed this as completed May 30, 2013
@ghost ghost removed their assignment Aug 15, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants