Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Containers should support optional SIZE and optional TYPE markers #27

Closed
thebuzzmedia opened this Issue · 26 comments

6 participants

@thebuzzmedia

Born out of the conversation with Chris/Alex/Bjorn/Michael on #26 (which is a spiritual successor to #16) which constantly invokes #13 - The crux of this proposal is actually based on Bjorn's suggestion from two months ago (#16 (comment))

Summary

We continue to talk about a "3rd type" of container we are referring to as "Strongly Typed Container" (STC) to provide a compact storage for homogeneous types (including supporting binary data as a compact list of int8 values).

In reality this is nothing more than a highly specialized ARRAY construct that has different semantics than a standard ARRAY -- it requires both a TYPE and a LENGTH to be specified.

After the discussion with Chris Dunn on #26 about the reality that having lengths on the ARRAY and OBJECT types (as was the case up until Draft 9) was more times than not an easier and more optimized way to shuttle data around in UBJSON, I had the following thought that I think addresses all 3 concerns.

Proposal

Forget adding a 3rd container type, instead, evolve the definition of ARRAY and OBJECT to the following:

ARRAY Container Definition

[[]
    [i][1] // 1
    [i][2] // 2
    [i][3] // 3
[]]

[[][#][i][3] // LENGTH, child element count (no end marker needed)
    [i][1] // 1
    [i][2] // 2
    [i][3] // 3

[[][$][i][#][i][3] // TYPE & LENGTH
    [1] // 1
    [2] // 2
    [3] // 3

OBJECT Container Definition

[{]
    [S][i][3][lat][d][29.976481] // "lat": 29.976481
    [S][i][4][long][d][31.131301] // "long": 31.131301
    [S][i][3][alt][d][67.0] // "alt": 67.0
[}]

[{][#][i][3] // LENGTH, child element count (no end marker needed)
    [S][i][3][lat][d][29.976481] // "lat": 29.976481
    [S][i][4][long][d][31.131301] // "long": 31.131301
    [S][i][3][alt][d][67.0] // "alt": 67.0

[{][$][d][#][i][3] // TYPE & LENGTH
    [S][i][3][lat][29.976481] // "lat": 29.976481
    [S][i][4][long][31.131301] // "long": 31.131301
    [S][i][3][alt][67.0] // "alt": 67.0

Type ([$]) must always be specified before length ([#]) if both are specified.

Any type is valid for the Type ([$]) marker, even if it is another container -- all [$] does in the ARRAY and OBJECT case is pre-set the type marker for every element in the container so the parser doesn't need to check it.

@thebuzzmedia thebuzzmedia was assigned
@kxepal
Collaborator

Summary: too complex rules and "too many ways to do single thing".

Side notes:

  1. What's the actual problem? Does the proposal really solves this problem? Any benchmarks around?

  2. Information about array's elements count without type information is useless: you're not able to decide how much memory allocate.

  3. Arrays with type info (STC) are really could be helpful and most languages provides special modules to handle such structures. Python, Ruby for example.

  4. Size information for objects is useless for most implementations since they are not preallocateable for high level languages unless you're not counting objects as array of two element subarrays.

  5. Type information for objects is mostly useless: keys are always has string type while values are have mixed types for 80% cases.

  6. Having container description on the same logic level as his elements forces parser to read at least one additional marker (two in worst case) to decide how this container have to be decoded. This means that parser have to be stateful which is not good thing for streams and UBJSON library itself.

Counter proposal:

Introduce new container structure which explicitly have the HEADER section which holds byte size, types, hashes, schemes, whatever and the BODY section which holds only the data. Old, solid, but there would only one explicitly only one right way to solve some problem if it have to be.

@thebuzzmedia

Summary: I respectfully disagree, but as always am open to being convinced otherwise! Thoughts below.

  1. The problem is the same thing that STC proposes to solve (efficiency) as well as adding explicit sizing back to the containers.

  2. As I've discussed in the past, this isn't about memory allocation for the entire collection (sizeOf(childElements) + sizeOf(container)), but more efficient container allocation for pointers to all the child elements.

  3. I didn't understand this point, can you clarify?

  4. See Point 2 above. Same motivation.

  5. Doh, that is my mistake. "names" are always String. I clarified the example. Do you think we should update the UBJSON spec to clarify that inside of an object, the [S] marker must be omitted for name elements? Also, generalizations are dangerous, we can't confirm 80% BUT I understand your point. That said, I still think templating out the object can save space especially for large objects and is a nice win. It helps combats the size INCREASE we had with Draft 9 when we normalized all the integer size types.

  6. Yes, absolutely a fair point, but I don't think the optional nature of these fields is a big issue -- any expense you incur in parsing 2 optional header elements is won back by NOT parsing additional marker information from the body. It will make the parsers marginally more complex, but I don't think very much so -- am I missing something?

Counter Proposal Response

I want ARRAY and OBJECT to stay synchronized in design, so what we do for one I want to do for the other. Unfortunately T and SIZE are not always able to be specified -- you might want to specify SIZE for an array and NO TYPE, and sometimes just a TYPE... and sometimes both (same goes for OBJECT) -- the spec needs to account for all those use cases.

Defining 2 new types that have optional headers is just as complex as implementing that logic on ARRAY and OBJECT directly.

@rogusdev

TL;DR intro summary, I'm proposing:

[A][i][3][I]
[5000]
[10000]
[999999]

[O][i][3][S][T]
[i][2][ab][F]
[i][2][cd][T]
[i][2][ef][F]


For what it's worth, I really like sticking to the TLV structure. I am less concerned with maintaining exact 1:1 compatibility with JSON (e.g. I am fine with allowing +/- infinity in UBJSON, even if it is not officially supported in JSON, given that they are valid float/double values elsewhere), but I do love the JSON simplicity and always aspiring retain that simplicity while still allowing for useful additional optimizations in more general data interchange is very valuable to me as well.

Having read through many, if not all, of the previous issues and the comments threads on them that led to this current idea of STCs, I would like to suggest that the idea that an STC type, so long as it is independent from the existing unbounded array and object containers, would necessarily in any way break TLV structure or 1:1 JSON compatibility (it seemed to me that various people had suggested this at one point or another) is in fact quite false.

However, modifying the structure of these unbounded arrays and objects would in fact break that TLV structure (as in the examples above), and that is confusing enough that I, for one, strongly discourage that path. Most notably, the use of these new [T] and [#] bytes to identify new sections just to complications these containers seems to be going far off the simplicity that has attracted me to UBJSON. I could say more about why I think modifying the array and object types to attempt to add length or type is a bad idea (and I may yet have to in response to any comments that arise :) ) but for now I'd rather talk about the potential beautiful simplicity I see in a new STC container as I envision it (with great thanks to those who have discussed at length before me to give rise to these ideas now):

TLV for STC:
type: STC (not modifying existing unbounded array or object containers), for discussion purposes here: [A]
length: number of children, same length format as used in strings and high precision, e.g. [i][10]
value: child type, followed by the children in just LV or V (as appropriate for the type) format. In my head, this is still TLV format because T is implied by context. And it would work extremely simply, elegantly, efficiently for both the serializer and deserializer/parser

Unbounded containers and even STCs definitely do work as children in this structure. I'm fairly confident that for those working in C or other languages where they need to allocate memory manually, it should be entirely reasonable to create an array of pointers as they begin parsing the STC, and just fill in the pointers to the individual containers as those are parsed from there.

The only places where this proposal could break down is for the 1 byte types: null, no-op, true, false. To that, my purist sort of reaction is "why would anyone want an array of 10 nulls?" followed by "if you really wanted to, that's super easy: [A][i][10][Z]" and there's nothing else required, strictly speaking. Same goes for no-ops, again, very strictly speaking, at this point of investigating.

However, booleans, covered by two different value types ([T] and [F]) make this question a touch more confusing. It would not be especially interesting probably to be restricted to creating STCs of some number of all trues or all falses. So how could we handle what we really want, which is mixing T and Fs? We could make this one special exception that if T or F is the given child type, either value is allowed going forward in the children, and thus all the children must be specified. I think that exception would be hardly challenging to handle, but it does seem just a tiny bit weird to make such a special case. All the same, I am fine with it, just acknowledging it. Any other thoughts on this potential challenge/quirk?

If this special case approach is accepted, it makes the null and no-op scenarios change, as you could not just do [A][i][10][Z] -- instead you would either have to either expand the exception for true and false as the type to include that for null and no-op, such that no children are required for those (which is not necessarily a huge exception, if we're already allowing it for true and false) or be obligated to list out all the children, even though they would all be the same. Frankly, I'm completely ok with either approach, perhaps moreso the second, to require all the identical children because, honestly, I cannot imagine anyone ever really wanting to do that. About the only time I could imagine it happening would be if I was serializing an array of objects in my code that just turned out to all be null at the time of serialization.

I think the one real problem with the proposal I'm making above is that it only creates an STC that equates to arrays and is not immediately usable as an STC equivalent for objects. However, creating such an STC equivalent for object (we could type it [O]) would very brightly highlight the question about whether or not objects in UBJSON are strictly required to use strings as keys, because currently there is no such restriction -- and frankly, I kind of like that, despite its divergence from official JSON. If object keys must be strings, first of all, the requirement of defining the type for keys in normal unbound objects should probably be removed, and just asserted that they will always be strings. And then the STC object has the same structure as the STC array from above, just with implicitly typed strings for keys, and the given child type only applies to the values in the object's collection of key-value pairs. But I would actually greatly prefer that key type remain open ended and instead we say that for this STC object [O] you must define both key and value child types, e.g. [O][i][10][S][S][... 10 key-value pairs] for a dictionary/map of 10 key-value pairs where both keys and values are string.

How does that sound?

@thebuzzmedia

@rogusdev great writeup, I am going to try and address each point you brought up.

Summary: I agree on some things and disagree on others.


  • Agreed on TLV, I like it as well when we can manage it.

  • I don't see how this would break (as in "this is unusable!") the ARRAY and OBJECT constructs -- I literally have code like this now (excuse the pseudo):

switch(byte) {
    case '[': startArray();
    case ']': endArray();
    default: parseElement(byte);
}

I don't see how adding 2 new cases:

    case 'T': determineType(byte);
    case '#': setContainerSize(byte);

would break what we already have -- you are already scanning every marker to determine if it is the closing marker for the container you are in. If you encounter '#' you no longer check for it and size the container in your execution environment. If you run across 'T' it is some optional details for you to optimize parsing.

  • I agree that this makes parsing have more possible paths of execution, especially if a Type is specified -- is that what you are concerned with? That you are no longer looking for markers, but instead just looking for V or LV? I think this is a minor cost for what could potentially be a very big win both in size reduction of UBJSON as well as finally getting a nice/cohesive support for Binary into the spec in a compatible way that doesn't introduce any special case containers (e.g. STC)

  • For 1-byte values, I agree with you, let's just have them directly in the container. Don't worry about deduplication.

  • I hate introducing incompatibility with JSON, but you bring up an interesting case where people utilizing UBJSON, MIGHT very well want their names to be something other than a string -- right now (until someone talks me out of it) I think it could be an interesting use-case to support specifying both the NAME and VALUE types ... the only problem is that this will become one of those edge cases where some Generator written in Python is generating UBJSON that the JavaScript parser (or Go or Erlang... just examples) can't parse because those authors ASSUMED the name fields were all Strings (just like JSON) -- this is really dangerous long term... I think I am 60/40 against this idea right now because of the potential pain for an unseeable win.

If someone is in a closed system and REALLY wants NAMEs to be something other than Strings, they can implement their custom 1-off no problem.

  • In closing, I hate the idea of re-introducing A/O again (or whatever other markers we come up with) -- we JUST removed them in Draft 9 to simplify and the biggest reason for me is that the parsing logic will have to handle OPTIONAL type and OPTIONAL length, so it will be just as complex as the unbounded container plus the additional logic.

We cannot require both TYPE and LENGTH, there are perfectly good reasons for each case making them optional, or having them both.

This is why I want to overload the existing ARRAY and OBJECT.

@rogusdev

When I said "break" for the Array and Object types, I was talking very directly about the TLV model for those two types. I never said "this is unusable", but I did very much mean that doing what you've described seemed to me to break the TLV pattern.

However, when you described adding those two case statements for 'T' and '#', my first reaction was that to do so would literally mean that 'T' and '#' would become value types in and of themselves. My second reaction was "ugh, that would be gross" -- if we're adding two new value types just for the beginning of arrays and objects, that sounds to me like a worse choice (on the basis of less transparency, at the very least) than simply adding two new container types as I have described. But that was just my second reaction, so we'll come back to this idea in a moment... ;)

Admittedly, the two new container types I described required both type and length, but, well, I guess I personally was not as concerned about specifying length without specifying type as others might be. I can see the advantage to that, so here's a relatively simple change to my suggestion so as to specify length without type: you could add yet another exception, now for when the no-op type is the child type, to indicate that no type was required when specifying length is desired but type is not. I definitely cannot envision a scenario where anyone should ever, ever use no-op as the type in an STC, so this seems valid to me.

But, very importantly, you can NOT specify type without specifying length! I thought that had been made pretty solidly clear in #25, @breese's comments. The breaking example is trivial to generate for any scenario: pick a container closing char, it has an ascii (byte) value of some integer, now strongly type your container to integer and with no specified length. How will you know when you have hit the container closing char byte vs just another integer byte in the sequence?... Without a length, you can not know. So you can not have strongly typed containers without length.

However, my third reaction, was that in fact, if one was willing to accept that 'T' and '#' were indeed two new value types, that actually could resurrect @adilbaig 's type groups suggestion also from #25, where you could specify a type before a block of values -- but then you'd also need the number of them to follow, as per the discussion above re: strong typing without length (and knowing that, you'd probably want to just combine T and # into one value type), implicitly typing those subsequent values -- until you specified a new type with the 'T' value to set the next grouping, which could still be in the same container. And then after a grouping list was done, new values would just have to go back to defining their own types -- until yet another type grouping was declared, potentially.

But then my fourth reaction was that typing for groups of values, while certainly a nice bonus for reduced number of bytes for the serialized data, does not likely solve one of the fundamental motivations for those deserializing/parsing in languages where they want to know what the type is in advance, before they start processing all the members of the container, so they can allocate space in the array to hold the data as it is read.

So, at this point, I'm kind of intrigued by the new value type of 'T' that specifies both type and length to declare a type group that could be inside a current array or object container (again, no need for '#' as well, since you have to have both together anyway in the type group scenario I described). But I'm personally still more interested in just being more explicit about new STC types: the [A] and [O] as I described with the one adjustment to specify that using the no-op type as the child type means that type is required for all child elements (no implicit typing).

@rogusdev

Oh and as for the key in the key-value pairs (it's not a "NAME", it's a key), the issue I'm bringing up is that in fact no one should have to assume one way or another at all -- we should be explicit. If someone's writing a parser in one language, and a different person is writing a serializer in another language (not that the difference in languages matters), both should be working off the same specs, not assumptions. In this case, it is extremely trivial in every language I know of to make the "object"/dictionary/map/hash keys be any of the valid value types UBJSON allows ( e.g. python: http://docs.python.org/2/tutorial/datastructures.html#dictionaries ruby: http://ruby-doc.org/core-2.0/Hash.html java: http://docs.oracle.com/javase/6/docs/api/java/util/Map.html php: http://php.net/manual/en/language.types.array.php c#: http://msdn.microsoft.com/en-us/library/xfhwa508.aspx etc). While most (all?) languages will not allow containers as keys without some requirements on the nature of those containers (namely, that they define equality in some way other than being the same reference/pointer), absolutely none other than javascript will ever restrict keys in a dictionary to be just strings. Since I am coming to UBJSON explicitly to use it as a storage mechanism for arbitrary data, I would greatly, greatly, greatly prefer to not be restricted on what types my keys can be (I will happily accept not keying off of containers -- which leaves keying off of strings, numbers and possibly booleans). At the same time, it seems imminently reasonable to me that in the different languages that might have even stronger restrictions, it is completely ok in my mind for the library maintainer to simply say "due to restrictions in [language X], we cannot support the use of non-strings as keys in object". But that should be a language restriction, not a spec restriction. And, most importantly, even if you disagree with me on this point about allowing non-strings as keys, we should nevertheless be explicit in the spec about the requirements on keys, one way or the other.

@thebuzzmedia

Chris, I really appreciate you taking the time to clarify your points in such detail, this is the only way the spec will get better. I will try and respond to in an in-paragraph order as best I can.

  • "Ugh, gross" - it is actually very important to me how the spec feels to people. Do they "Get it" on the first read through or do all the special-case addendums make them start to feel confused and annoyed. So I appreciate this gut-feeling call out. I agree with you, it starts to add complexity that might look/feel unfriendly -- that said, I am not sold that this needs to be a new type (yet).

  • TYPE w/o LENGTH - great catch, I had forgotten the case of ints. So the proposed containers would have variations:

    • ARRAY
    • ARRAY-LENGTH
    • ARRAY-TYPE-LENGTH
    • OBJECT
    • OBJECT-LENGTH
    • OBJECT-TYPE-LENGTH
  • Non-Strings as Keys - Moving this point here because it applies to my next point. I am sorry I wasn't clear here. I didn't mean we wouldn't clarify in the spec how this should be done, I meant that allowing non-strings can open the door (a crack) to some really painful cross-compatibility issues. To remain compatible with JSON and be easily supported/coded/understood on language platforms that wouldn't easily support non-string keys, UBJSON will require Object's keys continue to be STRING or CHAR fields.

As I mentioned before, if someone in a closed system wanted to optimize around this with a custom generator/parser behavior that would be fine -- something akin to how the JSON working group handles all the requests for adding comments to the spec "If your parser/generator wants to handle them, that is fine, but it is not part of the spec".

  • Type Groupings - Interesting... so you are proposing a new construct, for the sake of argument, that looks something like the following and works inside both ARRAY and OBJECT containers:
[T][#][i][3]
[T][i][#][i][3]

Resulting in containers that could potentially look like:

[[][T][#][i][3]
    [i][1]
    [i][2]
    [i][3]
[]]

[[][T][i][#][i][3]
    [1]
    [2]
    [3]
[]]

[{][T][#][i][3]     
    [i][1][a][i][1] // STRING key implied, no marker.
    [i][1][b][i][2]
    [i][1][c][i][3]
[}]

[{][T][i][#][i][3]
    [i][1][a][1] // STRING key implied, no marker.
    [i][1][b][2]
    [i][1][c][3]
[}]

The potential impact to this design are the following in the context of a container:
1. T-blocks can occur 0->* times inside of a container.
2. T-blocks can occur mixed with standard field values inside of a container (for example 100 unit T-Block, then 4 additional/different field values in an object).

I think the implementation fallout from this is more complexity than just augmenting the definitions of ARRAY and OBJECT to have two additional forms (LENGTH and TYPE-LENGTH)

Summary

This discussion has helped clarify a number of things in my head:
1. LENGTH and TYPE-LENGTH are the only valid scenarios (there is no TYPE-only scenario)
2. STRING keys are required in OBJECTs; both to maintain compatibility across all language platforms (especially JavaScript) and continue 1:1: compatibility with JSON, but to also allow the stripping of the 'S' marker off of OBJECT keys in the UBJSON itself. Combining that with the proposed modifications can lead to a nice level of compaction.

It is entirely possible that I am stubborn like a donkey with a thick skull, but I am still not convinced that adding those variations of ARRAY and OBJECT will introduce significant complexity to the spec, generating or parsing code.

If you are 100% convinced it will, please try and help me understand what I am missing.

@thebuzzmedia

ADDITIONAL THOUGHT: @rogusdev I was just thinking more about what you said (and what I understood you to be proposing) and @adilbaig suggestion with #25

Help me crystallize this in my brain (call out anything I'm getting wrong):

  1. Leave ARRAY and OBJECT alone. They have one, simple definition. (start/end markers)
  2. Define a new Typed Group (or "Chunk" or "Group" or whatever name we want to use) that can existing 0->* times inside of an ARRAY or OBJECT.
  3. Typed Groups are defined as a LENGTH, or TYPE-LENGTH construct (single TYPE to be compatible with both ARRAY and OBJECT because OBJECT keys are always String as mentioned in a previous point)
  4. Allow a mix of Typed Groups and standard values along side Typed Groups inside any of the containers.
  5. This provides slightly more flexibility in that containers can benefit from the optimizations of STC-style containers while still allowing JSON containers like ARRAY to contain a non-homogeneous collection of items if they want.

The flexibility of this while maintaining the integrity and simplicity of ARRAY or OBJECT is appealing to me the more I think about it... I don't love the complexity of the parsing of containers now (might have a group, might just be elements, if inside group, need to know if you were in an object scope or not so you don't need to look for STRING markers on the keys, etc. etc.) but if the win is big, it might not be that bad.

@rogusdev

@thebuzzmedia your "crystalized thoughts" post is certainly lining up with my thoughts, thanks :) -- I will point out the lingering concerns I have with your proposals.

A brief note on naming: "Chunking" as was discussed in #10 is a different potential container that should be discussed independently, and it is a potentially intriguing idea, so let's not name anything like that here, so as to not confuse things. (This is a challenging enough problem already.)

Type groups are, to my mind, only an optimization for serialized data length reduction. They would add decidedly non-trivial complexity to deserializing/parsing, as you noted already. That has two very, very important implications:

1) I see no reason for type groups that only define length: that merely adds to the data size while also dramatically increasing complexity for parsing it. (Despite some potential very minor convenience gains in memory allocation for those in C/etc, the complications of managing such state far outweighs the intended convenience, imo.) Lose-lose, in my book.

2) Using type groups in array or object does not actually, strictly speaking, solve the problem STCs were intended to solve, as I understand it: that, again when implementing manually memory managed (alliteration! :) ) parsing of UBJSON, or as the consumer of emitted STC data, it is quite advantageous to know that you have a completely homogeneous collection. Type groups do NOT give you that advantage. You must have a proper STC type for that, to guarantee that every child value is for sure of the exact same type.

Therefore, I am still very much proposing adding in STC types for array and object, e.g. [A] and [O].

Very important additional point: your currently proposed syntax for type group declaration is exactly the thing that I was referring to that breaks TLV -- you are putting the value (the type) before the length in all your examples. "TVL", especially if V is optional "T(V)L", is quite undesirable to my mind, please let's not do that. Let's stay with TLV everywhere.

So, I'm still proposing type groups, here's the syntax I propose, adapting your examples from above (no examples without type, per my first bullet above):

[[]
    [S][i][3][abc]
    [T][i][3][i]
    [1]
    [2]
    [3]
    [C][d]
[]]
== [ "abc", 1, 2, 3, 'd' ]

// STRING key implied, no marker
// TYPE GROUP goes around KEY-VALUE pairs, but only implies type of VALUEs
[{]
    [i][1][z][S][i][3][abc]
    [T][i][3][i]
    [i][1][a][1]
    [i][1][b][2]
    [i][1][c][3]
    [i][1][d][C][d]
[}]
== { z: "abc", a: 1, b: 2, c: 3, d: 'd' }

Note that with type group in these case, it is actually more bytes than it would have been without the type group. These suffice for exmaples all the same, with this caveat, but skillful serializers should ideally only create type groups for large enough sequences of the same data types that there is in fact a benefit!

Aside: at this point, I am tempted to be willing to accept keys only being strings. While as a purist I might like to have flexibility with keys being different times, in practice I envision very few instances where I would want that, and the advantage of reducing the data transfer by removing those extra bytes might be worth it, as well as possible conveniences in data consumption. My only lingering concern then is that this does add noticeable overhead to parsing "object"/dictionary/map data, if we do not also introduce some kind of implicit typing elsewhere. (I am referring to the idea of parsing just an LV without the T, which would otherwise be a concept not found elsewhere.) Of course, by the same token, if we do add that restriction, it inevitably makes STCs and type groups easier to implement from there.

@adilbaig
Collaborator

Late to the party :)

@thebuzzmedia - While the type group suggestion is certainly an improvement/simplification over mine. I think this can still be made simpler.

I agree with @rogusdev comments, type groups definitely need lengths to work and a pervasive use to TLV parsing is preferred over exceptional cases.

Where i agree is :

  • Arrays and Objects should always be defined with open/close markers.
  • Arrays MAY contain type groups. This is because for small values, plain old UBJSON encoding of types individually is more efficient. STC/Type groups only pay off above a certain size (as is clear by now).

Where i differ is :

  • Type groups should always be TYPE -> LEN -> VALUES. Nothing else can work or is simpler (as @rogusdev and @breese have said earlier).
  • Objects should NOT contain type groups in the root level. @rogusdev 's example above ("TYPE GROUP goes around KEY-VALUE pairs, but only implies type of VALUEs" ) is complicated, and breaks around the simplicity of the TLV parser. This implies that TYPE groups are NOT a homogenous group i.e: That every OTHER value belongs to this group. It is preferable that Type groups not mean something different inside an object. In my opinion, objects just should be simple key-value pairs. Keeps the parser and spec much simpler, at the cost of some bloat. Here is an example of a valid object encodings :
{ name:"adil", count:1, tags : ["php", "d", "python", "erlang"] }

//Encoding 1 without TYPE groups
[{]
    [S][i][4][name][S][i][4][adil]
    [S][i][5][count][i][1][1]
    [S][i][4][tags]
        [[]
            [S][i][3][php]
            [S][i][1][d]
            [S][i][6][python]
            [S][i][6][erlang]
        []]
[}]

//Encoding 2, with TYPE groups in arrays
[{]
    [S][i][4][name][S][i][4][adil]
    [S][i][5][count][i][1][1]
    [S][i][4][tags]
        [[]
        [T][i][37][S] //Notice [i][37]? Explained below
            [i][3][php]  //7 bytes
            [i][1][d] //3 bytes
            [i][6][python] //13 bytes
            [i][6][erlang] //13 bytes
        []]
[}]

I also think it is a mistake for the LENGTH in Type Groups to mean "number of elements". It should mean "number of bytes", because :

  • how else would you know how much to pre-allocate for a series of strings
  • it ties in with what TLV means in the rest of the spec.

Notice "[T][i][37]" in encoding 2 above. It is 1 (for the type) + 7 + 3 + 13 + 13 = 37 bytes

@rogusdev

I definitely disagree on specifying number of bytes in the length of a type group. I don't think would be advantageous to anyone: 1) it would be harder to serialize, since you'd need to calculate in advance exactly how long each and every element you were going to write would take up and 2) it'd be harder to deserialize because then while you'd know how much memory to allocate, you would NOT know how many elements (pointers) you were going to need into that memory block! The second is an especially big deal because it basically makes type groups even harder for such implementations than they already would be. If you thought, "oh but wait, let's include both bytes and elements" -- first point is still very much non-trivial.

@thebuzzmedia

@rogusdev a few thoughts:

  1. Ok let's call them Typed Group(s) -- agreed on not re-using chunks :)

  2. "No Typed Groups that only define length" -- a trivial optimization that depending on the data, can make measurable runtime differences though. Originally (up until Draft 8) all collections had a length count. That is what #26 is all about. Having LENGTH and TYPE-LENGTH is meant to address this optimization. To your point though, in spirit I agree with you, but I am convinced there is a highly optimized parsing case I am not thinking of right now that someone would want to leverage, so I think it is important it is supported. TYPE-LENGTH is an obvious winner.

  3. STC was actually intended to solve the lack of binary data in UBJSON -- as a side effect it gave us highly optimized container storage but it was only ever envisioned in the context of an optimized ARRAY, I don't think we ever discussed object-based STCs. That is new to this discussion.

NOTE - I think I am being dense here though, I think this is the 2nd time you've said this Typed Groups idea does not address the STC concerns, when it seems to me it does... I must be missing something. Do you just mean that a pure STC type would be even more optimal (which I agree with) or that Typed Groups don't solve the issue from a technical point of view? I agree that pure STC is more optimized, but I must be missing the point that the same problems aren't solved here with Typed Groups.

  1. TLV -- I think we are missing each other's meaning here.. Type-Length-Value (http://en.wikipedia.org/wiki/Type-length-value) --

STRING for example [S][i][3][bob]

  • Type: S
  • Length: [i]3
  • Value: bob

In my previous example: [T][i][#][i][3]...

  • Type: [T]i
  • Length: [#][i]3
  • Value: ... (container payload)

What you are proposing: [T][i][3][i]...

  • Length: [i]3
  • Type: i
  • Value: ... (container payload)

Did I misunderstand?

Summary

Sometimes I can get a sense for a proposal as a discussion moves on and I am getting the sense that Typed Groups, while interesting, add complexity to containers that MOST people will find confusing.

When you are describing JSON to someone and you say "Arrays are lists and Objects are maps" they get it.

If you are describing UBJSON and have to include an addendum "unless you have a Typed Group that defines yadda yadda"... a subset of developer's that don't love data formats may have their eyes glaze over.

I don't like the direction this is heading and if anything this is solidifying my desire to have this all defined at the ARRAY and OBJECT level, that way the discussion can go:

  1. ARRAYs are lists and OBJECTS are maps.
  2. conversation can optionally stop here for MOST people
  3. Oh you need a hyper optimized way to store data? Ok, well ARRAY and OBJECT actually accept 2 optional arguments to do that if you care.

I am not saying it is galvanized in my brain yet (would like to hear other's thoughts) but this is all helping shake out the details.

From a different angle, if ARRAY and OBJECT had already been defined the way I am proposing above in the initial post (assume nicer looking markers and more clarity) and you read the spec, what would your reaction have been? assuming there were libraries available for you to leverage with this all already implemented to boot?

Seems to me that once the details are worked out and all of this is behind a library for each platform, it is a big performance win both on-disk and over the wire with a simple/cohesive API (appendArray, writeOptimizedArray, etc.) without any odd terms in the APIs like "writeSTC" or "writeTypedGroup" -- non-JSON terms that make no sense to anyone except us because we discussed it for a year and a half.

This is a big issue for me -- familiarity. After the complexity of this stuff is addressed, we all get to live with with the APIs that result from this and the terms we decide on for a long time to come (decades?) -- I just want to make sure we make it easy for everyone, and piggy-backing on JSON's terms/techniques/constructs will give everyone just that.

@thebuzzmedia

@adilbaig Just replied to @rogusdev and saw your update. Unfortunately there are two proposals going on in this thread... 1 proposal is the original post I made at the top and the other proposal grew out of me trying to understand @rogusdev proposal of Typed Groups.

I'd like to know your thoughts about the original post and any gaps you can see. I am not a fan of the direction the Typed Groups discussion is going.

  1. TLV -- I think I might be dyslexic... Type-Length-Value seems very clear to me, all the variable length types in UBJSON are defined that way (STRING and HUGE) yet what I understand you and @rogusdev suggesting is LTV.

Using your example of [T][i][37][S]
Length: [i][37]
Type: [S]
Value: container payload

"T" is a marker, it does not denote a type by itself. That might be the point of confusion we are having -- all other markers in UBJSON are themselves both a marker AND a type declaration (and in the cases of T/F/N/Z marker, type AND value).

My proposal is to declare the full type upfront: [T][S][i][37] -- so once you parse T, you know the next marker is the type of the group, then you get the length, then you get the payload (Value).

  1. Length should be BYTES not Element Count -- We had this discussion at length about a year ago when we first discussed removing the counts on ARRAY and OBJECT, the reason it was thrown out is because for streaming-producers that want to benefit from the compacted representation of an STC, the only way they can know the byte size of an entire grouping of data is to produce/store/count the data before sending it -- in my discussions with the CouchDB core dev team, this was identified as a major problem (one of the reasons "Streaming Support" came about into existence).

Knowing the byte size is awesome for parsing, but puts too much pain on the producer at generation time.

Because individual, variable-length elements have their own lengths associated with them we try and bridge that optimization gap (of not having bytes) by providing a container child count -- so at least you know the child count of the container and don't need to constantly grow it during parsing and know the size of all the children individually, but the gap is that you don't know the ENTIRE grouping in bytes.

Because of embedded containers, this also makes things even worse for producing code to keep state and count of child containers. You might say "groups cannot contain containers then" which makes me like them even less... more and more special states :(

@rogusdev

@thebuzzmedia: Ah, I see how you're viewing TLV for this now. In my head it is:

[T] == Type/Marker
[i][3] == Length
[C] == Value (the implied type for the values that follow)

The reason I think of it this way is because this TYPE GROUP is itself almost a new container, but I see now that this is unhelpful/inaccurate. More on this in a moment.

The version you have described:

[T][C] == Type (Marker + implied type for the values that follow)
[i][3] == Length
[???] == Value (the data to follow, but with implicit type on the values)

dramatically alters both serializing and deserializing to require the possibility that a "type" is more than one byte! That's a huge deal.

So, now that I look at all this right now, in this light, I realize that in fact, this new TYPE GROUP that I'm suggesting is in fact almost the exact STC array we've been trying to get to, but...

The only catch is that we are currently proposing a TYPE GROUP that does two things: be a strongly typed array container AND be a type that merges its members back up into its parent. Let's not do try to both at once, but pick independent containers/whatever for each task, if we really want both.

So, if we toss out the desire for any STC object / type group inside objects just for values as @adilbaig suggests (and I'm fine with, for now at least), and focus on strongly typed array, I think [T] as I described it above, just taking out the part where it also does merging duty, totally suffices (at that point, it is exactly the same as the [A] I had proposed above, just a different marker byte). Then if we want to add the concept of merging at a later point, we could have a merging marker/type that would wrap this strongly typed container (or any other array -- possibly object inside of object?) to merge the inner container's contents up into the parent container.

But frankly, this merging idea sounds pretty complicated -- it's what's making the type group as discussed above so funky -- so I'd want to see an demo implementation of it to see how weird or not it would really turn out to be. The strongly typed array container though sounds like a super easy win right now.

Note (mostly to myself, probably): this merging idea is still not the same as chunking, a chunking container would combine all the values internally into one larger value, I've just yet to be convinced that the chunking idea would really have a great use case scenario to make it valuable. And while chunking results in a single value at the end, merging would only modify an existing parent container, without having any result value of its own, which might dramatically affect the way deserializing would have to be implemented. Again, not so sure, would need to see demo code to properly evaluate it, I think.

So, to start wrapping things up, I want to highlight why I am against the idea of modifying the existing array and/or object type definitions to include optional length and/or type: in the current extremely elegant design you have gotten to with UBJSON, there is nothing optional. Everything is required, even things like a zero length for a string (the required data is, in a certain important sense, still most definitely there: it is the bytes of empty string, which is to say, no bytes at all). I am personally strongly against adding any optional anythings at this point, notably including the ones that have been proposed here. ;)

So, I think just adding a strongly typed array container, [A] or [T] or whatever that has a given element count ("length") followed by a byte for the implied type of the values to follow -- and allows N as the "implied type" to say: type is not actually implied, but do use the given length for the number of elements to follow (and using either [T] or [F] as the implied type does not imply only the declared value/type but in fact implies that all subsequent values will be either [T] or [F])... that sounds like the goal for now. With adding containers for merging and chunking and STC objects as probably separate concerns.

As a small aside, I would ask that while the restriction about object/dictionary keys being only strings could be added to the specs now, the enhancement that the [S] marker byte must be left off of key values (as above, it should not be optional, it must be required to be one way or the other, otherwise parsing will get ugly) needs to be left for draft 10. (Issues of breaking backwards compatibility do not yet seem to be a concern for this spec, given the change to the meaning of markers [i], [I], [H] and [S] going from draft 8 to 9, and I personally am ok with that.)

@adilbaig
Collaborator

@adilbaig Just replied to @rogusdev and saw your update. Unfortunately there are two proposals going on in this thread... 1 proposal is the original post I made at the top and the other proposal grew out of me trying to understand @rogusdev proposal of Typed Groups.

I'd like to know your thoughts about the original post and any gaps you can see.

@thebuzzmedia - Sorry, i should have started on topic. After re-reading @cdunn2001's post i can see why this proposal was made. Given that, i think this proposal does not solve those issues. Chris needs arrays to always have lengths for debugging to be productive, optional lengths don't help (why should an encoder bother putting it in when it's optional and does not affect decoding!). It looks like we've tried to meld Chris's problem, STCs and "flowing types" from #25 (sorry for the new term! i only want to clarify what i'm saying). My gut feeling is this could be made simpler. I tried to come to a conclusion in my previous post but i'll retry here.

  • Arrays should have start/end markers, period. '[', ']'. They can contain any data type (including Type Groups, clarified below), just like JSON. They have no notion of length (for streaming purposes), and i don't like the idea of two arrays as @cdunn2001 suggested. There shouldn't be two ways to encode/decode the same thing, my feeling is this looks like a wart in the spec to new parser writers.
  • Objects - The preface in this thread suggests type hints for objects, like this :
[{][T][d] // Start, optional types of values, no makers needed
    [i][3][lat][29.976481] // "lat": 29.976481
    [i][4][long][31.131301] // "long": 31.131301
    [i][3][alt][67.0] // "alt": 67.0
[}] // End 

I am not for this. It complicates the parser because i can't recursively parse in a TLV fashion. I suggest we set this aside for now.

  • Type groups ARE a valuable suggestion (in the way coalesced in my last post, with corrections below) because they are simple, an optional optimization and they allow arrays to contain anything.

TLV -- I think I might be dyslexic... Type-Length-Value seems very clear to me, all the variable length types in UBJSON are defined that way (STRING and HUGE)

Agreed. That was my oversight.

Knowing the byte size is awesome for parsing, but puts too much pain on the producer at generation time.

I'm cool with that too.

Based on @thebuzzmedia and @rogusdev points on Type group lengths should be counts of objects, i concur. Here is my corrected example of type groups :

{ name:"adil", count:1, tags : ["php", "d", "python", "erlang"] }

//Encoding 2, with TYPE groups in arrays
[{]
    [S][i][4][name][S][i][4][adil]
    [S][i][5][count][i][1][1]
    [S][i][4][tags]
        [[]
        [T][S][i][4] // 'T' followed by the Type, Length and Values, always.
            [i][3][php]  //7 bytes
            [i][1][d] //3 bytes
            [i][6][python] //13 bytes
            [i][6][erlang] //13 bytes
        []]
[}]

My conclusion to Type Groups is :

  • They are [T][Type of Data][Count of items][Values ...]
  • They are optional, and should not be a part of objects.
  • They can occur multiple times in an array to optimize groups. Ex:
[1, 2, 3, 4, 5.0, 6.0, 7.0, 8.0]

        [[]
        [T][i][i][4] 
            [1]
            [2]
            [3]
            [4]
        [T][d][i][4]
            [5.0]
            [6.0]
            [7.0]
            [8.0]
        []]
@adilbaig
Collaborator

@rogusdev

The version you have described:

[T][C] == Type (Marker + implied type for the values that follow)
[i][3] == Length
[???] == Value (the data to follow, but with implicit type on the values)

dramatically alters both serializing and deserializing to require the possibility that a "type" is more than one byte! That's a huge deal.

Umm, I'm not quite sure how this is a huge deal. I'm probably not seeing far enough here, but if you encounter a 'T', you know the next byte is a type.

As a small aside, I would ask that while the restriction about object/dictionary keys being only strings could be added to the specs now, the enhancement that the [S] marker byte must be left off

I agree with this. It keeps the parser simple and allows non-conformants (too harsh?) to use whatever key they like.

@thebuzzmedia

@adilbaig and @rogusdev I think your points about the dangers of "optional" features in the spec is starting to sink in -- it makes everyone's lives harder (parsers, generators and client code handling returned values).

@adilbaig You stated something that really resonated with me:

Chris needs arrays to always have lengths for debugging to be productive, optional lengths don't help (why should an encoder bother putting it in when it's optional and does not affect decoding!)

...

They have no notion of length (for streaming purposes), and i don't like the idea of two arrays as @cdunn2001 suggested. There shouldn't be two ways to encode/decode the same thing, my feeling is this looks like a wart in the spec to new parser writers.

Very good point -- having dual types of something makes the spec more confusing AND will just lead to everyone using "whatever is easiest". Going back to my own words, JSON won because it was simpler than anything else. Looking at this recursively, certain decisions INSIDE of a spec will "win" because they are simpler than others.

Excellent point. Now, back to your post...

Your final version of Typed Groups I think looks rather elegant -- one clarification though, are we all in agreement that THIS is legal?

[1, 2, 3, 4, 5.0, 6.0, 7.0, 8.0, "hello", "world", 7]

        [[]
        [T][i][i][4] 
            [1]
            [2]
            [3]
            [4]
        [T][d][i][4]
            [5.0]
            [6.0]
            [7.0]
            [8.0]
        [S][i][5][hello]
        [S][i][5][world]
        [i][7]
        []]

It still bothers me a little bit that we don't have a solution for optimizing OBJECT representation (as it looks like we are in agreement on skipping the discussion around guaranteed STRING keys right now and stripping the 'S' markers -- probably a good idea since there doesn't seem to be consensus).

While optimized array representation is important, the majority of APIs out there speak in Objects and if possible it would be awesome if we could figure out a sensible way to optimize for them.

@adilbaig
Collaborator

Your final version of Typed Groups I think looks rather elegant -- one clarification though, are we all in agreement that THIS is legal?

[1, 2, 3, 4, 5.0, 6.0, 7.0, 8.0, "hello", "world", 7]

        [[]
        [T][i][i][4] 
            [1]
            [2]
            [3]
            [4]
        [T][d][i][4]
            [5.0]
            [6.0]
            [7.0]
            [8.0]
        [S][i][5][hello]
        [S][i][5][world]
        [i][7]
        []]

Yep, this MUST be legal. It's the beauty of type groups!

While optimized array representation is important, the majority of APIs out there speak in Objects and if possible it would be awesome if we could figure out a sensible way to optimize for them.

Agreed, i suggest we deal with that in a separate thread.

@rogusdev

Right now every value in UBJSON can be serialized like so:

  • write marker byte
  • write length if needed
  • write data (including empty data, i.e. zero bytes)

And thus deserialized like so:

  • read marker byte
  • read length if needed
  • read data (including empty data, i.e. zero bytes)

As a result, every single deserialize read of that nature returns a single value result (string, numeric, dictionary/map, bool, array/list, etc)

(The only place this currently breaks down is for no-op but I, and I expect others, have accepted this as ok, acknowledging it as a part of the communication protocol, and not a "data type" on its own.)

If we introduce this new format of [T][Type][Length][Array of values], it does not fit the above pattern. I would likely read it like so:

  • read marker byte
  • read implied type
  • read length
  • read length number of values AND add each of them to the parent container that this [T] must be declared in.

Totally could be done that way, does have a certain elegance. I still personally think of this type group declaration being best fit by [T][i][3][i][data]. In my head it makes it easier to see that the second [i] is applied directly to the immediately next value and X (3 here) number of subsequent values, but at this point, I certainly grant you it's just flipping the order of the implied type vs the length and it doesn't really matter in the bigger scheme of things (unless it helps a majority of other people to understand the usage faster). Arguments are clearly make-able for TLV in both cases, ah well. The functionality would be the same either way.

However, regarding the goal of an easily transmitted byte array, I still feel quite strongly that if I, as an implementer of a deserializing library, was presented with these two options for how I would want to binary data transmitted for ideal ease of deserialization, to then make it available as exactly that to consumers of my library:

[A][i][3][U]
    [41]
    [42]
    [43]

vs

[[][T][U][i][3]
    [41]
    [42]
    [43]
[]]

Two things come to mind immediately: 1) the trivial point is that the type group adds an extra two byte [T] and []] -- but this issue is effectively insignificant in the context of transferring potentially kilobytes or megabytes of binary data where an extra two bytes is nothing, so I won't get hung up on that point. However, 2) is that when I am reading this data, using the type group approach, as attractive as it is conceptually for some data serialization optimization, I still have NO guarantee that the resulting array will be made up entirely of only bytes! Someone could, in theory, send me an array of 10k bytes, neatly packaged up with the type group marker before them, but followed by one string at the very end. So I can't actually allocate just an array of bytes, I have to (potentially) allocate both an array of bytes and another array, of objects, and copy the bytes over into the objects when I'm done.

At that point, with no optimizing in the UBJSON library, the consumer of my library still has no guaranteed way to know that this is an array of bytes for sure, all they know is it is an array of any typed data. They'd have to loop over all the elements to see if they are all bytes and cast it to an array of bytes from here (to be clear: this would be very slow). If there were a true strongly typed array type, the deserializing library would have a guarantee in advance that the child objects would be of the appropriate type and then could easily allocate exactly that memory with appropriate typing and it would perform quickly as well as provide greater specificity/clarity to the end consumer of the data.

Very good point -- having dual types of something makes the spec more confusing AND will just lead to everyone using "whatever is easiest".

I think the point of having a strongly typed container is that it will be the easier choice compared to type groups, when confronted with the challenge of "how do I inform the recipient of this data that it is guaranteed to be homogeneous".

In fact, when programmatically serializing data, I cannot imagine a time when I would really think to use the type groups concept other than when I had a completely homogenous collection of data. So what we'd end up with are a lot of arrays that were just type groups with all the data in them, but lacking any true guarantee for the deserializer of such data that it was the only data type included.

Much as I am entertained by the type group idea, I don't think it properly/satisfactorily resolves the byte array problem.

@cdunn2001

This proposal is interesting but too complex.

@adilbaig wrote:

There shouldn't be two ways to encode/decode the same thing, my feeling is this looks like a wart in the spec to new parser writers.

I agree with most of your points, but I think we need to go back to Draft 8 and start over.

As @rogusdev and others have pointed out, TLV is the key to simplicity. To me a length-prefix simplifies parser writing in general. Maybe I can convince you all with another idea.

It can be beneficial to defer much of the parsing until data are actually required. (This is for random-access storage, of course, not a network stream. Sometimes JSON is used for configuration.) File-offsets can be stored, and sub-structures can be parsed on-demand. The savings in both memory and runtime can be enormous. This is not possible with JSON, nor with UBJSON Draft 9.

I understand the value of streaming, and of STC. I merely argue that they are extensions. They are not required. If we accept that they are extensions, then we can discuss them without artificial constraints like JSON compatibility.

@rogusdev

@cdunn2001 if you are storing file-offsets, I would imagine you could easily also store byte lengths to read for the lookup.

I for one think draft 9 is a nice upgrade over draft 8, not looking to go back, although I am not necessarily against including element count in every array and key-value pair count in every dictionary/object, although probably using the new system of using a proper numeric type for it, rather than different markers for each size scale. (If you want that, I would point you at MessagePack -- other than it's lack of explicit support for strings, which is planned to change, it probably covers that base the way you want.)

@cdunn2001

@rogusdev, You're right. MessagePack is close enough to what I want. I have to do some bit-masking, but the values are basically readable via tcpdump.

I also agree that a proper numberic type for length is preferable to MessagePack's type-encoded lengths.

The current UBJSON (Drafdt 9) is certainly an improvement over other binary JSON encodings and might satisfy many needs.

@logzero

Hi. I thought I'd share my approach to the issue as it is heavily inspired by your work. The main idea is to keep it simple and optimize for common cases.

I am using single byte markers to keep parsing trivial. I see json array as a special case of json object (one with an implicit key). Thus I add another special case (implicit key, bounded, typed), as it is quite common (and performance critical for me). This special case is transparent to json, can be represented as json array (decoder has to know it, encoder doesn't).

// base type
z nil
0 false
1 true
b int8
h int16
i int32
l int64
f float
d double
s string, [s][length][value]
n number, [n][length][value]

// unbounded generic container (defined by start, end marker)
{} with explicit key (object/map)
[] with implicit key (array)

// bounded typed container with implicit key, [id][length][values]
B int8 array
H int16 array
I int32 array
L int64 array
F float array
D double array
S string array ?
N number array ?

// bounded typed container with explicit key, [id][length][key value pairs] ?
T int8 map
U int16 map
V int32 map
W int64 map
X float map
Y double map
Z string map
Q number map

The ?-marked definitions are not part of spec (just included here for symmetry).

@thebuzzmedia

@logzero The format modifications look solid, but very optimized for numeric representations (16 types just to represent ARRAY/OBJECT of specific payload type).

That said, if that is what you needed for your uses I think it is a solid design.

@logzero

@thebuzzmedia Yeah, it is a specialization towards a certain data set. What I've tried to show with my post is, that there is strength in simplicity. I mean, the implementation could be auto-generated from a spec file, maybe even at runtime if one wanted to go all crazy. Although that starts to sound a bit too much like xml schema and co.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.