Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Strongly typed array with type inference #13

Closed
AnyCPU opened this Issue Aug 26, 2012 · 48 comments

Comments

Projects
None yet
4 participants
Collaborator

AnyCPU commented Aug 26, 2012

Intro #10 and #12 issues.

Adoption of anytype flat container aka array.

Proposal: adding to spec new type of container.

< - open chevron - start of container
> - close chevron - end of container

Container's type - type of container's items - auto inferenced from type of first element.
Container type is one of value types available in Spec 9 including booleans: True / False.

Array of parts from one string.
[<]
 [S] [3] [123]
      [5] [12345]
[>]

Bool array
[<]
 [T]
 [F]
 [F]
 [T]
 [T}
[>]

Type inference adds possibility of usage a boolean values.
About chunked string: string itself can contains string separators like \n and \r . So in general it's no big diff between one string and string that contain another strings.

P.S.: I believe that 13 is pretty lucky number.

@ghost ghost self-assigned this Aug 26, 2012

@ghost

ghost commented Aug 26, 2012

Michael,

Awesome, awesome idea, I think this even solves the issue of binary data by allowing a construct like:

[<][i]
    [45]
    [81]
    [111]
    ... more ...
[>]
  1. Am I understanding that correctly? (in which case binary is solved in a totally portable way) --> SEE UPDATE
  2. Do we allow typed arrays of NOOP?
  3. Do we allow typed arrays of NULL?
  4. Do we allow typed arrays of ARRAY?
  5. Do we allow typed arrays of OBJECT?

UPDATE ----

After reading Issue #12 more closely @adilbaig point about an array (or chunked array) of ints not necessarily getting converted to binary doesn't address the binary support issue.

While technically it can be stored that way, at the library level, there is no indication of the ints being binary data or just a list of ints (e.g. values read from an input device).

@ghost

ghost commented Aug 26, 2012

UPDATE -- One way to address this is to add a new [B] type marker that only works inside the new type-inferred arrays that indicates the stream of ints are all binary data -- [B] would essentially mean the same thing as [i] in that it is an 8-bit/1-byte interger, but in 1 context [i], it is a number and in another context [B], it is a byte as a run of binary data.

Collaborator

kxepal commented Aug 26, 2012

  1. Am I understanding that correctly? (in which case binary is solved in a totally portable way)

Yes.

  1. Do we allow typed arrays of NOOP?

Yes, since they have steaming nature.

  1. Do we allow typed arrays of NULL?

I suppose we have since NULL has "empty" / "no value" meaning.

  1. Do we allow typed arrays of ARRAY?

No, because of nested streamed containers problem. Also this gives no profit while produces a lot of unbalanced brackets.

  1. Do we allow typed arrays of OBJECT?

No, because of same reason as for ARRAY.

UPDATE -- One way to address this is to add a new [B] type marker that only works inside the new type-inferred arrays that indicates the stream of ints are all binary data -- [B] would essentially mean the same thing as [i] in that it is an 8-bit/1-byte interger, but in 1 context [i], it is a number and in another context [B], it is a byte as a run of binary data.

This is bad idea, because context depends upon client data processing logic. Probably this typed array of int8 is binary data, but I need to apply some XOR function to decode it back to raw binary format - better to handle it as array of numbers before work with him as with binary file. Also, context about typed array of int32 may be UTF-32 string or may be just array of huge numbers. We shouldn't make decision about data nature, let client process it in way that he expects and does.

Collaborator

AnyCPU commented Aug 26, 2012

  1. Yes.
  2. Yes.
  3. null is for NaN, +/- infinity - values that can't be represented.
  4. No.
  5. No.

This is bad idea, because context depends upon client data processing logic.
let client process it in way that he expects and does.

+1

We send just a model, client receives, controls and view its data.

@ghost

ghost commented Aug 26, 2012

Guys,

  1. Are we all in agreement that the "support" for Binary data is being pushed up a level to the library/application level but we leave it totally agnostic at the spec level by representing it as a list of int8 (byte) values?

Put another way, if we convert UBJSON to JSON (with 'binary' in it) it gets converted as a simple array, something like:

image: [12, 33, 122, 41...]

and if we convert it back to UBJSON, we unfortunately lose the conversion back to a strongly typed array and instead convert it into an array of ints as:

[S][5][image][[]
    [i][12]
    [i][33]
    [i][122]
    [i][41]
[]]

I don't like the lack of symmetry we lose here, but think the 'win' of getting spec-compliant binary support is a big deal and can be mostly abstracted away at the library level in the form of methods like:

writeBinary(byte[] data, int index, int length)
byte[] readBinary()
InputStream readBinary()

(or something like that, I haven't thought much about it yet)

  1. If #1 is true, then I think we need to provide the ability to have a Length argument for the strongly typed arrays per @kxepal original message otherwise reading and writing binary into this format is going to be untenably slow to be useful (if we have to read everything byte-by-byte and cannot read in a buffer's worth at a time.

What I am proposing is we allow a first argument after a typed-array open of ANY of the integeral numeric types (as we originally discussed) or a '*' char to indicate "unknown", something like:

[<][i][45]
    ....
[>]

or

[<][L][238712398723123]
    ....
[>]

or

[<][*]
    ....
[>]

This way optimized reads are at least an option in the format if needed (if someone is utilizing UBJSON for any significant transfer or storage of binary data this will be needed)

Thoughts?

Collaborator

AnyCPU commented Aug 26, 2012

  1. If we read a int8[] we can write back it.
[S][5][image] [ [i][12] [i][33] [i][122] [i][41] ]

If we have strongly typed we will write strongly typed.

[S][5][image]<i [12][33][122][41] >      <read--by design by default--write> [S][5][image]<i [12][33][122][41] >   
[ ]

used for anytype arrays as proposed, something like:

[ [S][5][super][i][0x11][d][123.123]] - it's like object container, but without member names.

[<][type of length][length][strong type of items] [... items ... ] [>]

Samples:

<i2i [0x12] [0xff]> array[2] of int 8
<i3L[0x..12][0x..13][0x..14]> array[3] of int64
<iZT[F][T][T].......[T][F][F][F]> unbounded array of booleans
<iZF[F][T][T].......[T][F][F][F]> still same unbounded array of booleans

@thebuzzmedia

Are we all in agreement that the "support" for Binary data is being pushed up a level to the library/application level but we leave it totally agnostic at the spec level by representing it as a list of int8 (byte) values?

Agreed. Convention over spec. To be precise, binary will not be supported in the spec per se. To help library users we can define a canonical convention in the docs showing how binary can be read and written to. That's something for later. More importantly, without binary, the spec is air-tight.

@thebuzzmedia - It would be useful to know exactly how you intend streaming to be used. I mentioned an idea for streaming that is broader in scope but i'm not sure i'm on the same page as you. If my question is due my own oversight, i apologize in advance!

Collaborator

AnyCPU commented Aug 27, 2012

Also unbounded arrays can be like:

<Zi [0x01][0x02]>
[<][Z or i2][items type specifier][...items...][>] - unbounded array if we use [Z] as [count/size specifier] or i2 as array[2] of type specifier.

real examples

<Zi [0x01]...[0x12]> - unbounded array of int8
<i2i [0x01][0x12]> - array[2] of int8
Collaborator

AnyCPU commented Aug 27, 2012

So

[] become as anytype array or tuple (still useful type)

[ [S][i][4][1234][d][123.456][L][0x..02] ] - for example as tuple response of query to database
Collaborator

kxepal commented Sep 1, 2012

Heh, I found this type a little hard to implement within classical TLV-parser, because it breaks TLV rules.
What's the algorithm should be for him?

  1. Read 1 byte;
  2. Oh, that's < tag! Delegate decoding to < handler;
  3. Emit next TLV structure - that's our first array item;
  4. Push back item tag to parser and emit next item;
  5. But how could parser emit end tag of this array in that case?

Who have any ideas about implementation of decoding this type without creation separate data reading function?(: I hope everyone there has a single function that emits data in TLV form...

Collaborator

kxepal commented Sep 1, 2012

Also we'd missed thing that with current structure this type prevents us to have int value 62 or 0x3e because this is closing marker > (:

Collaborator

kxepal commented Sep 1, 2012

By the way, all those problems could be solved if we explicitly define length of typed array like we have for STRING or HIDEF types, because we wouldn't care about end marker, only about for how much items our TLV parser would get tag value not from input data stream, but from cache or callback.

P.S. Sorry for multi post, thoughts and ideas comes in chunked way at night /:

Collaborator

AnyCPU commented Sep 2, 2012

< - start of container
[length (explicitly only, no arbitrary size)]
[item type]
[...items...]
[no closing character >]

<i2i [0x3e] [0x3e]

?

@ghost

ghost commented Sep 2, 2012

@adilbaig -- To your streaming questione earlier. Can you clarify what your point about streaming was? I don't think I follow where the disconnect is.

@kxepal -- To your comment about not being able to store the int value 62, this is the same issue we have with existing containers, and the reason it works (we can have 62) is because all the values contained within the container are all finite length (using @AnyCPU proposed format):

[<][i][33][i] 
  [62]
  [62]
  [62]
  ... 30 more entries ...
[>]

UPDATE: Ahhhh, I see the problem now... in my original draft of the example above I included the [i] marker on each byte value -- but with the strongly typed arrays, we don't have this value, so you never know if you are hitting '>' or another value to include in the array. Good point!

@everyone -- Given Alex's find, it seems to me that this idea of the strong-typed area only works when we have a FIXED-LENGTH container and the data cannot be unbounded because we don't know when to stop.

@AnyCPU -- I want to make sure I understand your proposed format for the strongly-typed arrays, it is:

<[ANY INTEGRAL NUMERIC TYPE][1-byte MARKER OF ANY UBJSON VALUE TYPE]
  ... collection of those same-typed values without their type markers, just raw bytes representing values ...
>

Is that right? Additionally it seems you are proposing using 'Z' or NULL for the length (instead of an integral type) to indicate an unbounded/unknown-length container, is that right? (as opposed to my suggestion of '*')

@kxepal If I understood @AnyCPU correct in the previous question, then the parsing seems fairly straight forward to me. Are you still seeing an issue here?

@everyone -- I am getting the impression that this idea of binary data/strongly typed arrays is getting very unweildy and we don't have it figured out correctly yet... let's keep discussing until something elegant comes out of this. What we have now isn't elegant/simple/self-explanatory unfortunately.

@ghost

ghost commented Sep 2, 2012

Thoughts on Strongly-typed Arrays

As this discussion continues on, it dawns on me that to effectively implement the strongly-typed container, where each element in the container uses short-hand representation of JUST its value (no MARKER bytes because we know everything in the container is the same type) it is impossible to accurately tell where the container ends, so this will require the use of a length, for example:

[<][i][5][i]
    [1]
    [2]
    [3]
    [4]
    [5]

This breaks the consistent feel of this spec, but is very space efficient.

Alternatively, to use a very spec-friendly method of a strongly typed container, we could just use the < and > markers, with no length (again, like the rest of UBJSON, all containers are unbounded) and then have individual units of data in the container using the standard UBJSON types:

[<]
    [i][1]
    [i][2]
    [i][3]
    [i][4]
    [i][5]
[>]

This looks nice and is intuitive BUT, writing binary data out like this and parsing it back in is a nightmare (1 byte/unit of data at a time).

It seems to me if we are serious about the strongly typed containers, the 1st method is superior performance-wise, but I hate the fact that it doesn't match any other aspect of the spec in design/implementation.

I am wondering if we should punt on this issue combined with Binary support until Draft 10 or until we can figure out something more elegant?

Thoughts?

Collaborator

kxepal commented Sep 2, 2012

Riyad, Michael,
In that case why do we ever need closing marker for sized container?

@ghost

ghost commented Sep 2, 2012

@kxepal The sized containers were a mistake as you pointed out to me (with containers holding containers holding containers... having the parser remember all that state to ensure size was a pain).

But you can look at the individual value constructs as examples:
[i][32]
[S][i][3][bob]

You are right, if you have the length you don't need the closing marker. If you don't have a length, you need a closing marker. The reason it works with ARRAY and OBJECT is because it contains full, discrete UBJSON types -- the proposal here for strongly-typed arrays is to further the optimization by removing the MARKER from each value, so you just have a flat, blob of all value data... in that case, you cannot (consistently/cleanly) discern the closing marker from a value in the list (as you pointed out).

So in short:

  • Unbounded container: Needs open/close markers.
  • Bounded container: Only needs open marker.
Collaborator

AnyCPU commented Sep 2, 2012

<[ANY INTEGRAL NUMERIC TYPE][1-byte MARKER OF ANY UBJSON VALUE TYPE]
  ... collection of those same-typed values without their type markers, just raw bytes representing values ...
>

Yes, and closing chevron can be omitted.

Collaborator

AnyCPU commented Sep 2, 2012

Known length array

[<][i][5][i]
    [1]
    [2]
    [3]
    [4]
    [5]

Is that right? Additionally it seems you are proposing using 'Z'
Yes, because if we use 'Z' we still in the same alphabet of the spec.

Unknown length array

[<][Z]
    [i][1]
    [i][2]
    [i][3]
    [i][4]
    [i][5]
[>]
Collaborator

kxepal commented Sep 3, 2012

Oh, I'd answer short because I was from mobile and didn't see such a lot of explanations. Now more expanded answer...

Is that right? Additionally it seems you are proposing using 'Z' or NULL for the length (instead of an integral type) to indicate an unbounded/unknown-length container, is that right? (as opposed to my suggestion of '*')

Yes, that's better, but why do we need duality of < marker? Let's take a look on use cases for both variants.

  1. Sized variant:
    • More compact way to keep data;
    • It's awesome for large binary data;
    • Since it's strongly typed and sized in low level languages you could explicitly allocate array of specific type and size or stop to handle it since length overflow;
    • Acts like a TLV hack, but you still able to make a parser without injection any knowledge about markers to him via callback values.
  2. Unsized variant:
    • Just an ARRAY with known and fixed items type. However, you still need to check each item type in your decoder because situation when you get:
[<][Z]
       [i][1]
       [i][2]
       [i][3]
       [S][i][5][BOOM!]
       [i][4]
[>]

is real. For sized variant you'll get decoding error for such case while you have no worries about items type.

  • This variant is not the same as size one because for empty array you still don't know his type. Examples:
[<][i][0][i]

^^^ this is zero length sized typed array that could contain only INT8 items.

[<][Z]
[>]

^^^ this is zero length unsized typed array with unknown acceptable items type. Since there is no elements at all it's same value as empty weak typed ARRAY until you put first item to it. In C++ you can do char data = {};, but you can't auto data = {}; - that's an error.

  • Respects TLV rules.

Any other ideas for both of them?

I suppose in most situations will be used most effective variant. It's first one since it more compact and static.
For second one I just need to take ARRAY and add type checker to him basing on first element.

I'd like to propose remove unsized variant of this type and rebrand it back to C as container type. Let's keep < > pair for some future usage in better case. For example, to have native support of key order preserved objects - that's a real problem for JSON (;

@ghost

ghost commented Sep 3, 2012

@kxepal I agree with 2 things you said and I am not sure how they fit together yet, but essentially:

  1. Do away with the unbounded strongly typed container (just use an array).
  2. Let's go back to the idea of a "chunked" container.

I think we are losing sight of the ultimate problem we are trying to solve here... I want to write something up and will share it shortly.

@ghost

ghost commented Sep 3, 2012

Reading back through Issue #10 I think we have lost sight of the original goal a little bit.

Originally we were looking for a way to associate "chunks" of a large value (represented by individual UBJSON values) together, like:

[C]
    [S][1024][... bytes ...]
    [S][1024][... bytes ...]
    [S][1024][... bytes ...]
    [S][1024][... bytes ...]
    [S][1024][... bytes ...]
[E]

The idea is that the units of data inside the chunked container are sufficiently large themselves, to make the overhead of the repeated MARKER to be insignificant (and therefor not necessary to change and break the existing TYPE reading/writing logic).

I think we got focused on the idea of BINARY support and the individual INT8 types, that we started only think of this construct:

[C]
    [i][22]
    [i][33]
    [i][17]
    ...
[E]

and how inefficient the byte-bloat was by the addition of 'i' over and over again was, that we set out to only solve that problem.

The whole argument for a chunked container type came out of Issue 10 trying to respond to people's request for handling really huge strings in chunks and not trying to stream/parse individual Strings that were GB's in size. Instead you could chunk a STRING into hundreds of smaller pieces and send them, letting the receiver handle them more efficiently.

The BINARY use case I think is derailing us into the realm of incompatible types and specialized container formats.

Maybe we are thinking about this wrong? Maybe we do add a specific BINARY type to avoid this? (but I agree with @adilbaig that this breaks the format in a way that might not be worth it yet).

Collaborator

kxepal commented Sep 3, 2012

@thebuzzmedia, I think we need just leave sized variant of this type with another marker in @AnyCPU proposed form:

C[ANY INTEGRAL NUMERIC TYPE][1-byte MARKER OF ANY UBJSON VALUE TYPE]
  ... collection of those same-typed values without their type markers, just raw bytes representing values ...

and that's all. It's looks very nice and effective for his task circle.

I think we are losing sight of the ultimate problem we are trying to solve here
Yes, that was a hard idea evolution from just chunked strings through binary containers to typed array.

Now we're trying solve all of previous problems with single type: binary data (with compact structure), chunked strings (because all container items are single typed) and we could take with us previous request to support UTF-16 strings (since items may have any non-container value type).

Suddenly, it couldn't designed in streamed way to keep his effectiveness (or may be I just didn't see how it could be implemented without things hacking).

@ghost

ghost commented Sep 3, 2012

Alex, thanks as always for helping clarify :)

@thebuzzmedia, I think we need just leave sized variant of this type with another marker in @AnyCPU proposed form:

C[ANY INTEGRAL NUMERIC TYPE][1-byte MARKER OF ANY UBJSON VALUE TYPE]
... collection of those same-typed values without their type markers, just raw bytes representing values ...

and that's all. It's looks very nice and effective for his task circle.

Ahh, so binary would look like:

[C][INTEGRAL NUMERIC TYPE LENGTH][1-BYTE MARKER]
    [32]
    [57]
    [13]
    ....

and strings would look like:

[C][INTEGRAL NUMERIC TYPE LENGTH][1-BYTE MARKER]
    [i][32][...32 bytes of String data...]
    [L][321312334445][...321312334445 bytes of String data...]
    [I][17862][...17862 bytes of String data...]
    ....

where the length is mandatory (not optional, and any of the INT8/16/32/64 types and the 1-byte Marker is any non-container value type?

If so, OK, I agree. Very elegant.

Suddenly, it couldn't designed in streamed way to keep his effectiveness (or may be I just didn't see how it could be implemented without things hacking).

I didn't understand this... can you clarify?

Collaborator

kxepal commented Sep 3, 2012

Ahh, so binary would look like:

Actually (binary foo):

[C][i][3][i]
    [102]
    [111]
    [111]

It's an array of int8 numbers, but you may work with it as with binary data, since it is actually an array of bytes.
So we've kept compatibility with JSON. On library level you may decide how to decode such container with some option, but you'll certainly know how to encode binary data streams to UBJSON.

and strings would look like:

Actually:

[C][i][3][S]
    [i][3][foo]
    [i][3][bar]
    [i][3][baz]

That's our chunked strings with known their numbers. So you could have an option to encode large strings chunkelly and option to assemble chunks to single string on data decoding. In any other ways it's a simple array of strings.

where the length is any of the INT8/16/32/64 types and the 1-byte Marker is any non-container value type?

Yes, length follows same rules as for STRING and HIDEF types.

Suddenly, it couldn't designed in streamed way to keep his effectiveness (or may be I just didn't see how it
could be implemented without things hacking).

I didn't understand this... can you clarify?

That was reference to my previous comments. Never mind(:

Collaborator

AnyCPU commented Sep 3, 2012

Day summary:
Known size:

[start of stc][any value of integral type as length][one byte marker for array's items type]
[bytes of 1st item]
...
[bytes of n-th item]

start of strongly typed container (stc) - <
length is value of int8, int16, in32, int64
item type is any value type: int8, int16, in32, int64, float32, float64, string, high-precision, Z, (T, F) as boolean
items are bytes of numeric types or [chunk length][bytes of string types] in case of string and high-pre.

Collaborator

AnyCPU commented Sep 3, 2012

Comments:

  • fastest arrays in view of parser speed are numeric arrays.
  • arrays with common speed are string and high-pre arrays.
  • strongly typed arrays with unknown size can be useful if we want to add into spec layer an additional integrity checking for arrays. For example we write STC and into a array body can be written a bytes of string by mistake and result bytes maybe will be the same.
    So next for of STC maybe can be useful:

<Zi      
 [i][0x01]
 ....
 [S][i][5][adfsdf] /// error thrown under spec level
 [i][0x02]
>

< - start of STC
Z - indicator of unknown length instead of length - [any value of integral type as length]
i - type of items
> - end of STC.
It's more verbose.

Question: it's can be useful?

Collaborator

AnyCPU commented Sep 3, 2012

since each item has int8 type it will be decoded as

no.


[<][Z][i]

[i][0x01]
[S][i][0x01][a] /// error under spec level and additional checking
[i][0x02]
[>]

any chance to replace < by C or something?(: 

we have [ ] and { }, so it's organic to have < and reserve >

Collaborator

kxepal commented Sep 3, 2012

@AnyCPU, yes-yes, I was wrong, sorry /:

However, same type validation check you'll get with sized variant, but it more compact form.

Collaborator

AnyCPU commented Sep 3, 2012

However, same type validation check you'll get with sized variant, but it more compact form.

If it is possible in some cases a some bytes of STC array of int8 will be replaced by string bytes and it will be as you mentioned

[<][i][11][i]
    [0x01]
    [0x53] // S
    [0x69] // i
    [0x06] // 6
    [0x61] // a
    [0x64] // d
    [0x66] // f
    [0x73] // s
    [0x64] // d
    [0x66] // f
    [0x02]

So, question: it's can be useful? if it is difficult and twirled let's throw out it and let's leave STC with known size only.

@ghost

ghost commented Sep 4, 2012

So, question: it's can be useful? if it is difficult and twirled let's throw out it and let's leave STC with known size only.

STC with known-size is my vote; unknown size (while it makes sense) requires that we use the type prefixes for every element so we can accurately detect OPEN/CLOSE arguments; and we lose all the efficiently of the more compact representation.

Collaborator

AnyCPU commented Sep 4, 2012

@thebuzzmedia we will leave integrity checking onto app level?

P.S.: So, STC with known-size only? (+1 my vote)

Collaborator

kxepal commented Sep 4, 2012

+1 for know sized variant.

Unsized STC could be done in more compact way:

[<]
    [i][0x01]
    [i][0x02]
    [i][0x03]
[>]

^^^ the difference is in duck typing: first item setups type restriction for the others. By the way, this is a classical array while our ARRAY type is actually list - terminology tricks(;

Have we decide about STC marker? I'm mostly for C while leave < > for some future usage.

Collaborator

AnyCPU commented Sep 4, 2012

@kxepal in this way I think we don't need an unsized STC arrays at all, because this behaviour can be checked on app level.

+1 for < - for STC, > - reserved and prohibited.

@ghost

ghost commented Sep 4, 2012

@kxepal I don't like '<' either, but if you are voting for 'C' it would seem we should leave 'A' and 'O' usage entact and have some potential consistency like:

[A]...[E]
[O]...[E]
[C]...[E]

right?

If we do that and define 'C' (or STC) as we are proposing (fixed length), then the very next question out of everyone's mouth is "Why doesn't ARRAY and OBJECT support fixed-length as well for performance optimizations?"

I don't like this inconsistency between these constructs, it suggests to me that we haven't considered all the use-cases yet.

I know it works, I just don't think it is elegant and the original spec's simplicity was such a huge selling point for the spec.

Another idea we haven't considered is overloading the ARRAY marker by supporting the addition of a fixed-length, the only problem with this is that JSON Array does not have to be the same type, but if we do this shortcut with fixed-length arrays, they DO need to be the same type.

I am sorry for the push-back... my gut is telling me that we aren't here yet, that Draft 9 is not ready.

When I look at the omissions we are making in Draft 9 (removing the lower-cased marker types) I love how simple the spec gets again.

When I look at the marker changes from A to [ and O to { -- even though it looks better in a hex viewer, I think it makes the documentation harder to read and seems a little more technical to someone reading the spec.

Then when I look at the proposed changes in STC, I think we are getting farther and farther away from "simple".

This might be what we end up using, but I just can't help feel like we are off track here and I don't know why.

Collaborator

AnyCPU commented Sep 4, 2012

@thebuzzmedia In general a stc can be implemented on application level with little overhead.
{ } [ ] are good because I think it's can speed ups parsing.

Collaborator

kxepal commented Sep 4, 2012

@kxepal I don't like '<' either, but if you are voting for 'C' it would seem we should leave 'A' and 'O' usage entact and have some potential consistency

< looks better if it ends by >, otherwise it creates unbalanced chevron which also breaks consistency with [ ] and { } pairs.

Since STC is sized container, there is no need to close it with some tag. This means that it's more closer by his structure to STRING and HIDEF types rather to ARRAY or OBJECT.

"Why doesn't ARRAY and OBJECT support fixed-length as well for performance optimizations?"

Because STC is actually ARRAY optimization that is very useful for some specific cases (binary data, collections of similar items type). There is analog for STC in JSON and you'll not find something similar in high level languages. While ARRAY tries acts as is JSON one, STC just same ARRAY with applied optimizations for binary data context.

In same way our int8/int16/int32/int64 float and double - they all are optimizations of JSON single type number. But nobody asking around what are these types and why not just to use HIDEF for numbers, right?

Then when I look at the proposed changes in STC, I think we are getting farther and farther away from "simple".
This might be what we end up using, but I just can't help feel like we are off track here and I don't know why.

I suppose these discussion have created some mess of ideas and confusion so that's why you're in doubt. I agree, this is hard to track all things from issue #10, #12 and #13 down. Let's try to move in another way: please, describe your vision of STC type as you like to see it and define his benefits and flaws - arguments for each of them are welcomed - and send it to everyone by mail listing. Then we'll comment this specification without any holywars, discussion and flood text, just only by deal: what's right and what's better to change and WHY - moot points will be discussed separately, may be even in this issue.

This should helps you to see all picture with one shot, because actually I have tried to implement every our idea from this issue in code and all things are clear for me.

Or, if you'd like, I could try to summarize all things about STC, but I'm a little not objective(:

@ghost

ghost commented Sep 5, 2012

Alex,

Sorry for the diatribe yesterday; you are exactly right in your assessment that all the conversations in all the different issues are blurring together to create a lot of confusion and complexity in my mind.

When I originally created the UBJSON spec, I just opened up the JSON spec, sat down, and defined the binary representations of each of the same constructs.

Easy and intuitive (also easy to learn and easy to code).

After Draft 4 and onto Draft 8, the spec kept growing, and then in Draft 9 we realized how many mistakes we made with big-marker/little-marker values, and not just re-using the same JSON container markers for simplicity sake.

When I tried to learn from my mistakes, the one common thread I could come up with is that I felt I moved too quickly to define Drafts 5/6/7/8 and integrate proposed changes too quickly and then we had to undo them a bit in Draft 9.

When I look at the proposed STC format, it makes all the sense in the world -- it is nice and compact, relatively easy to implement and efficient (AND gives us the ability to handle binary data which is awesome).

However... my mind and natural preference tends towards symmetry. For example, if STC is effectively a specialized ARRAY where all types are the same (and thus written in short-hand), should we overload '[' in some way so it is more natural to learn?

And if the answer to that is 'Yes',then maybe we should consider an overloaded '{' OBJECT construct as well that might offer some natural optimization to Object construction?

--- My concern is that we have a solution to a very specific problem here (STC) and it is a good solution, but it is another 1-off construction.

What if Draft 10 comes up and we realize that we did this all wrong and a much more elegant solution is to redefine all the container types in some other way?

(Obviously we would just accept the break and do it, but I'd love to avoid it if we could).

That is all... I just have this nagging feeling in the back of my head that there is a more holistic solution here that I am missing at the moment that would be simpler.

I agree with every point you made though (about how STC is specialized just like the numeric types). Sorry for the rambling, I just want to get this right.

Collaborator

AnyCPU commented Sep 5, 2012

Riyad & Alex,
as I understand, we have the next questions:

  • STC(with known length) is good? I think yes: robust, quick, compact and natural
    for primitive types.
  • Should we overload [ ]? I think no, because we have ability to create anytype arrays with arbitrary length. it's something like a result recordset/tuple of sql query. Also it adds ability to do a custom checks of data on app level.
  • Should we overload { }? I think no, because what can be more natural representation of Object than <key, value> pairs, also object can contain lots of custom metrics that can't be predicted.

As summary, we have efficient and whole balanced version of spec which can become the version number 9.

{ } [ ] < > All it are different types which can be combined to achieve settled goals.

@ghost

ghost commented Sep 5, 2012

@AnyCPU basically, yes. To Point #1 and the thing that is keeping me up at night...

  • Is losing 100% JSON->UBJSON compatibility worth the addition of BINARY support?

More specifically, right now using Draft 8, I can take any kind of JSON, convert it to UBJSON and convert it back to JSON.... then back to UBJSON again and it is always equal.

If we make this change, that is no longer true.

@adilbaig suggested a few weeks ago that Draft 9 be all our other changes except this STC/Binary Array support change to maintain the maximal compatibility and I think I am starting to lean towards that.

We simplify the spec greatly and gives us time to continue to think on BINARY / CHUNKED support.

Would you guys be violently opposed to this?

Collaborator

kxepal commented Sep 6, 2012

What if Draft 10 comes up and we realize that we did this all wrong and a much more elegant solution is to redefine all the container types in some other way?

We're not gods to make things ideal by first try, but we're humans to learn upon our mistakes to make things better than before(; But we couldn't improve our ideas without facing them with real world. Actually, my proposal about streamed only arrays/object wasn't just idea from nothing - it comes to me because I've used Python generators for their decoders - they have streamed nature. Take your idea about custom length for STRING/HIDEF - it comes to you in context of your Java lib realization, if I recall correctly.

After Draft 4 and onto Draft 8, the spec kept growing, and then in Draft 9 we realized how many mistakes we made with big-marker/little-marker values, and not just re-using the same JSON container markers for simplicity sake.

It's idea evolution, this is normal thing for any spec.

Is losing 100% JSON->UBJSON compatibility worth the addition of BINARY support?

I repeat myself: no, if we cares about JSON part in format name. However, we're free to work within JSON context.

More specifically, right now using Draft 8, I can take any kind of JSON, convert it to UBJSON and convert it back to JSON.... then back to UBJSON again and it is always equal.

- How do you encode binary data in JSON?

  • Base64 for sure!
  • Could this be true for UBJSON?
  • Of course! But with STC it just better.
  • What is STC for JSON?
  • Just plain array, nothing special.
  • So if I encode binary data in JSON as array of numbers in range 0-255, I could use it?
  • Yes, just tell your library to optimize plain arrays to STC.
  • Why do I need to store binary data in JSON with such form?
  • Because in this form we could save compatibility with JSON without losing some features. And because this form is native for some types that operates with binary data.
  • But it produces a lot of overhead to store such binary data in JSON!
  • Yes, we must admit that this is not optimal structure for JSON. However, is it really binary data or just array of numbers? We couldn't guess and we wouldn't even try to.
  • Why not just use plain UBJSON arrays for that? I've heard they are compact!
  • In case of binary data (remember, that binary data is just array of byte-sized integers) it will produce size 100% overhead. STC keeps data size in equal to original one without size overhead (ok, 11 bytes for GB files) while base64 encoding produces 33% overhead.
  • ...
  • Profit!

Another case:

- Hey! You said that UBJSON could is compact than JSON for about 20-40%?

  • In most cases it is...
  • How about this? ["f","o","o","b","a","r","b","a","z"] - 37 bytes. In UBJSON it's 38 bytes. Where is the profit? (Exp.: 2 bytes for array markers, 3 bytes per item for type and length and 1 byte per item).
  • Have you tried STC? With it total size is just 14 bytes! (Exp. 1 byte for STC marker, 2 bytes for length, 1 byte for items type, 9 bytes for data)
  • Wow! JSON variant produces 250% overhead!
  • Yeap, we're going to be compact for some cases.

Note: last case fails if JSON guy adds some other typed item in array, but this is marketing and who cares?(:
UPDATE: oops, my math sux: not 14 bytes, but 1+2+1+9*3 = 31 bytes, but still better (:

Pretty simple and clear. This type doesn't break transitive relation with JSON:

JSON array = STC array = JSON array

While it provides an option to store similar typed data in more optimized way within array container. It's like an OBJECT type which has ARRAY semantic with odd items type restriction(keys are string only) and auto grouping items by pairs.

So there is no question about adding binary support to UBJSON as there is no question about chunked support to it. All of them are use cases of STC type in different context, which one you're decide.

For example, let's take a C++. There is char type that used for both cases: 1 byte sized integers and text data. So, if you'd like to work with STC of int8 items as with array, you need to use signed char type since int8 is signed value. If you'd like to work with same data as with binary - unsigned char is your choice.

adilbaig commented Sep 6, 2012

@adilbaig suggested a few weeks ago that Draft 9 be all our other changes except this STC/Binary Array support change to maintain the maximal compatibility and I think I am starting to lean towards that.

We simplify the spec greatly and gives us time to continue to think on BINARY / CHUNKED support.

+1 to that.

As kxepal and thebuzzmedia pointed out at multiple junctures, we're losing sight of the big picture.

As an applications developer, i can shop around for 5 (or more?) different binary formats that support JSON in one way or another; BSON, BJSON, Smile, Protocol Buffers etc. Some are specialized for RPC, other's focus on high performance and yet others are too product specific. None provide exact reproduction with JSON and back. That's our core strength, and that's why i voted "no" for native binary data in the spec (as a reminder, we all agree we can use an array of bytes or even base64 encoded strings. Just add application specific meta data as objects/key-value pairs. You can never use binary data without an application specific context anyway).

With regards to STC, it is a niche optimization that only pays dividends with very large arrays. Plus, STC (with a known array length) doesn't address streaming. The STC array specs have ironed out a lot issues, which is a great discussion. I agree it is well designed for what it solves (and I sincerely respect everyone for their efforts). I just don't think it is a very important problem.

I have worked for many years with JSON both on the browser and on the server, and i have rarely, if ever, come across a large array that contains something useful in the form of all ints or all strings. Generally it is an array of similar objects (ex: an array of comments consists of objects with comment + date + author fields at the minimum). STC adds a layer of complexity for a use-case that rarely occurs in such huge sizes that would make STC a viable optimization. For a real-world example, look at BSON, it stores arrays as "key value" pairs, which is worse than what we have and they're using it for dbs that are 100s of gigabytes large. Hence i asked @thebuzzmedia for use-cases for STC. I'm not sure this is a real-world problem we need to deal with now.

And now you can shoot me :)

Update : some clarifications.

Collaborator

kxepal commented Sep 6, 2012

@adilbaig ,
totaly agreed about our main feature - full transitive compatibility with JSON.

I just don't think it is a very important problem.

Actually, this STC is a reaction on issues #10 and #12: to not introduce standalone string chunking and to not break compatibility with JSON via binary type.

As the real world example: CouchDB - JSON driven document oriented database. It allows to store attachments within documents. Already stored attachments are represented in documents as stubs with short info about attached file. If you want to update document and add to attach to him some files with single request, you need to inline attachments into document content. Inlined attachments should be encoded as base64. So if I'd like to cache this document to send it later and store it in UBJSON format STC would help me nohow.

TL;DR

I'd like to agree with you, @adilbaig , about common problems and rare use cases. STC helps to solve storing binary data in UBJSON, helps with UTF-16 and UTF-32 strings, helps to "stream" large text by chunks, but as for me I hadn't faced any of this problem yet with UBJSON and before without it. Except streaming large text: it's intuitive for me to chunkify large data, but for UBJSON there is array with streaming nature, so problem is solved (may be no in most optimal way that it could but it is).

If we all couldn't agree about STC, let's do no rush. Let's finish STC specification, use cases, hints (everything to reduce our time to recall "what ever is STC?") and leave it for future. Draft-9 already brings a lot of new things and totally breaks any compatibility with old Draft-8 while STC is not Draft-8 evolution just an improvement and things optimization for some cases.

P.S. Probably offtopic but...

For a real-world example, look at BSON, it stores arrays a "key value" pairs, which is worse than what we have and they're using it dbs that are 100s of gigabytes large.

OMG! Why? And how it works then..?

@ghost

ghost commented Sep 6, 2012

@kxepal and @adilbaig want to personally thank you both for an excellent analysis of the situation and discussion.

Alex, as always your detail and description breakdowns were spot on and gave me a lot of clarity into the problem we are proposing to solve.

Adil, as you exactly pointed out:

As an applications developer, i can shop around for 5 (or more?) different binary formats that support JSON in one way or another; BSON, BJSON, Smile, Protocol Buffers etc. Some are specialized for RPC, other's focus on high performance and yet others are too product specific. None provide exact reproduction with JSON and back.

That was my core motivation when I originally started spec'ing out UBJSON and I shouldn't lose sight of it now. I also have a sneaking sensation that you are spot on with the "real-world impact" of STC.

I propose (as you guys mentioned) making the Draft 9 release include all our optimizations and simplifications except STC and continue to work on this.

adilbaig commented Sep 6, 2012

@kxepal :

OMG! Why? And how it works then..?

I believe they were/are strongly in favor of TLV parsing, just because it is simpler. It seems to me they don't really worry about the amount of space it takes, a valid compromise in the age of huge harddisks.

Interestingly, even though Mongo supports binary natively, they use a different structure called GridFS to store large blobs.

BTW, i do understand your argument for STC and how it is particularly optimized for binary without breaking JSON compatibility.You did remind me that unbounded arrays work as a make do solution for string "chunking" and "streaming", by offloading logic to the client. Agreed on both point and i think we would benefit from a little more thought on this.

@thebuzzmedia -

I propose (as you guys mentioned) making the Draft 9 release include all our optimizations and simplifications except STC and continue to work on this.

Agreed and thank you.

syount commented Apr 16, 2013

An alternative would be for standard mixed arrays to simply allow an encoding to elide the type marker for the next item in an array when the next item would have the same marker as the previous item, excepting cases where the first byte of the item with an elided marker is one of the bytes reserved for use as a type marker.

@ghost ghost closed this Jan 15, 2014

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment