Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow strongly typed arrays to contain strongly typed arrays #43

Closed
edgar-bonet opened this issue May 23, 2014 · 59 comments
Closed

Allow strongly typed arrays to contain strongly typed arrays #43

edgar-bonet opened this issue May 23, 2014 · 59 comments
Milestone

Comments

@edgar-bonet
Copy link

Hi!

The current (2014-05-23) version of the specification seems to only allow primitive types (called “value” types) inside strongly typed arrays. As an enhancement, I am requesting that strongly typed arrays be acceptable as elements of strongly typed arrays. This would provide optimization for multidimensional arrays, without resorting to any new syntactic construct.

Example

I have (I really do) a few huge JSON files, each holding, among other things, a thousand arrays that look like this:

[
    [1.23, 4.56],  // this is one datapoint
    [7.89, 0.12],  // another datapoint
    // a few thousand more datapoints...
]

I am looking for a more efficient, JSON-compatible, binary representation of the same data. “Unoptimized” UBJSON yields:

[[]
    [[] [d][1.23] [d][4.56] []]
    [[] [d][7.89] [d][0.12] []]
    // a few thousand more datapoints...
[]]

With this representation, each datapoint costs 8 bytes of data (float32 is enough precision for me), plus 4 bytes of overhead. That's 50% overhead, not so good.

The same array as “optimized” UBJSON is:

[[]
    [[][$][d][#][i][2] [1.23] [4.56]
    [[][$][d][#][i][2] [7.89] [0.12]
    // a few thousand more datapoints...
[]]

Now we have 8 bytes of data + 6 bytes of overhead per datapoint. That's 75% overhead, so the optimization is obviously not good for these small inner arrays.

Per the current proposal, the outer array can also be optimized, which yields the following “recursively optimized” UBJSON:

[[]                 // this is an array
    [$][[]          // of arrays
        [$][d]      // of float32
        [#][i][2]   // inner arrays have length 2
    [#][I][3200]    // outer array has length 3200
    [1.23] [4.56]   // first datapoint
    [7.89] [0.12]   // second datapoint
    // a few thousand more datapoints...

Now we have a really optimized layout with zero overhead.

And importantly, we are not introducing any new syntax, but only specifying that the “type marker” of a strongly typed array is:

[type of array] = [[][$][type of elements][#][length of array]

In the above example, the type marker of the outer array ("[$[$d#i<2>#I<3200>" for short) would be recursively parsed as:

level 0: [$ ┐                   ┌→ #I<3200> = array of length 3200
level 1:    └→ [$ ┐    ┌→ #i<2> ┘           = arrays of length 2
level 2:          └→ d ┘                    = float32

Regards, and thanks for the good job.

@ghost ghost added this to the Draft 12 milestone May 23, 2014
@ghost ghost self-assigned this May 23, 2014
@ghost ghost added the enhancement label May 23, 2014
@ghost
Copy link

ghost commented May 23, 2014

@edgar-bonet I like the win, in this particular case, but I am having a hard time wrapping my head around the recursive implications here... for example a 3D or 4D array?

I understand the request in a 2D case:

<define header for outer and inner array>
  <data payload>

When I think of allowing 0+ definitions of pairs of type-size ($-#) and each one corresponds to a lower and lower level of array, it puts more and more burden on the parser or generator to be context aware.

I'm not against this, I'm just thinking through it out loud... would like others to weigh in as well.

(Restating examples so we can reference them for further discussion...)

1. JSON (40 bytes)

[
    [1.23, 4.56],
    [7.89, 0.12],
    [3.14, 0.08]
]

2. Plain UBJSON (38 bytes)

[[]
    [[][d][1.23][d][4.56][]]
    [[][d][7.89][d][0.12][]]
    [[][d][3.14][d][0.08][]]
[]]

3. Optimized Container UBJSON (47 bytes)

[[]
    [[][$][d][#][i][2][1.23][4.56][]]
    [[][$][d][#][i][2][7.89][0.12][]]
    [[][$][d][#][i][2][3.14][0.08][]]
[]]

@edgar-bonet
Copy link
Author

Well, the definition

[type of array] = [[][$][type of elements][#][length of array]

being recursive, it allows for arrays of any dimension. I am guessing that any UBJSON implementation already relies heavily on recursion. This proposal just requires a little bit more of it.

I've written a not-so-clean Qt-based implementation, but I do not have access to it now (it's in my office). Anyhow, for the parser, the general idea is to have one recursive function for parsing the “type signatures”, and a second recursive function for parsing the actual data. In JavaScript-style pseudo-code:

function parse_type_signature(byte_stream)
{
    if (byte_stream starts with "[$") {  // fixed type array
        advance byte_stream by 2;  // consume "[$"
        var inner_type = parse_type_signature(byte_stream);  // recurse!
        assert(byte_stream starts with "#");
        advance byte_stream by 1;  // consume "#"
        var count_type = parse_type_signature(byte_stream);
        assert(count_type is some integer type);
        var count = parse_data(count_type, byte_stream);
        return {
                type: FIXED_TYPE_ARRAY,
                element_type: inner_type,
                element_count: count
            };
    }
    else {
        // other types...
    }
}

function parse_data(type, byte_stream)
{
    if (type.type == FIXED_TYPE_ARRAY) {
        var data = new Array of length type.element_count;
        for (var i = 0; i < type.element_count; i++)
            data[i] = parse_data(type.element_type, byte_stream);
        return data;
    }
    else {
        // other types...
    }
}

@ghost
Copy link

ghost commented Jun 4, 2014

@edgar-bonet I was letting this proposal marinade a little bit in my brain - the space-optimization win is too big to ignore and I've since come around on the upfront definition of the container for an X-dimension set of containers... it is the nature of the strongly typed container anyway, so there is no diversion from what we already have.

I'm asking for feedback from the community on this, but overall I think a great proposal.

@meisme
Copy link

meisme commented Jun 4, 2014

Since the strongly typed arrays only can contain one type of data, these cases can always be "unnested" for serialization, moving the complexity from the (de)serializer to the application. The previous example then becomes

[[][$][d][#][i][6]
    [1.23][4.56][7.89][0.12][3.14][0.08]

This is just 30 bytes (of which 24 is the floating point data itself). Personally I don't see the need for introducing the suggested complexity since it will inevitably use more overhead than this. There might be a practical use-case for nested strongly typed objects, though, I haven't thought that all the way through yet.

@kxepal
Copy link
Collaborator

kxepal commented Jun 5, 2014

While idea looks good, I fear that we'll make UBJSON too complex and ugly with attempt to handle all possible cases with "optimized" constructions. UBJSON should provide plain and simple primitive to easily and effective describe data structures and be error-prone in the same time. I feel we're drifting away from that original idea since Draft 10 (I'll open issue to discuss this a bit later).

Actually, you'd like to see only one case that would be optimised: array of pairs. It's very common case for representing: geo data, money, node relations (A->B), ip-port, list of properties (strings). This could be easily done with new container P (pair) which expects the following byte as element type marker (primitives only) and which is the only allowed to be used in array type description:

Plain array - 44 bytes: 2 + N(4 + 2*S)*

[[]
    [[][d][1.23][d][4.56][]]
    [[][d][7.89][d][0.12][]]
    [[][d][3.14][d][0.08][]]
[]]

Plain array of pairs - 38 bytes: 2 + N(2 + 2*S)*

[[]
    [P][d][1.23][4.56]
    [P][d][7.89][0.12]
    [P][d][3.14][0.08]
[]]

Optimized array of pairs - 34 bytes: 4 + 2NS

[[][$][P][d]
    [1.23][4.56]
    [7.89][0.12]
    [3.14][0.08]
[]]

As for N=1000 the difference becomes more significant: 14002, 12002 and 10004 bytes respectively. As for old good JSON it's around 15001 bytes (as for array of strings since we have to preserve accuracy). As for any other cases I don't feel good about overengineering UBJSON to handle X-dimension arrays. Let's make it more suitable for real world problems and let him doing simple things, but in easiest and effective ways that possible.

UPDATE: Yes, you may argue that there are many other cases where nested arrays may be applicable. And you'll be right: RDF describes data in triples, so pairs wouldn't fit it well. However, that's the only one exception that popular as well as pairs comes to my mind and actually you cannot optimize RDF triplets with UBJSON since these elements are string of variable length.

@kxepal
Copy link
Collaborator

kxepal commented Jun 5, 2014

But again, all nested arrays optimizations are requires two rounds serializer or predefined data scheme: you cannot apply these optimization while you're not sure if all array's elements has common data type - that's gives you O(2*N) serializing cost and the size benefit should worth that. On real world high load tasks nobody will do that as well as everybody use own line-per-element protocol for JSON data instead of read-and-decode strategy.

@edgar-bonet
Copy link
Author

@meisme: Unnesting would not work for me. I have an application that is already in production and works well with JSON, except that the generated files can occasionally get too big to be practical. Users are familiar with the data model, which is “natural” for their use case. I just want to provide a more efficient serializing format in a way that is transparent to the users, thus without changing the data model. And this is a major selling point of UBJSON: it is supposed to work transparently with your current JSON data model.

@kxepal: Most of the time I will indeed have long arrays of pairs. However, occasionally these can be triplets. Typically, when acquiring data through a lock-in detector I have things like

[external_parameter, signal_in_phase, signal_in_quadrature]

The arrays could even be longer if the user sweeps several parameters simultaneously.

About the complexity: I do not think the complexity for the serializer should be an issue. As with any optimization, it is important that the format stays simple for the parser if the parser has to remain universal. The serializer, even a universal serializer, is free to ignore any optimizations... or implement whatever may make sense for the application.

@edgar-bonet
Copy link
Author

@kxepal: More on the complexity of the serializer...

I just realized that the O(2N) complexity has nothing to do with this proposal: it's already in the current version of the standard if one wants to leverage the optimization afforded by fixed-type arrays. Here is the algorithm I use for deciding what type to serialize some data into:

/*
 * Find a common supertype of a and b.
 */
function merge_types(a, b)
{
    if (a == b) return either;
    if (either a or b is a subset of the other)
        return the "biggest" type;
    if (a == INT8 and b == UINT8 or vice-versa)
        return INT16;

    // The only part specific to this proposal:
    if (both are FIXED_TYPE_ARRAY with same length) {
        var inner_type = merge_types(a.element_type, b.element_type);
        if (inner_type == INVALID) return INVALID;
        else return FIXED_TYPE_ARRAY of inner_type;
    }

    otherwise return INVALID;
}

/*
 * Get the proper type for serializing this piece of data.
 */
function get_type(data)
{
    if (data.isArray()) {
        var inner_type = NOP;  // compatible with all other
        for each (element of data) {
            var element_type = get_type(element);
            inner_type = merge_types(inner_type, element_type);
            if (inner_type == INVALID) return PLAIN_ARRAY;
        }
        return FIXED_TYPE_ARRAY of inner_type;
    }
    else {
        // other types...
    }
}

As stated in the comments, the only part specific to this proposal is where I try to decide whether two fixed-type arrays have compatible types. I do so by having merge_types() call itself recursively. The recursion is not expected to go deep: only as deep as the dimensionality of the multi-dimensional array.

This is for the universal serializer. It is intended to serialize whatever the user throws at it. I also have a situation where the data is generated by the application itself (a data acquisition process actually) and I know before hand what kind of data to expect. In this case I can output the known type marker of the multidimensional array and then just stream the data in real time with no size overhead, and a very low processing cost (cast to float, convert to big endian, output).

@meisme
Copy link

meisme commented Jun 7, 2014

Just to throw it in there, for most strictly statically typed languages (eg C++ which I'm in the process of writing a (de)serialiser for) types (including nested ones) can often be known at compile time, meaning the that the only increase in complexity will be for interpreting the data.

@Steve132
Copy link
Contributor

N-D arrays of raw binary data DO have a pretty significant use in scientific computing and many other data sets....but I'm honestly VERY concerned with the amount of extra logical complexity in the parser and reader this proposal will introduce. It also sorta starts to lose 1-1 parity with javascript.

One 'middle ground' proposal that I would prefer dramatically would fix all the use cases here while simultaneously getting rid of the complexity of the parser and the writer and give us all the performance benefits is as follows:

Call them 'Multi-dimensional Strongly-typed arrays'. What we do is we simply make it so that after the size information in the STC array header there are 0 or more '@' patterns that optionally specify additional array dimensions for this container. (we cant' reuse '#' because it introduces an ambiguity in the parser with a one-d STC array of sized objects)

For example, here's OPs proposal,extended to a 3D array (like 3200x3200x3 bitmap)

[[]                 // this is an array
    [$][[]          // of arrays
    [$][[]      // of arrays
         [$][d]      // of float32
         [#][i][3]   // inner arrays have length 3
    [#][I][3200]    // middle array has length 3200
    [#][I][4223]

    [1.23] [4.56] [4.56]    // first datapoint
    [7.89] [0.12] [4.56]  // second datapoint
    // a few thousand more datapoints...    

Thats fine, but recursively parsing and validating that is a LOT of complexity and a total rewrite of the parser and it doesn't work if ANY
of the recursive elements don't parse properly. It's 18 bytes for the header.

In contrast, heres the MSTA version

[[]                 // this is an array
    [$][d]          // of float32   
    [#][I][4223][@][I][3200][@][i][3]  //first dim is 4233,second dim is 3200, third dim is 3

    [1.23] [4.56] [4.56]    // first datapoint
    [7.89] [0.12] [4.56]  // second datapoint
    // a few thousand more datapoints...    

It's stupid easy to integrate into the parser AND the writer with no recursion or multiple pass parsing, and has EVEN MORE storage savings than the proposal we are discussing ( the header is 14 bytes), and gives us ALL the performance benefits. Parsing is a LOT easier and the data structure for the type does not need to be recursive or recursively parsed

@kxepal
Copy link
Collaborator

kxepal commented Jun 13, 2014

More fun: array of arrays of objects with arrays of array of object with array of strings. Solution?

[[{"foo": [[{"bar": ["baz"], "boo": [...], ...}, ...], ...], "...": [...]}, ...], ...]

I start to feel that the next step after strong header section will be full scheme definition which will describe the data structure while parse will just read it without need to lookup markers, length, etc. But don't we inject application level logic into data format?

@kxepal
Copy link
Collaborator

kxepal commented Jun 13, 2014

THB, I'm +1 for idea for having "data scheme map". I opens a lot of doors for having more compact format without making it ugly and complex.

@Steve132
Copy link
Contributor

kxepal: I'm highly against a schema. Its really really really complex. Nobody wants to parse and validate the schema before loading something and trying to maintain the proper schema while writing tools is horrible. I hate that idea. I can't even imagine writing a parser.

@kxepal
Copy link
Collaborator

kxepal commented Jun 13, 2014

@Steve132 what's you had proposed is a schema, but for specific data type (:

@Steve132
Copy link
Contributor

Arrays of arrays of objects with arrays of array of object with array of strings is already supported perfectly fine in draft 10. Any subset of them can be sized or typed. Not problem

@Steve132
Copy link
Contributor

@kxepal: I know that but, in a sense, ALL 'formats' are schemas. I'm only addressing the very real performance need for a multi-dimensional array by adding exactly that: a multi-dimensional array. It's a small change as opposed to a massively complicated DSEL

@edgar-bonet
Copy link
Author

@Steve132: I have two issues with your proposal.

The minor issue is that I am not convinced that recursion is all that complex, especially given that any parser is probably already completely recursive. Is that more complex than introducing a new syntax (the [@]) for a new concept (Multi-dimensional Strongly-typed arrays)?

The major issue is that your proposal makes the format ambiguous. How can you tell apart this one-dimensional array:

[[]                 // this is an array
    [$][d]          // of float32   
    [#][I][4223]    // length is 4233
    [3.14138794]    // first datum
    [1.23]          // second datum
    // etc...

from this 2D-array:

[[]                 // this is an array
    [$][d]          // of float32   
    [#][I][4223]    // first dim is 4233
    [@][I][3200]    // second dim is 3200
    [1.23]          // first datum
    // etc...

given that [@][I][3200] and [3.14138794] have the same binary representation?

@meisme
Copy link

meisme commented Jun 14, 2014

I much, much prefer this way of doing it. @edgar-bonet is right, though, the definition needs to be

[[]
    [$]<type>
    [@]<x>
    [#]<y>
    <data>

in order to be unique.

@kxepal
Copy link
Collaborator

kxepal commented Jun 14, 2014

Why use multiple markers to define same thing?
What if you'll need third dimension? forth?
What if data types on X and Y axis would be different?
How would you define complex nested containers?

I see for now too complex solutions for too specific case. Data format should be flexible and cover more common cases. Otherwise no one will use these optimizations due to they creates too much restrictions and may easily cause unpredictable behaviour: for one dataset our optimizations are applicable, result binary data is compact, transfers fast, decodes fast, for another the same dataset with a bit different values suddenly it starts to require more space to store and more time to transfer and processing. That's not a good situation.

@meisme
Copy link

meisme commented Jun 14, 2014

@kxepal You have a point, however different data types on different axis is not supported with the current recursive strategy either. If that were to be supported you'd almost certainly need to introduce a touple type. That might not be such a bad idea, though.
Something like

[T]
    [#]<n>
    [$]<first type><second type>...<n-th type>
    <data>

Where a touple is considered a primitive value. You could then use it in an array like so

[[]
    [$][T]
        [#][i][2][$][d][d]
    [#][i][3]
    [1.23][4.56]
    [7.89][0.12]
    [3.14][0.08]

Naturally you'd also be allowed to specify a touple as one of the types in a touple

edit: For converting UBJSON to JSON, a touple would always expand to an array with n elements

@edgar-bonet
Copy link
Author

OK, now that I have an implementation that I like, I think I can say more about the complexity. But before, a question for @kxepal: Since your argument against the optimized encodings applies also to the strongly typed arrays we already have, are you proposing to drop them?

Back to the issue: The C++ implementation of the parser and serializer is not substantially more complex than the algorithms I already posted here in my first and third comments... Except for one thing: the parsed type signature is now a recursive structure, which means I had to deal with pointers. In C++, I did it more or less along these lines:

struct UbjsonType {

    // The values of the enum match the markers for easier parsing.
    enum Type {
        NOP   = 'N',
        FALSE = 'F',
        TRUE  = 'T',
        UINT7 = '7',  // internal use only
        INT8  = 'i',
        UINT8 = 'U',
        // ...
        ARRAY = '[',
        FIXED_TYPE_ARRAY,
        OBJECT = '{'
    } type;

    // The fields below are only relevant for FIXED_TYPE_ARRAY.
    UbjsonType *element_type;
    int element_count;

    // Constructor.
    UbjsonType(enum Type t, UbjsonType *subtype = 0, int count = 0)
    : type(t), element_type(subtype), element_count(count) {}

    // Destructor.
    ~UbjsonType() { delete element_type; }
};

Also, since the struct owns the pointer to the nested type, I had to implement The Big Three, which I did by means of copy and swap. This was for me the hardest part, mostly because I am not very proficient in C++ and thus I had to learn about some subtleties such as copy constructors and copy elision optimization. I guess all this should sound trivial to a more seasoned C++ programmer.

Once this struct works, the rest is just a straightforward, almost verbatim translation of the algorithms above in C++.

Ah, just a word about the UINT7 type above, although it's unrelated to this proposal. I defined this type for numbers that can be stored as either UINT8 or INT8. When one such number ends up in a strongly typed array, it gets serialized into whatever type is also good for the other elements of the same array. Marking those numbers as UINT7 is a way to defer the final decision until we know more about the other elements of the array.

Just for fun, I also wrote a special parser function that can only read optimized 2D arrays of FLOAT32. This function saves me two levels of recursion when reading such arrays, and it also saves many file reads as it slurps the whole 2D array into memory in one read. Well, these optimizations only got me about 11% speedup, which for me means that all this recursive stuff is really not that expensive.

@Steve132
Copy link
Contributor

@meisme and @edgar-bonet are right: In order to not be ambiguous, we have to have the @ come BEFORE the #. I didn't realize that.

Additonally, instead of each @ being a DIM, the @ could specify the NUMBER of dimensions incoming....but that’s optional, I don't care either way....

@kxepal: I'll answer your questions:

Why use multiple markers to define same thing?

Because if you don't then a multiple-dimensional array forms a number of parser ambiguities...including the one @edgar-bonet pointed out where the binary form of multiple # in a multi-dimensional array specifier is indistinguishable from a number in a one-dimensional array specifier.

What if you'll need third dimension? forth?

A four-dimensional array of floats (12x24x4x10):

[[][$][d][@][i][12][@][i][24][@][i][4][#][i][10]
      [3.21][323.23] .. more data

What if data types on X and Y axis would be different?

This is a nonsensical question. Even the recursive version of the strongly typed arrays containing strongly typed arrays means that, eventually, there is exactly ONE type. You can't have different types along different axis. If I'm wrong, please provide a multidimensional strongly typed array that shows what you mean.

How would you define complex nested containers?

What complex strongly-typed containers could you NOT define by this proposal? Draft 10 already defines complex nested containers of dynamic size, the only advantage of this proposal is the performance boost we can get by defining multiple statically-sized statically-typed recursive containers. Can you please provide an example of the kind of container you do not believe you would be able to represent, because I can't think of one.

@edgar-bonet: Please correct me if I'm wrong, but can you think of ANY data structure that you could represent by implement the 'recursive' version of the proposal that is NOT equivalent semantically to a multi-dimensional array? Because the way I see it, the ONLY power this proposal adds is the ability to represent a statically-typed multi dimensional array and the additional performance that gives us. It gives us no additional power unless I'm mistaken.

GIVEN that argument, then we can look and see what the simplest and cheapest way to implement a multi-dimensional array is in the spec, that gives us the fewest number of bytes and the simplest parser and writer...and given that, it seems to me that my proposal for a multi-dimensional array is FAR simpler.

Consider, in your implementation example, implementing the multi-dimensional array requires overloading the type of the "TYPE' marker from a simple enum to a complex parsed type marker structure. This would dramatically change the reader and writer to do two passes of recursion, adding context-sensitivity to the parser and a TON of complexity for context sensitivity to the writer, and require everywhere that a TYPE marker is used or checked in the code to check if it is one of the base type enums or a data structure containing the recursive fingerprint for the array, everywhere that is used.

In contrast, my C implementation of the parser is simple to incorporate this change EASILY with changing:

static inline void priv_read_container_params(ubjr_context_t* ctx, UBJ_TYPE* typout, size_t* sizeout)
{
    int nextchar = priv_ubjr_context_peek(ctx);

    if (nextchar == '$')
    {
        priv_ubjr_context_getc(ctx);
        *typout = priv_ubjr_type_from_char(priv_ubjr_context_getc(ctx));
        nextchar = priv_ubjr_context_peek(ctx);
    }
    else
    {
        *typout = UBJ_MIXED;
    }

    if (nextchar == '#')
    {
        *sizeout = priv_ubjr_read_integer(ctx);
    }
    else
    {
        *sizeout = 0;
    }
}

to

static inline int priv_read_container_params(ubjr_context_t* ctx, UBJ_TYPE* typout, size_t* sizeout)
{
    int nextchar = priv_ubjr_context_peek(ctx);
    int cdim;
    if (nextchar == '$')
    {
        priv_ubjr_context_getc(ctx);
        *typout = priv_ubjr_type_from_char(priv_ubjr_context_getc(ctx));
        nextchar = priv_ubjr_context_peek(ctx);
    }
    else
    {
        *typout = UBJ_MIXED;
    }

    for(cdim=0;nextchar=='@';nextchar=priv_ubjr_context_peek(ctx))
    {
        sizeout[cdim++]=priv_ubjw_read_integer(ctx);
    }

    if (nextchar == '#')
    {
        sizeout[cdim++] = priv_ubjr_read_integer(ctx);
    }
    return cdim;
}

By adding only a few lines grabbing the dimensions, the parser now works wihtout changing anything aboput the API...the amount of data to be read in one call to fread is simple to calculate linearly by calculating the product of the array sizes from 0 to cdim.

The writer is similarly trivially updated by simply outputting the appropriate number of '@' symbols in the header..a single for loop.

Again, this is a MUCH simpler way of representing a multi-dimensional array that doesn't require implementing a context-aware recursive descent parser and representing types as a dynamic data structure that has to be validated and parsed in two passes. It's also got better space performance, as it uses fewer bytes.

Considering that we don't get any additional power out of the recursive type proposal over the multi-dimensional array, it seems better to do it this way.

@kxepal
Copy link
Collaborator

kxepal commented Jun 15, 2014

@kxepal: Since your argument against the optimized encodings applies also to the strongly typed arrays we already have, are you proposing to drop them?

Yes, that's the most ugly part of UBJSON. I'm working on RFC with attempt to fix that, returning format back to Draft 9 days when it was simple, clean and beauty, but effective. Sorry, but having one type which acts in 3 different ways (unsized array, sized array, strongly typed array) and now you're trying to add another one (multidimensional) - that is the biggest mistake. Better take in this case MessagePack or BED if you're wanted for compacted data.

How would you define complex nested containers?

What complex strongly-typed containers could you NOT define by this proposal? Draft 10 already defines complex nested containers of dynamic size, the only advantage of this proposal is the performance boost we can get by defining multiple statically-sized statically-typed recursive containers. Can you please provide an example of the kind of container you do not believe you would be able to represent, because I can't think of one.

this proposal is the performance boost we can get by defining multiple statically-sized statically-typed recursive containers key phrase of this issue. Noted. We're on different pages. You trying to solve short specific case while I'm trying to find more common solution. Never mind. Sorry for wasting your time for me.

@Steve132
Copy link
Contributor

I agree with you that this performance boost is the key point of this issue and I agree that it is desirable enough to warrant proposing a change for. The point I'm making is that the same performance boost can be gotten from ANY n-d array semantic, and furthermore that the proposed recursive type stuff doesn't provide ANY more semantic power than any other N-d array semantic. Therefore what we are actually discussing is "an Nd array is desirable. Therefore what/how should we implement it". The original proposal is one solution, but it is a particularly complex and inefficient way of representing an Nd array compared to my proposal

@meisme
Copy link

meisme commented Jun 15, 2014

Depending on what we want accomplished my vote goes for either the introduction of a touple type more or less as I outlined or of @Steve132's multi dimensional array with @ specifying number dimensions followed by one # marker per dimension. I'm fine with either of these choises.

@ghost
Copy link

ghost commented Jul 11, 2014

@Steve132 Responding to your last post - the STC spec currently allows the [$] type marker to only be one of the Value Types, so no containers-of-containers.

@Steve132
Copy link
Contributor

UBJSON imposes (unnecessary?) restrictions on you if you want to leverage strongly typed containers - in that those containers cannot contain other containers or mixed types...

I don't agree that this is written explicitly in the spec. I certainly didn't see it that way and I didn't implement it that way in ubj. I just re-called the type-parsing code but saved which type it was, including other containers. Worked fine and had fewer special cases

the STC spec currently allows the [$] type marker to only be one of the Value Types, so no containers-of-containers.

Ah, ok. So, all you have to do is remove that restriction. The "containers of containers" use case is then solved. No schema parsing or pre-recursive validating needed.

If you relaxed that restriction, (and ubj shows that you can, I didn't even realize that restriction was there when I implemented it...leaving it unrestricted made for prettier code so it didn't occur to me)

That would mean that you could choose to encode the "Draft 11" version of your example as

[{][#][i][3][$][[]
    [i][6][array1][#][i][3][$][{]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [i][6][array2][#][i][3][$][{]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [i][6][array3][#][i][3][$][{]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]
    [#][i][3][$][[]
        [i][5][vals1]
        [$][i][#][i][3][16][32][64]
        [i][5][vals2]
        [$][i][#][i][3][16][32][64]
        [i][5][vals3]
        [$][i][#][i][3][16][32][64]

in the current draft 11. This is actually even a little bit bigger than current json, but for sizes of arrays larger than 3 then the STC savings get significant, even in draft 11, without having to change anything at all. (somewhat furthering my argument that the example given is somewhat of a pathological case, because the Draft 11 with STCs starts to get smaller than the Draft 11 without STCs for inner arrays of size 4, literally the next size up)

In summary, my argument is that the use case of "Using STCs inside other STCs" is already supported by draft 11 (if you remove that restriction which I didn't notice before) and already provides size benefits except for sizes <= 3, so that use case doesn't need to be addressed,

And the use case of "efficient and simple n-dimensionality" is not supported by draft 11, but would be easily added by adding a @n specifier followed by n integers in the container header parser code, which is how most n-dimensionality libraries already work and would be a billion times simpler.

@ghost
Copy link

ghost commented Jul 11, 2014

In summary, my argument is that the use case of "Using STCs inside other STCs" is already supported by draft 11 (if you remove that restriction which I didn't notice before) and already provides size benefits except for sizes <= 3, so that use case doesn't need to be addressed,

Heh, just to be clear, it's not officially supported in Draft 11, but certainly appreciate that you have a proof of concept that shows how trivial it would to add support for it.

Maybe we break the work into two pieces:

Draft 12 (proposed)

  1. Remove restriction of STC containing STC.
  2. Modify definition of STC to be COUNT-TYPE (not TYPE-COUNT) as requested by @edgar-bonet to make the grammar cleaner.

RESULT: No major change to the spec; no major increase in size/performance.

Draft 13 (proposed)

  1. Discuss the generalization of STC headers to be allowed to be used anywhere inside a container, leading to potentially huge space savings in select cases (e.g. scientific data) - Redefine strongly typed containers as a more flexible, repeatable, embeddable construct. #48

RESULT: Significant change to parsing/generating of STCs; potentially huge win in data size and parsing performance in special cases.

Thoughts?

@Steve132
Copy link
Contributor

Regarding your points:

Proposal 48 seems entirely unnecessary to me as a way of solving the intended problem. The "problem" is that STCs can't contain containers. As I said, as an implementer I implemented draft 11 UNAWARE of this restriction and my code gracefully handles it easily. This shows that the restriction is entirely cosmetic. If the purpose of proposal 48 is to solve this problem, the text of proposal 48 should contain the following string only: "Proposal: remove the restriction that the marker following a [$] must be a value type". Solved.

Point 2: I'm not sure I understand how relevant this is, as far as I understand it neither STCs nor this Schema proposal can be used with mixed-type containers. Perhaps I'm confused?

Point 3: It's a limitation we can design out of the spec by simply deleting the offending restriction from the text of the spec. No other changes needed :)

Point 4 and 5 can be used to justify literally any change. If I proposed that we included a variant of the JPEG compressor in order to have significant savings in the use case of sending binary-encoded bitmap data in 2-D arrays, I could easily justify it by saying "Well in certain cases the space savings are significant and we expect changes to happen over time and things mature etc" That wouldn't make my proposal a good one for UBJSON.

Regarding point 5 in particular, I think the whole point is that XML is bad because of how complex it got. All the schemas and XSLT and complicated parsing rules and interactions with utf-8 meant that in practice nobody had a conformant implementation and nobody actually liked using it anyway. When JSON came out everyone was happy BECAUSE it was so radically simple and threw away all of the things that came tacked on over time to make XML more "flexible". Now, JSON has some of that same cruft, but I don't know anyone that uses it those crufty parts.

I was under the impression that the primary reason to use UBJSON over something like Smile, or to was that UBJSON was different than smile: Low complexity, easy to parse, few corner cases or restrictions, no application-specific optimizations or data-types. I feel like part of the POINT when discussing UBJSON is that we are going to make sure that UBJSON is NOT smile and never BECOMES smile.

People use JSON because its NOT XML.

@Steve132
Copy link
Contributor

So, I totally agree with your draft 12 btw.

Draft 13 then would be where we push the issue of how best to handle the 'scientific data' header. My argument is that the best way for scientific data is just the n-d array specification with @n and dimensions, because it leads to super fast super efficient code that maps well to how scientific data libraries currently work, and (as someone who wants to use ubjson for scientific data) it covers all the use cases I can foresee.

But discussing whether to do the schema version of the n-d array or my version would be best pushed to draft 13 I agree, and is a seperate issue

@ghost
Copy link

ghost commented Jul 11, 2014

The "problem" is that STCs can't contain containers.

It's more than that - as you pointed out, that limitation is easy to solve - the limitation of changing your data model to match UBJSON's limitations is still there... for example, consider an array with the values:

[1, 2, 3, 300, 301, 302, 1.23, 2.34, 3.45]

You cannot model that in UBJSON as a single (optimized) array - you would either need to break it into 3 arrays (problem: changing your data to meet the limitations of a spec) or use a single unoptimized array (problem: no space/performance savings).

If the purpose of proposal 48 is to solve this problem, the text of proposal 48 should contain the following string only: "Proposal: remove the restriction that the marker following a [$] must be a value type". Solved.

Proposal #48 is trying to solve much more than that. One of the biggest differences is the ability to define the STC headers for all data upfront if your data allows it and avoiding repeating them inside of the payload.

Point 2: I'm not sure I understand how relevant this is, as far as I understand it neither STCs nor this Schema proposal can be used with mixed-type containers. Perhaps I'm confused?

You are right, STC and what is being discussed in this issue cannot - that is why #48 was written as a super-solution to all of these limitations. If you get a chance to read it in detail, let me know your thoughts (it reads pretty quickly, just big examples)

Point 4 and 5 can be used to justify literally any change.

To your larger point ("let's be realistic with what we perceive as wins") I agree; I am perceiving a fairly common scenario (especially in geo-spatial data and scientific data) where allowing a singular header definition of containers can lead to a significant performance win.

I didn't think I was being too unreasonable here; any data that consists of large collections of small containers benefits from this change; the smaller the container, the bigger the savings (up to something like a 90% size reduction).

Right? I hope I'm not being obtuse here...

I think the whole point is that XML is bad because of how complex it got.

Uh oh... religious discussion :)

Now, JSON has some of that same cruft, but I don't know anyone that uses it.

I am not deaf to your arguments, trust me, I hear them, I just think the sample size of this group is uselessly small to pretend we know exactly what JHL would do with UBJSON, or Valve is doing with it in their OpenGL debugger or Smart Meets social network in France is doing with it or what CouchDB might do with it as a backing data storage format.

(as examples of why I think the argument of 'I don't know anybody that...' is valueless or just not needed here)

People use JSON because its NOT XML.

I understand - again, I'm not suggesting this change be made out of spite for the users of UBJSON; I am perceiving a bit benefit to it that makes it worth it.

So, I totally agree with your draft 12 btw.

Ok

But discussing whether to do the schema version of the n-d array or my version would be best pushed to draft 13 I agree, and is a seperate issue

Ok - would like to give other's a chance to weigh in before any decisions are made. Thanks for the discussion!

@meisme
Copy link

meisme commented Jul 12, 2014

To me (and my code) normal, count-optimised and type-optimised containers are fundamentally different, and I appreciate being able to tell them apart by the first byte. I would not like to see # marker placed before $ marker for type-optimised containers if draft 12 ends up being chosen.

@ghost
Copy link

ghost commented Jul 12, 2014

@meisme can you give a little more detail?

@meisme
Copy link

meisme commented Jul 13, 2014

Well, as it currently stands I've got more or less

Parser::_parse_array() {
    switch (_in.peek()) {
    case '#':
        _parse_array_countoptimized();
        break;
    case '$':
        _parse_array_typeoptimized();
        break;
    default:
        _parse_array_normal();
    }
}

As you can see it's easy to implement the delegate function. The _parse_array_normal and _parse_array_countoptimized functions are very similar, with countoptimized basically just priming the memory and calling the other. _parse_typeoptimized on the other hand is very different featuring performance optimizations. It's very nice to be able to distinguish this right away.

If # were moved first, the logic would have to be as below, which is more ... spaghetti.

Parser::_parse_array() {
    if (_in.peek() == '#') {
        _parse_array_count();
    } else {
        _parse_array_normal();
    }
}

Parser::_parse_array_count() {
    int len = _parse_count();
    if (_in.peek() == '$') {
        _parse_array_typeoptimized(len);
    } else {
        // prime memory
        _parse_array_normal(len);
    }
}

I mean... sure, this works too. But it's uglier.

@edgar-bonet
Copy link
Author

I personally do not see the point of the count optimized arrays. You can not really prime the memory since you do not know the size of your objects. Why not drop it from the spec?

@Steve132
Copy link
Contributor

You can not really prime the memory since you do not know the size of your objects.

Yes, you can. You can't prime ALL the memory all the way down, of course,
but you can prime the size of the array you are using to store the children.

See https://github.com/Steve132/ubj/blob/master/ubjr.c#L358

In UBJ, a count-optimized array of mixed children still travels along the fast path and has minimal memory allocations, because the mixed type is implemented as a union. Because the containers only store pointers, then the union size is actually fairly small and has constant size, then if you know the size in advance you can pre-allocate N instances of the union and then populate it dynamically. This is the same code path as it is for the fixed-size types and uses O(1) memory allocations instead of O(log(n)) allocations for the unknown case. It's also simpler.

Especially if none of the mixed cases are containers: In that case, then you only need to do a single allocation for the entire array.

@ghost
Copy link

ghost commented Jul 15, 2014

@meisme thank you for the detail -- I would appreciate it if you and @edgar-bonet could come to a consensus on the order of # and $ you would like to see as you are both requesting opposite things for good reasons.

@edgar-bonet count-optimized opens the doors to potentially very highly tuned implementations of UBJSON parsing -- for example parsing same-typed arrays of known size (particularly important in binary cases).

@AnyCPU
Copy link
Collaborator

AnyCPU commented Jul 16, 2014

Real use cases in production that uses new additions?

@ghost
Copy link

ghost commented Jul 16, 2014

@AnyCPU can you clarify what you mean? which new additions specifically? (no one is using anything from this proposal or #48 because they haven't been ratified yet)

@kxepal
Copy link
Collaborator

kxepal commented Jul 17, 2014

Hey guys,
I'd tried to follow the discussion for the whole day, but really I'm lost in about what problems you're trying to discuss and to which proposal points they are relates. How about to someone submit STC proposal as PR so we can discuss exactly by paragraphs and localize the issue without instead of apply them for the whole specification text?

@ghost
Copy link

ghost commented Jul 17, 2014

@kxepal It is a long discussion, let me summarize:

  1. Most everyone is onboard with the idea of STCs containing other STCs - infact one of the implementations already support it by accident :)
  2. There is a mini-proposal to change the ordering of the $ and # markers but @meisme and @edgar-bonet have opposing justifications and haven't centered on this so I'm not changing anything.
  3. Taking this proposal and adding your idea of repeatability of a header is what spun off Redefine strongly typed containers as a more flexible, repeatable, embeddable construct. #48, but there is some discussion about that stuck in here as well.

Guys, please summarize anything I missed.

Overall it looks like opening the doors to having STCs contain STCs will most likely happen - and the shape of #48 because of @Steve132 recent feedback may morph a little... see my latest comment there - #48 (comment)

@AnyCPU
Copy link
Collaborator

AnyCPU commented Jul 18, 2014

I'm talking about from where they came?
Before proposal to make new additions,
they are has been probed in production?

As I understand that UBJSON is more storage data format
instead of structural/container format.

With strongly typed array that have only one type,
but that have multiple inner arrays, we still
will have reduced parsing speed, because such
array will be parsed in any cases, for example, when skipping
or seeking some object.

Storage must be simple as possible.

@ghost
Copy link

ghost commented Jul 19, 2014

With strongly typed array that have only one type,
but that have multiple inner arrays, we still
will have reduced parsing speed, because such
array will be parsed in any cases, for example, when skipping
or seeking some object.

Great point, in the DB storage format, you are right, if you are seeking on disk and hit a record that you want to seek past - the current form of UBJSON would need to be augmented with a DB-centric format that would have the complete structure size, in bytes, so it could be skipped over.

I don't think this is a shortcoming of UBJSON necessarily, I think this can be easily augmented with a DB-specific format similar to how systems like CouchDB augment standard JSON payloads with their "_name" params.

In short, what I'm saying is that regardless of what we figure out here, that doesn't necessarily make that problem unsolveable.

@ghost
Copy link

ghost commented Jul 19, 2014

SUMMARY (July 19, 2014)

@ghost
Copy link

ghost commented Jul 29, 2014

This discussion thread ended up being very valuable - a lot of good ideas in here and I'll continue to break out sub-ideas into other proposals but tie up this discussion for the sake of moving the spec forward.

ACCEPTED - http://ubjson.org/type-reference/container-types/#optimized-format

Added to the spec the ability to define a container marker as a value for $ - easiest change.

We can pickup the discussion around headers/schemas/etc. in another proposal.

@ghost ghost closed this as completed Jul 29, 2014
@ghost
Copy link

ghost commented Jul 30, 2014

I should have added this comment when closing this issue - addressing @edgar-bonet original example:

[
    [1.23, 4.56],
    [7.89, 0.12]
]

this would now look like:

[[][$][[][#][i][2]
    [[][$][d][i][2][1.23][4.56]
    [[][$][d][i][2]7.89][0.12]

The change in this bug addressed the purpose/title of the bug (simply remove the limitation of STC not containing STC) but it did not address the proposal that @edgar-bonet had inside his recommendation and that is a way to efficiently define N-dimensional arrays.

There are 2 concerns that fell out from this discussion that still need to be addressed separately:

  1. N-dimensional arrays - I am not sure I want a special/different construct for these or if they can defined with a more generic format that applies to Arrays in general.
  2. Object and Array "schemas" defined in the header of the container and the payload for the container and subsequent nested containers are just raw data (carried into Redefine strongly typed containers as a more flexible, repeatable, embeddable construct. #48 and Container Elements Group Schema #50 )

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants