New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Container Elements Group Schema #50
Comments
@kxepal This is a sizeable post, so I'm going to try and address it in chunks to make it more digestible. I think I've read through this 6x and here are my thoughts: I like:
I don't like:
I really don't like:
Question
What am I missing? Or do you just mean the optional-nature of the T/L values that you don't like? |
As much as you need. We shouldn't limit our users just because we think that their data is too complex. It may be too complex, but if it has repeatable format optimizations would be very helpful.
You're wrong about doubling storage space: as for case
You don't like the special case or that numeric types must have length specified? Because the latter is what current STC have to do (
Nulls semantically means nothing. If you should have a field, but you don't have a value for it you put there null. Same propose it plays for Length and Type parts: allows you to reserve the slot without specification of his value, so parser always now how much to read, but is free to ignore nulls.
Because you count
Two objects + stream of data while it describe single logical object. So you introduces markers relations into the specification which isn't good sign since all the current objects are defined atomically by single TLV structure. |
This is a great proposal. The biggest challenge is how much it will impact on the implementation, so ideally there should be a reference implementation before it becomes standard. I agree with @thebuzzmedia about not liking the special treatment of null. If it has to be special, then we need a TypeSpec to specify that the type is optional/nullable. Nullable values may be nothing, but they may still be semantically important. Should there be a placeholder TypeSpec? Say we have an array of pairs, where the first element is an integer and the second is a hetegeneous type, then we could use underscore as an indication that more type information will inlined (e.g.
|
@breese Awesome idea with
Can you clarify what you mean under "special treatment of null" and why it's bad? I have few thoughts about, but I want to be sure that we are on the same wave. Thanks. /cc @thebuzzmedia |
@breese I agree - Alex put together something really fantastic here and I'm using it as a guide to incrementally pull his ideas out in chunks, mix them with the existing spec and move it forward a draft at a time.
#5 is the trickiest... but this is how I've mapped out pulling his core ideas out of this and merging them in nice, digestible pieces that we can debate individually. |
@thebuzzmedia I believe that #48 will pass soon enough. Then I think that ratyfing and polishing of the typespec shall follow shortly. The typespec should resolve much of the problems of n-dimensional arrays, while maintainig mixed-type containers. But it still offers much optimizations - especially in the Record example by @kxepal . Such record specification could be great in general, I see it as a simple kind of XSD (XML Schema Definition) but limited only to syntax of messages (parts of messages). Having typespec would allow even describing messages in documentations in simple, strict format etc. |
@Miosss 100% agree, @kxepal did an awesome job with this. Most all of the ideas in Draft 12 are just me massaging what he has presented here into something that 'feels' more like the original spec or is a more minor incremental stepping stone for us forward - making small changes one at a time, stepping towards the eventual goal. You are right that the potentially biggest win here is from what Alex is calling TypeSpec (basically schemas) -- unfortunately it's also probably one of the most complex to define and implement for both generators and parsers. You'll notice our #2 Goal is 'ease of use' and schemas definitely don't fall into that category :) That said, like everything else we have to discuss, think, suggest, debate and massage these ideas to move them forward. We've been discussing the typed containers for 2 years, but they have FINALLY gotten to a place that is 10x better than I originally came up with - they are intuitive, clean, conceptually simple and (relatively) easy to implement. I want to do the same with the schemas/TypeSpec idea. |
@thebuzzmedia Point taken, I am also very reluctant to schemas like in XML's. What I understand UBJSON is, that it is data serialization format (maybe one of the definitons) and what I keep repeating is that I believe syntax is the key and semantics should not get anyway near (and JSON handles it fine). Therefore, I think that typespec could be a potential win in extremely optimizing type declaration of the container like in the Record example. And only in that I suppose. Any other "schema" language, messages definitions etc. was just a flash in my mind, wrong one : ) |
@Miosss Agreed; and just for the record, I am just as excited as potentially big space savings introduced by TypeSpec/Schemas/whatever as the next guy :) |
After talk in #61 I have some thoughts:
Personally I would much more like to forget about repeatable headers than forget typespec. Typespec does not require repeatableness (is there such word?) : ) After general discussion over typespec (and its benefits over ND for example) we could get on with forgotten topics. |
As I mentioned in #60, I do not see the need for repeatable headers at all. I don't see what it gives us that we don't already have. Isn't the point of [N] that it's a no-op, with absolutely no meaning? Whenever my implementation sees a [N] it advances a byte without changing any state. If we were to give it a semantic meaning, I don't understand what the difference between [Z] and [N] would be any more. The advantage of typespec over the ND proposal is the ability to compress nested structures with mixed types at the cost of increased complexity. I agree that it would make sense to hash out that discussion without going ahead on any other proposals. |
@kxepal To efficiently support ND arrays with this idea, probably there should be a way to specify long sequence of same-type data.
is a lot easier than:
So I suppose than consecutive pairs should be allowed. What about [#] marker? |
Okay, having reread a few of these issues I have a clearer understanding of what folks are saying in this topic now. Here's my $0.02 on a few of the outstanding hesitations:
Documents that compress well have lots of repetition in them. That ought to taken advantage of, not designed out of the spec. Something EXI does on this front is have a string index, and on the first occurrence of every string they give it an index value, then in subsequent string positions they can use the index value instead of repeating the string. Or repeatable headers is also good...
@kxepal, I don't see how this is allowing an STC to contain a [Z] for a value. An array of 4 numeric values with one of them [Z], [1, 2, [Z], 4]; is not the same as [1, 2, 4]. |
@MikeFair see first post:
This behaviour strongly depends on what data parser handled: if it's typed array, than that an error - you cannot cast null to some int or float explicitly. If it's array of objects, what SQL Resultset is actually is, nulls are ok. |
I saw that, what I can't see is the parser logic that implements it. Let's say I have a binary bitstream of 15 2-byte values declared in a |
if it's typed array, than that an error
Unless what you're saying here is that you can't declare a null value for a
strongly typed position.
|
Right. For bitstream nulls are not a thing that is expected since it's NAN. That's why parser first reads type spec and configures a state that used for further data decoding until next type spec headers comes if any or streams of data ends. |
Ok but that obviously degrades the optimization opportunity for what I As the array gets longer this space penalty increases. I guess this is good in the name of simplicity but what do you think of
On this last idea, for each typed entry there would be two data arrays; the
That array is then followed by the actual data array. The analyzer would |
@thebuzzmedia, @kxepal, While it may be something you guys have discussed before, I found a mechanism to get around always having to predefine the length of and to allows nulls in the bitsream of strongly typed arrays; Use a second value after the magic value to determine how to interpret the preceding entry. (aka use an escape sequence inside the bitstream) Here's how it works:
For instance, when storing a single 8bit byte, integer treatment of these values ranges from 0-254. As the number of bytes for the type increases, the length the array has to be in order exhaust all possible values grows exponentially. This makes it reasonable to assume that for the most part, every strongly typed bitstream can find an escape value with no or at least low collisions with real values in the stream.
This method also gives a number of other commands if you think they would be useful. For example say a 3 means to repeat the value preceding the magic value some number of times (specified by the next value in the sequence), or a 2 means to insert some memorized predetermined pattern (like DEADBEEF) from a lookup table (the index of which is specified by the next value in the sequence). So putting it all together, if our escape sequence was '' then our array could be: |
@MikeFair but why? certainly, you can find a way to put a NAN into stream of integers, but how do you support to handle that on the client side? |
@kxepal There's one of two scenarios I'm envisioning on the client side, textual JSON gets produced, or there's some custom library that's using the bitstream to build in-app custom objects. NAN is a value they'd have to deal with... Maybe I'm just misunderstanding the question. |
@MikeFair SQL Resultset is an array of objects. Here nulls are ok. No magic numbers are needed. Bitstream is an array of bits. Here nulls aren't expected and not acceptable since you'll wrap that stream with some typed array thing that expects only 0 or 1, not Z. |
I should have probably used bytestream instead of bitstream. Also in SQL Resultsets I'd expected to be able to optimize the values of each column by specifying the type and allowing for nulls |
@MikeFair JSON arrays are untyped as like as JSON doesn't know anything about integers or float - all these are just numbers for it. In general case you cannot assign any specific UBJSON type for such kind of data. But when you can, it's still the one and nulls here aren't welcome. For SQLResultset you'll have such optimisation and nulls here are allowed. See top post. |
Okay, this is part I'm not seeing. So it's not clear to me what are saying when you say nulls are allowed within objects... What I'm looking at is a case where I've got an array of objects whose typespec is date, float, float, float. |
This logic is uncanny defined by Please clarify the case you what to understand because I actually answer you same thing not at once and it seems misunderstanding still flies in the air. |
Encode this array: What I see is something like: Here the field names and their strong storage type are defined once as a type specification, then there is an array of values matching that new complex type. Nulls in this case are a problem because it could be data or it could be null and the parser can't tell. By adding a Magicvalue to the stream, now the parser can tell. |
@MikeFair Well this is another layer of abstraction - @kxepal refers to NULL in UBJSON. There is no way that I can see to allow your idea, because NULL is not any special value of int8, int 16 or whatever - in UBJ it is completely different "type". Look at original post at the top for booleans - we do not have now type for booleans, because in early stages of developement bools have been optimized out as [T] or [F]. This is great and easy trick, but it creates problem with typespec when we try to say that some part of object/array is of type boolean. In this meaning there is no way to express your need with current typespec. Think about it other way - how would you declare typespec for two objects with the same only key "value", but having values for this key: 7 and "seven" - I mean one is integer and one is string. Those types are different - we cant write one definition for that. If we used placeholder [_] that was proposed in this thread it may be possible, but this was not yet discussed deep enough. To answer your question - NULL in UBJSON is a type (and "value" at the same time). NULL in SQL is value appriopriate for any supported type - this is quite rare and comes from SQL mostly I think. Because typespec refers to defining types of data in objects/arrays, there is no way to represent value of NULL in the way you wanted it. |
@Miosss
Ahhh! But it doesn't create a problem (at least it doesn't have to). Having special codes that convey both their type and their value is In fact, now that I'm thinking about it, all custom UBJ types could be For example, a 1 byte unsigned int is, an unsigned int, whose values have There could be lots of ways to make this subtype definition thing work in For instance, a set of ints that are big, but have a significantly reduced
all come to mind as other extended examples of "value restricted
Sure we can, we just need to be a bit creative about it. I see that special type as an instance of a value restricted subtype from Actually, first we'll need a way to dynamically define this type because So in the stream I'd expect to first see something declaring the type maybe Followed perhaps by these: (I'm just making that syntax up on the spot so if people don't like it This says, declare a new value constrained subtype of '_' whose code is 7. The interpretation of 0=the integer number 7; 1=the string "Seven"; 2=the Also declare some additional value constrained subtypes of the [*7] type; So in the value stream for a [7] you could see a 0, 1, 2, or 3. For an [_] Further using this idea, I think the boolean types could look something Which says declare a new subtype B of _ where the values are int=0 and
I've never seen null being a type before. I've seen null used as a value json.org has "null" listed under the section "values". Null as a value In javascript the typeof "null" is an object: There's also lots of other context for null to be a value instead of a type. When we code for null, we write something that tests the value of a Such a variable could still have a value, there would just be no predefined Interpretted languages deal with exactly this case (where a variable's type It seems null is being treated more like javascript's undefined at the Null as a value is definitely not unique to sql databases, every common
That's what a type's magic value would be for. It's a special value that |
This is the best example of language where NULL is not a value (for specified type). In non dynamic-do-want-you-want-nobody-cares languages NULL is not a value of a type. Lets take java - null is not a value for integer, string or whatever. NULL is value for reference - which can be set to something, or to nothing. In C/C++ NULL is not value for integers nor for floats - it is value for pointers, which can be set to some address in memory or not be set at all (I am simplifying actually, because due to history of C, NULL is not something much different; at some point in history it was agreed that dereferencing memory address of 0 is strictly illegal, and thats why NULL is not something different, it's just point to address 0 in memory, which is illegal for pointer's semantics). As you can see, in strongly typed languages, null is not a universal value suitable for any type. Mostly it is used for pointers, maybe references. In UBJSON spec we should not focus on any particular architecture, so we can not enforce using pointers, optional values , etc. Null exists in UBJSON as a reflection of JSON's NULL. As such, it can be used as a value in any place of (UB)JSON message - for example you could set some values in array to NULL even if you would expect floats or strings in those places - it is up to decoding application to respond properly to that. But typespec is a little different - it relies on the fact, that you can find/hardset pattern in following data types. If some integers in array can be nullable, then such typespec can not be written. Whats more - because UBJSON is transport protocol - look at the exaple: Imagine we send array of 1024 integers for every request from client. Server and client logic do not know anything more, that those values are integers (lets say max 4 bytes) or nulls.
But if it happens that some values are null, or encoder does not force optimizations, we fallback to standard form:
(Of course) the first version is way smaller - it is 1031 B, and the second around 2000B (depending on the number of NULLs). The point is, that those two versions can be translated one-to-one usually. BUT they can't, when NULLs are involved, because @kxepal has not invented solution yet : ) He has though for optimized bools -> the [B] is in the original post at the top. It was a lot easier though, bacause bool is in fact a type, which was thrown away as a optimization. Concerning your ideas. What you proposed with those [7*] etc. is some kind of enumerating possible values, but regardless of their type. While this is ok in that sense, that type can be safely incorporated into enumerated value in UBJ (quite interesting idea I must say), it does not provide any advantage in any other situation - aside of enumerations. If you follow my example with 2 values only, 7 and "seven", if you create "dictionary" that allows one of those two values in this place in typespec, then you gain no space nor efficiency. The idea to create dictionaries, for example for keys in objects may seem interesting (create a dictionary in front of a message and use only references/indexes to it in actual JSON message), but this is more like a task for compressing algorithm, while UBJSON should focus on optimizations that can be found in JSON domain. Arrays of numerics (ND) and array of objects (SQL, REST APIs, etc.) seem to be the common case. And while ND solves only first, typespec can solve both I believe. |
What about doing this! In place of giving the entire array container a specific type and length, create a new type specifier [!] to mean " a series of " ( or "a segment of"), followed by a 1 byte length, then a type. The special 1 byte value of [0] in the length field would mean use the following [LENGTH] instead of this byte. Then embed the [!] type modifier inside the array. This way arrays can be built up in segments. By using a linked list of "segments";(where a segment has a type, an array of data, and a start/end index) instead of assuming one long contiguous block the array still remains compact for mixed types, and it enables a decent mechanism to fairly quickly scan through the segments to find a specific array value at a specific index (if reading it). This implies bringing back the boolean type [B]. Which would be in addition to keeping the two specific boolean value constrained types. (As an aside: Consider other special value specific case types aside from T/F; like creating a few types for the various integer and float sizes of 0. (Or even better, use the enumerated list trick from above to define these specific value instances of a type.)) Back to the arrays, a small gain can be had here if when [!] occurs outside the scope of an array container (by itself as a value); then it means to create a single array of that single type. If it's any kind of a mixed type array, then you must use []. Using this technique, is a good way to describe a 1D vector of mixed values. From here the more general problem of containers can be solved by using the 1D vector of values, and and then providing metadata to it that describes how this vector is laid out. Similar in context I think to #51 (create a new container type, using [<] and [>], that means "metadata enhanced vector"). |
@MikeFair Currently the biggest discussion is around typespec, ND and current STC. The most important thing is to carefully pick one of those as a feature. I strongly believe that we have to choose something, because after some considerations, I think that UBJSON without any of those gives us so little advantages, that it is way simplier to stick to JSON (maybe apart of encoding arrays of intregers, but many people do that already without UBJ). Things such as repeatable headers, ___ as typespec placeholder, etc. are important, but they are less important and are determined by more general options. We have some break now I suppose, everyone is rethinking all the ideas and we cannot reach any consensus yet. So your fresh look is appreciated : ) |
Ok, so I reread through the whole proposal here again and can make some better refinements/comments. "repeatable headers" didn't exactly inspire the correct vision for what they are. Also the term "Containers" explicitly refers to objects or arrays for me. I created [!] which I can now see is really just reinventing a constrained version of @kexpal's [#] to a single type. So my comment is to restrict [#] to a single type and make that a 1D vector, an optimized 1D vector. It has all the expressive power of typespec for no additional cost that I can see. I also just opened #66 to request the change to length; which was another comment. (but kind of orthogonal to this). Lastly, optimized containers are worthy of their own object class. Like in #51. By having a more powerful 1D array type, an optimized container format can take advantage of that. For instance in the case of #43 something like: You don't need to define the type of the array, you just define that the vector has 2048 floats in it and the optimized container explains the rest. |
I wanted to add one more example to my comment above that's more specific to this issue. If this was a 1024 element array of [iSZC] arrays (the example from the top description (though I modified the second [S] to be a [Z])), then the container descriptor becomes this (though I've rearranged the container description from my prior example a little): |
Repeatable headers make sense only in containers.. In Json even single null, integere or string is valid JSON message, but you rarely see something like this (except from some streamings). This is why almost every JSON is one big container, array or usually object. If you remember about this, it is clear that headers could be rarely used outside containers, because we do not work outside containers at all (almost). You first example is indeed 2D array isn't it? If so, why are you talking about optimizing 1D? Btw.
Steve already commented about that. Data alignment is application problem, not UBJs. Your second example is array of 1024 integers, then 1024 string, then 2014 nulls and finally 1024 characters. While this is perfectly legal, I doubt this is popular use case. In dynamic languages and webdev, data will usually be mixed and mostly consist of strings. But since its legal, here is how it would look like in typespec (with repeatables, see top post):
Repatables were laid away for a moment due to some reasons (unknowc-length containers, I suppose?), and this creates a problem if we do not have them. This is what I wrote about before Mike - how to efficiently define 1024 consecutive types in typespec? It seems that dropping reapatable headers make case of mixe-type containers much more difficult. |
@Miosss
While true, this is not the same thing as saying all instances of repeatable headers are instances of a container, specifically just an array (because a repeatable header as an object doesn't really make much sense the way it's currently being constructed). An Array can contain many instances of a repeatable header; each instance appending more elements to the array.
Objects use stand alone JSON Values by themselves all the time. Both as field names and as values. Since # makes absolutely no sense as a field name, it's therefore invalid there. But what's the actual error? By saying, # defines an array in its entirety when used outside the scope of an open array, the decoder would likely write this error as something along the lines of "Array found where string was expected" or "An array cannot be used as the field name of an object". The other place this could happen is as the field's value. Consider this object: And the following encoding: Without this definition it would not be clear which of the following this encoding would mean: The simple/naive approach is to always force arrays to have [[] and []] surrounding them (interpretation 1). And this approach has its merits but it does burn those [[] []] characters for the simple case. However since (like you said) repeatable headers can only appear inside the scope of an array, I figured it would be a nice optimization to just say "If a repeatable header appears as a stand alone value, that's not where a value of a currently open array would appear; then it defines an entire array value". This makes the above encoding clear without forcing the additional [[] and []] around the outside of the repeatable header and is most likely what was meant. To get object (2) foo's value would have to use the surrounding brackets [] to define that. The other ambiguous case is the jagged array of arrays. And this encoding: If # always meant "Define an array container" then only the second array can built with this. Without the surrounding []s there is no way to tell when values should stop being added to the array (as has been mentioned by others before). So in this context, # means, "Append this [count] of this [type] of element to the current array". So to enable the use of # to define a singularly typed array as an object field's value in its entirety (which I think will be a very common thing to do); I tried to make language that would clearly mean, if # is encountered where a JSON value is expected it defines an array of exclusively that type and length, unless there is already an open array, in which case it appends its values to that array. This is obviously an optimization to eliminate the requirement of the []s in the case of an object field's value and so is totally a optional thing to include. |
In both these cases; the 1D value is not coming from the application, it's coming from UBJ. The second example is actually a 1024 element array of a repeating sequence of ISZC values (aka objects that have had their field names stripped, and their values repacked efficiently as an array of arrays). By defining the data vector as being laid out in Column-Major, instead of Row-Major order, it created a way to rotate the arrays so they transmit/decode more efficiently, but didn't actually change the layout of the original repeating sequence of Mixed-Type arrays at all. By creating a UBJ container that has some flexibility in terms of how it transmits the packing of its values versus the layout of the original JSON container, it can maximize the use of repeatable headers during the transmission but still recreate the original shape when it gets decoded. This is taking a 1D UBJ Array, something that is very easy and simple to create/encode, and using it as the set of values for the container; the container's "vector of values" (the "data vector"). In the first case, it happens to be a 2D array, but it could also easily be switched to being an array of jagged arrays that contain a total of 2048 values between them. My original was this: @ // This is an optimized multidimensional array [2] // it has 2 dimensions [R] // The vector is laid out in Row-Major order [ [U2] [I1024] ] // those dimensions have length 2 and 1024. [#0I2048][d]... // here is the 2048 element data vector this exact same construct could also have defined an array of arrays that have a combined 2048 elements using something like this: @ // This is an optimized multidimensional array [A] // The data vector contains an array of arrays [ [U24] [I500] [I1000] [I500] ] // those arrays are lengths 24, 500, 1000, and 500. [#0I2048][d]... // here is the 2048 element data vector By separating the layout of the complex container from its data vector, there is flexibility to more optimally pack the data vector for compact transmission and decoding speed. So what this idea is doing is saying that "The way the container was originally described in JSON may not be the most efficient way for UBJ to transmit and decode it. Here's an efficiently encoded data vector that is fast to decode and a description for how to unpack it to recreate the original container." |
As I mentioned above, the second example was misinterpreted, sorry about not specing it out more clearly/completely.
The JSON array that was actually defined was not an array of 4 sequences of 1024 elements of the same type. It was 1024 sequences of 4 mixed type values. This is the power of being able to rotate the array layouts. What I probably should have done was just define it as an object in the first place but I thought introducing objects to the fray was just going to add another layer of confusion so I used a repeating sequence of ISZC instead. In an environment where people knew how to code, they would do exactly what you said, create four arrays, and transmit those 4 arrays with some additional metadata that explained how to put the objects back together on the other side. That's exactly what these examples are doing; however instead of creating four (or N) separate arrays for everything, it's defining one data vector for the whole container and then using the dimensions and layout information to describe how to recreate the original JSON container from that vector. That's why I care about optimizing the 1D mixed type array; it's used as the source data vector for describing several different kinds of optimized complex containers using a simple repeatable handful of tools/techniques/layout descriptors. Imagine that in field 3, the null, wasn't Null for every field, just many of them. Also imagine that inside this array of objects, 1 object (the 800th element), had an additional field called "Temp" which was a double but didn't have a field 4, called "Classification". Here's what the "object" version of my second example could look like with those changes included: [%] // This is an optimized object [C] // The data vector is laid out in Column-Major order [[] [5] [0I1024] []] // the data vector contains 5 fields of 1024 objects [[] // Here is the array of field names [S][7][ObjectID] [S][5][Label] [S][8][ImageURL] [S][14][Classification] [S][4][Temp] []] [[] // And here is the data vector [#0I1024i]...... [#0I1024S]...... [#0I500Z][#200S]......[#250Z][S]...[S]...[Z][Z][#70S]... [#0I799C]......[_][#224C]...... [#0I799_][d][#224_] []] I should point out the mixed type stream used for field 3 (ImageURL); and the use of [_] to mean undefined for some of the object instance's fields. This same idea can be applied to optimize the layout for many different containers; choosing efficient layouts for commonly recognizable patterns.
The source JSON containers handed to UBJ are going to be mixed type by definition. And obviously it's going to be more efficient to pack the same typed data together where/when you can. So I see this technique (and the array extents/segments idea that goes with it) as proposing a way to maximize the same type packing, while at the same time respecting the fact that JSON containers are inherently mixed typed. The point to making the repeatable header singularly typed is to maximize the decoder's ability to bring in the same type technique as often as it can. Whether the example above is packed as 5 separate arrays (one for each field) or 1 long array they will pack and unpack almost exactly the same using a singularly typed repeatable header. It works primarily because there's metadata explaining that the provided layout of data is rotated from its original and so needs to be rotated again before handing it off to the end user (aka Column-Major). Packing it as a 1D data vector with a layout (aka Row-Major or Column-Major, the count of dimensions and their lengths), it's easy to reuse the same construct for both array and object containers, and to extend it to many different container optimizations like sparse (though the trick of using a repeatable header on the [_] type gets pretty close to sparse); Arrays of Arrays (of differing lengths); and ND Arrays. I appreciate you guys taking the time to let me comment; Thanks! |
@MikeFair
If you look at my examples of your JSONs, or kxepal's at the top, you will notice that [ and ] markers are present. typespec IS NOT definition of container. [ and ] are, This does not differ from JSON nor from current UBJSON spec.
Typespec can predefine keys for pairs in object. Concerning ISZC example, now when I understand you, it would look like the following (we transmit array of 1024 arrays, which contain I S Z and C consecutively):
So in fact, after succesfull parsing, you will decode array of 1024 arrays with 4 values of differenet types; You will use it like:
|
@Miosss I do understand what this proposal is saying; and how the existing typespec (at least the one on the ubjson.org website) works. There's a few different comments I'm making here.
Right; and I'm not saying otherwise, I was simply saying that when a stand alone value (like as the value of a field in an object) happens to be an array that contains only one type, then '#', in that case, can define the entire array value. Yes? (Not that you agree that it should do that; only that you see that it could.)
Except that there's nothing about JSON that requires all values of the same field name in different objects in an array be of the same type. This is extremely common when getting numerical data from computed values; you'll frequently encounter any of null, "NaN", "Inf", "N/A", "#NA", "@na", "DIV/0", etc as strings where a number value is expected because the formula was either invalid or the data simply doesn't exist. If typespec were implemented as described in this specification then any large dataset that includes those "NaN"/"Inf" strings or null as field values simply could not be optimized this way (or it would have to be broken up into smaller sections that did successfully meet the repeating type pattern). By using '#CTLV' and a different approach to objects and arrays that maximizes the use of #CTLV it eliminates those difficulties while remaining small (in most cases, smaller than existing proposals) and fast to decode (mostly because of the "Strongly Typed Section" semantics).
What seems to be missing here is the amount of parsing/interpretation/tracking work this approach is requiring the encoder/decoder to perform. For Instance, by defining it as #1024[ISZC], there is no opportunity for the decoder to predeclare a strongly typed array of length C and memcopy the values directly from the filestream; it must decode each value one at a time (which it can because it knows their types, it just has to do it value by agonizing value). There is also no opportunity for compressing runs of the "value explicit" types like Z, T, and F (and if more of these "value explicit" types were added (like the various integer and float sizes for 0, and 1) there'd be more of these optimized types to compress/use). Let's say instead of [Z], field 3 of the object was usually [T], and occasionally [F]. Using the typespec approach declared in this spec, it declares this as #1024[ISBC] and actually has to put all the [T] and [F] values into the stream. By making it possible to simply rotate the array streams, it can now declare the sequences of T and F more compactly. And use the #CTLV format for the other fields too. When rotated, the entire Boolean value stream looks like this: [#352T][#8F][#40T][#124F][#500T] This provides all 1024 values in 21 bytes instead of 1024 bytes and gives the decoder a way to translate/read in the values of each Strongly Typed Section way faster than having to interpret each value as a separate type one at a time. Further when encoding the Object example I later described with differing fields between objects, and replacing the series of null values with a mixture of null and strings becomes much more difficult to encode using this specification. When dealing with the largish datasets I've worked with (typically anywhere from 100MB to 100GB binary compressed), long runs of 0 or null or a repeating sequence of the same values over and over again is commonplace... I also find type exceptions in a long series of a field's values, whether it's a null in any field, a string in a typically numeric field (as I mentioned above), or an integer in a float field and a float in an integer field, are all quite commonplace. An optimized array/object spec ought to handle that "mixed typeness" effectively and I simply don't see the repeatable headers as defined for arrays and objects in this specification as meeting the needs. All that taken together is what has me saying "Yes, I see defining '#' this way is good, I say keep it as just modifying the count of values of type T to consume; and use a different approach to handling the cases of arrays and objects (namely enabling the frequent use of #CTLV and in case it's not clear; here's one way of doing that)". |
Well I am not convinced to make so many data rellocating, shifting and packign into nice and optimizable form that you propose, at all. But there are few points I can still comment:
What alternatives for arrays and objects of mixed types, deep structures, subarrays, subobjects, etc. do you see/
Whether typespec (as desribed by @kxepal) should appear only once or more is discusion about "repeatable headers". typespec itself has nothing to do with it - it just can be repeated if needed.
If we send numerics, then IEE754 floats, as used in UBJ, can have value of NaN +/- Inf etc. As I said it few times before, optimization is MOSTLY task of the ENCODER (internal machine), not user's. When UBJSON is, for example, used as transparent transport encoding of JSON, conversion shall be done automatically, with as much optimizations as the encoder can efficiently provide.
Could you provide some of such datasets, if of course those do not contain secret information? It would be nice to have something practical to discuss. |
@kxepal
In this example, you wrote a typespec like this:
(I suppose that in block-notation you meant something more like:
so it meant, as I understand, that objects in this containers are in fact arrays, of two fixed arrays: one of 4 U, second of two C. Ok, but what if we have arrays of arrays, for example arrays of gps coordinates - this is fairly simple, because we have only two floats in each, so [D][D] would be enough. But what if we had for example arrays of arrays of 1024 integers? |
I just posted #70 to begin describing them
Currently
have special meanings to describe a more complicated repeating structure of some kind. My comment was about reserving [#] for only single typed values, as in #66. Defining [#] to be an optional count modifier to a TLV is an elegant tweak to the spec. While I do find the additional uses of the array, object, and typespec idea are great; keeping [#] as strictly a count modifier to a TLV is a cleaner use for the [#] symbol. If there must be multitype repeating specifier then perhaps [$] would be a better symbol for that rather than overloading the [#] symbol.
I disagree that mixed type arrays aren't worth optimizing, especially since using the repeatable headers technique accounts for them so nicely by maximizing the number of values that can be compacted together in a single run, interrupting it when it needs to, and then going back to packing many of the same typed values together again. (and using the non bracketed [] version to mean an exclusively singly typed array.) If an optimized container of 4095 Double values and a single string must fall back to explicitly providing the type for all 4096 values, it occurs to me like a special use case optimization, not a general solution. It's great when it works out that every single value in every single field has exactly the type it's supposed to have, but when you start dealing with large amounts of data, it's simply not always there, or sometimes it's been miskeyed, or sometimes the value is outside the range it's supposed to be, or sometimes it's of the wrong type (like '+/- Inf'/'NaN' instead of a float). Oftentimes, you're not the one responsible for creating the data so fixing it "upstream" can be challenging. For example, any of the JSON datasets available online, since you're just downloading them you have little control over their quality. So if minor data discrepancies result in complete break down of a more optimized encoding, that doesn't seem to me like the right solution.
I'll see what I can dig up, but I think http://data.gov/ which is the US gov't public data site is a really great place to start; it looks like they've got tons of real world JSON to offer. Where should it get posted? Just attach it to a new issue? upload/email it to someone? |
Abstract
UBJSON provides two types of containers:
These containers are simple and cool: just read values till the container close
marker. This allows to stream data without pain. However, the real world doesn't
works with only streaming data. The real world also deals with:
read the actual content)
With Draft 10 both array and object containers received special "optimized"
format which aims to cover both these cases. However, it also introduces new
problems to the format:
Inconsistency. Currently, there are three different types of each container:
All of them start with the same marker. After reading it you don't know what
to expect: actual data or optimization markers. So eventually you have to read
one more byte just to figure what to do in your code: construct "optimized"
container or start to put the data into plain old one.
It's not possible to define just type optimization without count
So the case when we want to stream some strongly typed content with unknown size
is returns us back to unoptimized format.
It's not possible to correctly optimize array of numbers
Current optimization requires to have all container elements the same type
marker. However, if you'll try to optimize array of integers in range 0-300
you'll notice that there are only choices for you:
U
toi
, but this will make alloptimization profits fade away.
It's not possible to have optimized containers with values of various types
but which still follows the same schema
Proposed optimizations are applied for whole container without exceptions
The base TLV format is broken by
$
markerDraft 10 has a small step forward in question of providing optimizations for
repeatable data. Let's fix this idea and make it right.
Container Section Specification
To fix Draft 10 containers optimizations we need to do the following:
Type Specification
The Type Specification is special declaration that all following container
elements are belongs to described type. The Type specification can contains
"simple" types (numbers and strings) as like as complex (array and object).
The Type Specification is a sort of valid UBJSON data without "value" part:
S
,U
,i
,I
,H
[iSSC]
- means that each elements must bean array with int16 first element, second and third must be strings
and the last one is a char.
{U3fooSU3barU}
- means that each element isan object with keys
foo
andbar
where value forfoo
is stringand value for
bar
is unsigned int8.Container Section
The Container Section is a special construction which defines containers
elements (their type and amount) till the next Container Section or container
end occurrence.
The Container Section is defined by the following TLV object:
Where:
Tag
is#
(0x23) characterLength
is the amount of following elementsValue
part contains Type Specification for the following elementsBoth Length and TypeSpec could be omitted by setting them to null
Z
:The
[#][Z][Z]
means exactly the same as it wasn't be defined: the amount offollowing elements is unknown and their type may vary e.g. there is
no optimization being applied.
The Length MUST contains integer number (tag + value) or be null.
The TypeSpec MUST follow the requirements of Type Specification section or be
null.
If TypeSpec defines the numeric type, the Length MUST NOT be null.
The Container Section MAY occurs multiple times inside single container at the
beginning, in the middle and at the end of it.
Edge cases
Nulls
Values cannot be null unless TypeSpec describes the object.
Not allowed:
Allowed:
Booleans
In UBJSON there is no single marker to describe case of
T
orF
.I propose to reintroduce
B
marker as special one for TypeSpecentity for defining boolean values.
Use Cases
Checkpoints
The empty Container Section carries no mean, but it could be used in application
as checkpoint bays:
Chunked Streams
Use case as like as for checkpoints, but to control amount of passed/received
data:
Container with mixed elements type
This is an optimized array of 0-32768 numbers:
JSON size: 185505 bytes
UBJSON size (Draft-10): 98055 bytes
UBJSON with optimized array: 65432 bytes
3 times more compact than JSON, 50% more compact as current UBJSON.
Records
Records are named tuples of elements which almost has same data type. Good
example is a row of any SQL table: each "cell" has own type. How the dump of
generic SQL table will looks like in JSON:
(726 bytes)
UBJSON:
(510 bytes)
UBJSON + optimized containers:
(238 bytes)
Twice better than plain UBJSON, almost 4 times better than JSON.
Multidimensional Arrays
Issue #43 solves with easy without additional markers:
Error Cases
No-op marker inside Type Spec
Not enough elements
and also
Recommendations for implementations
separately. This allows you handle Type Specification with easy.
to handle specific TypeSpec definition. For instance, you may map UBJSON
records to used ORM.
containers when new Container Section occurs.
Required UBJSON Draft 10 Changes
B
for booleans (T
andF
)#
for Container SectionDifferences from existed proposals
$
-helper - you just reuse semantic of existed types instead.The text was updated successfully, but these errors were encountered: