Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Elements Group Schema #50

Open
kxepal opened this issue Jul 19, 2014 · 44 comments
Open

Container Elements Group Schema #50

kxepal opened this issue Jul 19, 2014 · 44 comments

Comments

@kxepal
Copy link
Collaborator

kxepal commented Jul 19, 2014

Abstract

UBJSON provides two types of containers:

  1. Array as list of various elements
  2. Objects as a list of key-value pairs where keys are always string typed

These containers are simple and cool: just read values till the container close
marker. This allows to stream data without pain. However, the real world doesn't
works with only streaming data. The real world also deals with:

  1. Sized containers (when parser known about container length before it starts
    read the actual content)
  2. Typed containers (when all container elements are have same data type)

With Draft 10 both array and object containers received special "optimized"
format which aims to cover both these cases. However, it also introduces new
problems to the format:

  • Inconsistency. Currently, there are three different types of each container:

    • Plain old container
    • Sized container
    • Typed and sized container

    All of them start with the same marker. After reading it you don't know what
    to expect: actual data or optimization markers. So eventually you have to read
    one more byte just to figure what to do in your code: construct "optimized"
    container or start to put the data into plain old one.

  • It's not possible to define just type optimization without count

    If a type is specified, a count must also be specified.
    A type cannot be specified by itself.

    So the case when we want to stream some strongly typed content with unknown size
    is returns us back to unoptimized format.

  • It's not possible to correctly optimize array of numbers
    Current optimization requires to have all container elements the same type
    marker. However, if you'll try to optimize array of integers in range 0-300
    you'll notice that there are only choices for you:

    • Use plain old array without optimizations
    • Forge type for the first 0-255 elements from U to i, but this will make all
      optimization profits fade away.
  • It's not possible to have optimized containers with values of various types
    but which still follows the same schema

  • Proposed optimizations are applied for whole container without exceptions

  • The base TLV format is broken by $ marker

Draft 10 has a small step forward in question of providing optimizations for
repeatable data. Let's fix this idea and make it right.

Container Section Specification

To fix Draft 10 containers optimizations we need to do the following:

  • Drop current optimizations
  • Define "Type Specification" thing
  • Define "Container Section" thing

Type Specification

The Type Specification is special declaration that all following container
elements are belongs to described type. The Type specification can contains
"simple" types (numbers and strings) as like as complex (array and object).

The Type Specification is a sort of valid UBJSON data without "value" part:

  • TypeSpec for simple types: S, U, i, I, H
  • TypeSpec for arrays: [iSSC] - means that each elements must be
    an array with int16 first element, second and third must be strings
    and the last one is a char.
  • TypeSpec for objects: {U3fooSU3barU} - means that each element is
    an object with keys foo and bar where value for foo is string
    and value for bar is unsigned int8.

Container Section

The Container Section is a special construction which defines containers
elements (their type and amount) till the next Container Section or container
end occurrence.

The Container Section is defined by the following TLV object:

    [#]<Length><TypeSpec>

Where:

  • Tag is # (0x23) character
  • Length is the amount of following elements
  • Value part contains Type Specification for the following elements

Both Length and TypeSpec could be omitted by setting them to null Z:

    [{]
        [#][Z][Z]
    [}]

The [#][Z][Z] means exactly the same as it wasn't be defined: the amount of
following elements is unknown and their type may vary e.g. there is
no optimization being applied.

The Length MUST contains integer number (tag + value) or be null.

The TypeSpec MUST follow the requirements of Type Specification section or be
null.

If TypeSpec defines the numeric type, the Length MUST NOT be null.

The Container Section MAY occurs multiple times inside single container at the
beginning, in the middle and at the end of it.

Edge cases

Nulls

Values cannot be null unless TypeSpec describes the object.

Not allowed:

   [[]
      [#][U][4][U]
      [1]
      [Z]
      [2]
      [3]
   []]

Allowed:

    [[]
      [#][Z][{][U][3][foo][U][}]
      [1]
      [Z]
      [2]
      [3]
    []]

Booleans

In UBJSON there is no single marker to describe case of T or F.
I propose to reintroduce B marker as special one for TypeSpec
entity for defining boolean values.

    [[]
      [#][Z][B]
      [T]
      [F]
      [T]
      [F]
    []]

Use Cases

Checkpoints

The empty Container Section carries no mean, but it could be used in application
as checkpoint bays:

    [[]
        [U][0]
        [U][1]
        [U][2]
        ......
        [U][99]
        [#][Z][Z]
        [U][100]
        [U][101]
        [U][102]
        ......
        [U][199]
        [#][Z][Z]
        [U][200]
        [U][201]
        [U][202]
        .....
        [U][255]
    []]

Chunked Streams

Use case as like as for checkpoints, but to control amount of passed/received
data:

    [[]
        [#][U][1024][U]
        [21]
        [32]
        ... 1021 elements more ...
        [232]
        [#][U][1024][U]
        [121]
        [234]
        [28]
        [81]
        ... 1020 elements more ....
        [#][U][2][U]
        [41]
        [74]
    []]

Container with mixed elements type

This is an optimized array of 0-32768 numbers:

    [[]
        [#][U][256][U]
        [0]
        [1]
        ......
        [255]
        [#][I][32512][i]
        [256]
        [257]
        .....
        [32766]
        [32767]
        [#][U][1][I]
        [32768]
    []]

JSON size: 185505 bytes
UBJSON size (Draft-10): 98055 bytes
UBJSON with optimized array: 65432 bytes

3 times more compact than JSON, 50% more compact as current UBJSON.

Records

Records are named tuples of elements which almost has same data type. Good
example is a row of any SQL table: each "cell" has own type. How the dump of
generic SQL table will looks like in JSON:


    [
        {"id": 3530597, "city": "Mexiko-Stadt", "lat": 19.428472427036, "lon": -99.12766456604},
        {"id": 1701668, "city": "Manila", "lat": 14.6042, "lon": 120.9822},
        {"id": 1185241, "city": "Dhaka", "lat": 23.710395616597037, "lon": 90.40743827819824},
        {"id": 1835848, "city": "Seoul", "lat": 37.566, "lon": 126.9784},
        {"id": 1642911, "city": "Jakarta", "lat": -6.214623197035775, "lon": 106.84513092041016},
        {"id": 1850147, "city": "Tokyo", "lat": 35.6895, "lon": 139.69171},
        {"id": 1668341, "city": "Taipeh", "lat": 25.047763, "lon": 121.531846},
        {"id": 3688689, "city": "Bogot\u00e1", "lat": 4.609705849789108, "lon": -74.08175468444824},
        {"id": 1816670, "city": "Peking", "lat": 39.9074977414405, "lon": 116.397228240967},
        {"id": 1819729, "city": "Hong Kong", "lat": 22.2855225817732, "lon": 114.157691001892}
    ]

(726 bytes)

UBJSON:

    [[]
        [{]
            [S] [i] [2] [id]
            [l] [3530597]
            [S] [i] [4] [city]
            [S] [i] [12] [Mexiko-Stadt]
            [S] [i] [3] [lat]
            [d] [19.4284725189209]
            [S] [i] [3] [lon]
            [d] [-99.1276626586914]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1701668]
            [S] [i] [4] [city]
            [S] [i] [6] [Manila]
            [S] [i] [3] [lat]
            [d] [14.60420036315918]
            [S] [i] [3] [lon]
            [d] [120.9822006225586]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1185241]
            [S] [i] [4] [city]
            [S] [i] [5] [Dhaka]
            [S] [i] [3] [lat]
            [d] [23.71039581298828]
            [S] [i] [3] [lon]
            [d] [90.40744018554688]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1835848]
            [S] [i] [4] [city]
            [S] [i] [5] [Seoul]
            [S] [i] [3] [lat]
            [d] [37.566001892089844]
            [S] [i] [3] [lon]
            [d] [126.97840118408203]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1642911]
            [S] [i] [4] [city]
            [S] [i] [7] [Jakarta]
            [S] [i] [3] [lat]
            [d] [-6.214622974395752]
            [S] [i] [3] [lon]
            [d] [106.84513092041016]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1850147]
            [S] [i] [4] [city]
            [S] [i] [5] [Tokyo]
            [S] [i] [3] [lat]
            [d] [35.68949890136719]
            [S] [i] [3] [lon]
            [d] [139.69171142578125]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1668341]
            [S] [i] [4] [city]
            [S] [i] [6] [Taipeh]
            [S] [i] [3] [lat]
            [d] [25.04776382446289]
            [S] [i] [3] [lon]
            [d] [121.53184509277344]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [3688689]
            [S] [i] [4] [city]
            [S] [i] [7] [Bogotá]
            [S] [i] [3] [lat]
            [d] [4.609705924987793]
            [S] [i] [3] [lon]
            [d] [-74.08175659179688]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1816670]
            [S] [i] [4] [city]
            [S] [i] [6] [Peking]
            [S] [i] [3] [lat]
            [d] [39.90749740600586]
            [S] [i] [3] [lon]
            [d] [116.39723205566406]
        [}]
        [{]
            [S] [i] [2] [id]
            [l] [1819729]
            [S] [i] [4] [city]
            [S] [i] [9] [Hong Kong]
            [S] [i] [3] [lat]
            [d] [22.2855224609375]
            [S] [i] [3] [lon]
            [d] [114.1576919555664]
        [}]
    []]

(510 bytes)

UBJSON + optimized containers:

    [[]
        [#][Z][{][U][2][id][l][U][4][city][S][U][3][lat][d][U][3][lon][d][}]

        [3530597]
        [i] [12] [Mexiko-Stadt]
        [19.4284725189209]
        [-99.1276626586914]

        [1701668]
        [i] [6] [Manila]
        [14.60420036315918]
        [120.9822006225586]

        [1185241]
        [i] [5] [Dhaka]
        [23.71039581298828]
        [90.40744018554688]

        [1835848]
        [i] [5] [Seoul]
        [37.566001892089844]
        [126.97840118408203]

        [1642911]
        [i] [7] [Jakarta]
        [-6.214622974395752]
        [106.84513092041016]

        [1850147]
        [i] [5] [Tokyo]
        [35.68949890136719]
        [139.69171142578125]

        [1668341]
        [i] [6] [Taipeh]
        [25.04776382446289]
        [121.53184509277344]

        [3688689]
        [i] [7] [Bogotá]
        [4.609705924987793]
        [-74.08175659179688]

        [1816670]
        [i] [6] [Peking]
        [39.90749740600586]
        [116.39723205566406]

        [1819729]
        [i] [9] [Hong Kong]
        [22.2855224609375]
        [114.1576919555664]
    []]

(238 bytes)

Twice better than plain UBJSON, almost 4 times better than JSON.

Multidimensional Arrays

Issue #43 solves with easy without additional markers:

    [[]
        [#][Z][[UUUU][CC]]

        [1][2][3][4]
        [a][b]

        [5][6][7][8]
        [c][d]
    []]

Error Cases

No-op marker inside Type Spec

    [[]
        [#][N][U][10][U]
        ...
    []]

Not enough elements

    [[]
        [#][U][10][U]
        [1]
        [2]
    []]

and also

    [[]
        [1]
        [2]
        [#][U][10][U]
    []]

Recommendations for implementations

  • Split TLV parser into three functions to parse tag, length and value
    separately. This allows you handle Type Specification with easy.
  • Allow UBJSON library to register custom records, structures and classes
    to handle specific TypeSpec definition. For instance, you may map UBJSON
    records to used ORM.
  • Ensure that you can easily update context of processing array/object
    containers when new Container Section occurs.
  • Instead of Recursion, use State Machine to handle containers body with TypeSpec

Required UBJSON Draft 10 Changes

  1. Drop existed optimized containers
  2. Define special marker B for booleans (T and F)
  3. Define Type Specification entity
  4. Define special marker # for Container Section
  5. ...
  6. Enjoy the power!

Differences from existed proposals

@ghost
Copy link

ghost commented Jul 29, 2014

@kxepal This is a sizeable post, so I'm going to try and address it in chunks to make it more digestible.

I think I've read through this 6x and here are my thoughts:

I like:

  • The idea of repeatable sections to a container. (I've said this before)
  • The idea that we can optimize OBJECT representation. (final missing piece from UBJSON)
  • Ability to optimized mixed-type containers
  • The ability to overload the use of repeating headers as a "chunking" or "checkpoint" mechanism for large payloads of UBJSON (e.g. every 4096 bytes for example) - this does get into the realm we have discussed before of putting protocol concerns into the spec, but it could be useful none the less.
  • The ability, namely with numeric data, to highly-specialize the type instead of hunting for a "least common denominator" numeric type to store all values of a container in. (your "Container with mixed elements type") section

I don't like:

  • How complex the TypeSpec can become in more complex containers.
  • Requiring Type-Length header to always be present. The reason I optimized them out of the original spec ($-#) is because with small unoptimized containers it can double your storage space by having them over and over and over again.

I really don't like:

  • Special case requirement of "if numeric type, then MUST include length"
  • Special case treatment of Nulls

Question

  • I don't understand all the comments about "breaking TLV" (type-length-value) -- that is why the current STC's are defined the way they are:

[$][i][#][i][4][abcd] == type:i, length:4, value:abcd

What am I missing? Or do you just mean the optional-nature of the T/L values that you don't like?

@kxepal
Copy link
Collaborator Author

kxepal commented Jul 30, 2014

How complex the TypeSpec can become in more complex containers.

As much as you need. We shouldn't limit our users just because we think that their data is too complex. It may be too complex, but if it has repeatable format optimizations would be very helpful.

Requiring Type-Length header to always be present. The reason I optimized them out of the original spec ($-#) is because with small unoptimized containers it can double your storage space by having them over and over and over again.

You're wrong about doubling storage space: as for case sized not typed containers this would cost you 1 byte per header: in fact that containers aren't optimized, that's not a big problem to care about, while it makes header structure consistent for all the cases. As for current solution, you have to maintain two headers: Sized and Typed-Sized.

Special case requirement of "if numeric type, then MUST include length"

You don't like the special case or that numeric types must have length specified? Because the latter is what current STC have to do (] vs 93). For other types this limitation isn't necessary. It's cheap to buffer few numbers to count them while it may be hard to buffer few strings for the same goal since you're not aware about how big they are.

Special case treatment of Nulls

Nulls semantically means nothing. If you should have a field, but you don't have a value for it you put there null. Same propose it plays for Length and Type parts: allows you to reserve the slot without specification of his value, so parser always now how much to read, but is free to ignore nulls.

I don't understand all the comments about "breaking TLV" (type-length-value) -- that is why the current STC's are defined the way they are:

[$][i][#][i][4][abcd] == type:i, length:4, value:abcd

What am I missing? Or do you just mean the optional-nature of the T/L values that you don't like?

Because you count [$][i][#][i][4] as single entity while it is not. # may occurs independently from $. To match TLV you should read this structure as

[$][i] - Tag: $, Length: omitted, Value: i
[#][i][4] - Tag: #, Length: omitted, Value: [i][4]
[abcd] - Stream of data.

Two objects + stream of data while it describe single logical object. So you introduces markers relations into the specification which isn't good sign since all the current objects are defined atomically by single TLV structure.

@breese
Copy link

breese commented Jul 31, 2014

This is a great proposal. The biggest challenge is how much it will impact on the implementation, so ideally there should be a reference implementation before it becomes standard.

I agree with @thebuzzmedia about not liking the special treatment of null. If it has to be special, then we need a TypeSpec to specify that the type is optional/nullable. Nullable values may be nothing, but they may still be semantically important.

Should there be a placeholder TypeSpec? Say we have an array of pairs, where the first element is an integer and the second is a hetegeneous type, then we could use underscore as an indication that more type information will inlined (e.g. [i_])

  [[]
    [#][Z][[i_]]

    [1]
    [S][2][h][i]

    [2]
    [i][42]

  []]

@kxepal
Copy link
Collaborator Author

kxepal commented Jul 31, 2014

@breese Awesome idea with _ as place holder for unknown type - I like it!

I agree with @thebuzzmedia about not liking the special treatment of null. If it has to be special, then we need a TypeSpec to specify that the type is optional/nullable. Nullable values may be nothing, but they may still be semantically important.

Can you clarify what you mean under "special treatment of null" and why it's bad? I have few thoughts about, but I want to be sure that we are on the same wave. Thanks. /cc @thebuzzmedia

@ghost
Copy link

ghost commented Aug 1, 2014

@breese I agree - Alex put together something really fantastic here and I'm using it as a guide to incrementally pull his ideas out in chunks, mix them with the existing spec and move it forward a draft at a time.

  1. [SUCCESS] STC containing STC's - Allow strongly typed arrays to contain strongly typed arrays #43
  2. [FAILURE] Repeatable headers - Redefine strongly typed containers as a more flexible, repeatable, embeddable construct. #48 (too soon, not enough details clarified)
  3. [SUCCESS] Consistent header representation - Make container header more structured and consistent in behavior #51
  4. [TODO] Repeatable headers (once 51 is ratified)
  5. [TODO] Full schema ("typespec") support to allow mass-optimization of Object and Array representation.

#5 is the trickiest... but this is how I've mapped out pulling his core ideas out of this and merging them in nice, digestible pieces that we can debate individually.

@Miosss
Copy link

Miosss commented Dec 7, 2014

@thebuzzmedia
I think this is the core of the developement and @kxepal did some great work.

I believe that #48 will pass soon enough. Then I think that ratyfing and polishing of the typespec shall follow shortly. The typespec should resolve much of the problems of n-dimensional arrays, while maintainig mixed-type containers. But it still offers much optimizations - especially in the Record example by @kxepal .

Such record specification could be great in general, I see it as a simple kind of XSD (XML Schema Definition) but limited only to syntax of messages (parts of messages). Having typespec would allow even describing messages in documentations in simple, strict format etc.

@ghost
Copy link

ghost commented Dec 9, 2014

@Miosss 100% agree, @kxepal did an awesome job with this.

Most all of the ideas in Draft 12 are just me massaging what he has presented here into something that 'feels' more like the original spec or is a more minor incremental stepping stone for us forward - making small changes one at a time, stepping towards the eventual goal.

You are right that the potentially biggest win here is from what Alex is calling TypeSpec (basically schemas) -- unfortunately it's also probably one of the most complex to define and implement for both generators and parsers.

You'll notice our #2 Goal is 'ease of use' and schemas definitely don't fall into that category :)

That said, like everything else we have to discuss, think, suggest, debate and massage these ideas to move them forward. We've been discussing the typed containers for 2 years, but they have FINALLY gotten to a place that is 10x better than I originally came up with - they are intuitive, clean, conceptually simple and (relatively) easy to implement.

I want to do the same with the schemas/TypeSpec idea.

@Miosss
Copy link

Miosss commented Dec 9, 2014

@thebuzzmedia Point taken, I am also very reluctant to schemas like in XML's. What I understand UBJSON is, that it is data serialization format (maybe one of the definitons) and what I keep repeating is that I believe syntax is the key and semantics should not get anyway near (and JSON handles it fine).
Maybe I was a bit eager when I finally understood what Alex meant : )

Therefore, I think that typespec could be a potential win in extremely optimizing type declaration of the container like in the Record example. And only in that I suppose. Any other "schema" language, messages definitions etc. was just a flash in my mind, wrong one : )

@ghost
Copy link

ghost commented Dec 9, 2014

@Miosss Agreed; and just for the record, I am just as excited as potentially big space savings introduced by TypeSpec/Schemas/whatever as the next guy :)

@Miosss
Copy link

Miosss commented Dec 12, 2014

After talk in #61 I have some thoughts:

  • maybe for the sake of some of us : ) it would be better to discuss typespec in a limited sense - for example drop the idea of repeatable headers (for a moment) and focus on outlining benefits of typeschema in general (lets say only one spec in container)
  • if repeatable headers were forgotten, then requirement that typespec is always present can be turned off too for now
  • optional typespec does not require STC drop I believe; STC header can be just written as typespec with almost the same syntax (and syntax can still be discussed)
  • for example, if we called typespec as generalized STC header, than we could change the meaning of $# markers to support typespec (namely:drop $ and reuse # as described in proposal)
  • the order of count/type can be reversed I think without any problems (to not break TLV)
  • maybe it would be better to use [N] instead of [Z] in typespec; [N] is semanticless in general, while [Z] may produce some problems (special treatment...), but this is just a thought
  • maybe typespec could be ommited (set to [Z] or [N]) while count would still be specified - same behaviour as current STC
  • you have not specified precisely the rules in nesting containers; thats what i tried to do in Fixed-length N-Dimensional Arrays for UBJSON #61 as a example, but i requires some discussion
  • maybe we should enforce "Record" and "ND" benefits of typespec as most promising; instead of checkpoints, chunks and repeatable headers

Personally I would much more like to forget about repeatable headers than forget typespec. Typespec does not require repeatableness (is there such word?) : )

After general discussion over typespec (and its benefits over ND for example) we could get on with forgotten topics.

@meisme
Copy link

meisme commented Dec 12, 2014

As I mentioned in #60, I do not see the need for repeatable headers at all. I don't see what it gives us that we don't already have.

Isn't the point of [N] that it's a no-op, with absolutely no meaning? Whenever my implementation sees a [N] it advances a byte without changing any state. If we were to give it a semantic meaning, I don't understand what the difference between [Z] and [N] would be any more.

The advantage of typespec over the ND proposal is the ability to compress nested structures with mixed types at the cost of increased complexity. I agree that it would make sense to hash out that discussion without going ahead on any other proposals.

@Miosss
Copy link

Miosss commented Dec 13, 2014

@kxepal
I think we need here way to specify mixed-arrays schema (for object you even gave an example).

To efficiently support ND arrays with this idea, probably there should be a way to specify long sequence of same-type data.
Fo example writing:

    [#][Z][[][I][1024][i][I][256][B][]]

is a lot easier than:

   [#][Z][[][Z][i][i][i]..times 1024..[i][B][B].. times 256 ..[B][]]

So I suppose than consecutive pairs should be allowed. What about [#] marker?

@MikeFair
Copy link

MikeFair commented Mar 3, 2015

Okay, having reread a few of these issues I have a clearer understanding of what folks are saying in this topic now.

Here's my $0.02 on a few of the outstanding hesitations:

  1. Repeating headers are really important
    I'd say the single biggest data object that needs to be optimized at this time is the SQL Resultset.
    It's regularly shaped and its shape is predefinable, it's as common as dirt, and translates horrendously into JSON. Next to this are the NoSQL DB server resultsets which are almost as predefinable, oftentimes regularly shaped, and fairly common.
    If a proposal cannot represent a SQL Resultset sanely then it isn't thinking big enough.
    The biggest downfall for the SQL Resultset is the repeated headers.

Documents that compress well have lots of repetition in them. That ought to taken advantage of, not designed out of the spec. Something EXI does on this front is have a string index, and on the first occurrence of every string they give it an index value, then in subsequent string positions they can use the index value instead of repeating the string.

Or repeatable headers is also good...

  1. Null value [Z] and no-op [N] simply aren't the same thing, [N] is effectively a keep alive byte for streaming transfers and should be ignored ([N] can be completely stripped from a UBJ encoding with absolutely no impact to the represented structures); [Z] on the other hand is a value and can appear anywhere a value can appear, and if [Z] cannot appear as a value in a strongly typed container then STCs are incompletely defined.

@kxepal, I don't see how this is allowing an STC to contain a [Z] for a value.
Especiallly in the case of the SQL Resultset.
The way SQL databases typically handle it is they have a bitfield off to the side, one for each column on every row, that specifies whether the value is [Z] or whatever data is in the proper position.

An array of 4 numeric values with one of them [Z], [1, 2, [Z], 4]; is not the same as [1, 2, 4].
How is a parser supposed to know how to interpret position 3?

@kxepal
Copy link
Collaborator Author

kxepal commented Mar 3, 2015

@MikeFair see first post:

Values cannot be null unless TypeSpec describes the object.

This behaviour strongly depends on what data parser handled: if it's typed array, than that an error - you cannot cast null to some int or float explicitly. If it's array of objects, what SQL Resultset is actually is, nulls are ok.

@MikeFair
Copy link

MikeFair commented Mar 3, 2015

Values cannot be null unless TypeSpec describes the object.

This behaviour strongly depends on what data parser handled: if it's
typed array, than that an error - you cannot cast null to some int or float
explicitly. If it's array of objects, what SQL Resultset is actually is,
nulls are ok.

I saw that, what I can't see is the parser logic that implements it.

Let's say I have a binary bitstream of 15 2-byte values declared in a
typespec. It's been decalared that some of them are null.
How can I tell which of them are null?

@MikeFair
Copy link

MikeFair commented Mar 3, 2015 via email

@kxepal
Copy link
Collaborator Author

kxepal commented Mar 3, 2015

Right. For bitstream nulls are not a thing that is expected since it's NAN. That's why parser first reads type spec and configures a state that used for further data decoding until next type spec headers comes if any or streams of data ends.

@MikeFair
Copy link

MikeFair commented Mar 3, 2015

Right. For bitstream nulls are not a thing that is expected since it's
NAN.

Ok but that obviously degrades the optimization opportunity for what I
think is a very common case. Null is a common value and this is saying "if
any value in the stream is null, then the type for every entry in the array
must be specified inline."

As the array gets longer this space penalty increases.

I guess this is good in the name of simplicity but what do you think of
these ideas:

  • a bit flag array (one bit per element to indicate a null)
  • an array of null indeces (one index address per null)
  • a runtime selected null indicator value, with an array of indeces where
    that value is actually not null (e.g. 42 == null, except positions
    2,16,255, and 728)

On this last idea, for each typed entry there would be two data arrays; the
first array would be either:

  • [Z] if there are no nulls for that stream
  • [0] if there are nulls but none share the null indicator; followed by the
    null indicator value
  • [N] where N is the number of places that share the null indicator value
    but aren't actually null; followed by those indeces; followed by the
    special null indicator value

That array is then followed by the actual data array. The analyzer would
obviously want to choose a value that occurs infrequently (if at all) for
the null indicator value to reduce the length of the exception array.

@MikeFair
Copy link

MikeFair commented Mar 4, 2015

@thebuzzmedia, @kxepal,

While it may be something you guys have discussed before, I found a mechanism to get around always having to predefine the length of and to allows nulls in the bitsream of strongly typed arrays; Use a second value after the magic value to determine how to interpret the preceding entry. (aka use an escape sequence inside the bitstream)

Here's how it works:

  1. Select a magic value; a magic value is the magic "escape value". Rather than always having it be the same value (e.g. 0 or '') the value is specified as the first entry in the sequence of values. At first I really didn't like magic value approaches, but then when I started developing an algorithm to dynamically find one, I realized that the space to choose a value from is typically going to be much larger than the array lengths and so therefore is going to quite regularly use a truly unique value.

For instance, when storing a single 8bit byte, integer treatment of these values ranges from 0-254.
If the length of the array is < 254 then we are guaranteed to have at least one unique value to use as an escape character. Not that we need a unique value because we have a way to deal with it when the escape value appears in the bitstream, but having a unique value means we can eliminate the cases where extra data appears just to say "that was an actual value".

As the number of bytes for the type increases, the length the array has to be in order exhaust all possible values grows exponentially. This makes it reasonable to assume that for the most part, every strongly typed bitstream can find an escape value with no or at least low collisions with real values in the stream.

  1. Use [Z] for length, [magicValue] for first element, [dataStream] to follow. [magicValue] is always treated as an integer of the same number of bits as the defined type. So if its an array of 8 byte floats, the value for purposes of testing for the magicValue is a direct cast of the same 8 bytes to an unsigned int. If the number of bits for the type exceeds the hardware architecture, then only the first bits up to the length of the architecture is used (this needs to be taken into account when choosing a magic value).

  2. If the magic value is found in the datastream, the next value determines what to do:

  • a zero, then that's the array terminator, stop, you've found the end of the array
  • a 1, then the preceding value is a legitimate value and should be included in the output
  • all 1's, then it's a null, mark it as such in the output

This method also gives a number of other commands if you think they would be useful. For example say a 3 means to repeat the value preceding the magic value some number of times (specified by the next value in the sequence), or a 2 means to insert some memorized predetermined pattern (like DEADBEEF) from a lookup table (the index of which is specified by the next value in the sequence).

So putting it all together, if our escape sequence was '' then our array could be:
[
[], // escape sequence is
[1], // the value 1 at position 1
[2], // the value 2 at position 2
[],[254], // a null value at position 3
[4],[],[3],[9], // repeat the number 4 9 more times (positions 4 - 13)
[5], // the value 5 at position 14
[],[1], // the value \ at position 15
[6], // the value 6 at position 16
[],[2],[32] // The pattern found in the lookup table at index 32 [DEADBEEF](positions 17 - 24)
[],[0] // end of the array
]

@kxepal
Copy link
Collaborator Author

kxepal commented Mar 4, 2015

@MikeFair but why? certainly, you can find a way to put a NAN into stream of integers, but how do you support to handle that on the client side?

@MikeFair
Copy link

MikeFair commented Mar 4, 2015

@kxepal
what do you mean? How does the client side handle NAN values in JSON arrays and SQL Resultsets now?

There's one of two scenarios I'm envisioning on the client side, textual JSON gets produced, or there's some custom library that's using the bitstream to build in-app custom objects. NAN is a value they'd have to deal with...
It seems like the same question as why did the NAN get into the bitstream from the source in the first place? (because it was in the data)

Maybe I'm just misunderstanding the question.

@kxepal
Copy link
Collaborator Author

kxepal commented Mar 4, 2015

@MikeFair SQL Resultset is an array of objects. Here nulls are ok. No magic numbers are needed. Bitstream is an array of bits. Here nulls aren't expected and not acceptable since you'll wrap that stream with some typed array thing that expects only 0 or 1, not Z.

@MikeFair
Copy link

MikeFair commented Mar 4, 2015

I should have probably used bytestream instead of bitstream.
I was primarily thinking about a JSON array of floats and ints of various byte sizes.

Also in SQL Resultsets I'd expected to be able to optimize the values of each column by specifying the type and allowing for nulls

@kxepal
Copy link
Collaborator Author

kxepal commented Mar 4, 2015

@MikeFair JSON arrays are untyped as like as JSON doesn't know anything about integers or float - all these are just numbers for it. In general case you cannot assign any specific UBJSON type for such kind of data. But when you can, it's still the one and nulls here aren't welcome.

For SQLResultset you'll have such optimisation and nulls here are allowed. See top post.

@MikeFair
Copy link

MikeFair commented Mar 4, 2015

For SQLResultset you'll have such optimisation and nulls here are allowed. See top post.

Okay, this is part I'm not seeing.
I see an object specification that gives an array of strong types.
I see a data sequence of those strongly typed values.
What I don't see is the logic by which a parser can read the typespec, and then read an array of strongly typed values (a record) and then figure out which of those values are null (unless it reads the type info for each value in every record).

So it's not clear to me what are saying when you say nulls are allowed within objects...

What I'm looking at is a case where I've got an array of objects whose typespec is date, float, float, float.
So I'm looking to pull date, float, float, float on each record read. Sometimes those can be null. I'm nit seeing the logic that accounts for that and at the same time doesn't require all the objects field values to contain a type specifier.

@kxepal
Copy link
Collaborator Author

kxepal commented Mar 4, 2015

This logic is uncanny defined by {symbol in typespec which means "here a object comes" which we count as a record by further definition of his fields. It's common situation when such objects represents JSON documents and these documents often has a typed fields which may have a null value when no data specified for them (to preserve overall object structure)

Please clarify the case you what to understand because I actually answer you same thing not at once and it seems misunderstanding still flies in the air.

@MikeFair
Copy link

MikeFair commented Mar 4, 2015

Encode this array:
[
{"f1": 2.0, "i1": 1, "i2": 2}
, {"f1": 2.0, "i1": null, "i2": 2}
, {"f1": null, "i1": 1, "i2": 2}
]

What I see is something like:
[ {[f1:float, i1:int, i2:int]}
[3]
[2.0, 1, 2, 2.0, null, 2, null, 1, 2]
]

Here the field names and their strong storage type are defined once as a type specification, then there is an array of values matching that new complex type.

Nulls in this case are a problem because it could be data or it could be null and the parser can't tell. By adding a Magicvalue to the stream, now the parser can tell.

@Miosss
Copy link

Miosss commented Mar 4, 2015

@MikeFair
As I understand, you are referring to situation, where some DB engine returns NULL as a value for field of another type, for example integer or varchar.

Well this is another layer of abstraction - @kxepal refers to NULL in UBJSON. There is no way that I can see to allow your idea, because NULL is not any special value of int8, int 16 or whatever - in UBJ it is completely different "type". Look at original post at the top for booleans - we do not have now type for booleans, because in early stages of developement bools have been optimized out as [T] or [F]. This is great and easy trick, but it creates problem with typespec when we try to say that some part of object/array is of type boolean. In this meaning there is no way to express your need with current typespec.

Think about it other way - how would you declare typespec for two objects with the same only key "value", but having values for this key: 7 and "seven" - I mean one is integer and one is string. Those types are different - we cant write one definition for that. If we used placeholder [_] that was proposed in this thread it may be possible, but this was not yet discussed deep enough.

To answer your question - NULL in UBJSON is a type (and "value" at the same time). NULL in SQL is value appriopriate for any supported type - this is quite rare and comes from SQL mostly I think.

Because typespec refers to defining types of data in objects/arrays, there is no way to represent value of NULL in the way you wanted it.

@MikeFair
Copy link

MikeFair commented Mar 5, 2015

@Miosss
That definitely clarifies a lot of the confusion! :)

optimized out as [T] or [F]. This is great and easy trick, but it creates
problem with typespec when we try to say that some part of object/array is
of type boolean.

Ahhh! But it doesn't create a problem (at least it doesn't have to).
There can still be the generic boolean type ([B]?); and also in the
specification can be that the type values [T] and [F] are value restricted
subtypes of the boolean type (having their respective values obviously).

Having special codes that convey both their type and their value is
conveying two dimensions of information in a single token (truth be told
it's actually three dimensions: value, interpretation (type), and size;
size is just typically conveyed as part of the type value so it oftentimes
gets overlooked (except for cases like strings, arrays, and objects where
size isn't conveyed and so we need to further specify lengths or use other
termination conditions).

In fact, now that I'm thinking about it, all custom UBJ types could be
formalized using exactly this definition! As value restricted subtypes of a
JSON type.

For example, a 1 byte unsigned int is, an unsigned int, whose values have
been restricted to 0 through 254 that is 1 byte long.

There could be lots of ways to make this subtype definition thing work in
UBJs favor.

For instance, a set of ints that are big, but have a significantly reduced
range like:

  • an int whose values only occur as 1000 through 1254,
  • a series of room temperatures in kelvin;
  • string enumeration types

all come to mind as other extended examples of "value restricted
subtypes".

Think about it other way - how would you declare typespec for two objects
with the same only key "value", but having values for this key: 7 and
"seven" - I mean one is integer and one is string. Those types are
different - we cant write one definition for that.

Sure we can, we just need to be a bit creative about it.

I see that special type as an instance of a value restricted subtype from
an enumeration type.
Let's use [7].

Actually, first we'll need a way to dynamically define this type because
it's obviously too specific and just odd to put in the actual predefined
UBJ spec. Let's use [*] to mean a dynamically defined subtype.

So in the stream I'd expect to first see something declaring the type maybe
like this:
[*7]={ [_], [ [d][7], [S]["Seven"], [f][7.0], [S]["VII"] ] }

Followed perhaps by these:
[_77]={ [7], 0 }
[_7S]={ [7], 1 }
[_7f]={ [7], 2 }
[_7R]={ [7], 3 }

(I'm just making that syntax up on the spot so if people don't like it
please help fix it.)

This says, declare a new value constrained subtype of '_' whose code is 7.
This is a 1 byte type whose values are 0, 1, 2, or 3 because it's described
as an array of values, the indeces become the data values we'll see in the
stream.

The interpretation of 0=the integer number 7; 1=the string "Seven"; 2=the
float value 7.0, and 3=the Roman numeral string "VII".

Also declare some additional value constrained subtypes of the [*7] type;
one for each value.
So [77] = the custom type [7] with the value 0 (corresponding to the int
7), etc.

So in the value stream for a [7] you could see a 0, 1, 2, or 3. For an [_]
stream you could see a [7S] to mean the string "Seven" or you could see a
[7][1] to mean the same thing.

Further using this idea, I think the boolean types could look something
like:
[B]={ [], [d][0], [d][1] }
[_T]={ [B], [d][1] }
[*F]={ [B], [d][0] }

Which says declare a new subtype B of _ where the values are int=0 and
int=1 and add two more new types, T and F, corresponding to those
respective values...

To answer your question - NULL in UBJSON is a type (and "value" at the
same time). NULL in SQL is value appriopriate for any supported type - this
is quite rare and comes from SQL mostly I think.

I've never seen null being a type before. I've seen null used as a value
for type (aka type is undefined); but never a null type.

json.org has "null" listed under the section "values". Null as a value
comes up quite a bit.

In javascript the typeof "null" is an object:
http://www.w3schools.com/js/js_typeof.asp

There's also lots of other context for null to be a value instead of a type.

When we code for null, we write something that tests the value of a
variable being null, not the type of that variable being null. A null type
would actually mean something different, that would mean that the type of
the variable has not yet been defined.

Such a variable could still have a value, there would just be no predefined
instructions on how interpret or distinguish its value.

Interpretted languages deal with exactly this case (where a variable's type
is unknown) all the time.

It seems null is being treated more like javascript's undefined at the
moment.

Null as a value is definitely not unique to sql databases, every common
language I can think of has the concept of null as a value. C/C++ may make
it the same thing as 0 but they still have the null keyword, and you test
values

Because typespec refers to defining types of data in objects/arrays,
there is no way to represent value of NULL in the way you wanted it.

That's what a type's magic value would be for. It's a special value that
essentially means "look at the next value for more information on this
value". One of the next value indicators would indicate null while another
would indicate the value of the magic value itself.

@Miosss
Copy link

Miosss commented Mar 5, 2015

C/C++ may make it the same thing as 0 but they still have the null keyword, and you test
values

This is the best example of language where NULL is not a value (for specified type). In non dynamic-do-want-you-want-nobody-cares languages NULL is not a value of a type. Lets take java - null is not a value for integer, string or whatever. NULL is value for reference - which can be set to something, or to nothing. In C/C++ NULL is not value for integers nor for floats - it is value for pointers, which can be set to some address in memory or not be set at all (I am simplifying actually, because due to history of C, NULL is not something much different; at some point in history it was agreed that dereferencing memory address of 0 is strictly illegal, and thats why NULL is not something different, it's just point to address 0 in memory, which is illegal for pointer's semantics).

As you can see, in strongly typed languages, null is not a universal value suitable for any type. Mostly it is used for pointers, maybe references. In UBJSON spec we should not focus on any particular architecture, so we can not enforce using pointers, optional values , etc.

Null exists in UBJSON as a reflection of JSON's NULL. As such, it can be used as a value in any place of (UB)JSON message - for example you could set some values in array to NULL even if you would expect floats or strings in those places - it is up to decoding application to respond properly to that.

But typespec is a little different - it relies on the fact, that you can find/hardset pattern in following data types. If some integers in array can be nullable, then such typespec can not be written. Whats more - because UBJSON is transport protocol - look at the exaple:

Imagine we send array of 1024 integers for every request from client. Server and client logic do not know anything more, that those values are integers (lets say max 4 bytes) or nulls.
In one message, it turns out that all integers are from range 0-255. Then I imagine that some more advanced encoder would detect that, and encode this array something like that:

[[]
 [#] <- we declare typespec
 [I][1024] <- 1024 consecutive values
 [U] <- of type uint8
 [...] <- data...
[]]

But if it happens that some values are null, or encoder does not force optimizations, we fallback to standard form:

[[]
 [U][123] <- 1st value
 [Z] <- 2nd value
 [U][1234]  <- 3rd value
 [Z]
 [...]
[]]

(Of course) the first version is way smaller - it is 1031 B, and the second around 2000B (depending on the number of NULLs).

The point is, that those two versions can be translated one-to-one usually. BUT they can't, when NULLs are involved, because @kxepal has not invented solution yet : ) He has though for optimized bools -> the [B] is in the original post at the top. It was a lot easier though, bacause bool is in fact a type, which was thrown away as a optimization.

Concerning your ideas.

What you proposed with those [7*] etc. is some kind of enumerating possible values, but regardless of their type. While this is ok in that sense, that type can be safely incorporated into enumerated value in UBJ (quite interesting idea I must say), it does not provide any advantage in any other situation - aside of enumerations. If you follow my example with 2 values only, 7 and "seven", if you create "dictionary" that allows one of those two values in this place in typespec, then you gain no space nor efficiency.

The idea to create dictionaries, for example for keys in objects may seem interesting (create a dictionary in front of a message and use only references/indexes to it in actual JSON message), but this is more like a task for compressing algorithm, while UBJSON should focus on optimizations that can be found in JSON domain. Arrays of numerics (ND) and array of objects (SQL, REST APIs, etc.) seem to be the common case. And while ND solves only first, typespec can solve both I believe.

@kxepal
Copy link
Collaborator Author

kxepal commented Mar 6, 2015

Thanks @Miosss ! Sorry, @MikeFair , looks like those day wasn't my the best one ):

@MikeFair
Copy link

MikeFair commented Mar 6, 2015

@Miosss, @kxepal,

What about doing this!

In place of giving the entire array container a specific type and length, create a new type specifier [!] to mean " a series of " ( or "a segment of"), followed by a 1 byte length, then a type.

The special 1 byte value of [0] in the length field would mean use the following [LENGTH] instead of this byte.

Then embed the [!] type modifier inside the array.
[[]
[!][28][I]...
[Z]
[!][22][d]...
[S]...
[!][100][Z]
[!][0][I][1024][U]...
[T]
[!][12][F]
[!][37][B]...
[]]

This way arrays can be built up in segments.

By using a linked list of "segments";(where a segment has a type, an array of data, and a start/end index) instead of assuming one long contiguous block the array still remains compact for mixed types, and it enables a decent mechanism to fairly quickly scan through the segments to find a specific array value at a specific index (if reading it).

This implies bringing back the boolean type [B]. Which would be in addition to keeping the two specific boolean value constrained types. (As an aside: Consider other special value specific case types aside from T/F; like creating a few types for the various integer and float sizes of 0. (Or even better, use the enumerated list trick from above to define these specific value instances of a type.))

Back to the arrays, a small gain can be had here if when [!] occurs outside the scope of an array container (by itself as a value); then it means to create a single array of that single type.

If it's any kind of a mixed type array, then you must use [].

Using this technique, is a good way to describe a 1D vector of mixed values.

From here the more general problem of containers can be solved by using the 1D vector of values, and and then providing metadata to it that describes how this vector is laid out. Similar in context I think to #51 (create a new container type, using [<] and [>], that means "metadata enhanced vector").

@Miosss
Copy link

Miosss commented Mar 6, 2015

@MikeFair
Your current proposition is quite nice in some cases : )
In fact, this is what we call repeatable headers and this was discussed a lot in other issues (even in this one). I admit that we have mess now, because optimizing arrays and objects is spread through so many issues and long conversations that it is hard to catch all ideas at once.

Currently the biggest discussion is around typespec, ND and current STC. The most important thing is to carefully pick one of those as a feature. I strongly believe that we have to choose something, because after some considerations, I think that UBJSON without any of those gives us so little advantages, that it is way simplier to stick to JSON (maybe apart of encoding arrays of intregers, but many people do that already without UBJ).

Things such as repeatable headers, ___ as typespec placeholder, etc. are important, but they are less important and are determined by more general options.

We have some break now I suppose, everyone is rethinking all the ideas and we cannot reach any consensus yet. So your fresh look is appreciated : )

@MikeFair
Copy link

MikeFair commented Mar 7, 2015

Ok, so I reread through the whole proposal here again and can make some better refinements/comments.

"repeatable headers" didn't exactly inspire the correct vision for what they are. Also the term "Containers" explicitly refers to objects or arrays for me.

I created [!] which I can now see is really just reinventing a constrained version of @kexpal's [#] to a single type. So my comment is to restrict [#] to a single type and make that a 1D vector, an optimized 1D vector. It has all the expressive power of typespec for no additional cost that I can see.

I also just opened #66 to request the change to length; which was another comment. (but kind of orthogonal to this).

Lastly, optimized containers are worthy of their own object class. Like in #51. By having a more powerful 1D array type, an optimized container format can take advantage of that. For instance in the case of #43 something like:
< @ // This is an optimized multidimensional array
[2] // it has 2 dimensions
[R] // The vector is laid out in Row-Major order
[ [U][2] [I][1024] ] // those dimensions have length 2 and 1024.
[ [#][0][I][2048][d]... ] // here is the 2048 length vector of float data for the array

You don't need to define the type of the array, you just define that the vector has 2048 floats in it and the optimized container explains the rest.

@MikeFair
Copy link

MikeFair commented Mar 7, 2015

I wanted to add one more example to my comment above that's more specific to this issue.

If this was a 1024 element array of [iSZC] arrays (the example from the top description (though I modified the second [S] to be a [Z])), then the container descriptor becomes this (though I've rearranged the container description from my prior example a little):
[<] [@] // This is an optimized multidimensional array
[C] // The vector is laid out in Column-Major order
[[] [4] [0][I][1024] []] // there are 2 dimensions and they have lengths 4 and 1024.
[[] // And here is the vector that describes the data
[#][0][I]1024][i]...
[#][0][I]1024][S]...
[#][1024][Z]
[#][0][I]1024][C]...
[]]
[>]

@Miosss
Copy link

Miosss commented Mar 7, 2015

"repeatable headers" didn't exactly inspire the correct vision for what they are. Also the term "Containers" explicitly refers to objects or arrays for me.

Repeatable headers make sense only in containers.. In Json even single null, integere or string is valid JSON message, but you rarely see something like this (except from some streamings). This is why almost every JSON is one big container, array or usually object. If you remember about this, it is clear that headers could be rarely used outside containers, because we do not work outside containers at all (almost).

You first example is indeed 2D array isn't it? If so, why are you talking about optimizing 1D? Btw.

[C] // The vector is laid out in Column-Major order

Steve already commented about that. Data alignment is application problem, not UBJs.

Your second example is array of 1024 integers, then 1024 string, then 2014 nulls and finally 1024 characters. While this is perfectly legal, I doubt this is popular use case. In dynamic languages and webdev, data will usually be mixed and mostly consist of strings.
In environment where people know how to code, they would probably split those 4 segments into 4 different arrays. And Steve would definitely agree, that having 4 arrays of fixed type is much better than having one big array of mixed types.

But since its legal, here is how it would look like in typespec (with repeatables, see top post):

[[]
 [#] <- typespec begins
 [U][1024] <- 1024 values
 [i] <- of type 'i'
 [your 1024 integers...]

 [#] <- repeated 'container section' occurs
 [U][1024] <- 1024 values
 [S] <- of type 'S'
 [your 1024 strings...]

 // here we do not have way, to define 1024 consecutive nulls
 // - it does not seem likely to happen but its a flaw
 // maybe writing [#][I][1024][Z] could mean this? I know that know it only informs
 // parser that there will be 1024 values of some-mixed types, but does it really give any advantage?

 [Z][Z]...[Z] <- 1024 values of null 

 [#] <- repeated 'container section' occurs
 [U][1024] <- 1024 values
 [C] <- of type 'C'
 [your 1024 characters...]
[]]

Repatables were laid away for a moment due to some reasons (unknowc-length containers, I suppose?), and this creates a problem if we do not have them. This is what I wrote about before Mike - how to efficiently define 1024 consecutive types in typespec? It seems that dropping reapatable headers make case of mixe-type containers much more difficult.

@MikeFair
Copy link

@Miosss
There's a few points here, so I'm going to break it across multiple replies.

Repeatable headers make sense only in containers..

While true, this is not the same thing as saying all instances of repeatable headers are instances of a container, specifically just an array (because a repeatable header as an object doesn't really make much sense the way it's currently being constructed). An Array can contain many instances of a repeatable header; each instance appending more elements to the array.

it is clear that [repeatable] headers could be rarely used outside containers, because we do not work outside containers at all

Objects use stand alone JSON Values by themselves all the time. Both as field names and as values. Since # makes absolutely no sense as a field name, it's therefore invalid there. But what's the actual error?

By saying, # defines an array in its entirety when used outside the scope of an open array, the decoder would likely write this error as something along the lines of "Array found where string was expected" or "An array cannot be used as the field name of an object".

The other place this could happen is as the field's value.

Consider this object:
{ "foo": [1,2,3,4], "bar": "seeFoo" }

And the following encoding:
[{] [S3][foo] [#4U][1][2][3][4] [S3][bar] [S6][seeFoo] [}]

Without this definition it would not be clear which of the following this encoding would mean:
(1) "INVALID: # cannot appear where JSON Value expected (must be as a value within an array)."
(2) { "foo": [1, 2, 3, 4, "bar", "seeFoo"] }
or the intended:
(3) { "foo": [1,2,3,4], "bar": "seeFoo" }

The simple/naive approach is to always force arrays to have [[] and []] surrounding them (interpretation 1). And this approach has its merits but it does burn those [[] []] characters for the simple case.

However since (like you said) repeatable headers can only appear inside the scope of an array, I figured it would be a nice optimization to just say "If a repeatable header appears as a stand alone value, that's not where a value of a currently open array would appear; then it defines an entire array value".

This makes the above encoding clear without forcing the additional [[] and []] around the outside of the repeatable header and is most likely what was meant. To get object (2) foo's value would have to use the surrounding brackets [] to define that.


The other ambiguous case is the jagged array of arrays.
Compare these two:
(1) [ 1, 2, 3 , "a", "b", "c" ] // An array of 6 elements
(2) [ [1, 2, 3], ["a", "b", "c"] ] // An array of 2 arrays, each with 3 elements

And this encoding:
[[] [#3U][1][2][3] [#3C][a][b][c] []] // Open an array, Repeat 3 uint8, Repeat 3 Char, Close the array

If # always meant "Define an array container" then only the second array can built with this. Without the surrounding []s there is no way to tell when values should stop being added to the array (as has been mentioned by others before).

So in this context, # means, "Append this [count] of this [type] of element to the current array".
Which brings us back to interpretation (1) from the object example above.

So to enable the use of # to define a singularly typed array as an object field's value in its entirety (which I think will be a very common thing to do); I tried to make language that would clearly mean, if # is encountered where a JSON value is expected it defines an array of exclusively that type and length, unless there is already an open array, in which case it appends its values to that array.

This is obviously an optimization to eliminate the requirement of the []s in the case of an object field's value and so is totally a optional thing to include.

@MikeFair
Copy link

You first example is indeed 2D array isn't it? If so, why are you talking about optimizing 1D? Btw.

[C] // The vector is laid out in Column-Major order
Steve already commented about that. Data alignment is application problem, not UBJs.

In both these cases; the 1D value is not coming from the application, it's coming from UBJ.
What the two examples are defining are an optimized packing for the two array containers.

The second example is actually a 1024 element array of a repeating sequence of ISZC values (aka objects that have had their field names stripped, and their values repacked efficiently as an array of arrays). By defining the data vector as being laid out in Column-Major, instead of Row-Major order, it created a way to rotate the arrays so they transmit/decode more efficiently, but didn't actually change the layout of the original repeating sequence of Mixed-Type arrays at all.

By creating a UBJ container that has some flexibility in terms of how it transmits the packing of its values versus the layout of the original JSON container, it can maximize the use of repeatable headers during the transmission but still recreate the original shape when it gets decoded. This is taking a 1D UBJ Array, something that is very easy and simple to create/encode, and using it as the set of values for the container; the container's "vector of values" (the "data vector").

In the first case, it happens to be a 2D array, but it could also easily be switched to being an array of jagged arrays that contain a total of 2048 values between them.

My original was this:

    @ // This is an optimized multidimensional array
    [2] // it has 2 dimensions
    [R] // The vector is laid out in Row-Major order
    [ [U2] [I1024] ] // those dimensions have length 2 and 1024.
    [#0I2048][d]... // here is the 2048 element data vector 
    

this exact same construct could also have defined an array of arrays that have a combined 2048 elements using something like this:

    @ // This is an optimized multidimensional array
    [A] // The data vector contains an array of arrays
    [ [U24] [I500] [I1000] [I500] ] // those arrays are lengths 24, 500, 1000, and 500.
    [#0I2048][d]... // here is the 2048 element data vector 
    

By separating the layout of the complex container from its data vector, there is flexibility to more optimally pack the data vector for compact transmission and decoding speed.

So what this idea is doing is saying that "The way the container was originally described in JSON may not be the most efficient way for UBJ to transmit and decode it. Here's an efficiently encoded data vector that is fast to decode and a description for how to unpack it to recreate the original container."

@MikeFair
Copy link

Your second example is array of 1024 integers, then 1024 string, then 2014 nulls and finally 1024 characters.

As I mentioned above, the second example was misinterpreted, sorry about not specing it out more clearly/completely.
By defining the data vector as being laid out in Column-Major Order, that's really an array of 1024 repeating sequences of ISZC. AKA an array of 1024 objects with 4 fields each, an int, a string, a null, and a character but secretly recasted as an array of mixed type arrays by the application before being given to UBJ (then UBJ repacked it again as 4 arrays of the same type for the transmission like you observed).

In environment where people know how to code, they would probably split those 4 segments into 4 different arrays.
...
But since its legal, here is how it would look like in typespec (with repeatables, see top post):

The JSON array that was actually defined was not an array of 4 sequences of 1024 elements of the same type. It was 1024 sequences of 4 mixed type values. This is the power of being able to rotate the array layouts.

What I probably should have done was just define it as an object in the first place but I thought introducing objects to the fray was just going to add another layer of confusion so I used a repeating sequence of ISZC instead.

In an environment where people knew how to code, they would do exactly what you said, create four arrays, and transmit those 4 arrays with some additional metadata that explained how to put the objects back together on the other side. That's exactly what these examples are doing; however instead of creating four (or N) separate arrays for everything, it's defining one data vector for the whole container and then using the dimensions and layout information to describe how to recreate the original JSON container from that vector. That's why I care about optimizing the 1D mixed type array; it's used as the source data vector for describing several different kinds of optimized complex containers using a simple repeatable handful of tools/techniques/layout descriptors.

Imagine that in field 3, the null, wasn't Null for every field, just many of them.
So Instead of:
[#0I1024Z]
we had:
[#0I500Z][#200S]......[#250Z][S]...[S]...[Z][Z][#70S]... // 1024 Mixed types of elements

Also imagine that inside this array of objects, 1 object (the 800th element), had an additional field called "Temp" which was a double but didn't have a field 4, called "Classification".

Here's what the "object" version of my second example could look like with those changes included:

    [%] // This is an optimized object
    [C] // The data vector is laid out in Column-Major order
    [[] [5] [0I1024] []] // the data vector contains 5 fields of 1024 objects 
    [[] // Here is the array of field names  
        [S][7][ObjectID]  [S][5][Label]  [S][8][ImageURL]  [S][14][Classification]  [S][4][Temp]
    []] 
    [[] // And here is the data vector
        [#0I1024i]......
        [#0I1024S]......
        [#0I500Z][#200S]......[#250Z][S]...[S]...[Z][Z][#70S]...
        [#0I799C]......[_][#224C]......
        [#0I799_][d][#224_]
    []]

I should point out the mixed type stream used for field 3 (ImageURL); and the use of [_] to mean undefined for some of the object instance's fields.

This same idea can be applied to optimize the layout for many different containers; choosing efficient layouts for commonly recognizable patterns.

And Steve would definitely agree, that having 4 arrays of fixed type is much better than having one big array of mixed types.

The source JSON containers handed to UBJ are going to be mixed type by definition. And obviously it's going to be more efficient to pack the same typed data together where/when you can. So I see this technique (and the array extents/segments idea that goes with it) as proposing a way to maximize the same type packing, while at the same time respecting the fact that JSON containers are inherently mixed typed.

The point to making the repeatable header singularly typed is to maximize the decoder's ability to bring in the same type technique as often as it can.

Whether the example above is packed as 5 separate arrays (one for each field) or 1 long array they will pack and unpack almost exactly the same using a singularly typed repeatable header. It works primarily because there's metadata explaining that the provided layout of data is rotated from its original and so needs to be rotated again before handing it off to the end user (aka Column-Major).

Packing it as a 1D data vector with a layout (aka Row-Major or Column-Major, the count of dimensions and their lengths), it's easy to reuse the same construct for both array and object containers, and to extend it to many different container optimizations like sparse (though the trick of using a repeatable header on the [_] type gets pretty close to sparse); Arrays of Arrays (of differing lengths); and ND Arrays.

I appreciate you guys taking the time to let me comment; Thanks!

@Miosss
Copy link

Miosss commented Mar 11, 2015

@MikeFair
I am not sure if you understand typespec because it does most of what you say (in this issue). And while it should be polished up, it is quite good (at specification level) I think.

This is obviously an optimization to eliminate the requirement of the []s in the case of an object field's value and so is totally a optional thing to include.

If you look at my examples of your JSONs, or kxepal's at the top, you will notice that [ and ] markers are present. typespec IS NOT definition of container. [ and ] are, This does not differ from JSON nor from current UBJSON spec.
typespec can occur in any place IN container (array or object) provided that parser wants to parse next value specification. Such situation occurs for example just after reading [ marker (array start). The same situation is when you read for example one integer, or one key-value pair in object. Or whne you stopped interpreting last block of data as previous typespec declared (repeatable headers of typespec).
Think of it more like of a "Section declaration" rather than "container declaration". typespec says that following X bytes shall be intepreted int thios conainer as something - after that, peek next byte and decide:

  • if it is ] then container closes
  • if it is object, you expect another key declaration
  • if it is value you expect another value declaration
  • if next byte is #, then parse this typespec and interpret next X bytes as defined.

Typespec can predefine keys for pairs in object.

Concerning ISZC example, now when I understand you, it would look like the following (we transmit array of 1024 arrays, which contain I S Z and C consecutively):

[[] <- big array starts!
 [#] <- some data will be defined using this typespec!
 [U][1024] <- we will have 1024 "values" of type:
 [[] <- of type array!! (this array marker is INSIDE typespec, and only for its purposes is used)
 [I][S][Z][C] <- and this array will have four values of types: I, S, Z (this is really pointless) and C
 []] <- close of the typespec-declaration-subarray, for consistency and parsing

 // here will actual data start, and it will looki like this:

 [123] <- I
 [5][hello] <- S
 [Z] <- this is strange...
 [a] <- character

 [256]
 [5][world]
 [Z]
 [b]

 ...
[]] <- big array ends!

So in fact, after succesfull parsing, you will decode array of 1024 arrays with 4 values of differenet types; You will use it like:

Value result = parse(data);
print result[0][2]; // prints "hello"
print result[1][2]; // prints "world"
print result[1][3]; // prints 'b'

@MikeFair
Copy link

@Miosss
"Strongly Typed Section"! Now we've got a word for them. :)

I do understand what this proposal is saying; and how the existing typespec (at least the one on the ubjson.org website) works.

There's a few different comments I'm making here.

  1. I totally agree that this version of typespec is great for the basic value types! The parts proposed for arrays and objects though is where I disagree. I see more compact and faster to decode alternatives that are simpler to generate and comprehend.

  2. '#' should not be applicable to objects and arrays at all; '#' should preserve the "Strongly Typed Section" semantics.

  3. Like the top post says, I agree [$] be eliminated from the spec and that the '#CTLV' pattern is a powerful one. I agree that Objects and Arrays require a solution (just not this one).

  4. The way I see '#' is best used is as modifying the count of 'T' values to retrieve from a TLV, and nothing more. Whereas a TLV means "Fetch 1 value of T"; #CTLV means "Fetch C values of T".

typespec IS NOT definition of container. [ and ] are

Right; and I'm not saying otherwise, I was simply saying that when a stand alone value (like as the value of a field in an object) happens to be an array that contains only one type, then '#', in that case, can define the entire array value.
The '#CTLV' construct can define an entire array right there on the spot, fully and completely, because when encountered as a stand alone value, that's the only thing it could possibly be that makes sense (eliminating the [] requirement for this strongly typed array value case).

Yes? (Not that you agree that it should do that; only that you see that it could.)

Typespec can predefine keys for pairs in object

Except that there's nothing about JSON that requires all values of the same field name in different objects in an array be of the same type. This is extremely common when getting numerical data from computed values; you'll frequently encounter any of null, "NaN", "Inf", "N/A", "#NA", "@na", "DIV/0", etc as strings where a number value is expected because the formula was either invalid or the data simply doesn't exist.

If typespec were implemented as described in this specification then any large dataset that includes those "NaN"/"Inf" strings or null as field values simply could not be optimized this way (or it would have to be broken up into smaller sections that did successfully meet the repeating type pattern).

By using '#CTLV' and a different approach to objects and arrays that maximizes the use of #CTLV it eliminates those difficulties while remaining small (in most cases, smaller than existing proposals) and fast to decode (mostly because of the "Strongly Typed Section" semantics).

Concerning ISZC example, now when I understand you, it would look like the following

What seems to be missing here is the amount of parsing/interpretation/tracking work this approach is requiring the encoder/decoder to perform.

For Instance, by defining it as #1024[ISZC], there is no opportunity for the decoder to predeclare a strongly typed array of length C and memcopy the values directly from the filestream; it must decode each value one at a time (which it can because it knows their types, it just has to do it value by agonizing value).

There is also no opportunity for compressing runs of the "value explicit" types like Z, T, and F (and if more of these "value explicit" types were added (like the various integer and float sizes for 0, and 1) there'd be more of these optimized types to compress/use).

Let's say instead of [Z], field 3 of the object was usually [T], and occasionally [F]. Using the typespec approach declared in this spec, it declares this as #1024[ISBC] and actually has to put all the [T] and [F] values into the stream. By making it possible to simply rotate the array streams, it can now declare the sequences of T and F more compactly. And use the #CTLV format for the other fields too.

When rotated, the entire Boolean value stream looks like this: [#352T][#8F][#40T][#124F][#500T]

This provides all 1024 values in 21 bytes instead of 1024 bytes and gives the decoder a way to translate/read in the values of each Strongly Typed Section way faster than having to interpret each value as a separate type one at a time.

Further when encoding the Object example I later described with differing fields between objects, and replacing the series of null values with a mixture of null and strings becomes much more difficult to encode using this specification.

When dealing with the largish datasets I've worked with (typically anywhere from 100MB to 100GB binary compressed), long runs of 0 or null or a repeating sequence of the same values over and over again is commonplace...

I also find type exceptions in a long series of a field's values, whether it's a null in any field, a string in a typically numeric field (as I mentioned above), or an integer in a float field and a float in an integer field, are all quite commonplace.

An optimized array/object spec ought to handle that "mixed typeness" effectively and I simply don't see the repeatable headers as defined for arrays and objects in this specification as meeting the needs.

All that taken together is what has me saying "Yes, I see defining '#' this way is good, I say keep it as just modifying the count of values of type T to consume; and use a different approach to handling the cases of arrays and objects (namely enabling the frequent use of #CTLV and in case it's not clear; here's one way of doing that)".

@Miosss
Copy link

Miosss commented Mar 12, 2015

Well I am not convinced to make so many data rellocating, shifting and packign into nice and optimizable form that you propose, at all. But there are few points I can still comment:

  1. I totally agree that this version of typespec is great for the basic value types! The parts proposed for arrays and objects though is where I disagree. I see more compact and faster to decode alternatives that are simpler to generate and comprehend.

What alternatives for arrays and objects of mixed types, deep structures, subarrays, subobjects, etc. do you see/

  1. '#' should not be applicable to objects and arrays at all; '#' should preserve the "Strongly Typed Section" semantics.

Whether typespec (as desribed by @kxepal) should appear only once or more is discusion about "repeatable headers". typespec itself has nothing to do with it - it just can be repeated if needed.
But why you say '#' (what do you mean by '#' ?) should have nothing to do with containers is beyond my recognition.

you'll frequently encounter any of null, "NaN", "Inf", "N/A", "#NA", "@na", "DIV/0

If we send numerics, then IEE754 floats, as used in UBJ, can have value of NaN +/- Inf etc.
If someone sends in flaot, integers and texts in one place, then it is his problem.
Somebody above proposed [_] marker which could be used here.

As I said it few times before, optimization is MOSTLY task of the ENCODER (internal machine), not user's. When UBJSON is, for example, used as transparent transport encoding of JSON, conversion shall be done automatically, with as much optimizations as the encoder can efficiently provide.
I am not interested in manually specyfing: this container I am creating is optimal, I TELL YOU ENCODER TRUST ME
I will add some array of data to the encoded message, and my parser will see that it can be optimized. And then it will automatically do this.

When dealing with the largish datasets I've worked with (typically anywhere from 100MB to 100GB binary compressed), long runs of 0 or null or a repeating sequence of the same values over and over again is commonplace...

Could you provide some of such datasets, if of course those do not contain secret information? It would be nice to have something practical to discuss.

@Miosss
Copy link

Miosss commented Mar 13, 2015

@kxepal
I asked about it some time ago, but you could overlook my question.

Multidimensional Arrays

In this example, you wrote a typespec like this:

[#][Z][[UUUU][CC]]

(I suppose that in block-notation you meant something more like:

[#][Z] <- so, some number of:
 [[] <- arrays
  [[] <- of arrays
   [U][U][U][U] <- of 4 integers
  []]
  [[]  <- followed by array
   [C][C] <- of two characters
  []]
 []]

so it meant, as I understand, that objects in this containers are in fact arrays, of two fixed arrays: one of 4 U, second of two C.

Ok, but what if we have arrays of arrays, for example arrays of gps coordinates - this is fairly simple, because we have only two floats in each, so [D][D] would be enough.

But what if we had for example arrays of arrays of 1024 integers?
Writing [U][U][U] ... 1024 times ... [U] is quite booooring. Do you have some solution to that? Or we leave this as a very uncommon case.

@MikeFair
Copy link

@Miosss

  1. I totally agree that this version of typespec is great for the basic value types! The parts proposed for arrays and objects though is where I disagree. I see more compact and faster to decode alternatives that are simpler to generate and comprehend.

What alternatives for arrays and objects of mixed types, deep structures, subarrays, subobjects, etc. do you see/

I just posted #70 to begin describing them

  1. '#' should not be applicable to objects and arrays at all; '#' should preserve the "Strongly Typed Section" semantics.

But why you say '#' (what do you mean by '#' ?) should have nothing to do with containers is beyond my recognition.

Currently

  • [#][COUNT][[]
  • [#][COUNT][{]
  • [#][COUNT][multiple type specifiers]

have special meanings to describe a more complicated repeating structure of some kind. My comment was about reserving [#] for only single typed values, as in #66.

Defining [#] to be an optional count modifier to a TLV is an elegant tweak to the spec.

While I do find the additional uses of the array, object, and typespec idea are great; keeping [#] as strictly a count modifier to a TLV is a cleaner use for the [#] symbol.

If there must be multitype repeating specifier then perhaps [$] would be a better symbol for that rather than overloading the [#] symbol.

If we send numerics, then IEE754 floats, as used in UBJ, can have value of NaN +/- Inf etc. If someone sends in flaot, integers and texts in one place, then it is his problem.
Somebody above proposed [_] marker which could be used here.

I disagree that mixed type arrays aren't worth optimizing, especially since using the repeatable headers technique accounts for them so nicely by maximizing the number of values that can be compacted together in a single run, interrupting it when it needs to, and then going back to packing many of the same typed values together again. (and using the non bracketed [] version to mean an exclusively singly typed array.)

If an optimized container of 4095 Double values and a single string must fall back to explicitly providing the type for all 4096 values, it occurs to me like a special use case optimization, not a general solution.

It's great when it works out that every single value in every single field has exactly the type it's supposed to have, but when you start dealing with large amounts of data, it's simply not always there, or sometimes it's been miskeyed, or sometimes the value is outside the range it's supposed to be, or sometimes it's of the wrong type (like '+/- Inf'/'NaN' instead of a float).

Oftentimes, you're not the one responsible for creating the data so fixing it "upstream" can be challenging. For example, any of the JSON datasets available online, since you're just downloading them you have little control over their quality. So if minor data discrepancies result in complete break down of a more optimized encoding, that doesn't seem to me like the right solution.

Could you provide some of such datasets, if of course those do not contain secret information? It would be nice to have something practical to discuss.

I'll see what I can dig up, but I think http://data.gov/ which is the US gov't public data site is a really great place to start; it looks like they've got tons of real world JSON to offer. Where should it get posted? Just attach it to a new issue? upload/email it to someone?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants