Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backporting BJData features to UBJSON - ND array, u/m/M/h markers and handling of NaN #109

Closed
fangq opened this issue May 13, 2020 · 23 comments

Comments

@fangq
Copy link

fangq commented May 13, 2020

@Steve132, Re your comments at Steve132/ubj#8 and #108

I think the right step forward for the project would be to agree on an
array spec, agree on other changes, and move forward to a 1.0 spec, along
with reference implementations in as many languages as possible and a test
suite.

Again, I am happy to assist on backporting the new syntax to UBJSON so we do not have to maintain two separate specs.

The main additions made in BJData Draft 1 include

  1. an optimized ND array container (via # followed by a 1D array object)

https://github.com/OpenJData/bjdata/blob/Draft_1/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type

  1. new dedicated markers for unsigned integers and half-precision floating-point numbers
    NeuroJSON/bjdata@95544dc

  2. Instead of converting NaN/+-infinity to null, BJData keeps their respective IEEE 754 binary forms
    NeuroJSON/bjdata@f3d338f#diff-d94a316d99f805938f1320db036a84a5L174

I will explain the rationales for each of my additions, and we can discuss/approve/disapprove in this thread.

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@fangq
Copy link
Author

fangq commented May 13, 2020

For the ND array construct, I am aware that @Steve132 has a similar proposal using the below construct

[[] [$] [type] [@] [1-byte Ndim] [Nx type] [Nx] [Ny type] [Ny] ... [serialized ND array in column-major order]

the ND array optimized container used in BJData was originally proposed in this mailing list post back in 2013, it is done via #+a UBJSON 1D integer array object, followed by row-major serialized array elements, for example

[[] [$] [type] [#] [[] [$] [Nx type] [#] [Ndim type] [Ndim] [Nx Ny Nz ...]  [Nx*Ny*Nz*...*sizeof(type)]
  or
[[] [$] [type] [#] [[] [Nx type] [nx] [Ny type] [Ny] [Nz type] [Nz] ... []] [Nx*Ny*Nz*...*sizeof(type)]

The main reasons that had led to my prefer to the #+[ construct are

  1. it does not need to add handling to a new marker in an existing parser, instead, one can just call the array parsing function to read out the dimensions when #+[ is encountered. You can see this in below examples:

in my opinion, this requires the least amount of work for parser updates

  1. it does not limit the maximum dimension to 255 as the @ construct

  2. the row-major byte order aligns with the conventions for the majority of programming languages, such as C, C++, python, Javascript Typed array etc, therefore, the data do not require transpose when saving/reading.

  3. it feels intuitive because for an ND array, its count is not a single number but a 1D vector.

  4. This construct also align with the general template construct I am about to propose - by allowing $+{ or #+{ to represent payload less object template, similar to what I proposed for messagepack: Add support for typed N-D array to simplify large array storage msgpack/msgpack#267 (comment)

  5. last, but not the least, the #+[ construct has been used in JSONLab since 2013, and currently has accumulated about 45,000 downloads from matlab file exchange, and about 1000 clones from github per week. Changing this syntax will render all previously generated files incompatible.

@fangq
Copy link
Author

fangq commented May 13, 2020

  1. The way you use # is ambiguous with the count specifier.

I don't think so - in draft 12, # is supposed to be followed by one of the integer markers U,i,I,l,L and will be in no way to be confused with [.

  1. It explicitly disallows multidimensional arrays of non-fixed or structured type (which is an unusual but reasonable use case that I think UBJ should support)

For extending to template objects, I reserved $+{ or #+{ for this purpose, which will also capitalize on the existing object parsing functions to handle the following structure without needing to write a specialized parsing routine. Similar to the ND array extension, I expect this to be much easier to be implemented from a parser's standing point.

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@edgar-bonet
Copy link

For the record, I implemented my own proposal for N-dimensional arrays that does not require extra syntax. It just adds the notion that, in a strongly-typed array, all the bytes that come before the first data field form the array's type specifier. Type specifiers can then be parsed recursively in a very natural way:

  • [[][$][d][#][i][2] = array of type float32 and length 2
  • [[][$] [[][$][d][#][i][2] [#][I][3200] = array of type (array of type float32 and length 2) and length 3200

This is efficient for the kind of data I deal with, which contains lots of very long and narrow matrices of float32, just like in this example. Binary JData would be about as efficient for me, at the cost the extra complexity that comes with adding new syntax. The current UBJSON specification does not permit storing this efficiently.

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@fangq
Copy link
Author

fangq commented May 13, 2020

We don't want parsers to be encouraged to re-use the existing array parsing code, because if they do that's incorrect and leads to bugs.

not entirely sure what was the main concern against using recursive calls. In a way, we are already doing it in almost all existing parsers because JSON/UBJSON array and object constructs can be embedded in one another. If you are worried about the recursion depth and stack-overflow risks, then I think the same concerns should be applied to regular array/object parsing, and not specific to this size construct.

the only thing to avoid stack-overflow is to change the entire parser structure using an iterator or some sort to convert recursion into loops.

Really what your spec says is that you need to either implement most of the array parser twice (one for the general version, and again for the constrained version) OR you need to add a 'constrain to 1d array' flag to the code.

if I just want to prevent it from being abused by constructing deep-recursion, adding a simple flag, like what I just added to jsonlab will likely eliminate the risk, and the required change is quite minimal and local

fangq/jsonlab@86efe89

@edgar-bonet
Copy link

@Steve132 wrote:

We sort of went over this at the time

Indeed. I am commenting mostly for @fangq, who was not there at the time.

It seems like your proposal the way you are using it provides the same semantics as the other two proposals

Of course. This is all about providing a representation of the most common forms of multidimensional arrays that is more efficient without altering their semantics.

it's much more complicated to parse

I disagree. Recursion makes it easy. Recursion is what makes the JSON data model both simple an powerful: containers hold values which can themselves be containers which hold values which...

you have to lookahead [...]

Yes, but never more than one byte:

  1. When the type parser sees [ or {, it reads the following byte in order to see whether it is either $ or #, which would indicate an optimized representation of the container. If it is neither (non optimized container), it “unreads” that byte (with the equivalent of ungetc()) before calling the value parser for reading the first contained element.

  2. When reading an non optimized container, the value parser has to read one byte past the current element in order to check whether it is the container's closing delimiter. If it is not, it unreads the byte before calling itself recursively for reading the next element.

But note that this has nothing to do with my proposal, which does not introduce any extra lookahead requirements.

For example [ [ ] [$] [ [ ][#][[][i][2] [#][3200]

I can't visually parse your example. [#][[][i][2] seems to imply that the count field is an array, which is legit in fangq's proposal but not in mine.

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@Steve132
Copy link
Contributor

To be clear, I don't necessarily think the bug caused by people wanting to re-use the array code is so severe as to veto using the JData standard for the syntax. As long as we warn implementers in the spec not to do it it's easy enough to avoid. But the JData standard does use more characters as a result as well...so it's worth considering.

@edgar-bonet
Copy link

@Steve132 wrote:

[[] [$] [{] [#] [U] [3] [{][}][{][}][{][}] is a valid 1-D array of 3 empty objects

My understanding is that, under the current spec, it is not. Either the array is untyped and the values have to be spelled in full:

[[] [#][U][3] [{][}][{][}][{][}]

or the array is typed and the type (a single byte) has to be removed from the values:

[[] [$][{] [#][U][3] [}][}][}]

What's easier to parse and understand [...]

I personally find “((float 3) 10) 4” easier, but that may be just me. I can understand that humans may find “floats, 3 dimensions: 4 rows 10 columns 3 channels” easier. :-)

@Steve132
Copy link
Contributor

Steve132 commented May 13, 2020 via email

@fangq
Copy link
Author

fangq commented May 13, 2020

@edgar-bonet, thanks for your proposal - what this tells us is that we DO NEED to include a native ND array header in the spec because it is such an important data type to support.

@Steve132 - Re your argument regarding whether it is appropriate to reuse an array parser

You can do it in more than one ways for sure - as you suggested, one can certain write a dedicated 1D array parser, which just needs a few extra lines (involving a loop), one can do it for both @ based header or #[ based header. However I don't see why it is a bad idea to reuse the generic array parser (like in 3 of my existing implementations), and then test the shape/value of the returned values. The 2nd approach is easier to program, and easier for future extensions - for example, if in the future we extend this dimensional array to 2D, perhaps we can handle tensors or other types of data partitioning schemes. There is not right or wrong, it is just trade off - writing more codes vs less code with possible overheads, or to progressively detect a non-conformant data structure as you read, vs loading the data first, and then validate it as a whole object.

I agree that my ubj patch does not "validate" the array size vector, but it can certainly be added.

@edgar-bonet, I think your proposal is one of many ways to add this, but the downside is the needs to use stack for each additional dimension, this comes with a bigger risk of stack overflow and slower performance.

@fangq
Copy link
Author

fangq commented May 13, 2020

@Steve132, for some reason, I felt this discussion is steered towards what is the best practice for writing a parser, which is not really the focus of this thread - we just define a format, and let the parser writers to come up with a parser that works with the format. Whether the parser is 100% compliant, whether it has a robust implementation, or have followed the best practices, they are really up to the parser writers. A good thing about JSON and UBJSON parser is that the format is so simple that writing a parser is not that difficult, that gives space for people to write their own flavors with different trade-offs (complexity vs utility).

I am looking forward to hearing about your proposal of BJData compatible array construct. I don't mind you list both constructs as valid UBJSON formats, as long as these new features (which really comes from the research needs from my community) are supported, and subsequently supported by the related parsers.

On a related note - I really like the MarkDown version of the UBJSON spec that py-ubjson team wrote, which I used as the starting point for writing the BJData spec. It takes the advantage of the convenient formatting supported by github and allows people to collaboratively develop a document. Comparing with the html version of the UBJSON spec, I feel the markdown version is much more efficient, easy to read and easy to revise. I really wish that UBJSON can take this new form to continue its development and future discussions (especially given that the ubjson.org website is not functional).

@Steve132
Copy link
Contributor

we just define a format, and let the parser writers to come up with a parser that works with the format.

Part of a standard is making sure that it's easy to implement without bugs. It does matter.

I don't mind you list both constructs as valid UBJSON formats, as long as these new features (which really comes from the research needs from my community) are supported, and subsequently supported by the related parsers.

I probably won't include multiple constructs to do the same thing into UBJSON. But I do think that a lot of the other issues you've brought up are important and worth looking into. I think a big open question could be how much can we avoid fragmentation in this space: Are you willing to standardize on UBJSON to avoid fragmentation if it ends up being modified to sufficiently meets your needs? (e.g. eliminating the BJData spec?)

t takes the advantage of the convenient formatting supported by github and allows people to collaboratively develop a document. Comparing with the html version of the UBJSON spec, I feel the markdown version is much more efficient, easy to read and easy to revise

I agree 100%. Another big issue I'm currently working on is versioning. I think we need to make an "alpha" version (spec 12) a "beta" version (spec 13) and a "1.0 version" (including whatever we end up deciding.

@fangq
Copy link
Author

fangq commented May 29, 2020

@Steve132, any update on this?

@Steve132
Copy link
Contributor

Steve132 commented May 30, 2020 via email

@fangq
Copy link
Author

fangq commented Sep 16, 2021

@Steve132, @rkalla, and other UBJSON contributors and maintainers - I am glad to share an update:

I just received a 5-year funding support from the NIH (National Institute of Health) to develop a new-generation neuroimaging data exchange platform named NeuroJSON (http://neurojson.org, https://github.com/NeuroJSON). The goal of the project is to develop a series of specifications to map major neuroimaging data files, standardized by the BIDS project (https://bids.neuroimaging.io/), to a unified JSON/BJData based exchange format to gain human-readability, easy parsing, reuse and integration with NoSQL database (such as MongoDB, CouchDB).

As I mentioned above, the BJData spec (Binary JData) was largely based upon UBJSON, with the extensions described in this ticket.

I see the updated UBJSON website now has a placeholder for UBJSON 2.0 - it would be wonderful if my above proposed extensions can be reviewed and incorporated into UBJSON spec. It will be mutually beneficial to join force so that our user communities can enjoy the works from both initiatives.

By the way, we are recruiting a full-time programmer to drive the NeuroJSON project and develop new parsers/libraries/specifications, and incorporate those to prominent neuroimaging software such as FieldTrip, SPM, FreeSurfer, BIDS etc. If any UBJSON developer/contributors is interested in joining this effort, please checkout our job posting:

https://careersmanager.pageuppeople.com/879/ci/en-us/job/507821/scientific-programmer

@rkalla
Copy link
Collaborator

rkalla commented Sep 17, 2021 via email

@to-miz
Copy link

to-miz commented Dec 9, 2021

I think number 2 is a sensible extension to UBJSON, since missing unsigned types can introduce incompatibilities in some programming languages or platforms. This is because I also don't see which binary format signed integers are using in the spec (I am assuming two's complement, this should definitely also be added to the spec), which is technically not a given for all platforms, although true in practice.

I don't think the extensions in 1 and 3 should be added to UBJSON, I think UBJSON should remain universal and as compatible to JSON itself as possible. I think one of the main advantages of UBJSON to other binary formats is, that it can be a drop in replacement to JSON, because they are so similar, and this should be kept, with the same limitations that JSON has. You would gain smaller data to be transferred and faster parsing speeds, but compatible representations of data.

This is why I don't think an extension for N-D arrays makes sense, because it is too specific thus not universal, as well as not really compatible with JSON. An N-D array should know itself, how to serialize itself to UBJSON/JSON and deserialize and validate itself, this should not be part of the serialization format itself. It is imposing a kind of schema to a schemaless format. It is also limiting and not useful in a general sense. Small N-D arrays can be serialized in a linear fashion, but for big matrices for instance, a better format would probably be to store and serialize them sparsely. N-D arrays don't help here.

Although UBJSON is a binary format, I think it is more important to keep the compatibility to JSON, since JSON is more widespread. As such, in my opinion, UBJSON should also not be extended in a manner to enable the serialization of specific binary data at all, since JSON itself is not suitable for that. UBJSON makes it more convenient to store binary data than JSON, but that should not be the reason to use it. There are better binary formats for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants