Bounded container types - for efficient parsing #26

Closed
cdunn2001 opened this Issue May 25, 2013 · 19 comments

4 participants

@cdunn2001

The ARRAY and OBJECT types no longer have a length argument; all containers (like JSON) are considered to be unbounded. In the existing spec, this is know as the Streaming support.

Please add bounded container types. I understand how these streaming types can be useful, but I want everyone to see that bounded types are important too. I want to be able to pre-allocate arrays. I should not need to reallocate memory when parsing UBJson. A parser should be simple and efficient.

This is a very big deal!

@kxepal
Collaborator

Hi @cdunn2001 !
I understand your reasons, but how can you preallocate arrays/objects that may contains data of mixed types? It's only reasonable if arrays will have data of the same type.

See issue #13 for more information about STC: they are friendly data type for preallocation operations.

@thebuzzmedia

@cdunn2001 Just to save you the time for reading all of Issue #13 (if you had the time, would appreciate you input on it though)

To stay 1:1 compatible with JSON, as @kxepal pointed out, there is NO guarantee that ARRAYS contain elements of the same type so we changed the definition of ARRAY and OBJECT to exactly match that of JSON.

That said, we recognize that for performance reasons, pre-defining container sizes (and types) is a highly desired feature.

This is what Strongly Typed Container (STC) discussion is about in #13 -- specifically adding a new container type that is an array with a type and I believe we settled on also requiring a size (see discussion on #16 )

This container type would be highly optimized for reading and writing and solve not only these performance cases, but allow UBJSON to support Binary data in a format-agnostic way (would still remain compatible with JSON).

@thebuzzmedia

Addendum: The current design for STC is only for ARRAYS and only for primitive types (no arrays containing arrays or objects).

@cdunn2001

The UBJ philosophy (same as JSON): All datatypes are unknown.

STC and UBJ are orthogonal, and I doubt that you will ever reach consensus on STC.

My use-case for UBJ is that I know my datatypes, but I want to be able to debug encoded bytes without knowing the types. Protobuf-encoded bytes are meaningless without a .proto file, so tcpdump is useless.

Also, most of my data are long strings, not integers. Protobuf optimizes for integer byte-length over CPU cycles, which is good for network communication but not for IPC.

UBJ is basically what I want, except for array decoding efficiency. When I deserialize, I know the datatype, but I still cannot preallocate the arrays. When parsing, I need to know the size in advance. (The arrays are variable-length in order to support backward-compatible schema changes, something Protobuf does well.)

Besides my specific use-case, the main value of UBJ is trivially simple parsing. You've lost that by adding trailing delimiters. Streamable datatypes should be distinct. Call that UBJ+ or UBJS. Then you can include chunked strings, if you want.

@kxepal
Collaborator

I believe you're looking for something like scheme for UBJSON data which describes decoded data. At this moment, as like for JSON, you may generate such scheme as separate structure and use it as helper for data decoding.

@thebuzzmedia

@cdunn2001 I want to understand why STC is not a good solution for you -- you mentioned you are primarily passing (long) strings and not primitive numeric types, can you give me a more detailed example of your use-case so we can better address if there is a gap in our planned definition of STC and maybe we need to consider another construct to optimize for the case you present?

You've lost that by adding trailing delimiters. Streamable datatypes should be distinct. Call that UBJ+ or UBJS.

I made the same arguments initially (the original spec had sized containers) but what UBJSON has defined by default were containers with weak semantics that didn't exist in JSON. We changed that in Draft 9 by aligning ARRAY and OBJECT exactly with JSON and then working towards addressing a more optimal ARRAY-based container that addressed those shortcomings.

Then you can include chunked strings, if you want.

Long discussion on Issue #10 about that -- maybe you meant some other impl of a "chunked string", if so, can you clarify?

Again, I want to understand your use case clearly to see if we need to reconsider how we are proposing STC.

This is why I have been sitting on STC so long; it will be the lynchpin feature to UBJSON and I want to make sure it is defined appropriately.

Thank you for your feedback Chris.

@cdunn2001

@buzz, I do not understand why passing the number of array elements is so terrible. You are forcing my parser to test for ] repeatedly, when all I should need is a simple for loop.

I do understand the value of streaming, but that can be an extension, using new datatypes. I also see the value of STC, but that should be an alternative to UBJ, not UBJ itself.

I don't understand the argument about semantics and JSON. An array which includes its length is still 1:1 with JSON; i.e. JSON <-> UBJ is still a reversible transformation.

I think you're making UBJ too complicated while losing sight of its main value: trivially simple parsing. I was very excited when I first learned about this project because it seemed that finally someone else had understood the value of simplicity. What you had before Draft 9 was quite elegant.

@kxepal, I can handle the schema layer myself. I have other plans for that. I am only interested in an encoding that can be debugged without the schema. That seems obviously useful to me, and yet nobody ever gets that. It is the primary failing of protobuf. Has no-one else used tcpdump, tcpflow, hexdump, or od to decode a binary message?

@cdunn2001

Maybe I can explain my grief with a code example. Here is parsing for Draft 9:

def ParseElement(input):
  b = input.read(1);
  if b == '{':
    return ParseArrayTail(input)
  else if b == '}':
    raise EndOfArray
  else ...

def ParseArrayTail(input):
  a = []
  while True:
    try:
      a.append(ParseElement(input)):
    except EndOfArray:
      return a

In other words, code which expects a new element must do something different when it encounters a final delimiter. But it is much simpler for Draft 8:

def ParseElement(input):
  b = input.read(1);
  if b == 'a':
    return ParseArrayTail(input)
  else ...

def ParseArrayTail(input):
  length = ParseNumber(input)
  a = [None]*length
  for i in range(length):
    a[i] = ParseElement(input)
  return a

It's easy to see how to translate that to C++, perhaps using boost::variant, or a simple type hierarchy.

@kxepal
Collaborator

@cdunn2001
Draft-8 and Draft-9 array parsing in Python. Both looks pretty simple. For C++ just use LinkedList type or something similar for unsized arrays.

@kxepal
Collaborator

@kxepal, I can handle the schema layer myself. I have other plans for that. I am only interested in an encoding that can be debugged without the schema. That seems obviously useful to me, and yet nobody ever gets that. It is the primary failing of protobuf. Has no-one else used tcpdump, tcpflow, hexdump, or od to decode a binary message?

Hm.. I'm using pprint function from simpleujbson library with hexdump help when things going wrong. Actually, only decoding have to be debugged. Encoding itself is the way to serialize language data structures into some format (UBJSON in our case). If the result couldn't be decoded back and/or the result violates specification - encoder is actually broken.

@kxepal
Collaborator

@buzz, I do not understand why passing the number of array elements is so terrible. You are forcing my parser to test for ] repeatedly, when all I should need is a simple for loop.

But you still have to check for invalid markers (N, E) - how this can be different from repeatedly testing for the ] ? Malformed data is a real world case and your parser should be ready for it.

I do understand the value of streaming, but that can be an extension, using new datatypes. I also see the value of STC, but that should be an alternative to UBJ, not UBJ itself.

STC is just an optimization of the existed array type. It's not new type itself, but an optimal way to handle one special case: arrays with all elements of single type. Since all of them has same type there is no need to repeat it again and again, so STC becomes more size-optimal array structure, nothing more.

I don't understand the argument about semantics and JSON. An array which includes its length is still 1:1 with JSON; i.e. JSON <-> UBJ is still a reversible transformation.

JSON <-> UBJSON transformation rule is still works even with STC type. We'd just make a step forward to some kind of two-step transformation. At the first step you encoded data to UBJSON as is, getting regular array. At the second one you applies optimizations to the produced data, making it more size-optimal. Decoding the result back to JSON doesn't and shouldn't be different (in most cases, there are exceptions like issue #12 ).

I think you're making UBJ too complicated while losing sight of its main value: trivially simple parsing. I was very excited when I first learned about this project because it seemed that finally someone else had understood the value of simplicity. What you had before Draft 9 was quite elegant.

I assure you - UBJSON is still very simple format and writing parser's pretends to be homework for students for about 1-2 hour ( @thebuzzmedia, sorry for such comparison) if you don't think about optimization tricks, edge cases and protection from malformed input.

@breese

We could add an optional count marker (as suggested in #16.) This could be regarded as a processing hint, that will allow the parser to do pre-allocation.

This should be an optional feature, because in the general case a count simply moves the inefficiency from the parser to the generator (e.g. getting the count of a list is O(N).)

@thebuzzmedia

@cdunn2001 Can you take a look at the comment @breese is linking to (#16 (comment)) -- he proposed a solution to option sizes a few months ago and I'd like to hear your thoughts.

@thebuzzmedia

@cdunn2001 -- I formalized my thoughts into a modification to the spec, I'd like to know what you think of #27

@breese This idea popped into my head after I re-read your proposal from 16 that you linked above. Let me know what you think (I took it 1 step further, but otherwise it is the core of your proposal).

@kxepal @AnyCPU it goes without saying that I'd like your thoughts on #27 ;)

@cdunn2001

Malformed data is a real world case and your parser should be ready for it.

I would throw an exception. The problem with a trailing delimiter is not the if-statement, but rather the confused program flow.

We could add an optional count marker (as suggested in #16.) This could be regarded as a processing hint, that will allow the parser to do pre-allocation.
This should be an optional feature, because in the general case a count simply moves the inefficiency from the parser to the generator (e.g. getting the count of a list is O(N).)

I'll try to explain my use-case.

I want two different parsers:

  1. A generic parser, for debugging.
  2. A generated parser/deserializer, for Remote Procedure Calls (RPC).

Protobuf provides (2), but it cannot provide (1). JSON provides (1), but it cannot be efficient for (2).

For RPC, the generated parser already knows the datatypes. What it doesn't know is the presence or absence of optional fields, and the number of repeated fields. (I am using Protobuf nomenclature because this is something Protobuf does well.)

The optional fields are critical for backward-compatible schema changes. Those arguments are passed as an unordered array of tuples -- effectively as a map, though an actual map is never actually needed. For this case, I would prefer to see the array-length in order to simplify the generated parser, but it's not a matter of efficiency as these elements are destined to become fields in structs.

The repeated fields are ordered, but their number of course cannot be known to the parser. STC could work here, but only if there is a clear distinction between typed and untyped arrays. Let's suppose we let streamed arrays be untyped, while fixed-length arrays are uniform-type. I'd be fine with:

[A][4][2][A][3][{]
...

That would be an array of 2 arrays of 3 generic arrays. The 4 is the number of bytes in the type-descriptor, which is the number of bytes that my generated parser would ignore as it already knows the datatypes. (It could parse those in a verification mode.)

@kxepal
Collaborator

I want two different parsers:

A generic parser, for debugging.
A generated parser/deserializer, for Remote Procedure Calls (RPC).

Protobuf provides (2), but it cannot provide (1). JSON provides (1), but it cannot be efficient for (2).

I believe these problems are very different to UBJSON data format itself. When you're building RPC protocol you're free to pass first data schemes on handshake phase for future friendly debug output. Moreover, with RPC implementation server and client may have data object mappings which handles incoming data, parses it and wraps into specified objects. I see there at least 3 completely different logic layers which may be based on top of UBJSON format.

Need to keep in mind that Protobuf mostly is a markup language that is designed to solve RPC tasks rather than just data format which UBJSON is.

@cdunn2001

@kxepal, You might have misunderstood my use-case. I'm not sure what problem you're trying to solve, or even what you're arguing about. I will not have 3 logic layers. I am not interested in a binary version of XML. I will have a single parser/deserializer, generated from my own trivially simple schema-specification. (.proto is hard to parse, which might be Google's way of locking us into Protobuf.)

All I need is a binary language that is:

  • easy to parse efficiently
  • easy to translate (i.e. generic, not schema-specific)
  • easy to debug via hexdump

UBJSON is tantalizingly close. Nothing else is. The things which make UBJSON wonderful are:

  • length always precedes data (i.e. no trailing delimiters)
  • integers are encoded in a normal way (i.e. not compressed as in Protobuf)
  • types are 8 bits (i.e. visible to hexdump)

I really don't see the problem with the array type in Draft 8. New datatypes can be added for streaming. Another new datatype can be added for strong-typing.

I see the value in a 1:1 transformation between JSON and a subset of UBJSON. To me, the JSON arrays map to the old a/A arrays of Draft 8. The streaming arrays are an extension, with a different datatype, [. STC would be another datatype. Those extended datatypes would not map to JSON, but I see nothing wrong with that.

The most important goal is simplicity, "the singular tenant that made JSON so successful".

@thebuzzmedia

@cdunn2001 If you had a moment I'd like to hear your thoughts on #27 that I filed as a result of this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment