Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Universal Integer Encoding for use in UBJ #76

Open
MikeFair opened this issue Feb 1, 2016 · 2 comments
Open

Universal Integer Encoding for use in UBJ #76

MikeFair opened this issue Feb 1, 2016 · 2 comments

Comments

@MikeFair
Copy link

MikeFair commented Feb 1, 2016

In working through the kinks of the encoding proposal in #66; a nice issue that came up was terminating a repeating list of values that start with a length (like the field names in an object). It became clear that "Not a Length" or a "Null Length" value was needed in addition to the values used for indicating the size of the value that followed after the first byte if it was needed.

@lightmare was instrumental in discussing that part; the rest of us had earlier hashed through that regardless of the decision, any change made to how length is encoded would be a breaking change.

I really liked the proposal overall and it could be part of creating a single unified Integer encoding for the UBJ types (in place of the multiple type codes used today). So I coded up a working reference in C# and posted it as a gist: https://gist.github.com/MikeFair/a3c6bd817ff716ce090f

To briefly redescribe the format, values less than 240, null, and the value 255 are encoded in a single byte. All other values are encoded using the first byte as a type descriptor and the fewest number of bytes that will represent their value; the type descriptor is also used to describe whether the encoding is in big or little endian format.

The code in the gist provides a UBJUInt class for sending/receiving Unsigned Integer values as a new encoding for UBJ. This is in addition to using the encoding description as a < LENGTH > whenever a < LENGTH > is required by the spec. The code should fairly easily translate into other languages.

Some highlights:

  • a nullable ulong "Value" property for the caller to read and update
  • "ReadFromBinaryStream" and "ReadFromByteArray" as a constructor or updater to the value
  • "GetBytes", "FillBytes", and "WriteToBinaryWriter" routines
  • construct or update directly from a provided ulong value
  • Supports endian types: Big, Little, MachineLocal, and MatchRemote

Setting the endian type to Big, Little, or MachineLocal will always encode the bytes as the described endian type. MatchRemote will use MachineLocal until it has been updated from a message; after which it will encode the value as Big or Little depending on the encoding read in. The intention is that the parser/encoder would only need to create one instance of the class and then reuse that instance via the routines provided to have it encode/decode the values as needed.

Please take a look and tell me what you think.

With some additional work, this class could be extended to handle Big and Little Endian variants of the "String of Bytes" type by receiving/providing a byte[] directly from/to the caller... A similar class could also be written for arbitrary decimal, float, and double values. All in all, this could become a single UBJ type code character to handle all of the Unsigned Integer Value types.

What prevented encoding Signed values into the class as well were the Signed Single Bytes; there is simply no way using only a single byte to encode both signed and unsigned variants of the same value.

I was thinking UBJ could use the preceding type code to mean Unsigned and Signed values:

  • "=" would mean Unsigned Values
  • "+" would mean Signed Values

The type code would determine how the bits are interpreted.

Instead of trying to coerce signedness for a single byte into the format; using the UBJ type code would provide the foreknowledge required to set the class instance to read or encode the data as either signed or unsigned. Nothing about the encoded bits on the wire would need to change; the class would provide the features and settings to allow sending/retrieving the value as either a Signed or Unsigned value.

Thanks

@tsieprawski
Copy link

This surely would introduce lots of incompatibilities with existing software.

  • Obvious stuff is revolution of type markers for (u)int*-s.
  • Parsing of integers complicates a bit (but not to a level a typical developer would hang himself). Instead of getting explicitely 1/2/4/8 bytes after the integer type marker, we get 1, examine its value, optionally get 1 more, examine whole value, optionally get, etc.
  • I can imagine that possible memory preoptimization for integer-typed containers may vanish. Right now having type and number of contained integers, we know exactly how much memory to spend. Now we need to examine each value one-by-one (because any int's length can vary), and this is practically similar to non-optimized containers right now.
  • New type markers for sign conflict with our current definition of type - exactly 1 byte. [1]

https://github.com/ubjson/universal-binary-json/blob/master/spec12/index.html#L101

@MikeFair
Copy link
Author

MikeFair commented Nov 9, 2016

I re-looked at the gist; the clearest part of the proposal is buried in the middle of the code (my bad):

        // The valid UBJUInt transmission code values
        public enum TypeCodes
        {
              UByte = 0 
            , Null = 255 
            , N255 = 254
            , UWordBig = 253, UWordLittle = 252

            , UDoubleWordBig = 249, UDoubleWordLittle = 248

            , UQuadWordBig = 245, UQuadWordLittle = 244

        };

        public enum EndianTypes { 
            Big, 
            Little, 
            LocalMachine, 
            MatchRemote 
        }

The format can only actually encode Big or Little endian types on the wire, the other two values are for informing the encoder how it should encode the values it is sending (with MatchRemote being the default in most cases; and LocalMachine for embedded systems).

This surely would introduce lots of incompatibilities with existing software.

Agreed and discussed at length about options that attempt to avoid breaking changes. None of the non-breaking proposals so far has produced any meaningful savings. They were different, but not meaningfully better.

This proposal could still be largely compatible because it doesn't break or change the behavior of the existing i, U, I, l, and L integer types, it replaces them with two unused types =, and +.

The related proposal makes [LENGTH] into it's own distinct UBJ metadata type, producing a meaningful reduction in transmission size, is a breaking change, and is essentially this same encoding.

Parsing of integers complicates a bit (but not to a level a typical developer would hang himself). Instead of getting explicitely 1/2/4/8 bytes after the integer type marker, we get 1, examine its value, optionally get 1 more, examine whole value, optionally get, etc.

This isn't a recursive lookup. After the first byte, you will know either exactly what the value actually is, or the endian encoding and exactly how many bytes the value that follows uses...

Values less than 240 , the value 255, and the value "not a number" (aka most string lengths where these numbers get used a lot) are encoded in a single byte. Everything else (up to 8-bytes) is 1-byte plus their actual size. (The only exceptions are the values 240-254; those 15 values are three bytes.)

And there is also the type byte to add in... making this layout 1 byte bigger for multibyte values.

The goals were (1) free up the type letters currently assigned to numbers, (2) make a hex dump easier to follow and numbers easier to spot/read using "=" and "+" which are symbols people in many languages equate with numbers, (3) make parsers/decoders easier to write by eliminating (or combining) the distinct parse routines for the different integer types (one function to rule them all), and (4) reduce the burden on small embedded hardware and increase the encoding/decoding speed by adding endianness to the information provided (this improves speed because "same endian systems" can avoid translating their bytes into a different endianness, and small embedded systems can always encode streams they send in their own native endian formats).

I can imagine that possible memory preoptimization for integer-typed containers may vanish. Right now having type and number of contained integers, we know exactly how much memory to spend. Now we need to examine each value one-by-one (because any int's length can vary), and this is practically similar to non-optimized containers right now.

Not sure exactly what you are saying here, but I think you're seeing that the size of integer loses predictability. To the contrary, this proposal, and the [LENGTH] proposal it came from, works hard to protect preallocating exactly the required amount of memory, and minimizing waste, before consuming any data from the buffer. The proposal guarantees certainty on exactly how much memory is required, and what endianness the incoming data will be in, before consuming more than a single byte.

When you see one of these "numbers" in the data stream, it will start with "+" (signed) or "=" (unsigned), followed by a byte: xFF means "NaN", xFE means "255", and all other values starting with xF mean "a multibyte integer follows" and the specific number tells you how many bytes it is.
Anything else not starting with xF is exactly its numeric value.

New type markers for sign conflict with our current definition of type - exactly 1 byte. [1]

These types are still exactly 1-byte "=" and "+" express two types of numbers. The first, "=" is an unsigned numeric value in a traditional binary format, and the second "+" is a signed numeric value in a traditional twos-complement binary format.

In a TLV encoding; "=" and "+" are [T].

The next byte consumed is an [L]; using [LENGTH] as its own encoding, if [L] is xFF, xFE, or less than 240, it means "The Length was 1 byte; and the byte already consumed also tells [V]"; if the [L] is between xF0 and xFD, [L] tells the length as 2, 4, or 8 and what the endianness of [V] is.

This is exactly where and how an [L] gets read in (just after [T]) so the decoder would do what it typically does for types with a [LENGTH] and parse the [LENGTH] value.

This proposal is recommending that for the types "=" and "+", the value of [LENGTH] is also the binary representation of the encoded number. So after reading in [L] there will be no more data to fetch.

The thinking is that it simplifies numeric integer handling, brings in the new [LENGTH] encoding for use as [L], adds endianness to the information, frees up the current integer type codes for other uses, and shifts the type codes used for integers to characters internationally recognized for numerics.

I think its also an aesthetic enhancement to the hex dump version of the encoded UBJ which is a stated goal for UBJ, but obviously that's also a matter of taste.

I admit it technically adds a byte for the multibyte encodings, and I am half-way thinking that the reduction in parser/decoder complexity (one function to rule them all) could be a part of compensating for breaking historical compatibility,..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants