Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small numbers encoded with uint8 type instead of FixInt. #24

Closed
aarondl opened this issue Nov 14, 2013 · 13 comments
Closed

Small numbers encoded with uint8 type instead of FixInt. #24

aarondl opened this issue Nov 14, 2013 · 13 comments

Comments

@aarondl
Copy link

aarondl commented Nov 14, 2013

The below snippet produces perhaps sub-optimal msgpack.

Notice that the struct is created from small numbers (even if big types), in serialization with MsgPack we should be able to create these numbers without any standalone type descriptor by using the positive FixInt type from the spec:

positive fixint 0xxxxxxx    0x00 - 0x7f

If you look at the output of the snippet you'll notice several 0xcc bytes in there describing each of the integers as a uint8 when those bytes don't need to exist.

Removing the 0xcc bytes and decoding with this package produces the expected result, so it seems to be an encoding problem only.

package main

import (
    "github.com/ugorji/go/codec"
    "bytes"
    "log"
)

type bstruct struct {
    A, B, C uint64
}

func main() {
    b := bstruct{1, 2, 3}

    log.Println(b)

    var m codec.MsgpackHandle
    m.StructToArray = true
    m.WriteExt = true

    var buf bytes.Buffer
    encoder := codec.NewEncoder(&buf, &m)
    if err := encoder.Encode(b); err != nil {
        log.Fatal("Failed to encode:", err)
    }

    log.Printf("%#v", buf.Bytes())

}

Output:

2013/11/14 17:24:29 {1 2 3}
2013/11/14 17:24:29 []byte{0x93, 0xcc, 0x1, 0xcc, 0x2, 0xcc, 0x3}
@ugorji
Copy link
Owner

ugorji commented Nov 14, 2013

fixint is for small signed integers. your snippet uses a uint8. change your snippet to int8 and you will see it work.

I actually thought long and hard about this one, and it was easiest to stay true to spec.

during decoding, we try to fit the values, so u can decode from an unsigned into in stream even into a float. but during encoding, we are precise.

@aarondl
Copy link
Author

aarondl commented Nov 14, 2013

The spec has both a positive and a negative fixint. Doesn't this mean that uint and int can both safely be encoded into a fixint if the value is within range and absolutely no harm can come of it? We know that value is positive, so there can be no confusion as to what sign it is.

@ugorji
Copy link
Owner

ugorji commented Nov 14, 2013

Let's say we encode a unsigned value of 5, and then try to decode it schemaless (ie decode into a nil interface{}). Do we return a int64 or uint64?

var v uint64 = 64
err = NewEncoder(buf).Encode(v)
var v2 interface{}
err = NewDecoder(buf).Decode(v2)
// v2 should now be a uint64, not either int64 or uint64 depending on the value of v previously.

The encoders try to be true to the spec, allowing users to decode into equivalent value without knowing the type beforehand.

To do this, we had to resolve ambiguities, where the main one was with FixInt. By reading the spec precisely, I saw that it was in-the-spirit of the spec to be precise. The spec calls out positive and negative fixint, allowing me interpret that as "fixints are signed values in range [-31,127]".

It also improves interop across languages that have separate signed vs unsigned values. I had to jump through hoops to test interop with python implementation, C implementation, etc.

I actually had it like you are requesting before. Looking through the source I struggled with the decision, but I knew it was the right one when doing schema-less decoding and not know what type of value to expect. It made sense to encode strictly.

I understand if you don't agree right away - it took me a while to decide that this was best.

@aarondl
Copy link
Author

aarondl commented Nov 14, 2013

From the spec:

Source Type | Output Format 
Integer       int format family (positive fixint,
              negative fixint, int 8/16/32/64 or uint 8/16/32/64)

"If an object can be represented in multiple possible output formats, serializers SHOULD use the format which represents the data in the smallest number of bytes."

By this definition the encoder is actually in violation of the spec, rather than adhering to it as you've said here.

@ugorji
Copy link
Owner

ugorji commented Nov 14, 2013

I worked with them on the revised spec for a few months.

The spec is written the way it is, because some types are not in every language. For example, Java doesn't have unsigned types, etc. In a language with strict signed vs unsigned integers, correctness comes into play.

The message from the spec you quoted above really relates to: should I encode 8 as int8 or fixint? . And not to: should I encode an unsigned as a positive fixint? .

Also, in terms of spec writing, SHOULD is not MUST. So encoding 8 as fixint, int8, ... int64 is allowed, but not recommended, since msgpack strives for packing.

In this issue you raised, it is about losing some specificity - so it's more about correctness. I explained why the correctness was important, and how the spec was interpreted to such that the library is still strictly in line with the spec, doing packing to its full extent without losing correctness. The correctness in this case is got by resolving ambiguity of fixint (is it a signed or unsigned value, or is it sometimes signed and sometimes unsigned, or is it always signed?) by intepreting it as "a fixint is a signed small integer in range [-31,127], such that it is always decoded back as a signed integer".

@aarondl
Copy link
Author

aarondl commented Nov 14, 2013

As you said, msgpack strives for packing, and we're not doing it right here.

There's no need for this correctness because there's no ambiguity in the value, only the type, which is irrelevant since every single type that can contain this value can ALL consume the ranges of FixInt and Negative FixInt with absolutely no ambiguity.

On a technical level: This will work in any language with no ambiguity because the neat thing about the msgpack's FixInt and Negative FixInt is that they use value ranges that cannot be misinterpreted no matter what type you deserialize into them. Focusing on the positive version of FixInt: it has 0-127 as a min-max range, this range is safe inside both int8 and uint8. This means even if the decoder selects int when you intended for uint, it makes no difference at all, because the correct value will be delivered.

On a userspace level: If you really care about the type (for what reason I can't imagine yet), you're going to have the schema, and if you have the schema it will decode correctly in either case (no matter how we've encoded it since FixInt can go into any size integer of any sign with no problems; this is due to the ranges declared in the spec). And if you don't have the schema, there's still no way to get the incorrect value and why would you care about the type if you don't even know what data you're going to get back at all anyways?

So I guess this is more of a question, what does this correctness actually give us? Besides an inflation of the size of small integer values I can't see any benefits. You say it resolves some ambiguity but there is none. And you've mentioned it resolves some interopability between other languages but that also seems unlikely since Go is as tight a language as there is (even moreso than C) and there's no way it can decode the value wrong.

@ugorji
Copy link
Owner

ugorji commented Nov 14, 2013

The ambiguity is in the type, not the value. And in a strongly-typed language, the type matters, because people will type assert to a type they expect.

However, it gives us precision in schema-less decoding i.e. based on the stream content, we can expect a specific type returned.

i.e. I wanted the constraint that:

  • signed integers are encoded as fixint or intX in the stream
  • unsigned integers are encoded as uintX in the stream

And vice versa, when decoding schemaless:

  • intX or fixInt is decoded into a int64
  • uintX in decoded into a uint64

You care about the type if you do.

For example, I expect to decode either an numeric id (signed int64) or a string id (string). My code does not expect an unsigned value coming back. I only do type checks for int64 or string. If a signed int64 was sent, I should get back an uint64 and have my code blow up, because there was ambiguity in the encoding-decoding contract.

    var v interface{}
    err = codec.NewDecoder(buf).Decode(&m)
    switch x := v.(type) {
    case uint64:
        // appengine uint64 id
    case string:
        // appengine string id
    }

Another example: You have a map of properties: map[string]interface{}. One of the properties is a ID, which should be a uint64. You should be able to reliably say:

     var m map[string]interface{}
     err = codec.NewDecoder(buf).Decode(&m)
     var id uint64 = m["ID"].(uint64) // no type assertion needed here
     var title string = m["TITLE"].(string)
     // ... 

In a strongly typed language, precision in the type is critical. Else users have to code not based on their expectations, but based on the idiosyncrasies of your library.

This actually bit me in my own application leveraging the msgpack lib. I had to do an un-necessary type-switch for this short range of integer values. And then I had to document it. And it wasn't spec required.

And I didn't like having to document:

  • When decoding schema-less ly, when you encode an unsigned value, it may be decoded as a signed value if it is less that 128, so check for that.

I prefer the expectation that:

  • If you encode a signed integer, it will come out decoded (schema-less) as an int64.
  • if you encode an unsigned integer, it will come out decoded (schema-less) as an uint64.
  • if you encode a float, it will come out decoded (schema-less) as an float64.
  • ...

It's a subtle difference, but it's correct and to the spec.

msgpack lib for about a year actually treated fixint as signed or unsigned based exactly on your argument above. I thought about this for a while, and saw all the points you made here back then. It took me a while to arrive here, and I reckon it will probably take you a while too, if you come around to seeing my point. And you may come to appreciate it (like I did) when you can count on the strictness (and not have to do extra type assertions, etc).

@ugorji ugorji closed this as completed Nov 14, 2013
@aarondl
Copy link
Author

aarondl commented Nov 14, 2013

To me the size of the packed data is much more important than an additional type assertion for the very odd case that someone is decoding without a schema.

What's more is typically if we're decoding our encoded data we're going to have the schema for sure. It's when we're decoding other's data that we're unlikely to have it. This means that it -probably- has not been encoded by this particular package. Which then also means that we've got that extra type assertion anyways because other libs have taken a different stance on this (which is what started this issue from the beginning, I have a lib that packs correctly instead of worrying about some imagined correctness).

That's to say: if you care about the data types, you're going to have the schema. And if you don't have the schema it's very likely this package is not involved in the encoding anyways and you have this problem.

What's worse is that by making this decision for this package alone while other encoders behave differently is that we don't get to see this issue handled in the documentation of this package. We have to find out through trial and error that msgpack encoders behave this way.

You closed this as I was typing it, so obviously I was unable to convince you. I think that in this scenario one additional type assertion is FAR less brutal an overhead than additional network traffic. As you know the bottleneck in most programs in not cpu or memory, but io and what we're doing here is increasing the sizes involved in io.

I guess if I really want this I'll have to fork. But I appreciate the conversation we were able to have here. Despite disagreeing in the end I think it was productive.

@ugorji
Copy link
Owner

ugorji commented Nov 14, 2013

Part of the reason why it is hard to convince me of this one, is because I had all your argument points before. They were all mine. That was the reason why I had it the way you are requesting before.

However, the cost of the change is mitigated.

  • this only applies to small unsigned integers (most people use signed integers). So the impact is contained/mitigated. If using unsigned numbers for ids, most of them will be > 127 anyway. If using regular numbers, then you will most likely be using signed integers. (I know - flags are a major use case for unsigned and are not helped here).
  • I don't know that other encoders behave the way. Many languages with msgpack codecs do not have native unsigned integers (e.g. python, java, javascript, etc).
  • Also, their documentation is non-existent, and everything has to be found by trial and error (I had to look through crazy msgpack C/C++ code and do tests to see how they encode so I can interop). Python for instance encodes negative numbers as msgpack signed intX and positive numbers as msgpack unsigned intX.
  • I'm not sure that users will typically have the schema. In your use-case, that may be true. But I have had many users ask me about decoding into blank interface{}, or map[string]interface{}, or map[interface{}]interface{}, or []interface{}, etc. For many users, those may be more prevalent that decoding into a struct with no interfaces.

I know that this position is an opinion, and I could have gone either way. I went one way in 2012, and I changed to a different way in 2013. I'm still comfortable with that position, though I definitely understand and appreciate where you are coming from. Trust that I was there.

Also, this would count as a potentially breaking change - users that have come to depend on this would have assertion panics in their code.

I agree it was productive, and I appreciate your willingness to take the time and make your argument and position precise.

BTW - i enjoyed reading your blog/site.

P.S. Things tend to play around in my head for a while. If I can figure out a way to get comfortable with a change, I will update and let you know. But at this time, I don't see that happening.

ugorji added a commit that referenced this issue Nov 15, 2013
…ular array.

This requires us checking currentEncodedType in the stream first, before
deciding how to decode the value when decoding into a slice.

There's a slight performance hit, but correctness is more important.

Also, put in flag to "conditionally" enable encoding uint as positive fixint,
and update tests so they work across binc and msgpack and regardless of flag.
(This means ensuring that all unsigned int values in test are > 128).
(I probably will not enable this flag by default, but it's there ... for placeholding and reminder).

Fixes issue #26 .

Updates issue #24 .
@ugorji
Copy link
Owner

ugorji commented Nov 15, 2013

FYI: I made the change for you under a flag. It's not permanent, as you know my position on it today.

But if you need it privately, feel free to change the const mpEncodeUintAsFixnum (line 67).

@aarondl
Copy link
Author

aarondl commented Nov 15, 2013

Thank you very much for your consideration. As well as your constantly swift action against all my issues. Much appreciated.

@ugorji
Copy link
Owner

ugorji commented Nov 15, 2013

This issue has been bothering me since yesterday. I don't like to be uncomfortable with decisions.

I looked through the C implementation (the first one that the msgpack author wrote, which powers most other implementations e.g. python, etc).

What they do is:

If integer value >= 0, encode as positive fixnum or unsigned
else encode as negative fixnum or signed

As badly as this turns my stomach, I think I have to maintain that contract.

It also makes interop with those implementations easier.

So yes, I have come full circle. Thank you for dragging me along ;)

I'd update in a few and deal with the repercussions as they occur. I just hope there was a way to notify all users of this.

@ugorji ugorji reopened this Nov 15, 2013
@ugorji ugorji closed this as completed in f8f25d7 Nov 15, 2013
@aarondl
Copy link
Author

aarondl commented Nov 15, 2013

Nice, wasn't expecting this. Also, this is the problem with Go currently, there's no nice versioning stuff in it yet to be able to flip the major version number switch. Hopefully one day that will be addressed. Until then, the best thing you can do is make sure it's well documented. I said before with a lot of confidence that anyone using this decoder is probably working with a schema. It's so much more work to not do so in Go, and because of the json/xml encoder implementations people are used to that style so I think you can rest easy; this probably won't nip anyone. Thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants