Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erlang_protobuffs compatibility option #63

Closed
lukebakken opened this issue Jul 14, 2016 · 26 comments
Closed

erlang_protobuffs compatibility option #63

lukebakken opened this issue Jul 14, 2016 · 26 comments

Comments

@lukebakken
Copy link
Contributor

Hi Tomas,

I'd like to propose an option in gpb to enable compatibility with erlang_protobuffs. There are a few subtle differences between the libraries and having an option to make gpb compatible would be a great way for Basho to direct erlang_protobuffs (epb) users to your library.

Here are the areas I've found so far -

  • gpb uses encode_msg / decode_msg and epb uses encode / decode
  • epb allows using 0 and 1 to specify true and false
  • epb will encode a list of integers as a set of unicode code points for bytes fields in a proto message

Do you think this could be a useful option?

Thanks -
Luke

@tomas-abrahamsson
Copy link
Owner

Might be interesting. What does epb decode booleans and bytes to? I'm thinking if they are always decoded to true|false and binary() then for the data types, it is mostly a matter of allowing extra formats for encoding, maybe it wouldn't even need to be an option. Perhaps the situation is same or similar for the functions: they could be provided in parallel: encode could just call encode_msg?

@tomas-abrahamsson
Copy link
Owner

By the way, is there any differences with file names? For example the module file end with _pb,erl does the header file end in _pb.hrl too? (I vaguely remember there being some difference in naming for the .erl vs .hrl files for epb but I'm not sure about it)

If there is a difference, then I think it is not very well handled by the mod_suffix options, thus making the case for a compatibility option stronger.

@tomas-abrahamsson
Copy link
Owner

I'll have to read up a bit on the epb interface too, to weigh that in.

@lrascao
Copy link
Contributor

lrascao commented Jul 15, 2016

from my understanding erlang_protobuffs accepts both strings and binaries as proto strings, won't this also pose an issue for gpb since it either accepts strings or binaries?

@tomas-abrahamsson
Copy link
Owner

@lrascao, for strings, for encoding, gpb actually accepts both strings, io lists and binaries, since it calls unicode:characters_to_binary to turn the indata into an UTF-8 string. On decoding, though, the strings_as_binaries option (-strbin) controls whether to return binaries or strings (lists). (hmm... I now noticed that the type specs does not reflect this... will fix)

Similarly, for float and double, on encoding gpb accepts both floating point values and integers, mostly since the underlying bit syntax N:32/float-little happens to accept N as both float and integers, but on decode, a floating point value is returned.

(From that point of view, 0 and 1 as alternatives on encoding for booleans, and lists as alternative on encoding for bytes would fit an existing pattern)

@lukebakken
Copy link
Contributor Author

Answers to your questions -

What does epb decode booleans and bytes to?

Booleans are decoded to true and false atoms. bytes fields are decoded to binary()

By the way, is there any differences with file names? For example the module file end with _pb,erl does the header file end in _pb.hrl too?

Yep! I used these gbp options when compiling .proto files in riak_pb.

Here is one compatibility issue. epb allows lists as data for a bytes field:

https://github.com/basho/erlang_protobuffs/blob/master/src/protobuffs.erl#L152-L153

I noticed this because Riak has one message that has a bytes field that sometimes receives list data. This field should have been declared as a string. I added a function to "fix up" the data:

https://github.com/basho/riak_pb/blob/features/lrb/use-gpb/src/riak_pb_codec.erl#L406-L412

I see that, in the spec, string and bytes have the same on-the-wire format. I should be able to change bytes to string in this message.

However, other messages use bytes fields and, without a lot of code review, I can't be 100% sure they always have binary() indata. What do you think about adding the ability to use a list as indata for bytes fields in gpb?

@tomas-abrahamsson
Copy link
Owner

I see that, in the spec, string and bytes have the same on-the-wire format. I should be able to change bytes to string in this message.

In the protobuf format (on the wire), a string field must always be UTF-8, whereas bytes don't have to be. For instance, the octets 255,255,255 on the wire is valid as bytes, but not decodable as a string. That value is of course not a realistic errmsg, but if some side encodes an error message in Latin-1, then it is not necessarily decodable as UTF-8, and even if it would be, the text itself might change. if the characters encoded is always UTF-8, or 7-bit ASCII, then it is fine to switch from bytes to string.

@tomas-abrahamsson
Copy link
Owner

For the record, all of the items in the first post are now included in 3.24.0, but keeping this open for a while in case we come to think of anything related.

@lukebakken
Copy link
Contributor Author

Hi Tomas -

I found these functions that erlang_protobuffs creates:

What do you think about doing the same in gpb? Since these are literally the only two places in the Riak codebase that these functions are used, it's not as important as the other epb compatibility functions that were added.

@tomas-abrahamsson
Copy link
Owner

What do these functions do?

@lukebakken
Copy link
Contributor Author

Decoding

riak_core_pb:decode_riakobject_pb(DecodedObject)

That is the same as (using gpb):

riak_core_pb:decode_msg(DecodedObject, riakobject_pb)

Encoding

riak_core_pb:encode_riakobject_pb(#riakobject_pb{bucket=Bucket, key=Key, val=Value})

Is this in gpb:

riak_core_pb:encode_msg(#riakobject_pb{bucket=Bucket, key=Key, val=Value})

These are very much "syntactic sugar" functions that would exist in gpb to provide a "drop in" migration path from epb.

@tomas-abrahamsson
Copy link
Owner

Ah, I see, so for every <Msg>, there's also an encode_<Msg>/1 and a decode_<Msg>/1.

I fired up the epb just now, to actually take a look at the code it generates. For every encode_<Msg>/1, there are two function clauses: one if the argument is a list and one if the argument is a record. The second clause is the equivalent of encode_msg(<Msg>), so it seems straight forward to generate compatibility functions for (but I'd like to meditate a little more over this)

However, the first clause calls an internal helper called delimited_encode. It seems this expects a list, [<Msg>], and encodes this list into Size1+encode(MsgElem1)+Size2+encode(MsgElem2)+... (The + here meaning roughly iolist concatenation.) The only thing in the protobuf encoding spec I can think of for this, is for a field which is repeated and has the option packed. For such a field, the encoded result would look like below, where the tail matches the result of the delimited_encode:

    <field number and wire type> + \
        total size + \
            Size1 + encode(MsgElem1) + \
            Size2 + encode(MsgElem2) + ...

I do not want to write compatibility functions for this first clause, mostly because I find it odd---more below---and also because I think it would complicate the gpb code generator quite a bit to expose this. The oddness I percieve is twofold: not complete, and not general. About not complete: is that it seems to offer the possibility to produce the result that is not a complete decodable unit, the leading two parts are missing. About not general: if we see the Msg as the type, then the encoding can be generalized only to a few other types: also string and bytes would be encoded like this, but the rest (the majoroty) would not, there would be no Size1, Size2 etc.

Similarly, for decoding, there's a decode_delimited_<Msg>, doing the opposite, which I don't really want to support either.

@lukebakken
Copy link
Contributor Author

lukebakken commented Aug 3, 2016

Tomas, I agree that the delimited encode function that takes a list argument is not worth the work (nor is decode_delimited), and to be honest, I didn't even know they existed!

@tomas-abrahamsson
Copy link
Owner

There's an obvious collision for compatibility encode/decode functions per message, if the message happens to be named just msg: the function name would be encode_msg, but for gpb, that's also the major entry function for encoding. One might argue that the arity differs, but if one would like to add a second argument, Options, then it collides.

(Pre-emptively reasoning: Why would one like to add an Options argument when there's no such thing in epb? Well, for a person that does not have the epb background, it might seem strange that the encode_<Msg> api does not support options, that it is only possible to specify options when using the encode_msg.)

So the question this builds up to: do you think it would be much of a nuisance if the compatibility functions were named encode_msg_<Msg> and not encode_<Msg>? The collision outlined above would be avoided, but on the other hand, one would need to edit code to switch epb to gpb. (Since editing would be needed, maybe this idea of almost-compatibility functions kills the idea -- one could just as well edit to call encode_msg directly instead?)

I'd like the name collision to get a resolution, or else drop the idea. (Not being able to generate code if a message happens to be named msg is not ok.)

Another thought (brain-storming): reduce area of collision by only generate compatibility functions if an option is specified, it would still collide, but I could accept those cases. Cons: other compatibility functions should be also under this option, and maybe a little bit messier code-generating code.

@lukebakken
Copy link
Contributor Author

An epb_compatibility option could just enable the per-message encode / decode case discussed here and the global encode / decode functions that were a part of #65. The boolean and iolist functionality I added could be seen as a general gpb enhancement that also just happens to make gpb more compatible with epb.

@tomas-abrahamsson
Copy link
Owner

Agree. I have something locally for this now

@tomas-abrahamsson
Copy link
Owner

Hi, the 3.24.4 (just pushed) contains the option epb_compatibility (-epb on the command line) which activates generation of the functions encode/decode/encode_<MsgName>/decode_<MsgName>. None of these functions are now generated when no epb_compatibility option is specifeid, so it is an incompatibility with the previous behaviour.

I also added a check at generation-time to see if there's a message named msg, and if so, there's an error that this will collide with the standard encode_msg/decode_msg functions (errors as early as possible).

@tomas-abrahamsson
Copy link
Owner

Revisiting... Now that we have an option (a good thing, I think), maybe this option should also imply:

    {module_name_suffix, "_pb"}

and

    {msg_name_to_lower, true}

to make life easier for any transitioners?

@tsloughter
Copy link
Contributor

Hm, I'd like to have the epb_compatibility but it complains that I can't have that and maps together. I mostly just wanted the encode and decode function names since I have a bunch of other code that already uses that but was wanting to have maps support so started looking at gpb. Is there anyway to have both?

@tomas-abrahamsson
Copy link
Owner

For the decode function, things could work straight off with maps. But for the encode function, it would need an extra argument: the name of the message. This is because with maps, the message is just a map of its fields, nothing more, so something is needed to say which message the map is. With such an extra argument to encode, the code would have to be reshaped anyway, or so I reasoned.

(About maps vs records, I don't know if this is a hot potato or not, but here goes: One could have a special '%@type' field or such, to make maps more like records, so that the value also contains its type or purpose, but I think the right thing is to avoid such a type field. No other Erlang value contains its purpose, integers are integers regardless of whether they're used for timeouts or indices, references are references regardless of whether they are for monitoring a process or just created using make_ref(), and so on. Writing programs tend to go just fine without encoding the purpose of a value into it, so I figured this would be how one would use messages as maps too.)

However, the encode_<MsgName> functions could work with maps (but currently don't), but for your use-case, I suppose going that road would, too, require for you to edit the code(?)

@tsloughter
Copy link
Contributor

My bad! I actually misunderstood what the maps option was for. I was thinking it was related to if the proto map<_,_> type is decoded to an Erlang map, not whether the encode function takes a map or record. I prefer a record since it is a strict representation anyway :). Thanks!

@tomas-abrahamsson
Copy link
Owner

Ah, ok, no problems, easy enough to confuse the two

@tomas-abrahamsson
Copy link
Owner

In just-pushed 3.25.0, I've updated the epb_compatibility option to also imply
Revisiting... Now that we have an option (a good thing, I think), maybe this option should also imply the two options discussed above:

    {module_name_suffix, "_pb"}
    {msg_name_to_lower, true}

With this, I'm closing this issue, should any of you come to think of some more compatibility thing, just either re-open this issue, or open a new.

@lukebakken
Copy link
Contributor Author

@tomas-abrahamsson thanks! I am checking in after being on vacation for the past few weeks.

@lukebakken
Copy link
Contributor Author

@tomas-abrahamsson epb_compatibility and the other changes you made are great. Thanks so much again.

@tomas-abrahamsson
Copy link
Owner

@lukebakken hehe, you did much of it yourself! Thanks to you too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants