-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: deterministic proto bytes serialization #7427
Conversation
This definitely makes sense, I think that defining the addition of fields to messages that require a canonical byte representation as a breaking change allows us to circumvent this, no? For cases where we need to verify digital signatures for votes, we re-produce the byte string on each validator by pulling fields out of the block and vote structure, putting these into a canonical vote struct and marshalling them to proto bytes. We then check that this recomputed bytes matches the digital signature we received. All this to say that, since we reproduce the serialized structure locally in these cases, it doesn't make sense to really allow unknown fields since we would not be able to reproduce them. EDIT: a maybe better way of saying the above: In cases where we verify the integrity of a message, we usually also need to verify the content and what would it mean to verify the content of an unknown field. |
|
||
The protocol buffer [encoding spec][proto-spec-encoding] does not guarantee that the byte representation | ||
of a protocol buffer message will be the same between two calls to an encoder. | ||
While there is a mode to force the encoder to produce the same byte representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: I somehow always thought that this mode only existed in gogo/proto: https://github.com/gogo/protobuf/blob/226206f39bd7276e88ec684ea0028c18ec2c91ae/proto/lib.go#L378-L398
TIL that it also exists in default protobuf:
https://github.com/golang/protobuf/blob/ae97035608a719c7a1c1c41bed0ae0744bdb0c6f/proto/buffer.go#L61-L81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's written down in the spec, I know it exists in c++ as well.
https://github.com/protocolbuffers/protobuf/releases/tag/v3.0.0
This seems promising, although more research is needed to guarantee gogoproto always | ||
produces a consistent set of bytes on serialized messages. This would solve the problem | ||
within Tendermint as written in Go, but would require ensuring that there are similar | ||
serializers written in other languages that produce the same output as gogoproto. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hence I think test-vectors are necessary in the spec even if "it's just protobuf with some extra rules".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a test harness defined in JSON or YAML taht also includes the expected output of marshaling and bit-munging may be worth producing. This would allow projects in different languages to check their implementations against our own.
This could be implemented as a function in many languages that performed the following steps: | ||
|
||
1. Reordered all fields to be in tag-sorted order. | ||
2. Reordered all `repeated` sub-fields to be in lexicographically sorted order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Is this already the case? I know that for instance ValidatorSets are sorted already but is e.g. is Evidence lexicographically sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s no guarantee that sorting the field before encoding affects the encoding. In Go it does, and in practice it’s a sensible assumption. But it could change at any time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth noting here that ValidatorSet
is actually merkelized in a somewhat custom way. (I have spent time staring at the different places we call hash functions on these 😆 )
The code for this is here. It gets used in the light client, so it's meaningful for multiple languages since light-client verification happens in multiple languages, but this is not necessarily as risky as could be feared.
We 'hash' the set by looping over the list of validators, which is in deterministic ordering of voting power, and marshalling each individually. We produce a merkle tree from this list of byte slices, and return the root of the tree as the 'hash'.
A similar process is done for the evidence hash as well. Perhaps a longer version of this document or full ADR may need to address all of the current use cases for the newly proposed hashing scheme but I wanted to surface the problem and some suggestions first 🙂
Finally, we should add clear documentation to the Tendermint codebase every time we | ||
compare hashes of proto messages or use proto serialized bytes to produces a | ||
digital signatures that we have been careful to ensure that the hashes are performed | ||
properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
|
||
A prototype implementation by @creachadair of this can be found in [the wirepb repo][wire-pb]. | ||
This could be implemented in multiple languages more simply than ensuring that there are | ||
canonical proto serializers that match in each language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It remains a bit vague what wirepb actually is. Is it some kind of linter that enforces the above rules? Or does it take the raw bytes of any proto serializer and additionally canonicalizes them (by applying above rules)? By quickly skimming through that repo it seems to be the latter and the introductory paragraph also speaks of "re-ordering the serialized bytes". I think it could still be clearer because above steps seem to operate on fields (which first sounds like this is about pre-serialization.
Independent from the text: This is an interesting approach and I like it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It remains a bit vague what wirepb actually is.
Right now it is just a quick proof-of-concept. I just added some more documentation to the README, that I hope may help clarify. In summary: It is not just a linter, it actually permutes the content of the wire message in binary format. It operates only on the wire format, and does not know anything about the language-specific type structure or the descriptor schema.
However, it does have some important limitations, which I've spelled out explicitly in the README file.
Another approach, that is potentially more robust, is to make the caller provide a protobuf Descriptor
message for the expected type along with the byte string to be hashed. Given the descriptor, the limitations of the prototype could be eliminated. However, this would mean every tool that wanted to check the hash of a protobuf message would need to (a) know the expected protobuf type, (b) have access to the descriptor table, and (c) have access to a descriptor-aware implementation of hashing. Implementing it this way would probably also be very inefficient.
Protobuf wire format is simple enough that it's easy to re-implement this prototype algorithm in any language. Adding in descriptor support would make it a much bigger job. It could maybe be worthwhile, if the alternative is having to maintain a completely-parallel deterministic code generator, but there are a lot of unknowns.
In Go specifically, it might be possible to shim in a deterministic marshaling rule via the Message interface, but it's not at all clear how we could keep such an implementation synchronized with other languages (and I know at least Rust and JS potentially need to harmonize with the Go implementation—perhaps there are others).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more complex the implementation, the more difficult it will be to make it agree across languages in all cases. I would prefer that we stick with a simpler implementation.
Per the limitations in that document:
The algorithm cannot distinguish between a field of opaque string type that contains the wire encoding of a message, and a field of message type.
This is worth keeping in mind if we ever allow opaque strings to represent messages in our types. The obvious example that comes to mind is vote-extensions. If we allow an extension to appear as bytes within our structure, we risk mangling the message if we re-encode it. I think being clear that this is strictly for hashing is a fine restriction to put on this algorithm. I.e., do not use this as any kind of intermediate representation ever. Just for comparison of hashes and signatures.
The algorithm does not unify default-valued fields with unset (omitted) fields. In proto3, the only default values are equivalent to omitting the field from the message, and most encoders do omit them. However, if an encoder does include default-valued fields in its output, Algorithm 2 will generate a different string than if the field was omitted.
I'll do some minor research to determine which encoders produce which type of output. While we are not guaranteed that this will change in the future, we can at least guide users towards 'safe' encoders in their language and caution users in new languages about what behavior their encoder must adhere to.
The algorithm does not handle "packed" repeated fields of scalar type. Packed repeated fields are encoded as a wire-type string of concatenated varint values. The values within the string do not have the structure of a message, so the algorithm does not "fix" the order of the packed values.
It appears from my research that when we use a message containing a list, we implement an additional hash
method on top of it that loops over the list in a deterministic order, marshals the list elements into byte slices, and then merkelizes the resulting slices, so we are generally in the clear on this point so far and can continue to use the same trick in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm cannot distinguish between a field of opaque string type that contains the wire encoding of a message, and a field of message type.
This is worth keeping in mind if we ever allow opaque strings to represent messages in our types.
We need to account for it even if we don't do that: The workaround is to assume all fields with the string wire-type could be messages. So an opaque string that happens to be a syntactically valid wire-format message would also be permuted, even if we never intended to treat it as a message.
We're still hashing all the same bytes in that case—but in a different order from the original. Simple cases (e.g., "\n\tabcdefghi" is one example) would not be changed, but more complex ones would (e.g., "\n\tbcdefghij\n\tabcdefghi" would be come "\n\tabcdefghi\n\tbcdefghij"). As a rule, most strings won't look syntactically like messages, but we can't hang anything important on that.
I'll do some minor research to determine which encoders produce which type of output.
It would also be possible to filter such fields out, but as far as I know the proto3 encoders should already be doing this.
It appears from my research that when we use a message containing a list, we implement an additional
hash
method on top of it that loops over the list in a deterministic order
Fortunately, field packing only applies to scalar numeric types, so repeated messages, strings, etc., are not affected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to account for it even if we don't do that: The workaround is to assume all fields with the string wire-type could be messages.
To clarify: I more meant that we need to take it into account so as to not accidentally ever deserialize out of the bytes being used for hashing. But yes, the canonicalizer needs to always treat strings as though it's acting on a sub-message since it can't disambiguate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my quick investigation, it looks like
protobufjs
does include the serialized bytes on default vaules:
https://github.com/williambanfield/proto-research/blob/master/js/test.js#L26
https://github.com/williambanfield/proto-research/blob/master/js/output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really well written. I'm curious as to whether this problem is also addressed by similar software and how they go about ensuring a deterministic array of bytes for hashes and verification.
Don't forget to add the link to this RFC in the table of contents.
|
||
According to [Cosmos-SDK ADR-27][cosmos-sdk-adr-27], when message types obey a simple | ||
set of rules, gogoproto produces a consistent byte representation of serialized messages. | ||
This seems promising, although more research is needed to guarantee gogoproto always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also have the intention of removing gogoproto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that there is interest in doing this in the SDK, although I'm not sure I fully understand the timeline. It's likely a reasonable thing to do seeing as gogoproto is currently in maintenance limbo. However, in terms of concrete plans to remove it from Tendermint completely, I'm not aware of any at the moment.
Co-authored-by: Callum Waters <cmwaters19@gmail.com>
I think this outlines the parameters of the problem space really well, and I don't see any reason why we shouldn't commit this as soon as the details are polished. I'm also not sure that we need to make any code changes imminently, (it's definitely a loophole, though perhaps one of the specification,) but the reality is that the implementation is likely stable for the short term. I think I come down, weakly, on the side of making the hashing protocol to be a function of the specification (e.g. read x values as serialized under y rules, in z order) rather than the implementation, but I can be convinced in other directions. |
…' into wb/rfc-deterministic-proto-bytes
Major thanks to @creachadair for guidance in regards to the proto spec and for quick prototyping of a serialized-proto-canonicalizer.