Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc: deterministic proto bytes serialization #7427

Merged
merged 11 commits into from
Dec 15, 2021

Conversation

williambanfield
Copy link
Contributor

@williambanfield williambanfield commented Dec 10, 2021

Major thanks to @creachadair for guidance in regards to the proto spec and for quick prototyping of a serialized-proto-canonicalizer.

@williambanfield
Copy link
Contributor Author

williambanfield commented Dec 10, 2021

We need to be careful about this, though: Discarding unknown fields would make it easier to get a consistent hash, but may also complicate migrations.

The reason for keeping unknown fields is to allow tools that exchange messages to be upgraded incrementally. An intermediate tool (for example, a relayer) that drops unknown fields may have to be upgraded in lockstep with one or both sides.

This definitely makes sense, I think that defining the addition of fields to messages that require a canonical byte representation as a breaking change allows us to circumvent this, no?

For cases where we need to verify digital signatures for votes, we re-produce the byte string on each validator by pulling fields out of the block and vote structure, putting these into a canonical vote struct and marshalling them to proto bytes. We then check that this recomputed bytes matches the digital signature we received. All this to say that, since we reproduce the serialized structure locally in these cases, it doesn't make sense to really allow unknown fields since we would not be able to reproduce them.

EDIT: a maybe better way of saying the above: In cases where we verify the integrity of a message, we usually also need to verify the content and what would it mean to verify the content of an unknown field.


The protocol buffer [encoding spec][proto-spec-encoding] does not guarantee that the byte representation
of a protocol buffer message will be the same between two calls to an encoder.
While there is a mode to force the encoder to produce the same byte representation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's written down in the spec, I know it exists in c++ as well.

https://github.com/protocolbuffers/protobuf/releases/tag/v3.0.0

This seems promising, although more research is needed to guarantee gogoproto always
produces a consistent set of bytes on serialized messages. This would solve the problem
within Tendermint as written in Go, but would require ensuring that there are similar
serializers written in other languages that produce the same output as gogoproto.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence I think test-vectors are necessary in the spec even if "it's just protobuf with some extra rules".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a test harness defined in JSON or YAML taht also includes the expected output of marshaling and bit-munging may be worth producing. This would allow projects in different languages to check their implementations against our own.

This could be implemented as a function in many languages that performed the following steps:

1. Reordered all fields to be in tag-sorted order.
2. Reordered all `repeated` sub-fields to be in lexicographically sorted order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is this already the case? I know that for instance ValidatorSets are sorted already but is e.g. is Evidence lexicographically sorted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s no guarantee that sorting the field before encoding affects the encoding. In Go it does, and in practice it’s a sensible assumption. But it could change at any time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth noting here that ValidatorSet is actually merkelized in a somewhat custom way. (I have spent time staring at the different places we call hash functions on these 😆 )

The code for this is here. It gets used in the light client, so it's meaningful for multiple languages since light-client verification happens in multiple languages, but this is not necessarily as risky as could be feared.

We 'hash' the set by looping over the list of validators, which is in deterministic ordering of voting power, and marshalling each individually. We produce a merkle tree from this list of byte slices, and return the root of the tree as the 'hash'.

A similar process is done for the evidence hash as well. Perhaps a longer version of this document or full ADR may need to address all of the current use cases for the newly proposed hashing scheme but I wanted to surface the problem and some suggestions first 🙂

Comment on lines 92 to 95
Finally, we should add clear documentation to the Tendermint codebase every time we
compare hashes of proto messages or use proto serialized bytes to produces a
digital signatures that we have been careful to ensure that the hashes are performed
properly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼


A prototype implementation by @creachadair of this can be found in [the wirepb repo][wire-pb].
This could be implemented in multiple languages more simply than ensuring that there are
canonical proto serializers that match in each language.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It remains a bit vague what wirepb actually is. Is it some kind of linter that enforces the above rules? Or does it take the raw bytes of any proto serializer and additionally canonicalizes them (by applying above rules)? By quickly skimming through that repo it seems to be the latter and the introductory paragraph also speaks of "re-ordering the serialized bytes". I think it could still be clearer because above steps seem to operate on fields (which first sounds like this is about pre-serialization.

Independent from the text: This is an interesting approach and I like it :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It remains a bit vague what wirepb actually is.

Right now it is just a quick proof-of-concept. I just added some more documentation to the README, that I hope may help clarify. In summary: It is not just a linter, it actually permutes the content of the wire message in binary format. It operates only on the wire format, and does not know anything about the language-specific type structure or the descriptor schema.

However, it does have some important limitations, which I've spelled out explicitly in the README file.

Another approach, that is potentially more robust, is to make the caller provide a protobuf Descriptor message for the expected type along with the byte string to be hashed. Given the descriptor, the limitations of the prototype could be eliminated. However, this would mean every tool that wanted to check the hash of a protobuf message would need to (a) know the expected protobuf type, (b) have access to the descriptor table, and (c) have access to a descriptor-aware implementation of hashing. Implementing it this way would probably also be very inefficient.

Protobuf wire format is simple enough that it's easy to re-implement this prototype algorithm in any language. Adding in descriptor support would make it a much bigger job. It could maybe be worthwhile, if the alternative is having to maintain a completely-parallel deterministic code generator, but there are a lot of unknowns.

In Go specifically, it might be possible to shim in a deterministic marshaling rule via the Message interface, but it's not at all clear how we could keep such an implementation synchronized with other languages (and I know at least Rust and JS potentially need to harmonize with the Go implementation—perhaps there are others).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more complex the implementation, the more difficult it will be to make it agree across languages in all cases. I would prefer that we stick with a simpler implementation.

Per the limitations in that document:

The algorithm cannot distinguish between a field of opaque string type that contains the wire encoding of a message, and a field of message type.

This is worth keeping in mind if we ever allow opaque strings to represent messages in our types. The obvious example that comes to mind is vote-extensions. If we allow an extension to appear as bytes within our structure, we risk mangling the message if we re-encode it. I think being clear that this is strictly for hashing is a fine restriction to put on this algorithm. I.e., do not use this as any kind of intermediate representation ever. Just for comparison of hashes and signatures.

The algorithm does not unify default-valued fields with unset (omitted) fields. In proto3, the only default values are equivalent to omitting the field from the message, and most encoders do omit them. However, if an encoder does include default-valued fields in its output, Algorithm 2 will generate a different string than if the field was omitted.

I'll do some minor research to determine which encoders produce which type of output. While we are not guaranteed that this will change in the future, we can at least guide users towards 'safe' encoders in their language and caution users in new languages about what behavior their encoder must adhere to.

The algorithm does not handle "packed" repeated fields of scalar type. Packed repeated fields are encoded as a wire-type string of concatenated varint values. The values within the string do not have the structure of a message, so the algorithm does not "fix" the order of the packed values.

It appears from my research that when we use a message containing a list, we implement an additional hash method on top of it that loops over the list in a deterministic order, marshals the list elements into byte slices, and then merkelizes the resulting slices, so we are generally in the clear on this point so far and can continue to use the same trick in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm cannot distinguish between a field of opaque string type that contains the wire encoding of a message, and a field of message type.

This is worth keeping in mind if we ever allow opaque strings to represent messages in our types.

We need to account for it even if we don't do that: The workaround is to assume all fields with the string wire-type could be messages. So an opaque string that happens to be a syntactically valid wire-format message would also be permuted, even if we never intended to treat it as a message.

We're still hashing all the same bytes in that case—but in a different order from the original. Simple cases (e.g., "\n\tabcdefghi" is one example) would not be changed, but more complex ones would (e.g., "\n\tbcdefghij\n\tabcdefghi" would be come "\n\tabcdefghi\n\tbcdefghij"). As a rule, most strings won't look syntactically like messages, but we can't hang anything important on that.

I'll do some minor research to determine which encoders produce which type of output.

It would also be possible to filter such fields out, but as far as I know the proto3 encoders should already be doing this.

It appears from my research that when we use a message containing a list, we implement an additional hash method on top of it that loops over the list in a deterministic order

Fortunately, field packing only applies to scalar numeric types, so repeated messages, strings, etc., are not affected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to account for it even if we don't do that: The workaround is to assume all fields with the string wire-type could be messages.

To clarify: I more meant that we need to take it into account so as to not accidentally ever deserialize out of the bytes being used for hashing. But yes, the canonicalizer needs to always treat strings as though it's acting on a sub-message since it can't disambiguate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my quick investigation, it looks like protobufjs does include the serialized bytes on default vaules:

https://github.com/williambanfield/proto-research/blob/master/js/test.js#L26
https://github.com/williambanfield/proto-research/blob/master/js/output

Copy link
Contributor

@cmwaters cmwaters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really well written. I'm curious as to whether this problem is also addressed by similar software and how they go about ensuring a deterministic array of bytes for hashes and verification.

Don't forget to add the link to this RFC in the table of contents.

docs/rfc/rfc-007-deterministic-proto-bytes.md Outdated Show resolved Hide resolved
docs/rfc/rfc-007-deterministic-proto-bytes.md Outdated Show resolved Hide resolved
docs/rfc/rfc-007-deterministic-proto-bytes.md Show resolved Hide resolved

According to [Cosmos-SDK ADR-27][cosmos-sdk-adr-27], when message types obey a simple
set of rules, gogoproto produces a consistent byte representation of serialized messages.
This seems promising, although more research is needed to guarantee gogoproto always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also have the intention of removing gogoproto?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that there is interest in doing this in the SDK, although I'm not sure I fully understand the timeline. It's likely a reasonable thing to do seeing as gogoproto is currently in maintenance limbo. However, in terms of concrete plans to remove it from Tendermint completely, I'm not aware of any at the moment.

Co-authored-by: Callum Waters <cmwaters19@gmail.com>
@tychoish
Copy link
Contributor

I think this outlines the parameters of the problem space really well, and I don't see any reason why we shouldn't commit this as soon as the details are polished.

I'm also not sure that we need to make any code changes imminently, (it's definitely a loophole, though perhaps one of the specification,) but the reality is that the implementation is likely stable for the short term.

I think I come down, weakly, on the side of making the hashing protocol to be a function of the specification (e.g. read x values as serialized under y rules, in z order) rather than the implementation, but I can be convinced in other directions.

@williambanfield williambanfield added S:automerge Automatically merge PR when requirements pass and removed S:automerge Automatically merge PR when requirements pass labels Dec 15, 2021
@mergify mergify bot merged commit 20c547a into master Dec 15, 2021
@mergify mergify bot deleted the wb/rfc-deterministic-proto-bytes branch December 15, 2021 22:52
@williambanfield williambanfield restored the wb/rfc-deterministic-proto-bytes branch September 9, 2022 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants