rfc: deterministic proto bytes serialization #7427

williambanfield · 2021-12-10T19:33:57Z

Major thanks to @creachadair for guidance in regards to the proto spec and for quick prototyping of a serialized-proto-canonicalizer.

williambanfield · 2021-12-10T19:56:43Z

We need to be careful about this, though: Discarding unknown fields would make it easier to get a consistent hash, but may also complicate migrations.

The reason for keeping unknown fields is to allow tools that exchange messages to be upgraded incrementally. An intermediate tool (for example, a relayer) that drops unknown fields may have to be upgraded in lockstep with one or both sides.

This definitely makes sense, I think that defining the addition of fields to messages that require a canonical byte representation as a breaking change allows us to circumvent this, no?

For cases where we need to verify digital signatures for votes, we re-produce the byte string on each validator by pulling fields out of the block and vote structure, putting these into a canonical vote struct and marshalling them to proto bytes. We then check that this recomputed bytes matches the digital signature we received. All this to say that, since we reproduce the serialized structure locally in these cases, it doesn't make sense to really allow unknown fields since we would not be able to reproduce them.

EDIT: a maybe better way of saying the above: In cases where we verify the integrity of a message, we usually also need to verify the content and what would it mean to verify the content of an unknown field.

liamsi · 2021-12-11T08:45:41Z

docs/rfc/rfc-007-deterministic-proto-bytes.md

+
+The protocol buffer [encoding spec][proto-spec-encoding] does not guarantee that the byte representation
+of a protocol buffer message will be the same between two calls to an encoder.
+While there is a mode to force the encoder to produce the same byte representation


Side note: I somehow always thought that this mode only existed in gogo/proto: https://github.com/gogo/protobuf/blob/226206f39bd7276e88ec684ea0028c18ec2c91ae/proto/lib.go#L378-L398
TIL that it also exists in default protobuf:
https://github.com/golang/protobuf/blob/ae97035608a719c7a1c1c41bed0ae0744bdb0c6f/proto/buffer.go#L61-L81

Yeah, it's written down in the spec, I know it exists in c++ as well.

https://github.com/protocolbuffers/protobuf/releases/tag/v3.0.0

liamsi · 2021-12-11T08:47:52Z

docs/rfc/rfc-007-deterministic-proto-bytes.md

+This seems promising, although more research is needed to guarantee gogoproto always
+produces a consistent set of bytes on serialized messages. This would solve the problem 
+within Tendermint as written in Go, but would require ensuring that there are similar
+serializers written in other languages that produce the same output as gogoproto.


Hence I think test-vectors are necessary in the spec even if "it's just protobuf with some extra rules".

I think a test harness defined in JSON or YAML taht also includes the expected output of marshaling and bit-munging may be worth producing. This would allow projects in different languages to check their implementations against our own.

liamsi · 2021-12-11T09:01:04Z

docs/rfc/rfc-007-deterministic-proto-bytes.md

+This could be implemented as a function in many languages that performed the following steps:
+
+1. Reordered all fields to be in tag-sorted order.
+2. Reordered all `repeated` sub-fields to be in lexicographically sorted order.


Q: Is this already the case? I know that for instance ValidatorSets are sorted already but is e.g. is Evidence lexicographically sorted?

There’s no guarantee that sorting the field before encoding affects the encoding. In Go it does, and in practice it’s a sensible assumption. But it could change at any time.

I think it's worth noting here that ValidatorSet is actually merkelized in a somewhat custom way. (I have spent time staring at the different places we call hash functions on these 😆 )

The code for this is here. It gets used in the light client, so it's meaningful for multiple languages since light-client verification happens in multiple languages, but this is not necessarily as risky as could be feared.

We 'hash' the set by looping over the list of validators, which is in deterministic ordering of voting power, and marshalling each individually. We produce a merkle tree from this list of byte slices, and return the root of the tree as the 'hash'.

A similar process is done for the evidence hash as well. Perhaps a longer version of this document or full ADR may need to address all of the current use cases for the newly proposed hashing scheme but I wanted to surface the problem and some suggestions first 🙂

liamsi · 2021-12-11T09:03:28Z

docs/rfc/rfc-007-deterministic-proto-bytes.md

+Finally, we should add clear documentation to the Tendermint codebase every time we
+compare hashes of proto messages or use proto serialized bytes to produces a
+digital signatures that we have been careful to ensure that the hashes are performed
+properly.


liamsi · 2021-12-11T09:14:25Z

docs/rfc/rfc-007-deterministic-proto-bytes.md

+
+A prototype implementation by @creachadair of this can be found in [the wirepb repo][wire-pb].
+This could be implemented in multiple languages more simply than ensuring that there are
+canonical proto serializers that match in each language.


It remains a bit vague what wirepb actually is. Is it some kind of linter that enforces the above rules? Or does it take the raw bytes of any proto serializer and additionally canonicalizes them (by applying above rules)? By quickly skimming through that repo it seems to be the latter and the introductory paragraph also speaks of "re-ordering the serialized bytes". I think it could still be clearer because above steps seem to operate on fields (which first sounds like this is about pre-serialization.

Independent from the text: This is an interesting approach and I like it :)

It remains a bit vague what wirepb actually is.

Right now it is just a quick proof-of-concept. I just added some more documentation to the README, that I hope may help clarify. In summary: It is not just a linter, it actually permutes the content of the wire message in binary format. It operates only on the wire format, and does not know anything about the language-specific type structure or the descriptor schema.

However, it does have some important limitations, which I've spelled out explicitly in the README file.

Another approach, that is potentially more robust, is to make the caller provide a protobuf Descriptor message for the expected type along with the byte string to be hashed. Given the descriptor, the limitations of the prototype could be eliminated. However, this would mean every tool that wanted to check the hash of a protobuf message would need to (a) know the expected protobuf type, (b) have access to the descriptor table, and (c) have access to a descriptor-aware implementation of hashing. Implementing it this way would probably also be very inefficient.

Protobuf wire format is simple enough that it's easy to re-implement this prototype algorithm in any language. Adding in descriptor support would make it a much bigger job. It could maybe be worthwhile, if the alternative is having to maintain a completely-parallel deterministic code generator, but there are a lot of unknowns.

In Go specifically, it might be possible to shim in a deterministic marshaling rule via the Message interface, but it's not at all clear how we could keep such an implementation synchronized with other languages (and I know at least Rust and JS potentially need to harmonize with the Go implementation—perhaps there are others).

The more complex the implementation, the more difficult it will be to make it agree across languages in all cases. I would prefer that we stick with a simpler implementation.

Per the limitations in that document:

The algorithm cannot distinguish between a field of opaque string type that contains the wire encoding of a message, and a field of message type.

This is worth keeping in mind if we ever allow opaque strings to represent messages in our types. The obvious example that comes to mind is vote-extensions. If we allow an extension to appear as bytes within our structure, we risk mangling the message if we re-encode it. I think being clear that this is strictly for hashing is a fine restriction to put on this algorithm. I.e., do not use this as any kind of intermediate representation ever. Just for comparison of hashes and signatures.

The algorithm does not unify default-valued fields with unset (omitted) fields. In proto3, the only default values are equivalent to omitting the field from the message, and most encoders do omit them. However, if an encoder does include default-valued fields in its output, Algorithm 2 will generate a different string than if the field was omitted.

I'll do some minor research to determine which encoders produce which type of output. While we are not guaranteed that this will change in the future, we can at least guide users towards 'safe' encoders in their language and caution users in new languages about what behavior their encoder must adhere to.

The algorithm does not handle "packed" repeated fields of scalar type. Packed repeated fields are encoded as a wire-type string of concatenated varint values. The values within the string do not have the structure of a message, so the algorithm does not "fix" the order of the packed values.

It appears from my research that when we use a message containing a list, we implement an additional hash method on top of it that loops over the list in a deterministic order, marshals the list elements into byte slices, and then merkelizes the resulting slices, so we are generally in the clear on this point so far and can continue to use the same trick in the future.

The algorithm cannot distinguish between a field of opaque string type that contains the wire encoding of a message, and a field of message type.

This is worth keeping in mind if we ever allow opaque strings to represent messages in our types.

We need to account for it even if we don't do that: The workaround is to assume all fields with the string wire-type could be messages. So an opaque string that happens to be a syntactically valid wire-format message would also be permuted, even if we never intended to treat it as a message.

We're still hashing all the same bytes in that case—but in a different order from the original. Simple cases (e.g., "\n\tabcdefghi" is one example) would not be changed, but more complex ones would (e.g., "\n\tbcdefghij\n\tabcdefghi" would be come "\n\tabcdefghi\n\tbcdefghij"). As a rule, most strings won't look syntactically like messages, but we can't hang anything important on that.

I'll do some minor research to determine which encoders produce which type of output.

It would also be possible to filter such fields out, but as far as I know the proto3 encoders should already be doing this.

It appears from my research that when we use a message containing a list, we implement an additional hash method on top of it that loops over the list in a deterministic order

Fortunately, field packing only applies to scalar numeric types, so repeated messages, strings, etc., are not affected.

We need to account for it even if we don't do that: The workaround is to assume all fields with the string wire-type could be messages.

To clarify: I more meant that we need to take it into account so as to not accidentally ever deserialize out of the bytes being used for hashing. But yes, the canonicalizer needs to always treat strings as though it's acting on a sub-message since it can't disambiguate.

From my quick investigation, it looks like protobufjs does include the serialized bytes on default vaules:

https://github.com/williambanfield/proto-research/blob/master/js/test.js#L26
https://github.com/williambanfield/proto-research/blob/master/js/output

cmwaters

This is really well written. I'm curious as to whether this problem is also addressed by similar software and how they go about ensuring a deterministic array of bytes for hashes and verification.

Don't forget to add the link to this RFC in the table of contents.

docs/rfc/rfc-007-deterministic-proto-bytes.md

cmwaters · 2021-12-13T23:34:17Z

docs/rfc/rfc-007-deterministic-proto-bytes.md

+
+According to [Cosmos-SDK ADR-27][cosmos-sdk-adr-27], when message types obey a simple 
+set of rules, gogoproto produces a consistent byte representation of serialized messages.
+This seems promising, although more research is needed to guarantee gogoproto always


Don't we also have the intention of removing gogoproto?

I believe that there is interest in doing this in the SDK, although I'm not sure I fully understand the timeline. It's likely a reasonable thing to do seeing as gogoproto is currently in maintenance limbo. However, in terms of concrete plans to remove it from Tendermint completely, I'm not aware of any at the moment.

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

tychoish · 2021-12-14T16:16:49Z

I think this outlines the parameters of the problem space really well, and I don't see any reason why we shouldn't commit this as soon as the details are polished.

I'm also not sure that we need to make any code changes imminently, (it's definitely a loophole, though perhaps one of the specification,) but the reality is that the implementation is likely stable for the short term.

I think I come down, weakly, on the side of making the hashing protocol to be a function of the specification (e.g. read x values as serialized under y rules, in z order) rather than the implementation, but I can be convinced in other directions.

…' into wb/rfc-deterministic-proto-bytes

williambanfield added 5 commits December 9, 2021 18:10

rfc: deterministic proto bytes serialization

d7fa107

missed end of sentence

8822cf8

fix references to include links

7421b83

add prototype impl from @creachadair

3bb825f

move file to correct number

260e1a4

williambanfield requested review from cmwaters, creachadair, ebuchman and tychoish as code owners December 10, 2021 19:33

liamsi reviewed Dec 11, 2021

View reviewed changes

liamsi mentioned this pull request Dec 12, 2021

Investigate canonical encoding scheme celestiaorg/celestia-specs#35

Closed

cmwaters approved these changes Dec 13, 2021

View reviewed changes

Apply suggestions from code review

aef2191

Co-authored-by: Callum Waters <cmwaters19@gmail.com>

williambanfield and others added 5 commits December 14, 2021 18:17

wip

9be0763

explain why unknown fields can be ignored

c932e79

clarify reasons for canonicalization rules

1e6e9c9

Merge branch 'master' into wb/rfc-deterministic-proto-bytes

76a922b

Merge remote-tracking branch 'origin/wb/rfc-deterministic-proto-bytes…

7d58fb1

…' into wb/rfc-deterministic-proto-bytes

williambanfield added S:automerge Automatically merge PR when requirements pass and removed S:automerge Automatically merge PR when requirements pass labels Dec 15, 2021

mergify bot merged commit 20c547a into master Dec 15, 2021

mergify bot deleted the wb/rfc-deterministic-proto-bytes branch December 15, 2021 22:52

williambanfield restored the wb/rfc-deterministic-proto-bytes branch September 9, 2022 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: deterministic proto bytes serialization #7427

rfc: deterministic proto bytes serialization #7427

williambanfield commented Dec 10, 2021 •

edited

williambanfield commented Dec 10, 2021 •

edited

liamsi Dec 11, 2021

williambanfield Dec 14, 2021

liamsi Dec 11, 2021

williambanfield Dec 14, 2021 •

edited

liamsi Dec 11, 2021

creachadair Dec 12, 2021

williambanfield Dec 13, 2021 •

edited

liamsi Dec 11, 2021

liamsi Dec 11, 2021

creachadair Dec 11, 2021 •

edited

williambanfield Dec 14, 2021

creachadair Dec 14, 2021

williambanfield Dec 14, 2021 •

edited

williambanfield Dec 14, 2021

cmwaters left a comment

cmwaters Dec 13, 2021

williambanfield Dec 14, 2021

tychoish commented Dec 14, 2021

rfc: deterministic proto bytes serialization #7427

rfc: deterministic proto bytes serialization #7427

Conversation

williambanfield commented Dec 10, 2021 • edited

williambanfield commented Dec 10, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williambanfield Dec 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williambanfield Dec 13, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

creachadair Dec 11, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williambanfield Dec 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmwaters left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tychoish commented Dec 14, 2021

williambanfield commented Dec 10, 2021 •

edited

williambanfield commented Dec 10, 2021 •

edited

williambanfield Dec 14, 2021 •

edited

williambanfield Dec 13, 2021 •

edited

creachadair Dec 11, 2021 •

edited

williambanfield Dec 14, 2021 •

edited