Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Options and Proto2 Extensions #674

Open
nswarm opened this issue Jun 30, 2022 · 9 comments
Open

Custom Options and Proto2 Extensions #674

nswarm opened this issue Jun 30, 2022 · 9 comments

Comments

@nswarm
Copy link

nswarm commented Jun 30, 2022

Overview

This is a proposal for support for extensions (proto2 only) and custom options (proto2 and 3).

The general idea of Extensions is that users can extend third-party protobuf definitions with additional data that will not break the deserialization of anything that relies only on the base version of the protobuf. For example, if your service reports health check information in some standard-compliant way, but also includes Extension data that another service in your ecosystem (with your specific protobuf defs) could use as well.

Custom Options are Extensions on the core *Options types like MessageOptions and FileOptions that are available in protoc generators, including custom protoc plugins, or prost.

Protobuf Extensions

Given a protobuf message like this:

// Provided by original message author.
message MyMessage {
	extensions 100 to 199;
}

Users can write extensions like this:

// Written by users of the "MyMessage" type.
extend MyMessage {
	optional string my_extension = 100;
}

Then they can use extension data like this (what it looks like in my PR #591):

// Encoding extension data.
let mut message = MyMessage::default();
message
	.set_extension_data(MY_EXTENSION, "some data")
	.expect("Failed to set ext data");
// Note: MY_EXTENSION is a generated static const value.
// Encode as normal.
	
// Later, assuming encoded_vec has a serialized `MyMessage`.
// Decoding extension data. ExtensionRegistry is used to tell the parser what
// extensions exist and how to parse them.
let mut registry = ExtensionRegistry::new();
registry.register(MY_EXTENSION);
let message = MyMessage::decode_with_extensions(&mut encoded_vec.as_slice(), registry)
	expect("Failed to decode.").;
	
// Using extension data.
let ext_data = message
	.extension_data(MY_EXTENSION)
	.expect("Failed to get ext data");
assert_eq!(ext_data, "some data");
Ok(())

Implementation

The original PR (#591) implements extensions similar to how other languages are handled within the core protobuf project, which I'll describe here.

There are three major components:

  1. Storing extension data in memory.
  2. Encoding and decoding the in-memory data.
  3. Getting and setting extension data in memory.

1. In-Memory Extension Data

Given a protobuf like:

// Provided by original message author.
message MyMessage {
	extensions 100 to 199;
}

We generate code like:

#[derive(Clone, PartialEq, ::prost::Message, ::prost::Extendable)]
pub struct MyMessage {
    #[prost(extension_set)]
    pub extension_set: ::prost::ExtensionSet<Self>,
}
impl EnumOptions {
    const EXTENDABLE_TYPE_ID: &'static str = ".google.protobuf.EnumOptions";
}

::prost::ExtensionSet: At runtime, holds type-erased extension values by protobuf tag.
EXTENDABLE_TYPE_ID: Used as a validator to ensure a particular Extension is for this protobuf type. (Will come back to this)
::prost::Extendable: A derived trait that generates accessors for the above two.

The ExtensionSet is simply a mapping of protobuf tag to a type-erased values, similar to this:

tag_to_value: BTreeMap<FieldTag, Box<dyn ExtensionValue>>,

2. Encoding and Decoding Extension Data

Given a type with an ExtensionSet, we need to be able to encode/decode that data to the wire. Users write this information into the protobuf extension itself like this:

extend MyMessage {
	optional string my_extension = 100;
}

We generate code like:

pub const MY_EXTENSION: &'static ::prost::ExtensionImpl<String> = &::prost::ExtensionImpl::<String> {
	extendable_type_id: ".extensions.ExtendableMessage",
	field_tag: 100,
	proto_int_type: ::prost::ProtoIntType::Default,
};

extendable_type_id and field_tag compose the ID to a specific field of a specific type.

The generic type and proto_int_type give us all the info we need to encode/decode the data. proto_int_type supplies additional info because there's not a 1:1 mapping of rust int types to proto.

Decoding

When decoding, fields in the rust generated code map 1:1 with the protobuf. For extensions we don't have generated fields. Instead we have a single ExtensionSet field which serves as a container for all unknown fields that we have an Extension registered for. This is why we need the ExtensionRegistry, which needs to be filled with those Extension types before parsing.

The ExtensionRegistry is simply a storage of ExtensionImpls which is queried when parsing the unknown fields of a message of a particular type and tag, then the generic type and proto_int_type can be used to decode the field from the serialized data and store it in the ExtensionSet field on the object using the merge method.

Encoding

When encoding, the ExtensionSet's encode is called, which in turn can tell each of its Extension datas to be encoded.

Merge and Encode Traits

The various merge and encode methods already exist for generated prost types via the ::prost::Message derive. I added the Merge and Encode traits, which are implemented by the same derive trait to call into those same methods. This way we can call these methods in a typeless context such as when encoding/decoding in the generic types of ExtensionImpl.

3. Getting and Setting Extension Data

The Extendable trait provides a function to get the type's ExtensionSet. From there you can use the static ExtensionImpl type to tell the ExtensionSet which data you want to operate on.

// Inside ExtensionSet
pub fn extension_data<T>(
	&mut self,
	extension: &ExtensionImpl<T>, // Using the "ID" properties `extendable_type_id` and `field_tag`
) -> Result<&T, ExtensionSetError> { ... }

// also:
// extension_data_mut
// set_extension_data
// has_extension
// clear_extension

I think it's plausible to break down into PRs roughly along the lines above if it would help to consume. The feature is kind of just a monster and not particularly useful without everything though.

Let me know if anything is unclear, it's fairly complicated and took some digging to truly understand what was going on. I'm also probably not explaining as well as I think I am. :)

@LucioFranco

@nswarm
Copy link
Author

nswarm commented Jul 23, 2022

@LucioFranco Following up, I have some time coming up, would you like me to move ahead some small PRs or would you like to review the design first?

I could also recreate the existing PR but cut the commits better into logical slices so it would be easier to follow the progression described here, then we could work off that.

@LucioFranco
Copy link
Member

@nswarm sorry for the delay on this! I would say for now hold off on doing some more work before I get some time to read through this and think about it. I know thats probably not the answer you want to hear but my time is a bit limited atm. I appreciate you doing this work and I will try to get to this in the next month or so, again sorry about the delay.

@nswarm
Copy link
Author

nswarm commented Jul 29, 2022

No problem, I'm in no rush. Thanks for getting back to me. I'll let it simmer :)

@slinkydeveloper
Copy link

Hi @nswarm @LucioFranco I'm catching up all the discussions about this feature, I'm gonna give my 2 cents:

Unknown extensions and ExtensionRegistry

I see coming up a couple of times the "we don't need an ExtensionRegistry" and "unknown extensions" argument. As far as I understood, I think it's possible in the ExtensionSet#merge_field function to don't parse the extension type, but rather mantain an unparsed buffer, so when the user accesses to the extension, ExtensionValue#merge can be used to parse it to the final type. If that's the case, we can tackle this feature from two directions:

  • We statically parse the extensions as in Support for custom options and proto2 extensions #591, hence you need an ExtensionRegistry. We can then support unknown extensions as a next feature, by maintaining for unknown extension tags the unparsed buffer in a separate map of the ExtensionSet. The user can then access it providing a manually instantiated ExtensionImpl, containing the type info to parse that buffer.
  • We partially parse the extensions every time, as described in the previous point. This entails introducing Extendable, ExtensionSet, ExtensionImpl (and related), but doesn't need the changes in prost-build to generate ExtensionImpl. We can support as a next feature the generation of ExtensionImpl in prost-build and introduce the ExtensionRegistry API (if we want to).

Specifically about the ExtensionRegistry API, I agree it's not nice to expose it the way it is in #591. But if we want to have such API, I don't really see much of an alternative, unless we go down the road of a static ExtensionRegistry (which is essentially what the Java implementation does). About that, I think it's not that bad as it aligns with other protobuf implementations out there in other languages, but I understand how a static singleton struct looks evil to Rusteaceans 😄 (I'm also unsure whether this approach works fine for no_std, I don't have enough experience with it).

My use case

In my use case, I'm using the #591 branch for reading extensions provided only by my application, so I'm fine with skipping all the other unknown extensions on my read path.

Although I see a use case where I might need unknown extensions support, not to access to them, but to be able to encode them back to the descriptor. In other words, an application that reads a protobuf descriptor, applies some changes to it, and then writes it back, needs this feature in order to write back extensions not known to that particular app. Even if we don't provide a way to access unknown extensions, we should at least guarantee that we can write them back.

I hope it helps, and I'll be happy to contribute to this feature.

@leviska
Copy link
Contributor

leviska commented Aug 29, 2022

Hello, I'm very interested in this feature and would like to help with anything: reviewing, discussion, implementing some of the features.

@LucioFranco
Copy link
Member

Hi sorry this has taken me so long to finally review but I started digesting it over the last few days and I don't feel like I am fully there yet on what I think the right idea is but ill add some of my current thoughts and notes to spark some discussion.

  • Can ExtensionImpl be Extension<impl Message>? I think the biggest issue we will face with using Message here is that its not object safe. But the work around here is to have it accept &mut dyn BufMut etc. Related issue Make Message object safe #742
  • I think adding an extension_set item on the struct is interesting. I wonder if we want to add a Message::extension_set() -> Option<ExtensionSet> that gets a default impl of None but when the Extension set macro is added will be added and mapped from the field.
  • I think what we will want to do also is lazily parse the extensions, if I understand them correctly they are just extra tagged items so we can store everything that is in the extension range in the ExtensionSet to be decoded at some point later.
  • Im not a big fan of a decode and a decode_with_extensions. It would be nice if it integrated into the main api.
  • If we used Extension<impl Message> we could avoid the proto_int_type stuff as well I believe?
  • It would be nice to also write out what the other impls do to understand what trade offs they choose.

Anyways, here are some quick thoughts, I will continue thinking about this and looking forward to see what yall say. Again, thanks for being so patient <3

@slinkydeveloper
Copy link

slinkydeveloper commented Nov 14, 2022

I wonder if we want to add a Message::extension_set() -> Option that gets a default impl of None but when the Extension set macro is added will be added and mapped from the field.

Perhaps a no-op ExtensionSet is nicer rather than the Option?

It would be nice to also write out what the other impls do to understand what trade offs they choose.

Just checked the Java implementation, what they do is hardcode the protobuf descriptor bytes in the generated java class, then in static context they initialize a global variable with the extension value. When the user accesses the extension, the value is casted by using the generic type of the generated extension class. This is roughly similar to lazy parsing the extensions from ExtensionSet at access time. Test.zip

@nswarm
Copy link
Author

nswarm commented Nov 24, 2022

  • Can ExtensionImpl be Extension<impl Message>? I think the biggest issue we will face with using Message here is that its not object safe. But the work around here is to have it accept &mut dyn BufMut etc. Related issue Make Message object safe #742

By this do you mean bounding ExtensionImpl<T : Message>/ExtensionValueImpl<T: Message> (T is pretty loose right now)? That way we could use the Message's own generated methods rather than going through the Encode/Merge traits I added?

I spiked this out quickly and I think that would work great, assuming we could call the merge/encode methods without generics (e.g. with &mut dyn Buf/BufMut) as you said.

  • If we used Extension<impl Message> we could avoid the proto_int_type stuff as well I believe?

Yes. It would nuke a lot of oddities of my implementation like the Encode/Merge traits as well.

(LucioFranco)

  • I think adding an extension_set item on the struct is interesting. I wonder if we want to add a Message::extension_set() -> Option<ExtensionSet> that gets a default impl of None but when the Extension set macro is added will be added and mapped from the field.

(slinkydeveloper)
Perhaps a no-op ExtensionSet is nicer rather than the Option?

The way I had it is effectively the no-op version, where the top-level Option is handled by the ExtensionSet itself, so even if you dug in you wouldn't run into the Option.

msg.extension_set().extension_data(...) // returns Option<T>

If the Option was around the ExtensionSet you'd get two options on every call

msg.extension_set().unwrap().extension_data(...) // returns Option<T>
        ^ also returns option

That said, since the Extendable trait has forwarding methods into the extension set, you can use the following and with either implementation above it can collapse the option checks.

msg.extension_data(...) // returns Option<T>

I don't have much opinion either way.

  • I think what we will want to do also is lazily parse the extensions, if I understand them correctly they are just extra tagged items so we can store everything that is in the extension range in the ExtensionSet to be decoded at some point later.
  • Im not a big fan of a decode and a decode_with_extensions. It would be nice if it integrated into the main api.

(Thinking out loud a bit here...)

WRT interface, you always need to supply byte layout info to the parser, whether it's up front or during data access. The tradeoff between: extension registry and lazily parsing extensions is when you supply that data.

With an extension registry, you supply layout info up front when you decode.

With lazily parsed extensions, you supply the layout info whenever you try to do an operation on the extension set (get/set/has). As it happens in my impl you already need to supply a static Extension reference which tells the ExtensionSet what you're looking for already, so arguably that could have the byte layout info required to parse along with that Extension reference.

There's also a performance tradeoff of when you parse the extension data. Probably not significant enough to consider without clear use case and measurements though, as long as you're properly storing the result on first eval. It could even provide some small perf benefit re-encoding if you're ok with the higher memory footprint to hang onto the encoded version if it hasn't changed. There's also a potential benefit that if you aren't accessing extensions often, you don't pay the cost.

tl;dr:

  1. I believe we could replace the Encode/Merge traits and proto_int_type by bounding extension generic type T: Message, pending Make Message object safe #742
  2. lazy eval seems totally do-able. We'd just need to figure out how to get the descriptor for the field into the ExtensionImpl (like slinky mentions) and also store the bytes in the ExtensionValue to be decoded at will.

@LucioFranco
Copy link
Member

I believe we could replace the Encode/Merge traits and proto_int_type by bounding extension generic type T: Message, pending #742

Agreed, I won't have time to take this on but if someone wants to start on this that would be good. I think I posted a proposal in there. The big thing here is running enough benches to be sure there is no perf impact.

lazy eval seems totally do-able. We'd just need to figure out how to get the descriptor for the field into the ExtensionImpl (like slinky mentions) and also store the bytes in the ExtensionValue to be decoded at will.

I think it would be nice if we could do get::<proto::MyExtMessage>() though i wonder if this will run into type conflicts because I assume you could have two fields with diff tags but of the same type. I will need to look at a java example to see how they extract it. I think the laziy ness will also allow us to support passing through extensions which would be a big win as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants