Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

External protos #278

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

xd009642
Copy link
Contributor

As mentioned in #272 I believe this will simplify the current release/build process for these bindings and using an external crate for this means some of the fields currently serialized as byte vectors can be interacted with by users with less hassle or internal change. As well as greater ability to extend the bindings without reliance on changes in this crate.

Also, I've generated a whole bunch of protos you don't even need yet so it's easy peasy to just add another feature and get more stuff bought into this crate whereas personally I found before interacting with tensorflow-proto-codegen was difficult to get working.

I'm aware relying on an external crate for the proto implementations may be seen as somewhat of a risk so I'm willing to add you as a maintainer/contributor @adamcrume 馃槃

@xd009642
Copy link
Contributor Author

All the failures seem to be of the form:

/tmp/rustdoctestC1cNkE/rust_out: error while loading shared libraries: libtensorflow.so.1: cannot open shared object file: No such file or directory

Is there an issue with CI?

@adamcrume
Copy link
Contributor

The doc tests have been broken for a while because of rust-lang/cargo#8531; that's what the shared library errors in Travis are.

What are the benefits of putting the protos in a separate crate, instead of keeping them in the tensorflow crate and just making them public API? I do have some concerns around coordinating releases.

I'll have to think about it more when I've had more sleep.

@xd009642
Copy link
Contributor Author

I would prefer the proto types became public and the deprecated fields were all removed for a minor version bump 馃憖 it would be nice if there was a minimum number of versions a deprecated field lived for before deletion.

Well a lot of the proto types aren't as yet publicly exposed the only fields you can access them from are a Vec<u8> or buffer. Without exposing them it means users need to generate the same proto code themselves (which is faff), or start grabbing private internals of these bindings and adding them into their own code. Which is something I've seen people have to do to get certain things works.

My hope with this is that there's less pressure to update the bindings to get certain functionality, and users can easily get the Vec<u8> into compatible types, and if needs be go to using things in the tensorflow-sys crate with the proto crate for functionality not yet in the tensorflow crate. Some of the gaps between releases have been semi-painful to work around at my own job where we're using these bindings so minimising that pain without needing to demand you work on this more. I suppose another option could be finding more trusted contributors?

Also, I got here because I was attempting to add a proto in another PR I was working on and working out all the protos I needed to add to the existing build script for dependencies was a massive pain, as well as working out to run the codegen itself (the docs didn't quite help, I eventually worked it out from something like a CI script). My method generates code for all the protos, works out the dependencies and sets up feature gates for every proto file you'd want to add including it's dependencies. So then for any new protos in the future it's just a case of enabling a feature which is definitely easier 馃憖

Another option could be to move my code inline to this repo, and then give you publish access to the protobufs crate I published. And then everything is here

@adamcrume
Copy link
Contributor

I agree that this project should move toward using actual protobuf types and not just Vec<u8>. I know that manually working with protobuf codegen is a headache for most users who want to use protos. I just have concerns about the details and how we get from here to there.

I don't think we want to get rid of the methods which return Vec<u8>, because some users may already be using other protobuf libraries. In fact, this is the main reason why the project hasn't committed to using a particular protobuf library in its API, yet: there hasn't been a clear winner in terms of which protobuf crate to use. protobuf is pretty popular, and it's what we're using internally and seems to work well, but some users may be using prost or something else.

One question I had was whether it made sense for the generated protos to go into the tensorflow crate or a separate one. One benefit in separating them is that code which needs TensorFlow protos but not TensorFlow itself (e.g. something generating Example protos or reading output) could use the proto crate without the tensorflow-sys crate, so that could be a reasonable argument for keeping them separate.

I'd be a bit more comfortable if we feature-gate the use of proto types in the API for now (i.e. a single feature named something like protobuf_protos), at least for a few releases. I'd also be more comfortable if the code were in this Git repo (even if in a separate crate), and if I had publish access on the proto crate. My concern is that if something happens and you're not able to release the proto crate, then the tensorflow crate wouldn't be able to use new protos/fields, and switching to using generated proto code in a different crate would be a breaking change.

By the way, if there is any documentation which is unclear, please file an issue or submit a PR. We should fix bad or incomplete docs.

I'm also definitely interested in having more contributors to the project, especially for the Windows build. I always need help with the Windows build. :(

@xd009642
Copy link
Contributor Author

I don't think we want to get rid of the methods which return Vec, because some users may already be using other protobuf libraries. In fact, this is the main reason why the project hasn't committed to using a particular protobuf library in its API, yet: there hasn't been a clear winner in terms of which protobuf crate to use. protobuf is pretty popular, and it's what we're using internally and seems to work well, but some users may be using prost or something else.

Hmm I can adapt my crate to generate using different protobuf crates and feature gate each one. I use prost in other projects just because tonic uses it so would be nice to cut down dependencies. I can probably get it to work so you can just do it like features=["prost"]/features=["protobuf"] and then if neither are supplied the Vec<u8> is supplied to the user 馃

My concern is that if something happens and you're not able to release the proto crate, then the tensorflow crate wouldn't be able to use new protos/fields, and switching to using generated proto code in a different crate would be a breaking change.

That's why I suggested making you a contributor to the repo/crates.io as well. That would give you push/merge/publish rights 馃槃. With that in mind would you still want it inside this repo, it might end up being more substantially sized with other protobuf options included.

@xd009642
Copy link
Contributor Author

Either way, I'll start working on changes to my proto crate to prototype other protobuf generation crates 馃憤

@xd009642
Copy link
Contributor Author

https://github.com/xd009642/tensorflow-protos-rs/tree/prost_support if you want to have a look at prototype prost support here. However, prost generates the files differently, instead of a file for each proto it's a file for each include directory. This means I can't generate the feature crates and a lot more will be pulled in as a result.

I did notice there was a tensorflow_grpc proto file so I guess for a fuller support I should considering generating the grpc service definitions as well if also adding prost support...

Copy link
Contributor

@adamcrume adamcrume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't hide the Vec<u8> from the user if they enable a feature, because features are generally considered to be additive. Basically, if code compiles with a feature disabled, it should continue to compile with the feature enabled. That also implies that if we have both prost_protos and protobuf_protos features, it should be possible to enable both at the same time. Imagine what happens if crate A uses tensorflow with the prost_protos feature, and crate B uses tensorflow with the protobuf_protos feature, and then crate C uses both crates A and B.

If we support both protobuf and prost explicitly, we'd have to do something like e.g.

fn graph_def(&self) -> Result<Vec<u8>>

#[cfg(feature="protobuf_protos")]
fn graph_def_protobuf(&self) -> Result<protobuf_protos::GraphDef>

#[cfg(feature="prost_protos")]
fn graph_def_prost(&self) -> Result<prost_protos::GraphDef>

If the generated protos are going to be in a separate crate anyway, it probably makes sense to have one for protobuf TensorFlow protos and one for prost TensorFlow protos. That's simpler than having both in the same crate and feature gating between the two proto libraries.

We also don't have to add prost support immediately. I would just like to make sure we leave room for it later.

src/session.rs Outdated Show resolved Hide resolved
src/session.rs Outdated Show resolved Hide resolved
src/session.rs Outdated Show resolved Hide resolved
ramon-garcia pushed a commit to ramon-garcia/tensorflow-rust that referenced this pull request May 20, 2023
This release contains completely rewritten class serializer using .NET expression trees. All the class serialisation documentation has been updated to reflect this. Existing `ParquetConvert` serializer is left untouched and marked obsolete, hence there is no increase in the major version number.
If you are using `ParquetConvert`, consider switching to `ParquetSerializer` soon, because no new features or fixes will be added to `ParquetConvert`.

New class serializer supports all primitive types and nested types (structs, lists, maps and their combinations). It also fully conforms to Dremel specification, which was a massive headache to implement properly.

Other improvements:
- Documentation updated for low-level and serializer API on how to use nested data types.
- Corrected how definition and repetition levels are calculated, making it more conformant to Parquet specification.
- Schema path calculation logic improved and does not rely on string splitting/joining, which allows you to use any characters anywhere in column names (tensorflow#278)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants