feat: generated protobuf classes#1
Conversation
| @@ -0,0 +1,10 @@ | |||
| def test_imports(): | |||
| from pysubstrait.proto.pysubstrait.algebra_pb2 import Expression | |||
There was a problem hiding this comment.
Just curious about the pysubstrait.proto.pysubstrait... could we make it something like
from pysubstrait.proto.core.algebra_pb2 import Expression or just from pysubstrait.proto.algebra_pb2 import Expression
There was a problem hiding this comment.
I would also have preferred that, but the second pysubstrait was added as a namespace for the generated protobuf classes so as not to conflict with other implementations like C++. I used this script in gen_proto.sh to generate it: https://github.com/substrait-io/substrait/blob/main/tools/proto_prefix.py. I'm open to a better naming convention if anyone has one!
There was a problem hiding this comment.
We can always merge as-is and change it in the future while the project is still pre- v1.0. For reference, Ibis-substrait sort of does it this way, too: https://github.com/ibis-project/ibis-substrait/tree/main/ibis_substrait/proto/substrait/ibis
There was a problem hiding this comment.
Note that this approach has some dangers when using extensions (e.g. google.protobuf.Any) Though my understanding was that Ibis-substrait had found some way to solve this?
There was a problem hiding this comment.
I think we haven't actually run into this in ibis-substrait yet, but I don't know that anyone has tried it.
There was a problem hiding this comment.
Do you mind elaborating on the dangers with an example?
There was a problem hiding this comment.
Sure. Let's pretend I created a message (EstimatedSelectivity) that was intended to be used as an advanced extension on RelCommon::Hint and we merged it into Substrait:
message RelCommon {
...
Hint hint = 3;
...
message Hint {
...
substrait.extensions.AdvancedExtension advanced_extension = 10;
...
}
...
}
message EstimatedSelectivity {
double selectivity = 1;
}
Then, I attached that hint onto a filter relation. Here is the plan, serialized as JSON:
{
...
"filter": {
"common": {
"hint": {
"optimization": {
"@type": "substrait.EstimatedSelectivity",
"selectivity": 0.03
}
}
},
"condition": { ... },
"input": { ... }
}
...
}
As you can see, the Substrait package is included as part of the @type when an Any is serialized.
However, I don't know of any existing use like this. In Acero we use google.protobuf.Any to add new nodes (currently as-of join and segmented aggregation). In both cases the new message is not part of the core Substrait set of messages. It's a part of Acero's custom protobuf. This custom protobuf uses Substrait messages (e.g. an as-of join has field references in it and these are Substrait expressions) but the only type name that gets serialized is the extension itself.
I think it's reasonable to expect that will never be the case. For example, consider the above example. One could argue that a new "standard hint" should be included in the hint definition as:
message RelCommon {
...
Hint hint = 3;
...
message Hint {
...
EstimatedSelectivity estimated_selectivity = 11;
substrait.extensions.AdvancedExtension advanced_extension = 10;
...
}
...
}
In other words, any time google.protobuf.Any is used it must refer to an object that is not part of the core Substrait protobuf files.
However, I'm not completely convinced it will never happen.
There was a problem hiding this comment.
I believe what Weston is describing is how I planned to approach doing my own extensions. I think the most concise example is here, where I use ibis-substrait to translate and then I use a google.protobuf.Any for an ExtensionLeafRel:
https://github.com/drin/mohair-extension/blob/mainline/mohair_extension/extension.py#L100
Using just the standard string representation, here's what I generate:
https://media.githubusercontent.com/media/drin/mohair-extension/mainline/resources/projection.mohair.txt
an object that is not part of the core Substrait protobuf files
I am assuming that if something becomes core, then I would change the way this is done for that specific message type. Using ibis-substrait's approach to translation makes it an easy change, not sure that approach holds true for any other, large projects.
There was a problem hiding this comment.
Thank you, both! The examples are super helpful. It's a good thing to note!
| ## Generate protocol buffers | ||
| Generate the protobuf files manually. Requires protobuf `v3.20.1`. | ||
| ``` | ||
| ./gen_proto.sh |
There was a problem hiding this comment.
Just curious:
Could we make this part of the setup which would automatically run?
There was a problem hiding this comment.
I don't think we want to do that -- one of the conveniences here should be that the files are already generated (using the correct version of protobuf) and bundled into the package for the end-user.
gforsyth
left a comment
There was a problem hiding this comment.
I think this is a good starting point and the general structure is good in re: protobuf generation, namespacing, etc.
My preference would be to call the module and the pypi package substrait instead of pysubstrait, but that's not a blocker.
Since there are a lot of opinions on the two names, is it worth voting in the ML? It's probably best that we definitively pick one early on. |
westonpace
left a comment
There was a problem hiding this comment.
Looks like a fine starting point to me.
| # Getting Started | ||
| ``` | ||
| git clone --recursive https://github.com/substrait-io/substrait-python.git | ||
| cd substrait-python | ||
| ``` | ||
|
|
||
| # Setting up your environment | ||
| ## Conda env | ||
| Create a conda environment with developer dependencies. | ||
| ``` | ||
| conda env create -f environment.yml | ||
| conda activate pysubstrait | ||
| ``` | ||
|
|
||
| # Build | ||
| ## Python package | ||
| ### Editable installation | ||
| ``` | ||
| pip install -e . | ||
| ``` | ||
|
|
||
| ## Generate protocol buffers | ||
| Generate the protobuf files manually. Requires protobuf `v3.20.1`. | ||
| ``` | ||
| ./gen_proto.sh |
There was a problem hiding this comment.
Are these instructions for pysubstrait developers? Or are these instructions for pysubstrait users?
If these are for developers should they be in a CONTRIBUTING.md? If they are for users then why does it list gen_proto.sh? Is that something a user has to do?
There was a problem hiding this comment.
Everything at and below "Getting started" is for devs! I'll move it to CONTRIBUTING.MD. Great idea!
| @@ -0,0 +1,10 @@ | |||
| def test_imports(): | |||
| from pysubstrait.proto.pysubstrait.algebra_pb2 import Expression | |||
There was a problem hiding this comment.
Actually "standard" (as in, the message defining the hint is in the core substrait proto files) hints will probably be more of a danger than extensions.
|
Looks good to me. |
I haven't familiarized myself with voting procedures for the project, but it's an option. I think it's worth looking at what other languages have done: Java (original implementation), refers to itself as All but |
To be fair, the cmake project name (which is probably a closer equivalent) is actually substrait-cpp. That being said, I'm in favor of |
Sorry! I didn't mean to misrepresent 😅 |
|
Okay, I think there is a pretty substantial majority leaning towards I'll still nickname this project PySubstrait from time to time outside of this repo, though. ;) |
| @@ -0,0 +1,12 @@ | |||
| name: pysubstrait | |||
There was a problem hiding this comment.
How do we feel about leaving the conda environment as pysubstrait? Alternatively we can rename to the repo: substrait-python
There was a problem hiding this comment.
I have minimal opinion since I don't know what it affects. Naively, it seems most straightforward for it to match the package name
There was a problem hiding this comment.
Yeah, this you can leave alone. A person can always override it if they want to, and it won't impact the package name on conda-forge (which should probably be substrait-python to follow convention there)
| @@ -0,0 +1,10 @@ | |||
| def test_imports(): | |||
| from substrait.proto.pysubstrait.algebra_pb2 import Expression | |||
There was a problem hiding this comment.
What should the protobuf class namespace be in the package? pysubstrait? python? python.substrait? etc.
There was a problem hiding this comment.
i think pysubstrait -> substrait. that matches ibis-substrait's naming scheme so I don't think it is problematic
There was a problem hiding this comment.
I think if it's proto.substrait.algebra_pb2 that it can collide with the C descriptor pool.
what if we namespace it to proto and then move the whole thing up a directory, so you have substrait.proto.algebra_pb2?
I think that works with all the relative imports? But I could be wrong.
| @@ -1,4 +0,0 @@ | |||
| try: | |||
There was a problem hiding this comment.
Hmm, this gets overwritten by ./gen_proto.sh when we don't use a subdirectory within the package.
There was a problem hiding this comment.
hm, I was looking to see if we could move this somewhere, but I'm suddenly unsure why this is getting overwritten if it should be putting files into a proto directory.
There was a problem hiding this comment.
I think I prefer to keep the generated files in substrait.proto.<namespace>
There was a problem hiding this comment.
I guess it generates an empty init.py as a safety precaution? not ideal in this scenario haha
| @@ -0,0 +1 @@ | |||
| from . import proto | |||
There was a problem hiding this comment.
I was trying to see if we could move the version info to this .pyi file you have, but it seems like pyi files are only supposed to have type information. does it make sense to have this stub file if there's only this import?
There was a problem hiding this comment.
I don't think so. It gets autogenerated from protol I believe. It used to generate .pyi files for all the classes, but stopped (maybe my version was updated locally mid-development?). Maybe we just add *.pyi to .gitignore.
| tmp_dir=./proto | ||
| dest_dir=./src/substrait/proto | ||
| tmp_dir=./buf_work_dir | ||
| dest_dir=./src/substrait |
There was a problem hiding this comment.
I think the hack-y way is to make this an intermediate dir and then rename it or put it in a different namespace (like pysubstrait) and then alias the imports via the __init__.py file?
edit: I suspect that this is overwriting the __init__.py because it by default creates the wrappers as a submodule into a directory it expects to be empty; maybe just modify the gen_proto.sh script to not generate an empty __init__.py file, since it wouldn't be necessary (you're not putting it in an empty directory)?
There was a problem hiding this comment.
JK, the script just calls protol. Maybe remove the --create-package option and it will stop creating the empty __init__.py file
There was a problem hiding this comment.
ooh ill give this a try!
There was a problem hiding this comment.
It doesn't quite work as we'd like. --create-package will recursively create init files, which we do want (except at the top level). Changing --in-place to --not-in-place would disallow overwriting of files, which we don't quite want either. Ultimately the cleanest approach to me still feels like placing these files in a proto subdir.
There was a problem hiding this comment.
Unless I'm misunderstanding you, @gforsyth 's suggestion is a proto subdir. I just assumed that the protol tool is creating an extra __init__.py file at the same level as the proto subdir.
I imagine the directory layout to be something like:
substrait
├── __init__.py (protol generates this by default)
├── proto (this is the proto subdir)
├── algebra_pb2.py
├── plan_pb2.py
├── ...
There was a problem hiding this comment.
but that being said, I see you added a gen directory which is probably fine in the near-term. I'm not sure what's generating that __init__.py file
There was a problem hiding this comment.
Yep your initial understanding is correct! IMO it would be better to generate files into a subdir if protol doesn't support exactly what we want. I'm hesitant to introduce any opaque behavior for devs to reduce bugs in the future. Having to choose between overwriting the top level __init__.py or not recursively generating __init__.py's seems bad. I'd rather have both if it means generating files into a subdir instead of root dir.
There was a problem hiding this comment.
Oh and I forgot to mention that protol was deleting the top level __init__.py file when removing the --create-package option!
There was a problem hiding this comment.
That was because I had --in-place option set (aka overwrite).
| @@ -0,0 +1,11 @@ | |||
| def test_imports(): | |||
| """Temporary sanity test""" | |||
| from substrait.gen.proto.algebra_pb2 import Expression | |||
There was a problem hiding this comment.
How do we feel about substrait.gen.proto?
|
I'm satisfied with this implementation if folks are as well. IMO it is about ready to merge! |
|
I feel like now it is in a good shape. @gforsyth any thoughts? |
gforsyth
left a comment
There was a problem hiding this comment.
![]()
This looks great @danepitkin ! Thanks for accommodating all of the change discussions.
|
I'll leave this open for the rest of today-ish in case anyone else wants to take a look, but I'll merge it in tonight. |
|
Thanks everyone for all the feedback, it was a huge help! Looking forward to seeing how Substrait Python grows from here. |
|
Thanks for leading the charge on this, @danepitkin ! |
A python package containing the generated protobuf classes.
bufandprotoletariatto generate the protobuf classesTesting: