Skip to content

feat: generated protobuf classes#1

Merged
gforsyth merged 12 commits intosubstrait-io:mainfrom
danepitkin:danepitkin/gen-proto
Apr 24, 2023
Merged

feat: generated protobuf classes#1
gforsyth merged 12 commits intosubstrait-io:mainfrom
danepitkin:danepitkin/gen-proto

Conversation

@danepitkin
Copy link
Copy Markdown
Contributor

A python package containing the generated protobuf classes.

  • Use buf and protoletariat to generate the protobuf classes
  • Use setuptools for python packaging and dynamic versioning
  • Pin Substrait submodule to latest release (v0.28.2)

Testing:

  • Ran through README instructions to generate proto, build python package, run pytests, and import module locally.

Comment thread tests/test_proto.py Outdated
@@ -0,0 +1,10 @@
def test_imports():
from pysubstrait.proto.pysubstrait.algebra_pb2 import Expression
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the pysubstrait.proto.pysubstrait... could we make it something like
from pysubstrait.proto.core.algebra_pb2 import Expression or just from pysubstrait.proto.algebra_pb2 import Expression

Copy link
Copy Markdown
Contributor Author

@danepitkin danepitkin Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also have preferred that, but the second pysubstrait was added as a namespace for the generated protobuf classes so as not to conflict with other implementations like C++. I used this script in gen_proto.sh to generate it: https://github.com/substrait-io/substrait/blob/main/tools/proto_prefix.py. I'm open to a better naming convention if anyone has one!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I understand your point. 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always merge as-is and change it in the future while the project is still pre- v1.0. For reference, Ibis-substrait sort of does it this way, too: https://github.com/ibis-project/ibis-substrait/tree/main/ibis_substrait/proto/substrait/ibis

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this approach has some dangers when using extensions (e.g. google.protobuf.Any) Though my understanding was that Ibis-substrait had found some way to solve this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we haven't actually run into this in ibis-substrait yet, but I don't know that anyone has tried it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind elaborating on the dangers with an example?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let's pretend I created a message (EstimatedSelectivity) that was intended to be used as an advanced extension on RelCommon::Hint and we merged it into Substrait:

message RelCommon {
  ...
  Hint hint = 3;
  ...
  message Hint {
    ...
    substrait.extensions.AdvancedExtension advanced_extension = 10;
    ...
  }
  ...
}

message EstimatedSelectivity {
  double selectivity = 1;
}

Then, I attached that hint onto a filter relation. Here is the plan, serialized as JSON:

{
  ...
  "filter": {
    "common": {
      "hint": {
        "optimization": {
          "@type": "substrait.EstimatedSelectivity",
          "selectivity": 0.03
        }
      }
    },
    "condition": { ... },
    "input": { ... }
  }
  ...
}

As you can see, the Substrait package is included as part of the @type when an Any is serialized.

However, I don't know of any existing use like this. In Acero we use google.protobuf.Any to add new nodes (currently as-of join and segmented aggregation). In both cases the new message is not part of the core Substrait set of messages. It's a part of Acero's custom protobuf. This custom protobuf uses Substrait messages (e.g. an as-of join has field references in it and these are Substrait expressions) but the only type name that gets serialized is the extension itself.

I think it's reasonable to expect that will never be the case. For example, consider the above example. One could argue that a new "standard hint" should be included in the hint definition as:

message RelCommon {
  ...
  Hint hint = 3;
  ...
  message Hint {
    ...
    EstimatedSelectivity estimated_selectivity = 11;
    substrait.extensions.AdvancedExtension advanced_extension = 10;
    ...
  }
  ...
}

In other words, any time google.protobuf.Any is used it must refer to an object that is not part of the core Substrait protobuf files.

However, I'm not completely convinced it will never happen.

Copy link
Copy Markdown
Member

@drin drin Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what Weston is describing is how I planned to approach doing my own extensions. I think the most concise example is here, where I use ibis-substrait to translate and then I use a google.protobuf.Any for an ExtensionLeafRel:
https://github.com/drin/mohair-extension/blob/mainline/mohair_extension/extension.py#L100

Using just the standard string representation, here's what I generate:
https://media.githubusercontent.com/media/drin/mohair-extension/mainline/resources/projection.mohair.txt

an object that is not part of the core Substrait protobuf files

I am assuming that if something becomes core, then I would change the way this is done for that specific message type. Using ibis-substrait's approach to translation makes it an easy change, not sure that approach holds true for any other, large projects.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, both! The examples are super helpful. It's a good thing to note!

Comment thread README.md Outdated
## Generate protocol buffers
Generate the protobuf files manually. Requires protobuf `v3.20.1`.
```
./gen_proto.sh
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious:

Could we make this part of the setup which would automatically run?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do that -- one of the conveniences here should be that the files are already generated (using the correct version of protobuf) and bundled into the package for the end-user.

@vibhatha vibhatha added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 18, 2023
Copy link
Copy Markdown
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good starting point and the general structure is good in re: protobuf generation, namespacing, etc.

My preference would be to call the module and the pypi package substrait instead of pysubstrait, but that's not a blocker.

@danepitkin
Copy link
Copy Markdown
Contributor Author

I think this is a good starting point and the general structure is good in re: protobuf generation, namespacing, etc.

My preference would be to call the module and the pypi package substrait instead of pysubstrait, but that's not a blocker.

Since there are a lot of opinions on the two names, is it worth voting in the ML? It's probably best that we definitively pick one early on.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a fine starting point to me.

Comment thread README.md Outdated
Comment on lines +163 to +187
# Getting Started
```
git clone --recursive https://github.com/substrait-io/substrait-python.git
cd substrait-python
```

# Setting up your environment
## Conda env
Create a conda environment with developer dependencies.
```
conda env create -f environment.yml
conda activate pysubstrait
```

# Build
## Python package
### Editable installation
```
pip install -e .
```

## Generate protocol buffers
Generate the protobuf files manually. Requires protobuf `v3.20.1`.
```
./gen_proto.sh
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these instructions for pysubstrait developers? Or are these instructions for pysubstrait users?

If these are for developers should they be in a CONTRIBUTING.md? If they are for users then why does it list gen_proto.sh? Is that something a user has to do?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything at and below "Getting started" is for devs! I'll move it to CONTRIBUTING.MD. Great idea!

Comment thread tests/test_proto.py Outdated
@@ -0,0 +1,10 @@
def test_imports():
from pysubstrait.proto.pysubstrait.algebra_pb2 import Expression
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually "standard" (as in, the message defining the hint is in the core substrait proto files) hints will probably be more of a danger than extensions.

Copy link
Copy Markdown
Member

@drin drin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me.

@vibhatha
Copy link
Copy Markdown

Looks good to me.

@gforsyth
Copy link
Copy Markdown
Member

Since there are a lot of opinions on the two names, is it worth voting in the ML? It's probably best that we definitively pick one early on.

I haven't familiarized myself with voting procedures for the project, but it's an option.

I think it's worth looking at what other languages have done:

Java (original implementation), refers to itself as io.substrait: https://mvnrepository.com/artifact/io.substrait
Rust, refers to itself as substrait: https://docs.rs/substrait/latest/substrait/
Go, refers to itself as substrait-go: https://pkg.go.dev/github.com/substrait-io/substrait-go
C++, uses the namespace io::substrait::common: https://github.com/substrait-io/substrait-cpp/blob/main/include/substrait/common/Exceptions.h#L9

All but go elide the language abbreviation from the name, and I believe go automatically removes go- and -go when referring to libraries internally (needs a fact check).

@westonpace
Copy link
Copy Markdown
Member

C++, uses the namespace io::substrait::common: https://github.com/substrait-io/substrait-cpp/blob/main/include/substrait/common/Exceptions.h#L9

To be fair, the cmake project name (which is probably a closer equivalent) is actually substrait-cpp. That being said, I'm in favor of substrait over pysubstrait.

@gforsyth
Copy link
Copy Markdown
Member

To be fair, the cmake project name (which is probably a closer equivalent) is actually substrait-cpp.

Sorry! I didn't mean to misrepresent 😅

@danepitkin
Copy link
Copy Markdown
Contributor Author

danepitkin commented Apr 20, 2023

Okay, I think there is a pretty substantial majority leaning towards substrait over pysubstrait. I will update this PR with that change!

I'll still nickname this project PySubstrait from time to time outside of this repo, though. ;)

Comment thread environment.yml Outdated
@@ -0,0 +1,12 @@
name: pysubstrait
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we feel about leaving the conda environment as pysubstrait? Alternatively we can rename to the repo: substrait-python

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have minimal opinion since I don't know what it affects. Naively, it seems most straightforward for it to match the package name

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this you can leave alone. A person can always override it if they want to, and it won't impact the package name on conda-forge (which should probably be substrait-python to follow convention there)

Comment thread tests/test_proto.py Outdated
@@ -0,0 +1,10 @@
def test_imports():
from substrait.proto.pysubstrait.algebra_pb2 import Expression
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should the protobuf class namespace be in the package? pysubstrait? python? python.substrait? etc.

Copy link
Copy Markdown
Member

@drin drin Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think pysubstrait -> substrait. that matches ibis-substrait's naming scheme so I don't think it is problematic

Copy link
Copy Markdown
Member

@gforsyth gforsyth Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if it's proto.substrait.algebra_pb2 that it can collide with the C descriptor pool.
what if we namespace it to proto and then move the whole thing up a directory, so you have substrait.proto.algebra_pb2?

I think that works with all the relative imports? But I could be wrong.

Comment thread src/substrait/__init__.py Outdated
@@ -1,4 +0,0 @@
try:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this gets overwritten by ./gen_proto.sh when we don't use a subdirectory within the package.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, I was looking to see if we could move this somewhere, but I'm suddenly unsure why this is getting overwritten if it should be putting files into a proto directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer to keep the generated files in substrait.proto.<namespace>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it generates an empty init.py as a safety precaution? not ideal in this scenario haha

Comment thread src/substrait/__init__.pyi Outdated
@@ -0,0 +1 @@
from . import proto
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to see if we could move the version info to this .pyi file you have, but it seems like pyi files are only supposed to have type information. does it make sense to have this stub file if there's only this import?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. It gets autogenerated from protol I believe. It used to generate .pyi files for all the classes, but stopped (maybe my version was updated locally mid-development?). Maybe we just add *.pyi to .gitignore.

Comment thread gen_proto.sh Outdated
tmp_dir=./proto
dest_dir=./src/substrait/proto
tmp_dir=./buf_work_dir
dest_dir=./src/substrait
Copy link
Copy Markdown
Member

@drin drin Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hack-y way is to make this an intermediate dir and then rename it or put it in a different namespace (like pysubstrait) and then alias the imports via the __init__.py file?

edit: I suspect that this is overwriting the __init__.py because it by default creates the wrappers as a submodule into a directory it expects to be empty; maybe just modify the gen_proto.sh script to not generate an empty __init__.py file, since it wouldn't be necessary (you're not putting it in an empty directory)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JK, the script just calls protol. Maybe remove the --create-package option and it will stop creating the empty __init__.py file

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh ill give this a try!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't quite work as we'd like. --create-package will recursively create init files, which we do want (except at the top level). Changing --in-place to --not-in-place would disallow overwriting of files, which we don't quite want either. Ultimately the cleanest approach to me still feels like placing these files in a proto subdir.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misunderstanding you, @gforsyth 's suggestion is a proto subdir. I just assumed that the protol tool is creating an extra __init__.py file at the same level as the proto subdir.

I imagine the directory layout to be something like:

substrait
    ├── __init__.py (protol generates this by default)
    ├── proto (this is the proto subdir)
        ├── algebra_pb2.py
        ├── plan_pb2.py
        ├── ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that being said, I see you added a gen directory which is probably fine in the near-term. I'm not sure what's generating that __init__.py file

Copy link
Copy Markdown
Contributor Author

@danepitkin danepitkin Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep your initial understanding is correct! IMO it would be better to generate files into a subdir if protol doesn't support exactly what we want. I'm hesitant to introduce any opaque behavior for devs to reduce bugs in the future. Having to choose between overwriting the top level __init__.py or not recursively generating __init__.py's seems bad. I'd rather have both if it means generating files into a subdir instead of root dir.

Copy link
Copy Markdown
Contributor Author

@danepitkin danepitkin Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and I forgot to mention that protol was deleting the top level __init__.py file when removing the --create-package option!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was because I had --in-place option set (aka overwrite).

Comment thread tests/test_proto.py
@@ -0,0 +1,11 @@
def test_imports():
"""Temporary sanity test"""
from substrait.gen.proto.algebra_pb2 import Expression
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we feel about substrait.gen.proto?

@danepitkin
Copy link
Copy Markdown
Contributor Author

I'm satisfied with this implementation if folks are as well. IMO it is about ready to merge!

@vibhatha
Copy link
Copy Markdown

I feel like now it is in a good shape. @gforsyth any thoughts?

Copy link
Copy Markdown
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

This looks great @danepitkin ! Thanks for accommodating all of the change discussions.

@gforsyth
Copy link
Copy Markdown
Member

I'll leave this open for the rest of today-ish in case anyone else wants to take a look, but I'll merge it in tonight.

@danepitkin
Copy link
Copy Markdown
Contributor Author

Thanks everyone for all the feedback, it was a huge help! Looking forward to seeing how Substrait Python grows from here.

@gforsyth gforsyth merged commit 70a3cb8 into substrait-io:main Apr 24, 2023
@gforsyth
Copy link
Copy Markdown
Member

Thanks for leading the charge on this, @danepitkin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants