Skip to content

Conversation

@benbellick
Copy link
Member

@benbellick benbellick commented Oct 23, 2025

Introduces a graceful migration from the previous usage of URIs to URNs

Closes #95, #120, #121

@github-actions
Copy link

ACTION NEEDED

Substrait follows the Conventional Commits
specification
for
release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@tokoko
Copy link
Contributor

tokoko commented Oct 24, 2025

Looks like the problem is in duckdb extension's json to substrait conversion, it is too strict and fails on unknown json fields. you can get around it by switching to use from_substrait instead of from_substrait_json in test_sql_to_substrait.py:

sql = f"CALL from_substrait(?)"
substrait_out = conn.sql(sql, params=[plan.SerializeToString()])

that being said, we should probably open a ticket on duckdb extension as well to loosen json conversion checks.

@benbellick
Copy link
Member Author

Looks like the problem is in duckdb extension's json to substrait conversion, it is too strict and fails on unknown json fields. you can get around it by switching to use from_substrait instead of from_substrait_json in test_sql_to_substrait.py:

sql = f"CALL from_substrait(?)"
substrait_out = conn.sql(sql, params=[plan.SerializeToString()])

that being said, we should probably open a ticket on duckdb extension as well to loosen json conversion checks.

Opened up an issue here: substrait-io/duckdb-substrait-extension#166

@benbellick benbellick force-pushed the uri-urn-migration branch 4 times, most recently from e3fe8cd to 6c8cddb Compare October 24, 2025 17:22
@tokoko
Copy link
Contributor

tokoko commented Oct 24, 2025

@benbellick Can you introduce make codegen-extensions changes also?

@benbellick benbellick force-pushed the uri-urn-migration branch 2 times, most recently from becf09b to 70b193f Compare October 24, 2025 18:06
@benbellick benbellick changed the title WIP: Uri urn migration feat: graceful URI -> URN migration Oct 24, 2025
@benbellick benbellick force-pushed the uri-urn-migration branch 3 times, most recently from 135e586 to e24c5a9 Compare October 24, 2025 19:34
@benbellick benbellick marked this pull request as ready for review October 24, 2025 19:54
@benbellick benbellick requested a review from tokoko October 24, 2025 21:41
@benbellick benbellick force-pushed the uri-urn-migration branch 2 times, most recently from d145e28 to f6201bb Compare October 25, 2025 16:45
Copy link
Contributor

@tokoko tokoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good.

P.S. it'd be good to run make antlr after substrait upgrades as well, but it needs some setting up (like in devcontainer) so we can do that separately.

@nielspardon
Copy link
Member

P.S. it'd be good to run make antlr after substrait upgrades as well, but it needs some setting up (like in devcontainer) so we can do that separately.

I guess this step is still missing in this PR before we can merge?

Note, this is not passing any of the tests, and more work is necessary.
Had to temporarily disable the duckdb extension tests until the
dependency on the duckdb-substrait-extension can handle URNs.
This PR introduces handling of migrating from the usage of URI to URN
for extension references. As an intermediate step, both URI and URN
and emitted from produced plans.

Closes #95
the duckdb extension throws an error when unexpected fields are
present in JSON on invocation of `from_substrait_json`. So we instead
switch to using `from_substrait`.
It is not actually set anywhere and isn't in the substrait spec.
@benbellick benbellick requested review from nielspardon and tokoko and removed request for tokoko October 28, 2025 19:03
@benbellick
Copy link
Member Author

I have rebased off of main and run make antlr.

from google.protobuf import symbol_database as _symbol_database
from google.protobuf.internal import builder as _builder
_runtime_version.ValidateProtobufRuntimeVersion(_runtime_version.Domain.PUBLIC, 5, 29, 5, '', 'proto/algebra.proto')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this line got removed it seems you did not using the same protoc version as the devcontainer has been configured to use. we had the same observation in this closed PR: #113 (review)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the devcontainer uses protoc v29.5 which is it build for protobuf v5.29.5 which is the most recent compatible version with the protoletariat tool used in the proto code generation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could add a Github Actions check that runs the codegen and tests whether the git workspace does not contain any uncommitted changes to help prevent missing to run codegen with the right versions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's too much code generation going on, we'll probably have hard time keeping track of all the versions in two places unless we build/run devcontainer in a gh action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, makes sense to do it via the devcontainer by building that using the Dockerfile during Github Actions and run the codegen and verification inside the container

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this issue for updating the github actions build: #120

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can improve the build in this PR that would be amazing. otherwise we do it in a follow-up

# generated by datamodel-codegen:
# filename: simple_extensions_schema.yaml
# timestamp: 2025-06-06T08:43:35+00:00
# timestamp: 2025-10-24T17:55:30+00:00
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could disable adding the timestamp in Makefile by adding the --disable-timestamp flag to make repeated make codegen-extensions runs produce reproducible outputs

# Generate the new python protobuf files
buf generate
protol --in-place --create-package --python-out "$dest_dir" buf
uv run protol --in-place --create-package --python-out "$dest_dir" buf
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to make CI/CD not complain about missing protol

Copy link
Member

@nielspardon nielspardon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for also helping with adding the codegen verification

Copy link
Contributor

@tokoko tokoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to merge, thanks for bearing with us :)

@benbellick benbellick merged commit 890f84b into main Oct 30, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate extensions from URIs to URNs

4 participants