[CLAUDE] [OPUS 4.7] fix(annotate): register typing for FirstValue and RegexpExtract#7577
Merged
georgesittas merged 1 commit intotobymao:mainfrom Apr 30, 2026
Conversation
FirstValue had no entry in base EXPRESSION_METADATA, so all dialects except BigQuery returned UNKNOWN — despite its sister LastValue being registered. Move it next to LastValue in the first-arg propagation block and drop the now-redundant BigQuery duplicate. RegexpExtract had no base entry either, so only BigQuery and Snowflake typed it. Register it as constant-VARCHAR in the Hive typing module (covers Hive/Spark2/Spark/Databricks through the existing chain). Keep BigQuery's _annotate_by_args override since BigQuery genuinely overloads on STRING vs BYTES input. Snowflake's existing entry is preserved. Scoping the registration to Hive (not base) avoids leaking VARCHAR onto dialects with different semantics — most notably DuckDB, where REGEXP_EXTRACT can return a STRUCT when group names are passed. Adds fixture coverage in annotate_functions.sql: - cross-dialect FIRST_VALUE on BIGINT and STRING - spark/databricks REGEXP_EXTRACT on STRING and BINARY input (proves the dialect's constant-STRING behavior, distinct from BigQuery's input-type overload) - snowflake REGEXP_SUBSTR on STRING - duckdb REGEXP_EXTRACT pinned at UNKNOWN to lock in the Hive-only scoping
22fd6aa to
ec241a4
Compare
Contributor
Author
|
I believe that the
|
Comment on lines
+6032
to
+6034
| # dialect: duckdb | ||
| REGEXP_EXTRACT(tbl.str_col, pattern, 0); | ||
| UNKNOWN; |
Collaborator
There was a problem hiding this comment.
This shouldn't be UNKNOWN, right? The right type is JSON for DuckDB. We should either type it as such or omit this test altogether.
Collaborator
|
I'll get this in and take it to the finish line. Thank you for the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Register
FirstValuein baseEXPRESSION_METADATAandRegexpExtractin HiveEXPRESSION_METADATAso the type annotator returns the right type instead of UNKNOWN.FirstValue: propagates type ofthisarg — correct for every dialect. Presumed harmless in dialects (like SQLite) where FirstValue is not supported.RegexpExtract: VARCHAR in Hive (and descendants: Spark2, Spark, Databricks) and Snowflake. Left out of base since DuckDB handles regexp_extract unlike other implementations.REGEXP_EXTRACT→ UNKNOWN.